detect metadata right from epub files #4
Labels
No labels
affects:backend
affects:scan-scripts
affects:webui
needs:confirmation
needs:feedback
needs:help
needs:merge
needs:testing
status:confirmed
status:duplicate
status:fixed
status:in-progress
status:invalid
status:no-repro
status:wontfix
type:bug
type:enhancement
type:feature
type:question
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
izzy/miniCalOPe#4
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Author: @IzzySoft
This issue has been raised with ticket:17 in the original tracker, and was transferred here slightly altered:
The referenced "ticket" includes a patch, relying on a backend written in C and also attached.
Author: @IzzySoft
I strongly advise against using such a feature in automatic runs, especially when unsupervised:
For the latter, an example: Bertha von Suttner. Most times (to my experience – running checks against the ~7,000 books in the German catalog on ebooks.qumran.org), she turns up as either "Bertha Suttner", or as "Suttner, Bertha". Makes two entries for the same author. But she got a title, so she might also turn up as "Bertha von Suttner", "Suttner, Bertha von" and even "von Suttner, Bertha" – making 5 different variants. Now, her title really would be "Freifrau von Suttner". And her full name is "Bertha Sophia Felicita Freifrau von Suttner" (and if you think that's already the most complicated name, check Ida Marie Louise Sophie Friederike Gustave Gräfin von Hahn😇). Unsupervised automated runs would leave all possible combinations – making "books by author X" quite … well, a broken concept.
So what I plan in a first run is:
class.epub.php).descand.datafiles for a given book (next on my schedule;class.epubdesc.php)doc/directory (script already exists and is tested by me for the past couple of weeks; needs rework incl. splitting-out theclass.epubdesc.php)That way you can at least have all the metadata extracted semi-automatically (e.g.
epubmeta book.epubwould create the.descand.datain the same place), and you can check (and fix/extend) the created files.This is the next feature I have planned (of course, bug-fixes have higher priority, if bugs pop up 😉)
Author: @IzzySoft
Implemented as described above 😇
Author: @IzzySoft
This feature has now been added for Metadata (by default, the
.datafiles). As lined out above, there might be a few issues – depending on who built the.epuband how they've set up the metadata. I will line out possible fields and their culprits here:author: see aboveisbn: safe. This is either an ISBN, or not present at all.publisher: to my experience, in many cases holds more than just the publisher. Usually also the publication place and year. Up to you if you wish that.rating: not sure. Rarely found in epubs.series: Might not be the one you wish to file it underseries_index: dittotag: probably not one of those you are using to file your books, but you might wish to trytitle: should be pretty safe, but no guaranteesuri: also pretty safe (and rarely used)Author: @IzzySoft
5cd0535completed this task, so I'll close the issue now. Some remarks on extracting book description you should be aware of:.epubfiles, it's not always really useful (even if it fills the page)$skip_broken_xmlset to TRUE if you care for OPDS – otherwise OPDS users might be unable to access such a book)headis useful or not is your decision. Doesn't usually break anything, but you never know how the metadata are set up (believe me, there are strange things around).You can always check ebooks manually using the
doc/epubmetascript, which extracts the full load of available values. Now enjoy!