detect metadata right from epub files #4

New issue

Closed

opened 2015-04-30 21:06:36 +02:00 by izzy · 4 comments

izzy commented

2015-04-30 21:06:36 +02:00

Owner

Author: @IzzySoft

This issue has been raised with ticket:17 in the original tracker, and was transferred here slightly altered:

it would be nice if minicalope could fetch the title, author, lang, etc right from the epub file, instead of taking the dir/filename or relying on additional *.data files that will be hard to maintain for the user in the long run.

This might be tricky to do in PHP, so an alternative idea could be to allow the user to use a 'backend' that will parse the epub file and return metadat in the format expected by minicalope.

The referenced "ticket" includes a patch, relying on a backend written in C and also attached.

Author: @IzzySoft This issue has been raised with [ticket:17](http://projects.izzysoft.de/trac/minicalope/ticket/17) in the original tracker, and was transferred here slightly altered: > it would be nice if minicalope could fetch the title, author, lang, etc right from the epub file, instead of taking the dir/filename or relying on additional `*.data` files that will be hard to maintain for the user in the long run. > > This might be tricky to do in PHP, so an alternative idea could be to allow the user to use a 'backend' that will parse the epub file and return metadat in the format expected by minicalope. The referenced "ticket" includes a patch, relying on a backend written in C and also attached.

izzy commented

2015-04-30 22:41:37 +02:00

Author

Owner

Author: @IzzySoft

I strongly advise against using such a feature in automatic runs, especially when unsupervised:

epub description might contain "invalid HTML" (e.g. missing closing tags for lists), which then would break OPDS (while working fine in HTML)
the same author might turn up in many different spellings

For the latter, an example: Bertha von Suttner. Most times (to my experience – running checks against the ~7,000 books in the German catalog on ebooks.qumran.org), she turns up as either "Bertha Suttner", or as "Suttner, Bertha". Makes two entries for the same author. But she got a title, so she might also turn up as "Bertha von Suttner", "Suttner, Bertha von" and even "von Suttner, Bertha" – making 5 different variants. Now, her title really would be "Freifrau von Suttner". And her full name is "Bertha Sophia Felicita Freifrau von Suttner" (and if you think that's already the most complicated name, check Ida Marie Louise Sophie Friederike Gustave Gräfin von Hahn😇). Unsupervised automated runs would leave all possible combinations – making "books by author X" quite … well, a broken concept.

So what I plan in a first run is:

creating a class for reading epub metadata (done and in testing currently: class.epub.php)
creating a class extending this, taking care for creating the .desc and .data files for a given book (next on my schedule; class.epubdesc.php)
creating a simple script making use of the two, and including it within e.g. the doc/ directory (script already exists and is tested by me for the past couple of weeks; needs rework incl. splitting-out the class.epubdesc.php)

That way you can at least have all the metadata extracted semi-automatically (e.g. epubmeta book.epub would create the .desc and .data in the same place), and you can check (and fix/extend) the created files.

This is the next feature I have planned (of course, bug-fixes have higher priority, if bugs pop up 😉)

Author: @IzzySoft I strongly advise _against_ using such a feature in automatic runs, especially when unsupervised: - epub description might contain "invalid HTML" (e.g. missing closing tags for lists), which then would break OPDS (while working fine in HTML) - the same author might turn up in many different spellings For the latter, an example: [Bertha von Suttner](https://de.wikipedia.org/wiki/Bertha_von_Suttner). Most times (to my experience – running checks against the ~7,000 books in the German catalog on ebooks.qumran.org), she turns up as either "Bertha Suttner", or as "Suttner, Bertha". Makes two entries for the same author. But she got a title, so she might also turn up as "Bertha von Suttner", "Suttner, Bertha von" and even "von Suttner, Bertha" – making 5 different variants. Now, her title really would be "Freifrau von Suttner". And her full name is "Bertha Sophia Felicita Freifrau von Suttner" (and if you think that's already the most complicated name, check [Ida Marie Louise Sophie Friederike Gustave Gräfin von Hahn](http://de.wikipedia.org/wiki/Ida_Hahn-Hahn):innocent:). Unsupervised automated runs would leave all possible combinations – making "books by author X" quite … well, a broken concept. So what I plan in a first run is: - creating a class for reading epub metadata (done and in testing currently: `class.epub.php`) - creating a class extending this, taking care for creating the `.desc` and `.data` files for a given book (next on my schedule; `class.epubdesc.php`) - creating a simple script making use of the two, and including it within e.g. the `doc/` directory (script already exists and is tested by me for the past couple of weeks; needs rework incl. splitting-out the `class.epubdesc.php`) That way you can at least have all the metadata extracted semi-automatically (e.g. `epubmeta book.epub` would create the `.desc` and `.data` in the same place), and you can check (and fix/extend) the created files. This is the next feature I have planned (of course, bug-fixes have higher priority, if bugs pop up :wink:)

izzy commented

2015-05-02 23:08:26 +02:00

Author

Owner

Author: @IzzySoft

Implemented as described above 😇

Author: @IzzySoft Implemented as described above :innocent:

izzy commented

2015-05-26 08:38:56 +02:00

Author

Owner

Author: @IzzySoft

This feature has now been added for Metadata (by default, the .data files). As lined out above, there might be a few issues – depending on who built the .epub and how they've set up the metadata. I will line out possible fields and their culprits here:

author: see above
isbn: safe. This is either an ISBN, or not present at all.
publisher: to my experience, in many cases holds more than just the publisher. Usually also the publication place and year. Up to you if you wish that.
rating: not sure. Rarely found in epubs.
series: Might not be the one you wish to file it under
series_index: ditto
tag: probably not one of those you are using to file your books, but you might wish to try
title: should be pretty safe, but no guarantees
uri: also pretty safe (and rarely used)

Author: @IzzySoft This feature has now been added for Metadata (by default, the `.data` files). As lined out above, there might be a few issues – depending on who built the `.epub` and how they've set up the metadata. I will line out possible fields and their culprits here: - `author`: see above - `isbn`: safe. This is either an ISBN, or not present at all. - `publisher`: to my experience, in many cases holds more than just the publisher. Usually also the publication place and year. Up to you if you wish that. - `rating`: not sure. Rarely found in epubs. - `series`: Might not be the one you wish to file it under - `series_index`: ditto - `tag`: probably not one of those you are using to file your books, but you might wish to try - `title`: should be pretty safe, but no guarantees - `uri`: also pretty safe (and rarely used)

izzy commented

2015-05-26 21:15:07 +02:00

Author

Owner

Author: @IzzySoft

5cd0535 completed this task, so I'll close the issue now. Some remarks on extracting book description you should be aware of:

though TOC is always present in .epub files, it's not always really useful (even if it fills the page)
a book description may be available. If it is, it might contain HTML tags which might break the XML for the OPDS part (make sure to have $skip_broken_xml set to TRUE if you care for OPDS – otherwise OPDS users might be unable to access such a book)
whether the head is useful or not is your decision. Doesn't usually break anything, but you never know how the metadata are set up (believe me, there are strange things around).

You can always check ebooks manually using the doc/epubmeta script, which extracts the full load of available values. Now enjoy!

Author: @IzzySoft 5cd0535 completed this task, so I'll close the issue now. Some remarks on extracting book description you should be aware of: - though TOC is always present in `.epub` files, it's not always really useful (even if it fills the page) - a book description _may_ be available. If it is, it might contain HTML tags which _might_ break the XML for the OPDS part (make sure to have `$skip_broken_xml` set to TRUE if you care for OPDS – otherwise OPDS users might be unable to access such a book) - whether the `head` is useful or not is your decision. Doesn't usually break anything, but you never know how the metadata are set up (believe me, there are strange things around). You can always check ebooks manually using the `doc/epubmeta` script, which extracts the full load of available values. Now enjoy!