Posts

Showing posts with the label CRF Model

2021-09-19: Conditional Random Field with Textual and Visual Features to Extract Metadata From Scanned ETDs

Image
Our previous  blog  described Electronic Theses and Dissertations (ETDs) before 1997, and a significant fraction of ETDs after 1997 are scanned from physical copies. These ETDs are valuable for digital library preservation, but to make them accessible, it is necessary to index these ETDs. Many ETD repositories are accompanied by incomplete, little, or no metadata, posing challenges for accessibility. For example, advisor names appearing on the Scanned ETDs may not be available in the metadata provided in the library repository. Thus, an automatic approach should be adopted to extract metadata from scanned ETDs. We proposed a conditional random field (CRF) based sequence tagging model that combines textual and visual features . The source code can be found in our GitHub repository. Introduction Automatic metadata extraction is important to build scalable digital library search engines. Most existing tools such as GROBID [1], CERMINE [2], and ParsCit [3] developed and applied t...