Dominika Tkaczyk

Dominika Tkaczyk

Dublin, County Dublin, Ireland
572 followers 500+ connections

Activity

Join now to see all activity

Experience

  • Crossref Graphic

    Crossref

    Dublin, County Dublin, Ireland

  • -

  • -

  • -

    Dublin, Ireland

  • -

    Dublin, Ireland

  • -

    ICM, University of Warsaw

  • -

    ICM, Networking Group, Warszawa

Education

  • Systems Research Institute, Polish Academy of Sciences

    -

    -

  • -

    -

Licenses & Certifications

Publications

  • CERMINE: automatic extraction of structured metadata from scientific literature

    International Journal on Document Analysis and Recognition

    CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning…

    CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://​cermine.​ceon.​pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.

    Other authors
    See publication
  • CERMINE — automatic extraction of metadata and references from scientific literature

    11th IAPR International Workshop on Document Analysis Systems

    CERMINE is a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. The system is based on a modular workflow, whose architecture allows for single step training and evaluation, enables effortless modifications and replacements of individual components and simplifies further architecture expanding. The implementations of most steps are based on supervised and unsupervised machine-learning techniques, which…

    CERMINE is a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. The system is based on a modular workflow, whose architecture allows for single step training and evaluation, enables effortless modifications and replacements of individual components and simplifies further architecture expanding. The implementations of most steps are based on supervised and unsupervised machine-learning techniques, which simplifies the process of adjusting the system to
    new document layouts. The paper describes the overall workflow architecture, provides details about individual implementations and reports evaluation methodology and results. CERMINE service is available at http://cermine.ceon.pl.

    Other authors
  • GROTOAP2 — The Methodology of Creating a Large Ground Truth Dataset of Scientific Articles

    D-Lib Magazine

    Scientific literature analysis improves knowledge propagation and plays a key role in understanding and assessment of scholarly communication in scientific world. In recent years many tools and services for analysing the content of scientific articles have been developed. One of the most important tasks in this research area is understanding the roles of different parts of the document. It is impossible to build effective solutions for problems related to document fragments classification and…

    Scientific literature analysis improves knowledge propagation and plays a key role in understanding and assessment of scholarly communication in scientific world. In recent years many tools and services for analysing the content of scientific articles have been developed. One of the most important tasks in this research area is understanding the roles of different parts of the document. It is impossible to build effective solutions for problems related to document fragments classification and evaluate their performance without a reliable test set, that contains both input documents and the expected results of classification. In this paper we present GROTOAP2 — a large dataset of ground truth files containing labelled fragments of scientific articles in PDF format, useful for training and evaluation of document content analysis-related solutions. GROTOAP2 was successfully used for training CERMINE — our system for extracting metadata and content from scientific articles. The dataset is based on articles from PubMed Central Open Access Subset. GROTOAP2 is published under Open Access license. The semi-automatic method used to construct GROTOAP2 is scalable and can be adjusted for building large datasets from other data sources. The article presents the content of GROTOAP2, describes the entire creation process and reports the evaluation methodology and results.

    Other authors
    See publication
  • Large Scale Citation Matching Using Apache Hadoop

    Research and Advanced Technology for Digital Libraries, volume 8092 of Lecture Notes in Computer Science, Springer Berlin Heidelberg

    During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.

    Other authors
    See publication
  • Methodology for evaluating citation parsing and matching

    Intelligent Tools for Building a Scientific Information Platform, volume 467 of Studies in Computational Intelligence, Springer Berlin Heidelberg

    Bibliographic references between scholarly publications contain valuable information for researchers and developers involved with digital repositories. They are indicators of topical similarity between linked texts, impact of the referenced document, and improve navigation in user interfaces of digital libraries. Consequently, several approaches to extraction, parsing and resolving said references have been proposed to date. In this paper we develop a methodology for evaluating parsing and…

    Bibliographic references between scholarly publications contain valuable information for researchers and developers involved with digital repositories. They are indicators of topical similarity between linked texts, impact of the referenced document, and improve navigation in user interfaces of digital libraries. Consequently, several approaches to extraction, parsing and resolving said references have been proposed to date. In this paper we develop a methodology for evaluating parsing and matching algorithms and choosing the most appropriate one for a document collection at hand. We apply the methodology for evaluating reference parsing and matching module of the YADDA2 software platform.

    Other authors
    See publication
  • A Modular Metadata Extraction System for Born-Digital Articles

    10th IAPR International Workshop on Document Analysis Systems

    We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The workflow is optimized towards processing of born-digital documents, but may accept scanned document images as well. The machine-learning approaches we…

    We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The workflow is optimized towards processing of born-digital documents, but may accept scanned document images as well. The machine-learning approaches we have chosen for solving individual tasks increase the ability to adapt to new document layouts and formats. The evaluation tests we have performed showed good results of the individual implementations and the entire metadata extraction process.

    Other authors
    • Łukasz Bolikowski
    • Artur Czeczko
    • Krzysztof Rusek
    See publication
  • GROTOAP: ground truth for open access publications

    JCDL '12 Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries

    The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP --- a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The…

    The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP --- a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The test set contains input articles in a digital form and corresponding ground truth files. All input documents included in the test set have been selected from DOAJ database, which indexes articles published under CC-BY license. The whole test set is available under the same license.

    Other authors
    • Artur Czeczko
    • Krzysztof Rusek
    • Łukasz Bolikowski
    • Roman Bogacewicz
    See publication
  • Workflow of Metadata Extraction from Retro-Born Digital Documents

    Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011

    In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.

    Other authors
    • Łukasz Bolikowski
    See publication

Honors & Awards

  • ESWC 2015 SemPub Best Performing Approach Award

    Semantic Publishing Challenge at 12th Extended Semantic Web Conference

    CERMINE, the tool for mining scientific publications, won the best performing approach award at Semantic Publishing Challenge hosted by the 12th Extended Semantic Web Conference (http://2015.eswc-conferences.org/)

  • DAS 2014 Best Student Paper Award

    11th IAPR International Workshop on Document Analysis Systems

    The paper entitled "CERMINE - automatic extraction of metadata and references from scientific literature" won the Best Student Paper Award at Document Analysis Systems conference (http://das2014.sciencesconf.org/resource/page/id/27)

Languages

  • Polish

    Native or bilingual proficiency

  • English

    Professional working proficiency

More activity by Dominika

View Dominika’s full profile

  • See who you know in common
  • Get introduced
  • Contact Dominika directly
Join to view full profile

Other similar profiles

Explore collaborative articles

We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.

Explore More