Activity
-
And we find ourselves at the final session of #Crossref2025 with a fascinating panel on "Research Nexus in the real world: What is the impact and…
And we find ourselves at the final session of #Crossref2025 with a fascinating panel on "Research Nexus in the real world: What is the impact and…
Liked by Dominika Tkaczyk
-
Why did Crossref need a dedicated Data Science team? Dominika Tkaczyk, Director of Technology, explains why and gives the team's mission as "The…
Why did Crossref need a dedicated Data Science team? Dominika Tkaczyk, Director of Technology, explains why and gives the team's mission as "The…
Liked by Dominika Tkaczyk
-
I’m thrilled to be joining Comarch this November as Chief AI Officer. We are living in extraordinary times, at the cusp of the greatest…
I’m thrilled to be joining Comarch this November as Chief AI Officer. We are living in extraordinary times, at the cusp of the greatest…
Liked by Dominika Tkaczyk
Experience
Education
-
Systems Research Institute, Polish Academy of Sciences
-
-
-
-
-
Licenses & Certifications
Publications
-
CERMINE: automatic extraction of structured metadata from scientific literature
International Journal on Document Analysis and Recognition
CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning…
CERMINE is a comprehensive open-source system for extracting structured metadata from scientific articles in a born-digital form. The system is based on a modular workflow, whose loosely coupled architecture allows for individual component evaluation and adjustment, enables effortless improvements and replacements of independent parts of the algorithm and facilitates future architecture expanding. The implementations of most steps are based on supervised and unsupervised machine learning techniques, which simplifies the procedure of adapting the system to new document layouts and styles. The evaluation of the extraction workflow carried out with the use of a large dataset showed good performance for most metadata types, with the average F score of 77.5 %. CERMINE system is available under an open-source licence and can be accessed at http://cermine.ceon.pl. In this paper, we outline the overall workflow architecture and provide details about individual steps implementations. We also thoroughly compare CERMINE to similar solutions, describe evaluation methodology and finally report its results.
Other authorsSee publication -
Extracting Contextual Information from Scientific Literature Using CERMINE System
Semantic Web Evaluation Challenges
-
CERMINE — automatic extraction of metadata and references from scientific literature
11th IAPR International Workshop on Document Analysis Systems
CERMINE is a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. The system is based on a modular workflow, whose architecture allows for single step training and evaluation, enables effortless modifications and replacements of individual components and simplifies further architecture expanding. The implementations of most steps are based on supervised and unsupervised machine-learning techniques, which…
CERMINE is a comprehensive open source system for extracting metadata and parsed bibliographic references from scientific articles in born-digital form. The system is based on a modular workflow, whose architecture allows for single step training and evaluation, enables effortless modifications and replacements of individual components and simplifies further architecture expanding. The implementations of most steps are based on supervised and unsupervised machine-learning techniques, which simplifies the process of adjusting the system to
new document layouts. The paper describes the overall workflow architecture, provides details about individual implementations and reports evaluation methodology and results. CERMINE service is available at http://cermine.ceon.pl.Other authors -
GROTOAP2 — The Methodology of Creating a Large Ground Truth Dataset of Scientific Articles
D-Lib Magazine
Scientific literature analysis improves knowledge propagation and plays a key role in understanding and assessment of scholarly communication in scientific world. In recent years many tools and services for analysing the content of scientific articles have been developed. One of the most important tasks in this research area is understanding the roles of different parts of the document. It is impossible to build effective solutions for problems related to document fragments classification and…
Scientific literature analysis improves knowledge propagation and plays a key role in understanding and assessment of scholarly communication in scientific world. In recent years many tools and services for analysing the content of scientific articles have been developed. One of the most important tasks in this research area is understanding the roles of different parts of the document. It is impossible to build effective solutions for problems related to document fragments classification and evaluate their performance without a reliable test set, that contains both input documents and the expected results of classification. In this paper we present GROTOAP2 — a large dataset of ground truth files containing labelled fragments of scientific articles in PDF format, useful for training and evaluation of document content analysis-related solutions. GROTOAP2 was successfully used for training CERMINE — our system for extracting metadata and content from scientific articles. The dataset is based on articles from PubMed Central Open Access Subset. GROTOAP2 is published under Open Access license. The semi-automatic method used to construct GROTOAP2 is scalable and can be adjusted for building large datasets from other data sources. The article presents the content of GROTOAP2, describes the entire creation process and reports the evaluation methodology and results.
Other authorsSee publication -
Large Scale Citation Matching Using Apache Hadoop
Research and Advanced Technology for Digital Libraries, volume 8092 of Lecture Notes in Computer Science, Springer Berlin Heidelberg
During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.
Other authorsSee publication -
Methodology for evaluating citation parsing and matching
Intelligent Tools for Building a Scientific Information Platform, volume 467 of Studies in Computational Intelligence, Springer Berlin Heidelberg
Bibliographic references between scholarly publications contain valuable information for researchers and developers involved with digital repositories. They are indicators of topical similarity between linked texts, impact of the referenced document, and improve navigation in user interfaces of digital libraries. Consequently, several approaches to extraction, parsing and resolving said references have been proposed to date. In this paper we develop a methodology for evaluating parsing and…
Bibliographic references between scholarly publications contain valuable information for researchers and developers involved with digital repositories. They are indicators of topical similarity between linked texts, impact of the referenced document, and improve navigation in user interfaces of digital libraries. Consequently, several approaches to extraction, parsing and resolving said references have been proposed to date. In this paper we develop a methodology for evaluating parsing and matching algorithms and choosing the most appropriate one for a document collection at hand. We apply the methodology for evaluating reference parsing and matching module of the YADDA2 software platform.
Other authorsSee publication -
A Modular Metadata Extraction System for Born-Digital Articles
10th IAPR International Workshop on Document Analysis Systems
We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The workflow is optimized towards processing of born-digital documents, but may accept scanned document images as well. The machine-learning approaches we…
We present a comprehensive system for extracting metadata from scholarly articles. In our approach the entire document is inspected, including headers and footers of all the pages as well as bibliographic references. The system is based on a modular workflow which allows for evaluation, unit testing and replacement of individual components. The workflow is optimized towards processing of born-digital documents, but may accept scanned document images as well. The machine-learning approaches we have chosen for solving individual tasks increase the ability to adapt to new document layouts and formats. The evaluation tests we have performed showed good results of the individual implementations and the entire metadata extraction process.
Other authors -
-
GROTOAP: ground truth for open access publications
JCDL '12 Proceedings of the 12th ACM/IEEE-CS joint conference on Digital Libraries
The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP --- a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The…
The field of digital document content analysis includes many important tasks, for example page segmentation or zone classification. It is impossible to build effective solutions for such problems and evaluate their performance without a reliable test set, that contains both input documents and expected results of segmentation and classification. In this paper we present GROTOAP --- a test set useful for training and performance evaluation of page segmentation and zone classification tasks. The test set contains input articles in a digital form and corresponding ground truth files. All input documents included in the test set have been selected from DOAJ database, which indexes articles published under CC-BY license. The whole test set is available under the same license.
Other authors -
-
Workflow of Metadata Extraction from Retro-Born Digital Documents
Towards a Digital Mathematics Library. Bertinoro, Italy, July 20-21st, 2011
In this work-in-progress report we propose a workflow for metadata extraction from articles in a digital form. We decompose the problem into clearly defined sub-tasks and outline possible implementations of the sub-tasks. We report the progress of implementation and tests, and state future work.
Other authors -
Honors & Awards
-
ESWC 2015 SemPub Best Performing Approach Award
Semantic Publishing Challenge at 12th Extended Semantic Web Conference
CERMINE, the tool for mining scientific publications, won the best performing approach award at Semantic Publishing Challenge hosted by the 12th Extended Semantic Web Conference (http://2015.eswc-conferences.org/)
-
DAS 2014 Best Student Paper Award
11th IAPR International Workshop on Document Analysis Systems
The paper entitled "CERMINE - automatic extraction of metadata and references from scientific literature" won the Best Student Paper Award at Document Analysis Systems conference (http://das2014.sciencesconf.org/resource/page/id/27)
Languages
-
Polish
Native or bilingual proficiency
-
English
Professional working proficiency
More activity by Dominika
-
Another new open dataset just dropped! Last time it was affiliations, now it's GRANTS! Here are over 250,000 Crossref grant<>publication matches for…
Another new open dataset just dropped! Last time it was affiliations, now it's GRANTS! Here are over 250,000 Crossref grant<>publication matches for…
Liked by Dominika Tkaczyk
-
If Carlsberg did jobs... Probably the best role in #ScholarlyPublishing. Public Knowledge Project is hiring a Managing Director, responsible for…
If Carlsberg did jobs... Probably the best role in #ScholarlyPublishing. Public Knowledge Project is hiring a Managing Director, responsible for…
Liked by Dominika Tkaczyk
-
We've got metadata. Lots of it. And we need a Program Technical Lead to help keep it all connected, open, and sustainable. ✔ Work remotely ✔ Lead a…
We've got metadata. Lots of it. And we need a Program Technical Lead to help keep it all connected, open, and sustainable. ✔ Work remotely ✔ Lead a…
Liked by Dominika Tkaczyk
-
We are hiring a remote Program Technical Lead at Crossref to help shape the future of open infrastructure for global scholarly communication. This…
We are hiring a remote Program Technical Lead at Crossref to help shape the future of open infrastructure for global scholarly communication. This…
Shared by Dominika Tkaczyk
-
Sooo, this happened (it actually did). Most reactions have been "Wait, I thought Crossref was already in the cloud". Well, nope, we just talked…
Sooo, this happened (it actually did). Most reactions have been "Wait, I thought Crossref was already in the cloud". Well, nope, we just talked…
Liked by Dominika Tkaczyk
-
🚀 Crossref is hiring a DevOps Engineer! Join our fully remote, mission-driven team and help build and support critical infrastructure for the…
🚀 Crossref is hiring a DevOps Engineer! Join our fully remote, mission-driven team and help build and support critical infrastructure for the…
Liked by Dominika Tkaczyk
-
I was fortunate to be a panellist at an excellent event organised by Institute of International and European Affairs on Ireland and AI-readiness. My…
I was fortunate to be a panellist at an excellent event organised by Institute of International and European Affairs on Ireland and AI-readiness. My…
Liked by Dominika Tkaczyk
-
Crossref is hosting a pub watch party on the last afternoon of the Metascience 2025 conference, screening and discussing two of the pre-conference…
Crossref is hosting a pub watch party on the last afternoon of the Metascience 2025 conference, screening and discussing two of the pre-conference…
Liked by Dominika Tkaczyk
Other similar profiles
Explore collaborative articles
We’re unlocking community knowledge in a new way. Experts add insights directly into each article, started with the help of AI.
Explore More