Franco Niccolucci & Achille Felicetti
(PIN, University of Florence, Italy)
EOSC-hub Week 2018
Malaga, 16/4/2018
EOSCpilot is a project funded by the EC H2020 programme
 Domain: Archaeology
 Goal: semantic enrichment of texts
 Archaeological documentation largely based on texts
◦ Excavation diaries, reports, surveys, grey literature
◦ Literary/historical sources. research articles, monographs …
◦ Huge number of small (<100Kb) files in different languages
 Registry of 2,000,000 archaeological datasets (70% texts) in ARIADNE
 ARIADNE’s data infrastructure popular among archaeologists
◦ ARIADNE users in 2016: 25-30% of the European research community
◦ Strong support by
 Professional associations (EAA, EAC) & national archaeological/cultural heritage authorities
 National research institutions (CNR, CNRS, CAS, ÖAW, KNAW, BAS, ATHENA RC, FORTH)
 International recognition (USA, Mexico, Japan, Argentina)
 Needed for cloud-based data infrastructure to be developed in ARIADNEplus
◦ Deeper integration between texts, databases, GIS etc.
◦ Advanced services & VREs for data-centric archaeological research
2
EOSCpilot is a project funded by the EC H2020 programme
 NLP & NER OS engine
 Syntactic rules (tailored to specific writing style)
 Texts stating facts, not stories
◦ Data fuzziness, provenance, reliability, reasoning
 Domain ontology: CIDOC CRM (ISO 21127:2006)
◦ ... and not TEI
 Terminology
◦ Specialized vocabularies
 Terra sigillata is not just “sealed earth”
◦ Gazetteers for modern (Geonames) and ancient (Pleiades) place names
 Málaga (modern) vs Màlaka (Phoenician) vs Màlaca (Roman)
◦ Named time period management
 Bronze Age (∼ 3200-600 BC), Recent Orientalizing Period (∼ 630-570 BC)
EOSCpilot is a project funded by the EC H2020 programme
 Modular framework based on GATE toolchain: https://gate.ac.uk
◦ Advanced stemming/lemmatization components
 OpenNLP (https://opennlp.apache.org) : sentence segmentation and part of
speech (POS) tagging
 OpeNER (http://www.opener-project.eu) neuronal network for advanced
named entities recognition (NER), developed in OpeNER FP7 project
◦ Machine learning framework for auto education
 Annotated corpus required
 Ontology: CRMarcheo (CRM extension for archaeology)
 Vocabularies, gazetteers and terminological tools
◦ ICCD vocabularies for Italian archaeology, augmented with term lists
created on purpose
◦ Geonames (modern places), Pleiades (historical places)
◦ Timespan and named period component based on PeriodO
4
EOSCpilot is a project funded by the EC H2020 programme
 TextCrowd detects:
◦ Artefacts
◦ Colours
◦ Materials
◦ Time periods
◦ Persons
◦ Places
◦ Sites
◦ Time spans
◦ Techniques
 Target output formats:
◦ Textual documents automatically annotated and enriched
◦ CIDOC CRM semantic triples (RDF)
5
EOSCpilot is a project funded by the EC H2020 programme
 No annotated text corpora available in Italian to be used as training data for
machine learning algorithms
◦ Manual annotation of 400 pages of Italian archaeology reports (< 1 Person-Month)
 Preparation and adaptation of vocabularies
 Availability of user-friendly cloud-based environments and of necessary tools, to
migrate standalone prototype to cloud
◦ Several cloud solutions tested in early development, limited support provided except in
D4Science
◦ Implementation in D4Science infrastructure, but portable to other cloud services if support and
required modules available
 Authentication and Authorization
◦ No access control to metadata/data implemented so far
◦ Demonstrator focused on freely accessible textual documents
◦ Fasti Online used (http://www.fastionline.org) Open Access collection of archaeological reports
6
EOSCpilot is a project funded by the EC H2020 programme
 Operated and maintained by CNR-ISTI on the D4Science platform
https://www.d4science.org
 Modular engine based on GATE toolchain + OpenNLP-OpeNER
modules, natively provided by D4Science
 Web-based user interface for
◦ User and access management
◦ Cloud storage (private and shared files)
◦ Results available for other Virtual Research Environments (VRE) within D4Science
 Released for open use, for tests & comments
 No fancy interface produced, also to adapt to any Look-and-Feel
7
EOSCpilot is a project funded by the EC H2020 programme
 Machine-readable results: RDF encoding produced
 Human-readable results: color-encoded text (for testing)
 Interoperability of extracted knowledge
◦ Semantic information in CRM format: full integration and interoperability with
other archaeological semantic data (to be fully implemented in ARIADNEplus)
 Supporting FAIR Principles implementation
◦ Metadata to be stored in various registries for easy findability and accessibility
◦ Results ready to be reused within the same environment or consumed by other
services and/or in different scenarios
8
EOSCpilot is a project funded by the EC H2020 programme
 TEXTCROWD has shown to be useful for its main purpose: to demonstrate
the importance and usefulness of EOSC for scientific research in the cultural
heritage domain
 Adoption by other research teams in the EOSCpilot framework
◦ Integration of TEXTCROWD with new VisualMedia Demonstrator: a service for
sharing and visualizing visual media files on the web - automatic metadata extraction
from controlled lists or textual documents for 2D and 3D models
 Testing on real use cases in progress
◦ Open Access papers of the Italian Journal Archeologia e Calcolatori, ongoing
 Clean visualization
 Language extension
◦ English, Dutch: from standalone to cloud-based (annotated corpora available)
◦ French, Spanish, German: new from scratch (annotated corpora to be prepared)
◦ Other EU languages: OpeNER extension required
 Additional work required to suit it to everyday use – but not too much
9
EOSCpilot is a project funded by the EC H2020 programme
 TEXTCROWD Official Pages:
https://eoscpilot.eu/science-demos/textcrowd
https://textcrowd.d4science.org
 TEXTCROWD Pilot:
https://services.d4science.org/group/textcrowd/data-miner
(registration required)
10
EOSCpilot is a project funded by the EC H2020 programme
1. Upload the file(s) to analyze
2. Launch TextCrowd
3. Select the file(s) to process
4. Collect the results
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
EOSCpilot is a project funded by the EC H2020 programme
Franco Niccolucci: franco.niccolucci@gmail.com – Achille Felicetti: achille.felicetti@pin.unifi.it

More Related Content

PDF
Session3 01.clemens neudecker
PPTX
OCR-D: An end-to-end open source OCR framework for historical printed documents
PDF
Crating Value with Open Source, OW2con11, Nov 24-25, Paris
 
ODP
Poio API and GraF-XML @ Balisage 2013
PPTX
Science Demonstrator Session: Social and Earth Sciences
PPTX
European Open Science Cloud architecture future view
PPT
New Goals of PARES: Spanish Archives Web Portal
PPTX
European Research Projects as EOSC Service Providers
Session3 01.clemens neudecker
OCR-D: An end-to-end open source OCR framework for historical printed documents
Crating Value with Open Source, OW2con11, Nov 24-25, Paris
 
Poio API and GraF-XML @ Balisage 2013
Science Demonstrator Session: Social and Earth Sciences
European Open Science Cloud architecture future view
New Goals of PARES: Spanish Archives Web Portal
European Research Projects as EOSC Service Providers

Similar to Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd (20)

PDF
Reducing Infrastructure and Service Fragmentation
PPTX
Gergely Sipos, Claudio Cacciari: Welcome and mapping the landscape: EOSC-hub ...
PDF
Design phase kick-off event and Ceremony
PDF
LoCloud Annual Publishable Summary 2014-15
PPTX
European Cloud Initiative: implementation status
PPTX
2019 05-21 egi and eosc - final
PPTX
WEBINAR: "How to manage your data to make them open and fair"
PDF
Archiver pilot phase kick off Award Ceremony
PDF
Archiver pilot phase kick off Award Ceremony
PPTX
Deep Hybrid DataCloud
PDF
A Service-Oriented National E-Theses Information System And Repository
PPT
IMPACT at OCR Summit
PPTX
Science Demonstrator Session: Physics and Astrophysics
PPTX
Introduction to LoCloud
PPTX
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
PDF
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
PPTX
Technical integration of data repositories status and challenges
 
PDF
Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...
PPT
Rio Info 2009 - Europeana - Bram van der Werf
PPT
Videoactive @IASA World Conference 2009
Reducing Infrastructure and Service Fragmentation
Gergely Sipos, Claudio Cacciari: Welcome and mapping the landscape: EOSC-hub ...
Design phase kick-off event and Ceremony
LoCloud Annual Publishable Summary 2014-15
European Cloud Initiative: implementation status
2019 05-21 egi and eosc - final
WEBINAR: "How to manage your data to make them open and fair"
Archiver pilot phase kick off Award Ceremony
Archiver pilot phase kick off Award Ceremony
Deep Hybrid DataCloud
A Service-Oriented National E-Theses Information System And Repository
IMPACT at OCR Summit
Science Demonstrator Session: Physics and Astrophysics
Introduction to LoCloud
eROSA Policy WS2: European Open Science Cloud (EOSC) - The Perspective of e-I...
Sem tech in CH, Linked Data Meetup, 2014-08-21, Malmo, Sweden
Technical integration of data repositories status and challenges
 
Ontology Repositories and Semantic Artefact Catalogues with the OntoPortal Te...
Rio Info 2009 - Europeana - Bram van der Werf
Videoactive @IASA World Conference 2009
Ad

More from EOSC-hub project (20)

PPTX
EOSC-hub Early Adopter Programme
PPTX
Introduction to service management and FitSM
PPTX
Service management board (SMB), Service providers’ forum (SPF)
PPTX
Joining the EOSC-hub as a Service Provider
PDF
PID services - understandability and findability of data
PDF
Software for data management and exploitation
PDF
Repositories for long-term preservation - certification
PDF
EOSC working group on FAIR
PDF
Updates on the FAIR Data Maturity Model RDA Working Group & the DG RTD FAIR i...
PDF
Services to support FAIR data - Introduction
PDF
EOSC-synergy
PDF
PDF
EOSC-Pillar
PDF
NI4OS-Europe
PDF
Excellerat CoE
PDF
Pathways for EOSC-hub and MaX collaboration
PDF
Overview on the HPC CoEs panorama
PDF
Overview of the Onboarding and validation process and the Rules of Participat...
PDF
ELIXIR Competence Centre in EOSC-hub
PDF
Data sharing in EOSC-hub: perspectives on “sensitive” data
EOSC-hub Early Adopter Programme
Introduction to service management and FitSM
Service management board (SMB), Service providers’ forum (SPF)
Joining the EOSC-hub as a Service Provider
PID services - understandability and findability of data
Software for data management and exploitation
Repositories for long-term preservation - certification
EOSC working group on FAIR
Updates on the FAIR Data Maturity Model RDA Working Group & the DG RTD FAIR i...
Services to support FAIR data - Introduction
EOSC-synergy
EOSC-Pillar
NI4OS-Europe
Excellerat CoE
Pathways for EOSC-hub and MaX collaboration
Overview on the HPC CoEs panorama
Overview of the Onboarding and validation process and the Rules of Participat...
ELIXIR Competence Centre in EOSC-hub
Data sharing in EOSC-hub: perspectives on “sensitive” data
Ad

Recently uploaded (20)

PDF
Electrocardiogram sequences data analytics and classification using unsupervi...
PDF
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
PDF
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
PDF
Introduction to MCP and A2A Protocols: Enabling Agent Communication
PDF
Ensemble model-based arrhythmia classification with local interpretable model...
PDF
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
PDF
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
PPTX
Presentation - Principles of Instructional Design.pptx
PDF
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
PDF
Rapid Prototyping: A lecture on prototyping techniques for interface design
PDF
Human Computer Interaction Miterm Lesson
PDF
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
PDF
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
PDF
NewMind AI Weekly Chronicles – August ’25 Week IV
PDF
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
PPTX
MuleSoft-Compete-Deck for midddleware integrations
PDF
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
PDF
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
PDF
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
PDF
CEH Module 2 Footprinting CEH V13, concepts
Electrocardiogram sequences data analytics and classification using unsupervi...
IT-ITes Industry bjjbnkmkhkhknbmhkhmjhjkhj
Transform-Quality-Engineering-with-AI-A-60-Day-Blueprint-for-Digital-Success.pdf
Introduction to MCP and A2A Protocols: Enabling Agent Communication
Ensemble model-based arrhythmia classification with local interpretable model...
Dell Pro Micro: Speed customer interactions, patient processing, and learning...
5-Ways-AI-is-Revolutionizing-Telecom-Quality-Engineering.pdf
Presentation - Principles of Instructional Design.pptx
The-Future-of-Automotive-Quality-is-Here-AI-Driven-Engineering.pdf
Rapid Prototyping: A lecture on prototyping techniques for interface design
Human Computer Interaction Miterm Lesson
AI.gov: A Trojan Horse in the Age of Artificial Intelligence
Planning-an-Audit-A-How-To-Guide-Checklist-WP.pdf
NewMind AI Weekly Chronicles – August ’25 Week IV
“The Future of Visual AI: Efficient Multimodal Intelligence,” a Keynote Prese...
MuleSoft-Compete-Deck for midddleware integrations
Aug23rd - Mulesoft Community Workshop - Hyd, India.pdf
Transform-Your-Factory-with-AI-Driven-Quality-Engineering.pdf
CXOs-Are-you-still-doing-manual-DevOps-in-the-age-of-AI.pdf
CEH Module 2 Footprinting CEH V13, concepts

Franco Niccolucci: Example of an EOSCpilot Science Demonstrator - TextCrowd

  • 1. Franco Niccolucci & Achille Felicetti (PIN, University of Florence, Italy) EOSC-hub Week 2018 Malaga, 16/4/2018
  • 2. EOSCpilot is a project funded by the EC H2020 programme  Domain: Archaeology  Goal: semantic enrichment of texts  Archaeological documentation largely based on texts ◦ Excavation diaries, reports, surveys, grey literature ◦ Literary/historical sources. research articles, monographs … ◦ Huge number of small (<100Kb) files in different languages  Registry of 2,000,000 archaeological datasets (70% texts) in ARIADNE  ARIADNE’s data infrastructure popular among archaeologists ◦ ARIADNE users in 2016: 25-30% of the European research community ◦ Strong support by  Professional associations (EAA, EAC) & national archaeological/cultural heritage authorities  National research institutions (CNR, CNRS, CAS, ÖAW, KNAW, BAS, ATHENA RC, FORTH)  International recognition (USA, Mexico, Japan, Argentina)  Needed for cloud-based data infrastructure to be developed in ARIADNEplus ◦ Deeper integration between texts, databases, GIS etc. ◦ Advanced services & VREs for data-centric archaeological research 2
  • 3. EOSCpilot is a project funded by the EC H2020 programme  NLP & NER OS engine  Syntactic rules (tailored to specific writing style)  Texts stating facts, not stories ◦ Data fuzziness, provenance, reliability, reasoning  Domain ontology: CIDOC CRM (ISO 21127:2006) ◦ ... and not TEI  Terminology ◦ Specialized vocabularies  Terra sigillata is not just “sealed earth” ◦ Gazetteers for modern (Geonames) and ancient (Pleiades) place names  Málaga (modern) vs Màlaka (Phoenician) vs Màlaca (Roman) ◦ Named time period management  Bronze Age (∼ 3200-600 BC), Recent Orientalizing Period (∼ 630-570 BC)
  • 4. EOSCpilot is a project funded by the EC H2020 programme  Modular framework based on GATE toolchain: https://gate.ac.uk ◦ Advanced stemming/lemmatization components  OpenNLP (https://opennlp.apache.org) : sentence segmentation and part of speech (POS) tagging  OpeNER (http://www.opener-project.eu) neuronal network for advanced named entities recognition (NER), developed in OpeNER FP7 project ◦ Machine learning framework for auto education  Annotated corpus required  Ontology: CRMarcheo (CRM extension for archaeology)  Vocabularies, gazetteers and terminological tools ◦ ICCD vocabularies for Italian archaeology, augmented with term lists created on purpose ◦ Geonames (modern places), Pleiades (historical places) ◦ Timespan and named period component based on PeriodO 4
  • 5. EOSCpilot is a project funded by the EC H2020 programme  TextCrowd detects: ◦ Artefacts ◦ Colours ◦ Materials ◦ Time periods ◦ Persons ◦ Places ◦ Sites ◦ Time spans ◦ Techniques  Target output formats: ◦ Textual documents automatically annotated and enriched ◦ CIDOC CRM semantic triples (RDF) 5
  • 6. EOSCpilot is a project funded by the EC H2020 programme  No annotated text corpora available in Italian to be used as training data for machine learning algorithms ◦ Manual annotation of 400 pages of Italian archaeology reports (< 1 Person-Month)  Preparation and adaptation of vocabularies  Availability of user-friendly cloud-based environments and of necessary tools, to migrate standalone prototype to cloud ◦ Several cloud solutions tested in early development, limited support provided except in D4Science ◦ Implementation in D4Science infrastructure, but portable to other cloud services if support and required modules available  Authentication and Authorization ◦ No access control to metadata/data implemented so far ◦ Demonstrator focused on freely accessible textual documents ◦ Fasti Online used (http://www.fastionline.org) Open Access collection of archaeological reports 6
  • 7. EOSCpilot is a project funded by the EC H2020 programme  Operated and maintained by CNR-ISTI on the D4Science platform https://www.d4science.org  Modular engine based on GATE toolchain + OpenNLP-OpeNER modules, natively provided by D4Science  Web-based user interface for ◦ User and access management ◦ Cloud storage (private and shared files) ◦ Results available for other Virtual Research Environments (VRE) within D4Science  Released for open use, for tests & comments  No fancy interface produced, also to adapt to any Look-and-Feel 7
  • 8. EOSCpilot is a project funded by the EC H2020 programme  Machine-readable results: RDF encoding produced  Human-readable results: color-encoded text (for testing)  Interoperability of extracted knowledge ◦ Semantic information in CRM format: full integration and interoperability with other archaeological semantic data (to be fully implemented in ARIADNEplus)  Supporting FAIR Principles implementation ◦ Metadata to be stored in various registries for easy findability and accessibility ◦ Results ready to be reused within the same environment or consumed by other services and/or in different scenarios 8
  • 9. EOSCpilot is a project funded by the EC H2020 programme  TEXTCROWD has shown to be useful for its main purpose: to demonstrate the importance and usefulness of EOSC for scientific research in the cultural heritage domain  Adoption by other research teams in the EOSCpilot framework ◦ Integration of TEXTCROWD with new VisualMedia Demonstrator: a service for sharing and visualizing visual media files on the web - automatic metadata extraction from controlled lists or textual documents for 2D and 3D models  Testing on real use cases in progress ◦ Open Access papers of the Italian Journal Archeologia e Calcolatori, ongoing  Clean visualization  Language extension ◦ English, Dutch: from standalone to cloud-based (annotated corpora available) ◦ French, Spanish, German: new from scratch (annotated corpora to be prepared) ◦ Other EU languages: OpeNER extension required  Additional work required to suit it to everyday use – but not too much 9
  • 10. EOSCpilot is a project funded by the EC H2020 programme  TEXTCROWD Official Pages: https://eoscpilot.eu/science-demos/textcrowd https://textcrowd.d4science.org  TEXTCROWD Pilot: https://services.d4science.org/group/textcrowd/data-miner (registration required) 10
  • 11. EOSCpilot is a project funded by the EC H2020 programme 1. Upload the file(s) to analyze 2. Launch TextCrowd 3. Select the file(s) to process 4. Collect the results
  • 12. EOSCpilot is a project funded by the EC H2020 programme
  • 13. EOSCpilot is a project funded by the EC H2020 programme
  • 14. EOSCpilot is a project funded by the EC H2020 programme
  • 15. EOSCpilot is a project funded by the EC H2020 programme
  • 16. EOSCpilot is a project funded by the EC H2020 programme
  • 17. EOSCpilot is a project funded by the EC H2020 programme
  • 18. EOSCpilot is a project funded by the EC H2020 programme Franco Niccolucci: [email protected] – Achille Felicetti: [email protected]