SlideShare a Scribd company logo
CrawlerLD - 
Distributed Crawler 
for Linked Data 
RAPHAEL DO VALE
Summary 
Introduction 
Until now=) 
Issues 
Large Memory Footprint 
Graphical Interface
Introduction 
How can we recommend linked data sources to a beginner user? 
◦ Data sources may not use popular ontologies. 
◦ There might be more than one ontology for the same domain. 
◦ The user may not know all (if any) of the ontologies. 
3
Introduction 
Our solution: 
◦ Create a recommender system that receives a small set of generic URI 
resources and returns a complete report of related resources (URIs, Datasets 
and Ontologies). 
◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have 
to know about specific datasets or ontologies, he only need to know how to get started. 
◦ The recommender system would benefit from a Linked Data crawler, based 
on metadata. 
4
Introduction 
Metadata focused crawler 
◦ INPUT: 
◦ User should summarize the desired domain with a small set of related terms (URI Resources). 
◦ OUTPUT: 
◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output 
was generated. 
◦ With the output results, the user should evaluate the most relevant 
vocabularies for triplification or linkage process. 
◦ This step could be manual or use another tool (e.g.: recommender system). 
5
Introduction 
Our solution: 
◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data 
Cloud). 
◦ For each dataset, applies several queries trying to discover relationships 
between datasets and the crawling resource. 
◦ A breath first algorithm is used to discover more data in cycles. 
6
Until now 
Simplified Workflow: 
7 
List of Terms Processor 
Mediator
Until now 
Processors: 
◦ Each way to recover data from the Linked Data is mapped into a processor. 
◦ Small pieces of code that can be plugged and unplugged. 
◦ Any user can create a new processor. 
8
Until now 
Crawling stages. 
◦ Challenge: based on generic terms, how can we discover more data? 
◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and 
instanceOf). 
9 
Schema.org 
DBpedia WordNet 
Music Ontology 
BBC Music 
More specific
Issues 
Large Memory Footprint 
◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!) 
Absence of Graphical Interface 
‘Locked code’ 
◦ Open source on roadmap 
Small amout of processors
LARGE MEMORY FOOTPRINT
Identifying the issue 
Processor 
ResultSets 
One request for each 
dataset 
Over 500 distinct 
datasets 
Asynchronous 
Synchronous 
Several processors 
running at the same 
time 
Each of them with a 
increasing resultset 
Jena resultset is far 
from being small
Theorical Solution 
Processor 
ResultSets 
One request for each 
dataset 
Over 500 distinct 
datasets 
Asynchronous 
Asynchronous 
Several processors 
running at the same 
time 
The results are 
immediately 
processed 
Even with bigger 
resultsets, the 
memory is controlled
The reactive manifesto 
Reactive Systems are 
◦ Responsive 
◦ The system responds in a timely manner if at all possible 
◦ Resilient 
◦ The system stays responsive in the face of failure 
◦ Elastic 
◦ The system stays responsive under varying workload. 
◦ Message Driven 
◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between 
components that ensures loose coupling, isolation, location transparency, and provides the 
means to delegate errors as messages 
◦ Essentially, reactive systems are event driven applications where modules 
send events (messages) to other modules. Each module should ask 
something to another asynchronously. 
http://www.reactivemanifesto.org/
Actor model 
The actor model in computer science is a mathematical model of 
concurrent computation that treats "actors" as the universal primitives of 
concurrent computation: in response to a message that it receives, an actor 
can make local decisions, create more actors, send more messages, and 
determine how to respond to the next message received. The actor model 
originated in 1973.[1] It has been used both as a framework for a 
theoretical understanding of computation, and as the theoretical basis for 
several practical implementations of concurrent systems. The relationship 
of the model to other work is discussed in Indeterminacy in concurrent 
computation and Actor model and process calculi. 
http://en.wikipedia.org/wiki/Actor_model 
1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular 
Actor Formalism for Artificial Intelligence". IJCAI. 
http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency
Actor model 
http://codermonkey65.blogspot.com.br/2012/09/actors-in-c-with-nact.html
Akka 
http://akka.io/ 
Java or Scala framework for the Actor Model
Akka 
Comparisson with Java’s thread model 
◦ + Simpler 
◦ CrawlerLD worked with two thread pools: 
◦ One to manage all the system’s algorithm 
◦ Other to make calls to datasets 
◦ Using the same thread pool could block all threads in IO operations 
◦ + No thread blocking 
◦ Not need to worry about shared resources 
◦ Each actor runs at most one task at a time 
◦ + Better performance 
◦ No blocking 
◦ Allows distributed computing 
◦ + Better error management 
◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks 
◦ Support for transactions (atomic operations between several actors, even if distributed over several 
machines) 
◦ + Configuration can change system behavior without code change 
◦ Change number of allocated threads, create thread pools for different actors, distribute over several 
machines, change message priority without touching the code.
Akka 
Comparisson with Java’s thread model 
◦ - Much harder to learn 
◦ New paradigm 
◦ - Not native
Results 
CrawlerLDMainActor Calculate 
CalculateResource LevelFinished ResourceProcessedFromLevel 
LevelActor 
Calculate ResourceProcessed 
ResourceActor 
Calculate Calculate Calculate Calculate 
ResourceProcessed ResourceProcessed ResourceProcessed 
ResourceProcessed 
DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor
Results 
Processor 
Calculate QueryFinishedMessage 
SparqlResultset 
SparqlQuerierMasterActor 
CrawlerLD 
UtilitiesSemanticWeb 
ProcessSparqlOnDataset SparqlResultset 
SparqlQuerierActor 
Jena 
Modified 
version 
Blocking calls 
Managed by another 
Akka Dispatcher 
Critical message. Must be 
processed immediately. 
One actor for 
each dataset
Results 
Complete refactor of the code 
◦ Better organization 
◦ Better understanding 
◦ Bugs found and resolved 
◦ Almost two months to understand the paradigm, change the code and test 
Better performance 
◦ Even in heavy workload, the system is always available, 
◦ Another message to another actor 
◦ Distributed code made easy 
◦ Each SparqlQuerierActor could run in a separated machine 
◦ Not yet implemented / tested 
(Much) better memory footprint 
◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!) 
◦ Number of levels or any other parameter does not seem to affect the memory 
footprint
Graphical Interface 
60% completed
Graphical Interface 
New actor message to retrieve task status while running 
CrawlerLDMainActor 
Calculate 
GetSimplifiedStatus 
CrawlerLDSimplifiedStatus 
GetFullStatus 
CrawlerLDFullStatus
Graphical Interface 
Allows creation and monitoring of the tasks 
Takes advantage of actor model 
Anyone will be able to create new tasks 
URL available soon
Questions?

More Related Content

What's hot (20)

PPTX
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
PDF
Log analysis with elastic stack
Bangladesh Network Operators Group
 
PDF
Roaring with elastic search sangam2018
Vinay Kumar
 
PPTX
NOSQL Databases types and Uses
Suvradeep Rudra
 
PPTX
Incorta spark integration
Dylan Wan
 
PPT
7. Key-Value Databases: In Depth
Fabio Fumarola
 
PPTX
Appache Cassandra
nehabsairam
 
PDF
Schema Agnostic Indexing with Azure DocumentDB
Dharma Shukla
 
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
PPTX
Cool NoSQL on Azure with DocumentDB
Jan Hentschel
 
PPTX
NoSQL databases
Filip Ilievski
 
PDF
Design of Experiments on Federator Polystore Architecture
Luiz Henrique Zambom Santana
 
PDF
New Security Features in Apache HBase 0.98: An Operator's Guide
HBaseCon
 
PPTX
Key-Value NoSQL Database
Heman Hosainpana
 
PDF
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
PDF
Automating Research Data Management at Scale with Globus
Globus
 
PPTX
Elasticsearch as a search alternative to a relational database
Kristijan Duvnjak
 
PPTX
introduction to NOSQL Database
nehabsairam
 
PPT
9. Document Oriented Databases
Fabio Fumarola
 
PPTX
Azure DocumentDB
Neil Mackenzie
 
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
Log analysis with elastic stack
Bangladesh Network Operators Group
 
Roaring with elastic search sangam2018
Vinay Kumar
 
NOSQL Databases types and Uses
Suvradeep Rudra
 
Incorta spark integration
Dylan Wan
 
7. Key-Value Databases: In Depth
Fabio Fumarola
 
Appache Cassandra
nehabsairam
 
Schema Agnostic Indexing with Azure DocumentDB
Dharma Shukla
 
Building a Large Scale SEO/SEM Application with Apache Solr
Rahul Jain
 
Cool NoSQL on Azure with DocumentDB
Jan Hentschel
 
NoSQL databases
Filip Ilievski
 
Design of Experiments on Federator Polystore Architecture
Luiz Henrique Zambom Santana
 
New Security Features in Apache HBase 0.98: An Operator's Guide
HBaseCon
 
Key-Value NoSQL Database
Heman Hosainpana
 
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 
Automating Research Data Management at Scale with Globus
Globus
 
Elasticsearch as a search alternative to a relational database
Kristijan Duvnjak
 
introduction to NOSQL Database
nehabsairam
 
9. Document Oriented Databases
Fabio Fumarola
 
Azure DocumentDB
Neil Mackenzie
 

Similar to CrawlerLD - Distributed crawler for linked data (20)

PPTX
Zookeeper big sonata
Anh Le
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PPTX
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
PDF
Noha mega store
Noha Elprince
 
PDF
Profiler Guided Java Performance Tuning
osa_ora
 
PPT
Multithreading in java programming language.ppt
AutoAuto9
 
PPTX
Operating Systems R20 Unit 2.pptx
Prudhvi668506
 
PPTX
Software architecture for data applications
Ding Li
 
PDF
BISSA: Empowering Web gadget Communication with Tuple Spaces
Srinath Perera
 
PPT
Aleksandr_Butenko_Mobile_Development
Ciklum
 
PDF
Design patterns - Common Solutions to Common Problems - Brad Wood
Ortus Solutions, Corp
 
PDF
cf.Objective() 2017 - Design patterns - Brad Wood
Ortus Solutions, Corp
 
PPT
multithreading
Rajkattamuri
 
PDF
Automatisez la détection des menaces et évitez les faux positifs
Elasticsearch
 
PPT
Java Multithreading
Rajkattamuri
 
PPT
Java multithreading
Mohammed625
 
PDF
Multithreading 101
Tim Penhey
 
PPT
Java
Khasim Cise
 
PPT
Multithreading
F K
 
PPTX
Cloud Computing
butest
 
Zookeeper big sonata
Anh Le
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Multivariate algorithms in distributed data processing computing.pptx
ms236400269
 
Noha mega store
Noha Elprince
 
Profiler Guided Java Performance Tuning
osa_ora
 
Multithreading in java programming language.ppt
AutoAuto9
 
Operating Systems R20 Unit 2.pptx
Prudhvi668506
 
Software architecture for data applications
Ding Li
 
BISSA: Empowering Web gadget Communication with Tuple Spaces
Srinath Perera
 
Aleksandr_Butenko_Mobile_Development
Ciklum
 
Design patterns - Common Solutions to Common Problems - Brad Wood
Ortus Solutions, Corp
 
cf.Objective() 2017 - Design patterns - Brad Wood
Ortus Solutions, Corp
 
multithreading
Rajkattamuri
 
Automatisez la détection des menaces et évitez les faux positifs
Elasticsearch
 
Java Multithreading
Rajkattamuri
 
Java multithreading
Mohammed625
 
Multithreading 101
Tim Penhey
 
Multithreading
F K
 
Cloud Computing
butest
 
Ad

Recently uploaded (20)

PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PPTX
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
PDF
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
PPTX
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PDF
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
PPT
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
PDF
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
JavaScript - Good or Bad? Tips for Google Tag Manager
📊 Markus Baersch
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
Module-5-Measures-of-Central-Tendency-Grouped-Data-1.pptx
lacsonjhoma0407
 
OOPs with Java_unit2.pdf. sarthak bookkk
Sarthak964187
 
apidays Munich 2025 - Building Telco-Aware Apps with Open Gateway APIs, Subhr...
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
Context Engineering for AI Agents, approaches, memories.pdf
Tamanna
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Helsinki & North 2025 - APIs at Scale: Designing for Alignment, Trust...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
AI Future trends and opportunities_oct7v1.ppt
SHIKHAKMEHTA
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Using AI/ML for Space Biology Research
VICTOR MAESTRE RAMIREZ
 
The European Business Wallet: Why It Matters and How It Powers the EUDI Ecosy...
Lal Chandran
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
Ad

CrawlerLD - Distributed crawler for linked data

  • 1. CrawlerLD - Distributed Crawler for Linked Data RAPHAEL DO VALE
  • 2. Summary Introduction Until now=) Issues Large Memory Footprint Graphical Interface
  • 3. Introduction How can we recommend linked data sources to a beginner user? ◦ Data sources may not use popular ontologies. ◦ There might be more than one ontology for the same domain. ◦ The user may not know all (if any) of the ontologies. 3
  • 4. Introduction Our solution: ◦ Create a recommender system that receives a small set of generic URI resources and returns a complete report of related resources (URIs, Datasets and Ontologies). ◦ Why generic? Because our user is a beginner person exploring the Linked Data! He doesn’t have to know about specific datasets or ontologies, he only need to know how to get started. ◦ The recommender system would benefit from a Linked Data crawler, based on metadata. 4
  • 5. Introduction Metadata focused crawler ◦ INPUT: ◦ User should summarize the desired domain with a small set of related terms (URI Resources). ◦ OUTPUT: ◦ The tool returns a list of vocabulary terms, as well as provenance data indicating how the output was generated. ◦ With the output results, the user should evaluate the most relevant vocabularies for triplification or linkage process. ◦ This step could be manual or use another tool (e.g.: recommender system). 5
  • 6. Introduction Our solution: ◦ Executes several SPARQL Queries over all the LOD Cloud (Linked Open Data Cloud). ◦ For each dataset, applies several queries trying to discover relationships between datasets and the crawling resource. ◦ A breath first algorithm is used to discover more data in cycles. 6
  • 7. Until now Simplified Workflow: 7 List of Terms Processor Mediator
  • 8. Until now Processors: ◦ Each way to recover data from the Linked Data is mapped into a processor. ◦ Small pieces of code that can be plugged and unplugged. ◦ Any user can create a new processor. 8
  • 9. Until now Crawling stages. ◦ Challenge: based on generic terms, how can we discover more data? ◦ Answer: using strong relationships (sameAs, subclassOf, seeAlso and instanceOf). 9 Schema.org DBpedia WordNet Music Ontology BBC Music More specific
  • 10. Issues Large Memory Footprint ◦ At a 2 level task, with 20 concurrent threads consumes 40gb RAM memory(!!) Absence of Graphical Interface ‘Locked code’ ◦ Open source on roadmap Small amout of processors
  • 12. Identifying the issue Processor ResultSets One request for each dataset Over 500 distinct datasets Asynchronous Synchronous Several processors running at the same time Each of them with a increasing resultset Jena resultset is far from being small
  • 13. Theorical Solution Processor ResultSets One request for each dataset Over 500 distinct datasets Asynchronous Asynchronous Several processors running at the same time The results are immediately processed Even with bigger resultsets, the memory is controlled
  • 14. The reactive manifesto Reactive Systems are ◦ Responsive ◦ The system responds in a timely manner if at all possible ◦ Resilient ◦ The system stays responsive in the face of failure ◦ Elastic ◦ The system stays responsive under varying workload. ◦ Message Driven ◦ Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages ◦ Essentially, reactive systems are event driven applications where modules send events (messages) to other modules. Each module should ask something to another asynchronously. http://www.reactivemanifesto.org/
  • 15. Actor model The actor model in computer science is a mathematical model of concurrent computation that treats "actors" as the universal primitives of concurrent computation: in response to a message that it receives, an actor can make local decisions, create more actors, send more messages, and determine how to respond to the next message received. The actor model originated in 1973.[1] It has been used both as a framework for a theoretical understanding of computation, and as the theoretical basis for several practical implementations of concurrent systems. The relationship of the model to other work is discussed in Indeterminacy in concurrent computation and Actor model and process calculi. http://en.wikipedia.org/wiki/Actor_model 1 - Carl Hewitt; Peter Bishop; Richard Steiger (1973). "A Universal Modular Actor Formalism for Artificial Intelligence". IJCAI. http://pt.slideshare.net/drorbr/the-actor-model-towards-better-concurrency
  • 18. Akka Comparisson with Java’s thread model ◦ + Simpler ◦ CrawlerLD worked with two thread pools: ◦ One to manage all the system’s algorithm ◦ Other to make calls to datasets ◦ Using the same thread pool could block all threads in IO operations ◦ + No thread blocking ◦ Not need to worry about shared resources ◦ Each actor runs at most one task at a time ◦ + Better performance ◦ No blocking ◦ Allows distributed computing ◦ + Better error management ◦ Actor hierarchy allows supervisor actors to manage errors and even repeat the failed tasks ◦ Support for transactions (atomic operations between several actors, even if distributed over several machines) ◦ + Configuration can change system behavior without code change ◦ Change number of allocated threads, create thread pools for different actors, distribute over several machines, change message priority without touching the code.
  • 19. Akka Comparisson with Java’s thread model ◦ - Much harder to learn ◦ New paradigm ◦ - Not native
  • 20. Results CrawlerLDMainActor Calculate CalculateResource LevelFinished ResourceProcessedFromLevel LevelActor Calculate ResourceProcessed ResourceActor Calculate Calculate Calculate Calculate ResourceProcessed ResourceProcessed ResourceProcessed ResourceProcessed DereferenceProcessor NumberOfInstancesProcessor PropertyQueryProcessor Processor
  • 21. Results Processor Calculate QueryFinishedMessage SparqlResultset SparqlQuerierMasterActor CrawlerLD UtilitiesSemanticWeb ProcessSparqlOnDataset SparqlResultset SparqlQuerierActor Jena Modified version Blocking calls Managed by another Akka Dispatcher Critical message. Must be processed immediately. One actor for each dataset
  • 22. Results Complete refactor of the code ◦ Better organization ◦ Better understanding ◦ Bugs found and resolved ◦ Almost two months to understand the paradigm, change the code and test Better performance ◦ Even in heavy workload, the system is always available, ◦ Another message to another actor ◦ Distributed code made easy ◦ Each SparqlQuerierActor could run in a separated machine ◦ Not yet implemented / tested (Much) better memory footprint ◦ Using a 3 level task it ran with 1,5gb RAM memory at most (!!) ◦ Number of levels or any other parameter does not seem to affect the memory footprint
  • 24. Graphical Interface New actor message to retrieve task status while running CrawlerLDMainActor Calculate GetSimplifiedStatus CrawlerLDSimplifiedStatus GetFullStatus CrawlerLDFullStatus
  • 25. Graphical Interface Allows creation and monitoring of the tasks Takes advantage of actor model Anyone will be able to create new tasks URL available soon