SlideShare a Scribd company logo
Multi-threaded web
crawler in Ruby
Hi,
I’m Kamil Durski, Senior Ruby Developer at Polcode
If improving Ruby skills is what you’re after, stick around. I’ll
show you how to use multiple threads to drastically increase
the efficiency of your application.
As I focus on threads, only the relevant code will be displayed in the slideshow.
Find the full source here.
The (much) underestimated threads
Ruby programmers have easy access to threads thanks to
build-in support.
Threads can be very useful, yet for some reason they don’t
receive much love.
Where can you use threads to see their prowess first-hand?
Crawling the web is a perfect example! Threads allow you to save
much time you’d spend waiting for data from the remote server.
I’m going to build a simple app so you can really understand
the power of threads. It will fetch info on some popular U.S.
TV shows (that one with dragons and an ex chemistry teacher
too!) from a bunch of websites.
But before we take a look at the code, let’s start with a few
slides of good old theory.
What’s the difference between
a thread and a process?
A multi-threaded app is capable of doing a lot of things at the
same time.
That’s because the app has the ability to switch between
threads, letting each of them use some of the process time.
But it’s still a single process
The same things goes for running many apps on a single-core
processor. It’s the operating system that does the switching.
Another big difference
Use threads within a single process and you can share memory
and variables between all of them, making development easier
Use multiple processes and processor cores and it’s no longer
the case – sharing data gets harder.
Check Wikipedia to find out more on threads.
Now we can go back to the TV shows. Aside of Ruby on Rails’
Active Record library for database access, all I’m going to use
are:
Three components from Ruby’s thread library:
1) Thread – the core class that runs multiple parts of code at
			 the same time,
2) Queue – this class will let me schedule jobs to be used by all
			 the threads,
3) Mutex – the role of the Mutex component is to synchronize
			 access to the resources. Thanks to that, the app
			 won’t switch to another thread too early.
The app itself is also divided into three major components:
1) Module
			 I’m going to supply the app with a list of modules to
			 run. The module creates multiple threads and tells	
			 the crawler what to do,
2) Crawler
			 I’m going to create crawler classes to fetch data
			from websites,
3) Model
			 Models will allow me to store and retrieve data
			from the database.
Crawler module
The Crawler module is responsible
for setting the environment and
connecting to the database.
The autoload calls refer to major
components inside the lib/
directory. The setup_env method
connects to the database and
adds app/ directories to the
$LOAD_PATH variable and includes
all of the files under app/ directory.
A new instance of the mutex
method is stored inside of the
@mutex variable. We can access it
by Crawler.mutex.
Crawler::Threads class
core feature
Now I’m going to create the core
feature of the app. I’m initializing a
few variables - @size, to know how
many threads to spawn, @threads
array to keep track of the threads,
and @queue to store the jobs to do.
I’m calling the #add method to add
each job to the queue. It accepts
optional arguments and a block.
Please, google block in Ruby if
you’re not familiar with the concept.
Next,the#start methodinitializes
threads and calls #join on each of
them.It’sessentialforthewholeapp
to work – otherwise once the main
thread is done with its job, it would
instantly kill spawned threads and
exit without finishing its job..
To complete the core functionality,
I’m calling the #pop method on a
block from the queue and then run
the block with the arguments from
the earlier #add method. The true
argument makes sure that it runs in
a non-blocking mode. Otherwise, I
would run into a deadlock with the
thread waiting for a new job to be
addedevenafterthequeueisalready
emptied (eventually throwing
anapplicationerror „Nolivethreads
left. Deadlock?”).
I can use the Crawler::Threads
class to crawl multiple pages at the
same time.
NowIcanrunsomecodetoseewhat
all of it amounts to:
10 second to visit 10 pages and fetch
somebasicinformation.Alright,now
I’m going to try 10 threads.
All it took to do the same task is 1.51 s!
The app no longer wastes time doing nothing while waiting for
the remote server to deliver data.
Additionally, what’s interesting, the input order is different –
for the single thread option it’s the same as the config file. For
the multi-threaded it’s random, as some threads do their job
faster.
Thread safety
The code I used outputs information
using puts. It’s not a thread-safe
way of doing this as it causes two
particular things:
	 - outputs a given string,
	 - then outputs the new line (NL)
	 character.
This may cause random instances of
NLcharactersappearingoutofplace
as the thread switches in the middle
andanother assumes controlbefore
the NL character is printed See the
example below:
I fixed this with mutex by creating a
custom #log method to output the
information to the console wrapped
in it:
Now the console output is always
in order as the thread waits for the
puts to finish.
And that’s it.
Nowyouknowmoreabouthowthreadswork.
I wrote this code as a side project the topic of web crawling
being an important part of what I do. The previous version
included more features such as the usage of proxies and TOR
networksupport.Thelatterimprovesanonymitybutalsoslows
down the code a lot.
Thanks for your time and, again, feel free to tackle the entire
code at:
https://github.com/kdurski/crawler

More Related Content

What's hot (20)

PPT
RAM and ROM Memory Overview
Pankaj Khodifad
 
PPT
6 Switch Fabric
FNian
 
PPTX
Deadlock avoidance (Safe State, Resource Allocation Graph Algorithm)
Shayek Parvez
 
PPTX
Optical character recognition (ocr) ppt
Deijee Kalita
 
PPTX
PPT on BRAIN TUMOR detection in MRI images based on IMAGE SEGMENTATION
khanam22
 
PPTX
Android seminar ppt
chakrapani tripathi
 
PDF
Compilation and Execution
Chong-Kuan Chen
 
PPTX
Face detection and recognition
Pankaj Thakur
 
PPT
Hardware and Software parallelism
prashantdahake
 
PPTX
SPM Resource Management
Saqib Raza
 
PDF
Facial recognition attendance system
Kuntal Faldu
 
PDF
FINAL REPORT DEC
Axis Bank
 
PPTX
Operating System-Memory Management
Akmal Cikmat
 
PDF
Detection of Malarial Parasite in Blood Using Image Processing
Associate Professor in VSB Coimbatore
 
PPTX
Parallel computing
virend111
 
PPTX
Load balancing
ankur bhalla
 
PDF
Software Engineering Final Year Project Report
judebwayo
 
PPT
Proxy Server
guest095022
 
PPT
Cluster Computing
BOSS Webtech
 
PPT
deadlock avoidance
wahab13
 
RAM and ROM Memory Overview
Pankaj Khodifad
 
6 Switch Fabric
FNian
 
Deadlock avoidance (Safe State, Resource Allocation Graph Algorithm)
Shayek Parvez
 
Optical character recognition (ocr) ppt
Deijee Kalita
 
PPT on BRAIN TUMOR detection in MRI images based on IMAGE SEGMENTATION
khanam22
 
Android seminar ppt
chakrapani tripathi
 
Compilation and Execution
Chong-Kuan Chen
 
Face detection and recognition
Pankaj Thakur
 
Hardware and Software parallelism
prashantdahake
 
SPM Resource Management
Saqib Raza
 
Facial recognition attendance system
Kuntal Faldu
 
FINAL REPORT DEC
Axis Bank
 
Operating System-Memory Management
Akmal Cikmat
 
Detection of Malarial Parasite in Blood Using Image Processing
Associate Professor in VSB Coimbatore
 
Parallel computing
virend111
 
Load balancing
ankur bhalla
 
Software Engineering Final Year Project Report
judebwayo
 
Proxy Server
guest095022
 
Cluster Computing
BOSS Webtech
 
deadlock avoidance
wahab13
 

Viewers also liked (17)

PPT
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
M. Atif Qureshi
 
PPT
Working with WebSPHINX Web Crawler
Sanchit Saini
 
PDF
Threading and Concurrency in Ruby
Tim Raymond
 
PDF
Ruby thread safety first
Emily Stolfo
 
PPT
Threads in Ruby (Basics)
varunlalan
 
KEY
Ruby Concurrency and EventMachine
Christopher Spring
 
ODP
Concurrent Programming with Ruby and Tuple Spaces
luccastera
 
PPT
Webcrawler
Govind Raj
 
PDF
building blocks of a scalable webcrawler
Marc Seeger
 
PPT
WebCrawler
mynameismrslide
 
PPT
Web crawler
anusha kurapati
 
KEY
Actors and Threads
mperham
 
PPTX
Web crawler
poonamkenkre
 
PDF
鐵道女孩向前衝-RubyKaigi心得分享
Yu-Chen Chen
 
PPT
Web Crawler
iamthevictory
 
PDF
SXSW 2016: The Need To Knows
Ogilvy Consulting
 
PDF
The Top Skills That Can Get You Hired in 2017
LinkedIn
 
Analyzing Web Crawler as Feed Forward Engine for Efficient Solution to Search...
M. Atif Qureshi
 
Working with WebSPHINX Web Crawler
Sanchit Saini
 
Threading and Concurrency in Ruby
Tim Raymond
 
Ruby thread safety first
Emily Stolfo
 
Threads in Ruby (Basics)
varunlalan
 
Ruby Concurrency and EventMachine
Christopher Spring
 
Concurrent Programming with Ruby and Tuple Spaces
luccastera
 
Webcrawler
Govind Raj
 
building blocks of a scalable webcrawler
Marc Seeger
 
WebCrawler
mynameismrslide
 
Web crawler
anusha kurapati
 
Actors and Threads
mperham
 
Web crawler
poonamkenkre
 
鐵道女孩向前衝-RubyKaigi心得分享
Yu-Chen Chen
 
Web Crawler
iamthevictory
 
SXSW 2016: The Need To Knows
Ogilvy Consulting
 
The Top Skills That Can Get You Hired in 2017
LinkedIn
 
Ad

Similar to Multi-threaded web crawler in Ruby (20)

PDF
Concurrency and parallel in .net
Mohammad Hossein Karami
 
PDF
Ruby openfest
Panagiotis Papadopoulos
 
PPTX
Engineeering Operating systemsOS UNIT 3 Threads.pptx
ppkmurthy2006
 
PDF
Concurrency in java
Saquib Sajid
 
PDF
Graphql
Neven Rakonić
 
PPT
Java Performance, Threading and Concurrent Data Structures
Hitendra Kumar
 
PDF
RubyMotion Inspect Conference - 2013. (With speaker notes.)
alloy020
 
KEY
Synchronous Reads Asynchronous Writes RubyConf 2009
pauldix
 
PDF
J threads-pdf
Venketesh Babu
 
PPTX
Multithreading and concurrency.pptx
ShymmaaQadoom1
 
ODP
DiUS Computing Lca Rails Final
Robert Postill
 
PDF
Nt1310 Unit 3 Language Analysis
Nicole Gomez
 
PDF
Understanding the Single Thread Event Loop
TorontoNodeJS
 
PDF
Multithreading 101
Tim Penhey
 
DOCX
Assignment 2
Conor Dorrian
 
PPTX
The mean stack
faizrashid1995
 
PDF
Introductionto Xm Lmessaging
LiquidHub
 
PDF
Best node js course
bestonlinecoursescoupon
 
PDF
System design for Web Application
Michael Choi
 
PDF
MultiThreading in Python
SRINIVAS KOLAPARTHI
 
Concurrency and parallel in .net
Mohammad Hossein Karami
 
Engineeering Operating systemsOS UNIT 3 Threads.pptx
ppkmurthy2006
 
Concurrency in java
Saquib Sajid
 
Java Performance, Threading and Concurrent Data Structures
Hitendra Kumar
 
RubyMotion Inspect Conference - 2013. (With speaker notes.)
alloy020
 
Synchronous Reads Asynchronous Writes RubyConf 2009
pauldix
 
J threads-pdf
Venketesh Babu
 
Multithreading and concurrency.pptx
ShymmaaQadoom1
 
DiUS Computing Lca Rails Final
Robert Postill
 
Nt1310 Unit 3 Language Analysis
Nicole Gomez
 
Understanding the Single Thread Event Loop
TorontoNodeJS
 
Multithreading 101
Tim Penhey
 
Assignment 2
Conor Dorrian
 
The mean stack
faizrashid1995
 
Introductionto Xm Lmessaging
LiquidHub
 
Best node js course
bestonlinecoursescoupon
 
System design for Web Application
Michael Choi
 
MultiThreading in Python
SRINIVAS KOLAPARTHI
 
Ad

More from Polcode (20)

PDF
How to keep customers engaged to turn them into fans
Polcode
 
PDF
Expert Advice on ERP
Polcode
 
PDF
User Experience (UX): Brand-Customer Interaction
Polcode
 
PDF
The Difference Between UX and UI
Polcode
 
PDF
5 Benefits of Utilizing Machine Learning in eLearning
Polcode
 
PDF
KrakowJS Conference Highlights
Polcode
 
PDF
Best Practices for Dropdowns
Polcode
 
PDF
What’s Next for the Web?
Polcode
 
PDF
Book Recommended By Our CTO
Polcode
 
PDF
8 Biggest Web Design Trends For 2018 eCommerce
Polcode
 
PDF
World Wide Web today
Polcode
 
PDF
Wordpress in numbers
Polcode
 
PDF
Cryptocurrencies in e-commerce
Polcode
 
PDF
Why Choose WooCommerce?
Polcode
 
PDF
A guide to vastly improving your eCommerce business by investing nothing more...
Polcode
 
PDF
Boost your conversions by 40% and more with these 10 growth hacking tips!
Polcode
 
PDF
Future web developer, you are going to be tremendously valuable
Polcode
 
PDF
10 reasons why Symfony is just the right fit for your project
Polcode
 
PDF
Free, SaaS or Enterprise? You’re asking the wrong question!
Polcode
 
PDF
Improve your web and app development with the Symfony3 framework.
Polcode
 
How to keep customers engaged to turn them into fans
Polcode
 
Expert Advice on ERP
Polcode
 
User Experience (UX): Brand-Customer Interaction
Polcode
 
The Difference Between UX and UI
Polcode
 
5 Benefits of Utilizing Machine Learning in eLearning
Polcode
 
KrakowJS Conference Highlights
Polcode
 
Best Practices for Dropdowns
Polcode
 
What’s Next for the Web?
Polcode
 
Book Recommended By Our CTO
Polcode
 
8 Biggest Web Design Trends For 2018 eCommerce
Polcode
 
World Wide Web today
Polcode
 
Wordpress in numbers
Polcode
 
Cryptocurrencies in e-commerce
Polcode
 
Why Choose WooCommerce?
Polcode
 
A guide to vastly improving your eCommerce business by investing nothing more...
Polcode
 
Boost your conversions by 40% and more with these 10 growth hacking tips!
Polcode
 
Future web developer, you are going to be tremendously valuable
Polcode
 
10 reasons why Symfony is just the right fit for your project
Polcode
 
Free, SaaS or Enterprise? You’re asking the wrong question!
Polcode
 
Improve your web and app development with the Symfony3 framework.
Polcode
 

Recently uploaded (20)

PDF
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
PPTX
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PDF
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
PDF
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
PDF
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
PPTX
Tally software_Introduction_Presentation
AditiBansal54083
 
PPTX
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
PDF
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
PDF
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
PPTX
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
PPTX
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
PPTX
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
PDF
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
PDF
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
PDF
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
PPTX
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
PDF
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
PPTX
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
PPTX
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
PPTX
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 
SAP Firmaya İade ABAB Kodları - ABAB ile yazılmıl hazır kod örneği
Salih Küçük
 
ChiSquare Procedure in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
iTop VPN With Crack Lifetime Activation Key-CODE
utfefguu
 
Build It, Buy It, or Already Got It? Make Smarter Martech Decisions
bbedford2
 
Automate Cybersecurity Tasks with Python
VICTOR MAESTRE RAMIREZ
 
Tally software_Introduction_Presentation
AditiBansal54083
 
Foundations of Marketo Engage - Powering Campaigns with Marketo Personalization
bbedford2
 
Top Agile Project Management Tools for Teams in 2025
Orangescrum
 
[Solution] Why Choose the VeryPDF DRM Protector Custom-Built Solution for You...
Lingwen1998
 
Hardware(Central Processing Unit ) CU and ALU
RizwanaKalsoom2
 
Finding Your License Details in IBM SPSS Statistics Version 31.pptx
Version 1 Analytics
 
AEM User Group: India Chapter Kickoff Meeting
jennaf3
 
Alexander Marshalov - How to use AI Assistants with your Monitoring system Q2...
VictoriaMetrics
 
SciPy 2025 - Packaging a Scientific Python Project
Henry Schreiner
 
Download Canva Pro 2025 PC Crack Full Latest Version
bashirkhan333g
 
Agentic Automation: Build & Deploy Your First UiPath Agent
klpathrudu
 
HiHelloHR – Simplify HR Operations for Modern Workplaces
HiHelloHR
 
Empowering Asian Contributions: The Rise of Regional User Groups in Open Sour...
Shane Coughlan
 
Agentic Automation Journey Series Day 2 – Prompt Engineering for UiPath Agents
klpathrudu
 
In From the Cold: Open Source as Part of Mainstream Software Asset Management
Shane Coughlan
 

Multi-threaded web crawler in Ruby

  • 2. Hi, I’m Kamil Durski, Senior Ruby Developer at Polcode If improving Ruby skills is what you’re after, stick around. I’ll show you how to use multiple threads to drastically increase the efficiency of your application. As I focus on threads, only the relevant code will be displayed in the slideshow. Find the full source here.
  • 4. Ruby programmers have easy access to threads thanks to build-in support. Threads can be very useful, yet for some reason they don’t receive much love. Where can you use threads to see their prowess first-hand? Crawling the web is a perfect example! Threads allow you to save much time you’d spend waiting for data from the remote server.
  • 5. I’m going to build a simple app so you can really understand the power of threads. It will fetch info on some popular U.S. TV shows (that one with dragons and an ex chemistry teacher too!) from a bunch of websites. But before we take a look at the code, let’s start with a few slides of good old theory.
  • 6. What’s the difference between a thread and a process?
  • 7. A multi-threaded app is capable of doing a lot of things at the same time. That’s because the app has the ability to switch between threads, letting each of them use some of the process time. But it’s still a single process The same things goes for running many apps on a single-core processor. It’s the operating system that does the switching.
  • 8. Another big difference Use threads within a single process and you can share memory and variables between all of them, making development easier Use multiple processes and processor cores and it’s no longer the case – sharing data gets harder. Check Wikipedia to find out more on threads.
  • 9. Now we can go back to the TV shows. Aside of Ruby on Rails’ Active Record library for database access, all I’m going to use are: Three components from Ruby’s thread library: 1) Thread – the core class that runs multiple parts of code at the same time, 2) Queue – this class will let me schedule jobs to be used by all the threads, 3) Mutex – the role of the Mutex component is to synchronize access to the resources. Thanks to that, the app won’t switch to another thread too early.
  • 10. The app itself is also divided into three major components: 1) Module I’m going to supply the app with a list of modules to run. The module creates multiple threads and tells the crawler what to do, 2) Crawler I’m going to create crawler classes to fetch data from websites, 3) Model Models will allow me to store and retrieve data from the database.
  • 12. The Crawler module is responsible for setting the environment and connecting to the database.
  • 13. The autoload calls refer to major components inside the lib/ directory. The setup_env method connects to the database and adds app/ directories to the $LOAD_PATH variable and includes all of the files under app/ directory. A new instance of the mutex method is stored inside of the @mutex variable. We can access it by Crawler.mutex.
  • 15. Now I’m going to create the core feature of the app. I’m initializing a few variables - @size, to know how many threads to spawn, @threads array to keep track of the threads, and @queue to store the jobs to do.
  • 16. I’m calling the #add method to add each job to the queue. It accepts optional arguments and a block. Please, google block in Ruby if you’re not familiar with the concept.
  • 17. Next,the#start methodinitializes threads and calls #join on each of them.It’sessentialforthewholeapp to work – otherwise once the main thread is done with its job, it would instantly kill spawned threads and exit without finishing its job..
  • 18. To complete the core functionality, I’m calling the #pop method on a block from the queue and then run the block with the arguments from the earlier #add method. The true argument makes sure that it runs in a non-blocking mode. Otherwise, I would run into a deadlock with the thread waiting for a new job to be addedevenafterthequeueisalready emptied (eventually throwing anapplicationerror „Nolivethreads left. Deadlock?”).
  • 19. I can use the Crawler::Threads class to crawl multiple pages at the same time.
  • 21. 10 second to visit 10 pages and fetch somebasicinformation.Alright,now I’m going to try 10 threads.
  • 22. All it took to do the same task is 1.51 s! The app no longer wastes time doing nothing while waiting for the remote server to deliver data. Additionally, what’s interesting, the input order is different – for the single thread option it’s the same as the config file. For the multi-threaded it’s random, as some threads do their job faster.
  • 24. The code I used outputs information using puts. It’s not a thread-safe way of doing this as it causes two particular things: - outputs a given string, - then outputs the new line (NL) character. This may cause random instances of NLcharactersappearingoutofplace as the thread switches in the middle andanother assumes controlbefore the NL character is printed See the example below:
  • 25. I fixed this with mutex by creating a custom #log method to output the information to the console wrapped in it: Now the console output is always in order as the thread waits for the puts to finish.
  • 26. And that’s it. Nowyouknowmoreabouthowthreadswork. I wrote this code as a side project the topic of web crawling being an important part of what I do. The previous version included more features such as the usage of proxies and TOR networksupport.Thelatterimprovesanonymitybutalsoslows down the code a lot. Thanks for your time and, again, feel free to tackle the entire code at: https://github.com/kdurski/crawler