SlideShare a Scribd company logo
Scrapy internals
Alexander Sibiryakov, 16-17 July 2017, PyConRU 2017
sibiryakov@scrapinghub.com
made by
Talk scope
Talk scope
• Design of complex asynchronous
application,
Talk scope
• Design of complex asynchronous
application,
• Flow-control issues,
Talk scope
• Design of complex asynchronous
application,
• Flow-control issues,
• open source life.
Scrapy: web scraping
Scrapy: web scraping
• extraction of structured data,
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
• Interactive shell,
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
• Interactive shell,
• Feed exports in JSON, CSV, XML and storing in
FTP, S3, local fs,
Scrapy: web scraping
• extraction of structured data,
• Selecting and extracting data from HTML/XML
(CSS, Xpath, regexps) → Parsel
• Interactive shell,
• Feed exports in JSON, CSV, XML and storing in
FTP, S3, local fs,
• Robust encoding support and auto-detection,
Main features
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Telnet console
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
Form submission
Telnet console
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Form submission
Telnet console
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Form submission
Telnet console
Graceful shutdown
by signal
Main features
• Extensible: spider, signals, middlewares,
extensions, and pipelines,
COOKIES
Robots.txt
Form submission
Telnet console
Graceful shutdown
by signal
Scrapy architecture
Twisted
Twisted
• Event-driven network programming
framework
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
• Cross platform
Twisted
• Event-driven network programming
framework
• Event loop and Deferreds («Обещания»)
• Protocols and transport:
• TCP, UDP, SSL, UNIX sockets
• HTTP, DNS, SMTP/IMAP, IRC
• Cross platform
Creator of Twisted
Glyph Lefkowitz
Creator of Twisted
–Twisted source code
self._nameResolver =
_SimpleResolverComplexifier(resolver)
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
events:
[e1: Event, e2: Event, … eN]
Event:
func, args, desired_time
Twisted event loop
https://stackoverflow.com/questions/35111265/how-does-pythons-twisted-reactor-work
https://www.cs.cmu.edu/~adamchik/15-121/lectures/Binary%20Heaps/heaps.html
events:
[e1: Event, e2: Event, … eN]
Event:
func, args, desired_time min: O(1)
x86 time sources
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
• RDTSC/RDTSCP - CPU clock cycles
x86 time sources
• Real Time Clock - absolute time, 1 sec. precision,
• 8254 chip previously,
• HPET (High Precision Event Timer), at least 10Mhz
• single counter for periodic mode,
• many for one-shot mode,
• compares actual timer value and target
• RDTSC/RDTSCP - CPU clock cycles
• Proprietary timers
Twisted.Deferred
Twisted.Deferred
• callback
Twisted.Deferred
• callback
• errback
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
• cancel
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
• cancel
• addTimeout
Twisted.Deferred
• callback
• errback
• addCallback, addErrback
• cancel
• addTimeout
• pause/unpause
Internal components intercommunication
Web agent pipeline
Downloader
Slots:
PROBLEMS
Throttling between internal
components
Throttling between internal
components
• Downloader,
Throttling between internal
components
• Downloader,
• Scraper
Throttling between internal
components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups,
storing,..)
Throttling between internal
components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups,
storing,..)
• Feed exports (serialization + disk/network IO)
Throttling between internal
components
• Downloader,
• Scraper
• Item pipelines (cleansing, validating, dups,
storing,..)
• Feed exports (serialization + disk/network IO)
• ?
Flow control: memory
Flow control: memory
Flow control: memory
• Unlimited downloading -> unlimited items growth
from cascading feed pages.
Flow control: memory
• Unlimited downloading -> unlimited items growth
from cascading feed pages.
• maintain limit per amount of memory used for
Responses in queue (~5Mb)
Flow control: CPU
spending more time on
than
> reactor.callLater( 0.1 , d.errback, _failure)
an artificial delay in 100ms
Callbacks-> CPU
io
Summarizing
Summarizing
• concurrent items limits,
Summarizing
• concurrent items limits,
• memory consumption limits,
Summarizing
• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
Summarizing
• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
if limit is reached ->
Summarizing
• concurrent items limits,
• memory consumption limits,
• scheduling of new calls with delays.
if limit is reached ->
don’t pickup new request from scheduler
It just stopped…
It just stopped…
• Why?
It just stopped…
• Why?
• Some Deferred was lost?
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
> self.heartbeat = task.LoopingCall(nextcall.schedule)
It just stopped…
• Why?
• Some Deferred was lost?
• Where?
• How to debug?
No silver bullet.
> self.heartbeat = task.LoopingCall(nextcall.schedule)
+ extensive logging
«Scrapy internals» Александр Сибиряков, Scrapinghub
Design your async
application well
Design your async
application well
Iterations
Design your async
application well
Iterations
State diagrams
Вопросы
Alexander Sibiryakov, Scrapinghub Ltd.,
sibiryakov@scrapinghub.com

More Related Content

What's hot (20)

PPT
How ElasticSearch lives in my DevOps life
琛琳 饶
 
PPTX
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Jeremy Zawodny
 
PDF
Fluentd - Flexible, Stable, Scalable
Shu Ting Tseng
 
ODP
MySQL And Search At Craigslist
Jeremy Zawodny
 
KEY
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Jeremy Zawodny
 
PDF
Aaron Mildenstein - Using Logstash with Zabbix
Zabbix
 
PDF
[245] presto 내부구조 파헤치기
NAVER D2
 
PPTX
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
PPTX
MongoDB's New Aggregation framework
Chris Westin
 
PDF
Monitoring the ELK stack using Zabbix and Grafana (Dennis Kanbier / 26-11-2015)
Nederlandstalige Zabbix Gebruikersgroep
 
PDF
Volker Fröhlich - How to Debug Common Agent Issues
Zabbix
 
PPTX
Back to Basics Webinar 6: Production Deployment
MongoDB
 
PPTX
To Hire, or to train, that is the question (Percona Live 2014)
Geoffrey Anderson
 
PDF
Journée DevOps : Des dashboards pour tous avec ElasticSearch, Logstash et Kibana
Publicis Sapient Engineering
 
PDF
Docker Monitoring Webinar
Sematext Group, Inc.
 
PDF
Logstash-Elasticsearch-Kibana
dknx01
 
PPTX
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
PPTX
Open Source Monitoring Tools
m_richardson
 
PDF
Fluentd and Docker - running fluentd within a docker container
Treasure Data, Inc.
 
PDF
Machine Learning in a Twitter ETL using ELK
hypto
 
How ElasticSearch lives in my DevOps life
琛琳 饶
 
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Jeremy Zawodny
 
Fluentd - Flexible, Stable, Scalable
Shu Ting Tseng
 
MySQL And Search At Craigslist
Jeremy Zawodny
 
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Jeremy Zawodny
 
Aaron Mildenstein - Using Logstash with Zabbix
Zabbix
 
[245] presto 내부구조 파헤치기
NAVER D2
 
Fusion-io and MySQL at Craigslist
Jeremy Zawodny
 
MongoDB's New Aggregation framework
Chris Westin
 
Monitoring the ELK stack using Zabbix and Grafana (Dennis Kanbier / 26-11-2015)
Nederlandstalige Zabbix Gebruikersgroep
 
Volker Fröhlich - How to Debug Common Agent Issues
Zabbix
 
Back to Basics Webinar 6: Production Deployment
MongoDB
 
To Hire, or to train, that is the question (Percona Live 2014)
Geoffrey Anderson
 
Journée DevOps : Des dashboards pour tous avec ElasticSearch, Logstash et Kibana
Publicis Sapient Engineering
 
Docker Monitoring Webinar
Sematext Group, Inc.
 
Logstash-Elasticsearch-Kibana
dknx01
 
Sharding - patterns & antipatterns, Константин Осипов, Алексей Рыбак
Ontico
 
Open Source Monitoring Tools
m_richardson
 
Fluentd and Docker - running fluentd within a docker container
Treasure Data, Inc.
 
Machine Learning in a Twitter ETL using ELK
hypto
 

Similar to «Scrapy internals» Александр Сибиряков, Scrapinghub (20)

KEY
London devops logging
Tomas Doran
 
PDF
Tuning the Kernel for Varnish Cache
Per Buer
 
PDF
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
PDF
Debugging applications with network security tools
ConFoo
 
PPTX
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
PDF
Performance
Christophe Marchal
 
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
PDF
Http2 in practice
Patrick Meenan
 
PPTX
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
PDF
Oracle GoldenGate Architecture Performance
Enkitec
 
PDF
EKON27-FrameworksTuning.pdf
Arnaud Bouchez
 
PDF
OGG Architecture Performance
Enkitec
 
PDF
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
 
PDF
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
PPT
IWMW 1997: WWW Caching
IWMW
 
PDF
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Bobby Curtis
 
PDF
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
PPTX
Apache Performance Tuning: Scaling Up
Sander Temme
 
PDF
Server Tips
liqingfang126
 
PDF
Swift at Scale: The IBM SoftLayer Story
Brian Cline
 
London devops logging
Tomas Doran
 
Tuning the Kernel for Varnish Cache
Per Buer
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
SignalFx
 
Debugging applications with network security tools
ConFoo
 
High performace network of Cloud Native Taiwan User Group
HungWei Chiu
 
Performance
Christophe Marchal
 
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Chris Fregly
 
Http2 in practice
Patrick Meenan
 
Managing Security At 1M Events a Second using Elasticsearch
Joe Alex
 
Oracle GoldenGate Architecture Performance
Enkitec
 
EKON27-FrameworksTuning.pdf
Arnaud Bouchez
 
OGG Architecture Performance
Enkitec
 
Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement
VMware Tanzu
 
Using Riak for Events storage and analysis at Booking.com
Damien Krotkine
 
IWMW 1997: WWW Caching
IWMW
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Bobby Curtis
 
Rackspace: Email's Solution for Indexing 50K Documents per Second: Presented ...
Lucidworks
 
Apache Performance Tuning: Scaling Up
Sander Temme
 
Server Tips
liqingfang126
 
Swift at Scale: The IBM SoftLayer Story
Brian Cline
 
Ad

More from it-people (20)

PDF
«Про аналитику и серебряные пули» Александр Подсобляев, Rambler&Co
it-people
 
PDF
«Отладка в Python 3.6: Быстрее, Выше, Сильнее» Елизавета Шашкова, JetBrains
it-people
 
PDF
«Gevent — быть или не быть?» Александр Мокров, Positive Technologies
it-people
 
PDF
«Ещё один Поиск Яндекса» Александр Кошелев, Яндекс
it-people
 
PDF
«How I Learned to Stop Worrying and Love the BFG: нагрузочное тестирование со...
it-people
 
PDF
«Write once run anywhere — почём опиум для народа?» Игорь Новиков, Scalr
it-people
 
PDF
«Gensim — тематическое моделирование для людей» Иван Меньших, Лев Константино...
it-people
 
PDF
«Тотальный контроль производительности» Михаил Юматов, ЦИАН
it-people
 
PDF
«Детские болезни live-чата» Ольга Сентемова, Тинькофф Банк
it-people
 
PDF
«Микросервисы наносят ответный удар!» Олег Чуркин, Rambler&Co
it-people
 
PDF
«Память и Python. Что надо знать для счастья?» Алексей Кузьмин, ЦНС
it-people
 
PDF
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
it-people
 
PDF
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
it-people
 
PDF
«PyWat. А хорошо ли вы знаете Python?» Александр Швец, Marilyn System
it-people
 
PDF
«(Без)опасный Python», Иван Цыганов, Positive Technologies
it-people
 
PDF
«Python of Things», Кирилл Борисов, Яндекс
it-people
 
PDF
«Как сделать так, чтобы тесты на Swift не причиняли боль» Сычев Александр, Ra...
it-people
 
PDF
«Клиенту и серверу нужно поговорить» Прокопов Никита, Cognician
it-people
 
PDF
«Кошелек или деньги: сложный выбор между памятью и процессором» Алексеенко Иг...
it-people
 
PDF
ЗАВИСИМОСТИ В КОМПОНЕНТНОМ ВЕБЕ, ПРИГОТОВЛЕННЫЕ ПРАВИЛЬНО, Гриненко Владимир,...
it-people
 
«Про аналитику и серебряные пули» Александр Подсобляев, Rambler&Co
it-people
 
«Отладка в Python 3.6: Быстрее, Выше, Сильнее» Елизавета Шашкова, JetBrains
it-people
 
«Gevent — быть или не быть?» Александр Мокров, Positive Technologies
it-people
 
«Ещё один Поиск Яндекса» Александр Кошелев, Яндекс
it-people
 
«How I Learned to Stop Worrying and Love the BFG: нагрузочное тестирование со...
it-people
 
«Write once run anywhere — почём опиум для народа?» Игорь Новиков, Scalr
it-people
 
«Gensim — тематическое моделирование для людей» Иван Меньших, Лев Константино...
it-people
 
«Тотальный контроль производительности» Михаил Юматов, ЦИАН
it-people
 
«Детские болезни live-чата» Ольга Сентемова, Тинькофф Банк
it-people
 
«Микросервисы наносят ответный удар!» Олег Чуркин, Rambler&Co
it-people
 
«Память и Python. Что надо знать для счастья?» Алексей Кузьмин, ЦНС
it-people
 
«Что такое serverless-архитектура и как с ней жить?» Николай Марков, Aligned ...
it-people
 
«Python на острие бритвы: PyPy project» Александр Кошкин, Positive Technologies
it-people
 
«PyWat. А хорошо ли вы знаете Python?» Александр Швец, Marilyn System
it-people
 
«(Без)опасный Python», Иван Цыганов, Positive Technologies
it-people
 
«Python of Things», Кирилл Борисов, Яндекс
it-people
 
«Как сделать так, чтобы тесты на Swift не причиняли боль» Сычев Александр, Ra...
it-people
 
«Клиенту и серверу нужно поговорить» Прокопов Никита, Cognician
it-people
 
«Кошелек или деньги: сложный выбор между памятью и процессором» Алексеенко Иг...
it-people
 
ЗАВИСИМОСТИ В КОМПОНЕНТНОМ ВЕБЕ, ПРИГОТОВЛЕННЫЕ ПРАВИЛЬНО, Гриненко Владимир,...
it-people
 
Ad

Recently uploaded (20)

PPTX
internet básico presentacion es una red global
70965857
 
PPTX
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
PPTX
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
PPT
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
PPTX
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
PPTX
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
PPT
introductio to computers by arthur janry
RamananMuthukrishnan
 
PPTX
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
PPTX
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
PPTX
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
PDF
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
PDF
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PPTX
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
DOCX
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
PPTX
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
PPTX
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
PPT
introduction to networking with basics coverage
RamananMuthukrishnan
 
PPT
Computer Securityyyyyyyy - Chapter 1.ppt
SolomonSB
 
PPTX
西班牙武康大学毕业证书{UCAMOfferUCAM成绩单水印}原版制作
Taqyea
 
PDF
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 
internet básico presentacion es una red global
70965857
 
Presentation3gsgsgsgsdfgadgsfgfgsfgagsfgsfgzfdgsdgs.pptx
SUB03
 
一比一原版(SUNY-Albany毕业证)纽约州立大学奥尔巴尼分校毕业证如何办理
Taqyea
 
Agilent Optoelectronic Solutions for Mobile Application
andreashenniger2
 
法国巴黎第二大学本科毕业证{Paris 2学费发票Paris 2成绩单}办理方法
Taqyea
 
ONLINE BIRTH CERTIFICATE APPLICATION SYSYTEM PPT.pptx
ShyamasreeDutta
 
introductio to computers by arthur janry
RamananMuthukrishnan
 
sajflsajfljsdfljslfjslfsdfas;fdsfksadfjlsdflkjslgfs;lfjlsajfl;sajfasfd.pptx
theknightme
 
04 Output 1 Instruments & Tools (3).pptx
GEDYIONGebre
 
L1A Season 1 Guide made by A hegy Eng Grammar fixed
toszolder91
 
Build Fast, Scale Faster: Milvus vs. Zilliz Cloud for Production-Ready AI
Zilliz
 
AI_MOD_1.pdf artificial intelligence notes
shreyarrce
 
PM200.pptxghjgfhjghjghjghjghjghjghjghjghjghj
breadpaan921
 
Custom vs. Off-the-Shelf Banking Software
KristenCarter35
 
L1A Season 1 ENGLISH made by A hegy fixed
toszolder91
 
一比一原版(LaTech毕业证)路易斯安那理工大学毕业证如何办理
Taqyea
 
introduction to networking with basics coverage
RamananMuthukrishnan
 
Computer Securityyyyyyyy - Chapter 1.ppt
SolomonSB
 
西班牙武康大学毕业证书{UCAMOfferUCAM成绩单水印}原版制作
Taqyea
 
The-Hidden-Dangers-of-Skipping-Penetration-Testing.pdf.pdf
naksh4thra
 

«Scrapy internals» Александр Сибиряков, Scrapinghub