SlideShare a Scribd company logo
Web Scraping Using Nutch and Solr - Part 2
● The following example assumes that you have
– Watched “web scraping with nutch and solr”
– The above movie identity is cAiYBD4BQeE
– Set up Linux based Nutch/Solr environment
– Run the web scrape in the above movie
● Now we will
– Clean up that environment
– Web scrape a parameterised url
– View the urls in the data
Empty Nutch Database
● Clean up the Nutch crawl database
– Previously used apache-nutch-1.6/nutch_start.sh
– This contained -dir crawl option
– This created apache-nutch-1.6/crawl directory
– Which contains our Nutch data
● Clean this as
– cd apache-nutch-1.6; rm -rf crawl
● Only because it contained dummy data !
● Next run of script will create dir again
Empty Solr Database
● Clean Solr database via a url
– Book mark this url
– Only use it if you need to empty your data
● Run the following ( with solr server running )
– http://localhost:8983/solr/update?commit=true -d
'<delete><query>*:*</query></delete>'
Set up Nutch
● Now we will do something more complex
● Web scrape a url that has parameters i.e.
– http://<site>/<function>?var1=val1&var2=val2
● This web scrape will
– Have extra url characters '?=&'
– Need greater search depth
– Need better url filtering
● Remember that you need to get permission to scrape a third
party web site
Nutch Configuration
● Change seed file for Nutch
● apache-nutch-1.6/urls/seed.txt
● In this instance I will use a url of the form
– http://somesite.co.nz/Search?DateRange=7&industry=62
– ( this is not a real url – just an example )
● Change conf regex-urlfilter.txt entry i.e.
– # skip URLs containing certain characters
– -[*!@]
– # accept anything else
– +^http://([a-z0-9]*.)*somesite.co.nz/Search
● This will only consider some site Search urls
Run Nutch
● Now run nutch using start script
– cd apache-nutch-1.6 ; ./nutch_start.bash
● Monitor for errors in solr admin log window
● The Nutch crawl should end with
– crawl finished: crawl
Checking Data
● Data should have been indexed in Solr
● In Solr Admin window
– Set 'Core Selector' = collection1
– Click 'Query'
– In Query window set fl field = url
– Click Execute Query
● The result ( next ) shows the filtered list of urls in Solr
Checking Data
Results
● Congratulations you have completed your second crawl
– With parameterised urls
– More complex url filtering
– With a Solr Query search
Contact Us
● Feel free to contact us at
– www.semtech-solutions.co.nz
– info@semtech-solutions.co.nz
● We offer IT project consultancy
● We are happy to hear about your problems
● You can just pay for those hours that you need
● To solve your problems

More Related Content

What's hot (20)

PDF
Web Crawling with Apache Nutch
sebastian_nagel
 
PDF
A quick introduction to Storm Crawler
Julien Nioche
 
PDF
Nutch as a Web data mining platform
abial
 
PDF
Meet Solr For The Tirst Again
Varun Thacker
 
PDF
Nutch - web-scale search engine toolkit
abial
 
PPTX
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Mark Kerzner
 
PPT
8a. How To Setup HBase with Docker
Fabio Fumarola
 
PPTX
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
PDF
Caching. api. http 1.1
Artjoker Digital
 
PDF
An introduction To Apache Spark
Amir Sedighi
 
PDF
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Amir Sedighi
 
PDF
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
PPT
8b. Column Oriented Databases Lab
Fabio Fumarola
 
PPTX
Developing Frameworks for Apache Mesos
Joe Stein
 
KEY
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
PDF
Distributed Data Processing Workshop - SBU
Amir Sedighi
 
PPT
SphinxSE with MySQL
Ritesh Puthran
 
PPTX
Implementing Hadoop on a single cluster
Salil Navgire
 
PDF
[2D1]Elasticsearch 성능 최적화
NAVER D2
 
PDF
Elasticsearch 1.x Cluster Installation (VirtualBox)
Amir Sedighi
 
Web Crawling with Apache Nutch
sebastian_nagel
 
A quick introduction to Storm Crawler
Julien Nioche
 
Nutch as a Web data mining platform
abial
 
Meet Solr For The Tirst Again
Varun Thacker
 
Nutch - web-scale search engine toolkit
abial
 
Nutch + Hadoop scaled, for crawling protected web sites (hint: Selenium)
Mark Kerzner
 
8a. How To Setup HBase with Docker
Fabio Fumarola
 
Making Apache Kafka Elastic with Apache Mesos
Joe Stein
 
Caching. api. http 1.1
Artjoker Digital
 
An introduction To Apache Spark
Amir Sedighi
 
Hadoop 2.x HDFS Cluster Installation (VirtualBox)
Amir Sedighi
 
Get started with Developing Frameworks in Go on Apache Mesos
Joe Stein
 
8b. Column Oriented Databases Lab
Fabio Fumarola
 
Developing Frameworks for Apache Mesos
Joe Stein
 
You know, for search. Querying 24 Billion Documents in 900ms
Jodok Batlogg
 
Distributed Data Processing Workshop - SBU
Amir Sedighi
 
SphinxSE with MySQL
Ritesh Puthran
 
Implementing Hadoop on a single cluster
Salil Navgire
 
[2D1]Elasticsearch 성능 최적화
NAVER D2
 
Elasticsearch 1.x Cluster Installation (VirtualBox)
Amir Sedighi
 

Similar to Web scraping with nutch solr part 2 (20)

PDF
Nutch and lucene_framework
samuelhard
 
PPTX
Dev Con 2014
yewint ko
 
DOCX
Open source search engine
Primya Tamil
 
ODP
Large scale crawling with Apache Nutch
Julien Nioche
 
PDF
A customized web search engine [autosaved]
Mustafa Elkhiat
 
PPT
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
PDF
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett...
Lucidworks
 
PPT
Web Crawler
iamthevictory
 
PPTX
Solr installation
ZHAO Sam
 
PDF
Rapid prototyping with solr - By Erik Hatcher
lucenerevolution
 
PDF
Rapid Prototyping with Solr
Lucidworks (Archived)
 
PDF
Mi Domain Wheel Slides
lancesfa
 
PDF
Low latency scalable web crawling on Apache Storm
Julien Nioche
 
KEY
Solr 101
Findwise
 
PPT
Working with solr.pptx
alignminds
 
PDF
Rapid Prototyping with Solr
Erik Hatcher
 
PPTX
Apache Solr - search for everyone!
Jaran Flaath
 
PDF
Top 5 Tools for Web Scraping
PromptCloud
 
PDF
Deploying Immutable infrastructures with RabbitMQ and Solr
Jordi Llonch
 
Nutch and lucene_framework
samuelhard
 
Dev Con 2014
yewint ko
 
Open source search engine
Primya Tamil
 
Large scale crawling with Apache Nutch
Julien Nioche
 
A customized web search engine [autosaved]
Mustafa Elkhiat
 
Lessons Learned in the Development of a Web-scale Search Engine: Nutch2 and b...
Chris Mattmann
 
Challenges of Simple Documents: When Basic isn't so Basic - Cassandra Targett...
Lucidworks
 
Web Crawler
iamthevictory
 
Solr installation
ZHAO Sam
 
Rapid prototyping with solr - By Erik Hatcher
lucenerevolution
 
Rapid Prototyping with Solr
Lucidworks (Archived)
 
Mi Domain Wheel Slides
lancesfa
 
Low latency scalable web crawling on Apache Storm
Julien Nioche
 
Solr 101
Findwise
 
Working with solr.pptx
alignminds
 
Rapid Prototyping with Solr
Erik Hatcher
 
Apache Solr - search for everyone!
Jaran Flaath
 
Top 5 Tools for Web Scraping
PromptCloud
 
Deploying Immutable infrastructures with RabbitMQ and Solr
Jordi Llonch
 
Ad

More from Mike Frampton (20)

PDF
Apache Airavata
Mike Frampton
 
PDF
Apache MADlib AI/ML
Mike Frampton
 
PDF
Apache MXNet AI
Mike Frampton
 
PDF
Apache Gobblin
Mike Frampton
 
PDF
Apache Singa AI
Mike Frampton
 
PDF
Apache Ranger
Mike Frampton
 
PDF
OrientDB
Mike Frampton
 
PDF
Prometheus
Mike Frampton
 
PDF
Apache Tephra
Mike Frampton
 
PDF
Apache Kudu
Mike Frampton
 
PDF
Apache Bahir
Mike Frampton
 
PDF
Apache Arrow
Mike Frampton
 
PDF
JanusGraph DB
Mike Frampton
 
PDF
Apache Ignite
Mike Frampton
 
PDF
Apache Samza
Mike Frampton
 
PDF
Apache Flink
Mike Frampton
 
PDF
Apache Edgent
Mike Frampton
 
PDF
Apache CouchDB
Mike Frampton
 
ODP
An introduction to Apache Mesos
Mike Frampton
 
ODP
An introduction to Pentaho
Mike Frampton
 
Apache Airavata
Mike Frampton
 
Apache MADlib AI/ML
Mike Frampton
 
Apache MXNet AI
Mike Frampton
 
Apache Gobblin
Mike Frampton
 
Apache Singa AI
Mike Frampton
 
Apache Ranger
Mike Frampton
 
OrientDB
Mike Frampton
 
Prometheus
Mike Frampton
 
Apache Tephra
Mike Frampton
 
Apache Kudu
Mike Frampton
 
Apache Bahir
Mike Frampton
 
Apache Arrow
Mike Frampton
 
JanusGraph DB
Mike Frampton
 
Apache Ignite
Mike Frampton
 
Apache Samza
Mike Frampton
 
Apache Flink
Mike Frampton
 
Apache Edgent
Mike Frampton
 
Apache CouchDB
Mike Frampton
 
An introduction to Apache Mesos
Mike Frampton
 
An introduction to Pentaho
Mike Frampton
 
Ad

Recently uploaded (20)

PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PDF
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
PDF
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PPTX
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
PDF
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PPTX
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PDF
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
PDF
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
PDF
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
PPTX
Digital Circuits, important subject in CS
contactparinay1
 
PDF
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
PDF
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Future-Proof or Fall Behind? 10 Tech Trends You Can’t Afford to Ignore in 2025
DIGITALCONFEX
 
Kit-Works Team Study_20250627_한달만에만든사내서비스키링(양다윗).pdf
Wonjun Hwang
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
MuleSoft MCP Support (Model Context Protocol) and Use Case Demo
shyamraj55
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
CIFDAQ Market Wrap for the week of 4th July 2025
CIFDAQ
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Agentforce World Tour Toronto '25 - MCP with MuleSoft
Alexandra N. Martinez
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
Book industry state of the nation 2025 - Tech Forum 2025
BookNet Canada
 
Peak of Data & AI Encore AI-Enhanced Workflows for the Real World
Safe Software
 
Reverse Engineering of Security Products: Developing an Advanced Microsoft De...
nwbxhhcyjv
 
Digital Circuits, important subject in CS
contactparinay1
 
The Rise of AI and IoT in Mobile App Tech.pdf
IMG Global Infotech
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Bitcoin for Millennials podcast with Bram, Power Laws of Bitcoin
Stephen Perrenod
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 

Web scraping with nutch solr part 2

  • 1. Web Scraping Using Nutch and Solr - Part 2 ● The following example assumes that you have – Watched “web scraping with nutch and solr” – The above movie identity is cAiYBD4BQeE – Set up Linux based Nutch/Solr environment – Run the web scrape in the above movie ● Now we will – Clean up that environment – Web scrape a parameterised url – View the urls in the data
  • 2. Empty Nutch Database ● Clean up the Nutch crawl database – Previously used apache-nutch-1.6/nutch_start.sh – This contained -dir crawl option – This created apache-nutch-1.6/crawl directory – Which contains our Nutch data ● Clean this as – cd apache-nutch-1.6; rm -rf crawl ● Only because it contained dummy data ! ● Next run of script will create dir again
  • 3. Empty Solr Database ● Clean Solr database via a url – Book mark this url – Only use it if you need to empty your data ● Run the following ( with solr server running ) – http://localhost:8983/solr/update?commit=true -d '<delete><query>*:*</query></delete>'
  • 4. Set up Nutch ● Now we will do something more complex ● Web scrape a url that has parameters i.e. – http://<site>/<function>?var1=val1&var2=val2 ● This web scrape will – Have extra url characters '?=&' – Need greater search depth – Need better url filtering ● Remember that you need to get permission to scrape a third party web site
  • 5. Nutch Configuration ● Change seed file for Nutch ● apache-nutch-1.6/urls/seed.txt ● In this instance I will use a url of the form – http://somesite.co.nz/Search?DateRange=7&industry=62 – ( this is not a real url – just an example ) ● Change conf regex-urlfilter.txt entry i.e. – # skip URLs containing certain characters – -[*!@] – # accept anything else – +^http://([a-z0-9]*.)*somesite.co.nz/Search ● This will only consider some site Search urls
  • 6. Run Nutch ● Now run nutch using start script – cd apache-nutch-1.6 ; ./nutch_start.bash ● Monitor for errors in solr admin log window ● The Nutch crawl should end with – crawl finished: crawl
  • 7. Checking Data ● Data should have been indexed in Solr ● In Solr Admin window – Set 'Core Selector' = collection1 – Click 'Query' – In Query window set fl field = url – Click Execute Query ● The result ( next ) shows the filtered list of urls in Solr
  • 9. Results ● Congratulations you have completed your second crawl – With parameterised urls – More complex url filtering – With a Solr Query search
  • 10. Contact Us ● Feel free to contact us at – www.semtech-solutions.co.nz – [email protected] ● We offer IT project consultancy ● We are happy to hear about your problems ● You can just pay for those hours that you need ● To solve your problems