Web Crawlers in Perl

Web Crawlers
in Perl
Presented by
Lambert Lum

Terminology
Web Crawling: larger scale
Screen Scraping: smaller scale
Here we'll use both terms interchangeably

What we don't cover
No database
No data warehousing
No parallel processing
No asynchronous coding

Useful for you?
Just for google wannabe?
Just for search engine engineers?

cpanminus
Uses screen scraping to get data from cpan.org

WWW::Mechanize
Inherits from LWP::UserAgent
Ported to other languages
Strangely missing in PHP

WWW::Mechanize
(basic)
my $mech = WWW::Mechanize->new();
$mech->get("http://sfbay.craigslist.org/ela");
my $content = $mech->content;
print $content;

WWW::Mechanize
(regex)
my ($h4_text) = $content =~ m{<h4.*?>(.*?)</h4>};
print "$h4_textn";

HTML::TreeBuilder
my $tree = HTML::TreeBuilder->new();
$tree->parse_content($content);

HTML::TreeBuilder
my $elt = $tree->look_down (
_tag => 'h4',
class => 'ban',
);
print "h4: " . $elt->as_text . "n";

HTML::TreeBuilder
[use firebug]

HTML::TreeBuilder alternatives
Web::Scraper
HTML::TreeBuilder::XPath

Cached Mechanize
my $dir = "data/$ela";
my $content;
eval {
$content = read_file ("$dir/index.html");
};
if (!$content) {
my $url = "http://$ela";
$mech->get($url);
$content = $mech->content;
make_path ($dir);
write_file ("$dir/index.html", $content);
print "wrote $dir/index.htmln";
}

Other caching
WWW::Mechanize::Plugin::Cache
WWW::Mechanize::Cached

Form submission
$mech->get("http://sfbay.craigslist.org/ela/");
$mech->field ('catAbb', 'ela');
$mech->field ('query', 'playstation');
$mech->field ('maxAsk', 300);
$mech->submit();
my @links = $mech->find_all_links(
text_regex => qr{playstation}i,
);
print join "", map { $_->text() . "n" } @links;

Form submission (2)
$mech->get("http://sfbay.craigslist.org/ela/");
my @forms = $mech->forms;
my $form = $forms[0];
my $action = $form->action;
my @inputs = $form->inputs;
my @names = $form->param;

Follow Next Link
my $url = "http://sfbay.craigslist.org/ela";
$mech->get($url);
my $uri = $mech->uri;
print "uri: $urin";
my $i = 0;
while ($i < 10 && $mech->follow_link (text => 'next >')) {
#print Dumper $link;
$uri = $mech->uri;
print "uri: $urin";
$i++;
}

Other uses
Test::WWW::Mechanize

Legality
I'm not a lawyer
User agreements may object to screen scraping
Ebay has sued a notorious screen scraper
Online-games will almost always ban you

No DDoS
Be considerate.
Don't hit the server like a DDoS attack

JavaScript parsing
HTML parsers are easy.
JavaScript parsers are hard.

JavaScript parsing
Selenium lets you hijack your FireFox web browser
Headless WebKit (PhantomJS/Wight)
– WebKit is the base for Chrome/Safari

Homework
Crawl every page of modernperlbooks.com,
extracting title and 1st
paragraph

Web Crawlers in Perl

More Related Content

What's hot (20)

Similar to Web Crawlers in Perl (20)

More from Lambert Lum (6)

Recently uploaded (20)

Web Crawlers in Perl