维基百科网络爬虫案例研究

维基爬虫实践

最新推荐文章于 2024-11-23 00:00:00 发布

原创最新推荐文章于 2024-11-23 00:00:00 发布 · 211 阅读

0 ·

CC 4.0 BY-SA版权

维基百科网络爬虫案例研究

一、HTML源代码的构成

HTML源代码由嵌套标签组成。

第一个标签是标题标签，<title>和结束标签</title>之间的文本是页面标题。
下一个标签是<div id="introduction">，以</div>作为结束标签。
p是段落的缩写，<p>和结束标签</p>之间的文本是显示在文本上的内容。p是div标签的子类，div是p段落的父类。
该html文档中，第二个div较为复杂。它有一个子类的段落标签p，p段落标签有自己的字类img和a。这两个标签是div标签的后代标签，不是div的子类，而是p段落标签的子类。
锚标签：用<a></a>表示，用于创建链接。href可以指定链接。

二、简单页面的源码

<title>My Website</title>
<div id="introduction">
  <p>
    Welcome to my website!
  </p>
</div>    
<div id="image-gallery">
  <p>
    This is my cat!
    <img src="cat.jpg" alt="Meow!">
    <a href="https://en.wikipedia.org/wiki/Cat">Learn more about cats!</a>
  </p>
</div>

三、使用python获取HTML

安装请求库resquests

import requests
response = requests.get('https://en.wikipedia.beta.wmflabs.org/wiki/Dog')
html = response.text
print(html)
print(type(html))

四、解析HTML

常用库：Beautiful Soup文档描述

五、思路

编写continue_crawl函数。函数功能：

search_history是维基百科文章 url 的字符串列表。列表中的最后一个项目是最近发现的 url。
如果target_url是查找到结果，停止搜索时文章 url 的字符串。

如果 search_history 中最近的文章 == 目标文章，则停止搜索，函数返回 False
如果列表中有 25 个 url，函数返回 False
如果列表中有一个循环，函数返回 False
否则继续搜索，函数返回 True。

def continue_crawl(search_history, target_url, max_steps = 25):
    if search_history[-1] == target_url:
        print("We've found the target article!")
        return False
    elif len(search_history) > max_steps:
        print("The search has gone on suspiciously long, aborting search!")
        return False
    elif search_history[-1] in search_history[:-1]:
        print("We've arrived at an article we've already seen, aborting search!")
        return False
    else:
        return True