Scrapy and Selenium

最新推荐文章于 2025-07-08 11:02:07 发布

weixin_34240520

最新推荐文章于 2025-07-08 11:02:07 发布

阅读量133

点赞数

文章标签： python java

本文介绍如何利用Scrapy和Selenium配合抓取JavaScript动态生成的网页内容。通过安装配置Selenium并启动Selenium Server，实现对Firefox浏览器的远程控制。文章提供了具体的Python代码示例，展示如何初始化Selenium、打开网页、等待页面加载及获取渲染后的HTML源码。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

How to scrapy js?

scrapy结合webkit抓取js生成的页面 http://www.cnblogs.com/Safe3/archive/2011/10/19/2217965.html

rc arch diagram

pip install -U selenium

Selenium IDE
http://docs.seleniumhq.org/projects/ide/

Download the server separately, from: http://selenium-release.storage.googleapis.com/2.40/selenium-server-standalone-2.40.0.jar

java -jar selenium-server-standalone-2.40.0.jar

下面我们开始一步步来做：
1. 首先，进入你的电脑上Selenium Server的jar包所在的目录，通过java -jar xxx.jar的方式运行它，程序会自动监听本地的4444端口；
2. 参考我的上一篇博文《如何连入一台没有外网IP的服务器》，将本地的4444端口与服务器的4444端口建立Remote映射；
3. 使用Scrapy框架开始编写python程序，具体的例子不再赘述，网上有许多例子，比如这个：https://gist.github.com/1045108。仅描述几个要点：
a) 在python里调用selenium这样写：
self.sel = selenium(“localhost”, 4444, “*firefox”,”http://example.com/”)
不过直接写 “*firefox” 可能会找不到Firefox的路径，这时可以强制指定Firefox的程序路径，比如：”*firefox D:/Program Files/Mozilla Firefox/firefox.exe”。
b) 获取Firefox渲染完成后的HTML代码：

sel = self.selenium
sel.open(response.url)
sel.wait_for_page_to_load(10000)
html = sel.get_eval(“selenium.browserbot.getCurrentWindow().document.getElementsByTagName(‘html’)[0].innerHTML”)

from selenium import selenium
from scrapy.spider import BaseSpider
from scrapy.http import Request
import time
import lxml.html
 
class SeleniumSprider(BaseSpider):
    name = "selenium"
    allowed_domains = ['selenium.com']
    start_urls = ["http://localhost"]
    
    def __init__(self,  **kwargs):
        print kwargs
        self.sel = selenium("localhost", 4444, "*firefox","http://selenium.com/")
        self.sel.start()
    
    def parse(self, response):
        sel = self.sel
        sel.open("/index.aspx")
        sel.click("id=radioButton1")
        sel.select("genderOpt", "value=male")
        sel.type("nameTxt", "irfani")
        sel.click("link=Submit")
        time.sleep(1) #wait a second for page to load
        root = lxml.html.fromstring(sel.get_html_source())