利用Scrapy爬取豆瓣排名前250部电影封面

本文介绍了一个使用Scrapy框架实现的爬虫项目,用于抓取豆瓣电影Top250页面上的电影海报图片及其名称。项目包括爬虫逻辑、图片下载管道及配置等关键部分。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

一、爬虫代码

项目目录结构:
在这里插入图片描述

item.py文件

# -*- coding: utf-8 -*-
import scrapy

class DoubanmovieItem(scrapy.Item):
    # two items: url and name of image
    url = scrapy.Field()
    img_name = scrapy.Field()

pineline.py文件

# -*- coding: utf-8 -*-

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request 

class DoubanmoviePipeline(object):
    def process_item(self, item, spider):
        return item

class MyImagesPipeline(ImagesPipeline):
    # yield meta for file_path() function
    def get_media_requests(self, item, info): 
        for url in item['url']: 
            yield Request(url, meta={'item': item, 'index':item['url'].index(url)})

    # rename the image
    def file_path(self, request, response=None, info=None):
        item = request.meta['item']
        index = request.meta['index']

        image_name = item['img_name'][index]
        return 'full/%s.jpg' % (image_name)

setting.py文件

# -*- coding: utf-8 -*-
BOT_NAME = 'doubanMovie'
SPIDER_MODULES = ['doubanMovie.spiders']
NEWSPIDER_MODULE = 'doubanMovie.spiders'
ITEM_PIPELINES = {'doubanMovie.pipelines.DoubanmoviePipeline': 2,
                  'doubanMovie.pipelines.MyImagesPipeline':1 }
IMAGES_URLS_FIELD = 'url'
IMAGES_STORE = r'.'

middlewares.py文件


# -*- coding: utf-8 -*-
from scrapy import signals
class DoubanmovieSpiderMiddleware(object):

    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class DoubanmovieDownloaderMiddleware(object):
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

doubanMovieSpider.py文件

import scrapy
from scrapy.spiders import Spider  
from scrapy.selector import Selector
from ..items import  DoubanmovieItem

class movieSpider(Spider):
    # name of Spider  
    name = "movie"
    #start urls
    start_urls = ["https://movie.douban.com/top250"] 
    for i in range(1,10):
        start_urls.append("https://movie.douban.com/top250?start=%d&filter="%(25*i))

    #parse function
    def parse(self, response):
        
        item = DoubanmovieItem()
        sel = Selector(response)
        images = sel.xpath('//*[@id="content"]/div/div[1]/ol/li')

        item['url'] = [] 
        item['img_name'] = []
        # append the url and name of the image in item
        for image in images:
            # extract url and name of the image   
            site = image.xpath('div/div[1]/a/img/@src').extract_first()
            img_name = image.xpath('div/div[1]/a/img/@alt').extract_first()
            
            item['url'].append(site)
            item['img_name'].append(img_name)
        yield item

二、爬取内容

在这里插入图片描述
在这里插入图片描述

爬取豆瓣读书Top250的图书封面通常需要利用网络爬虫技术,结合Python等编程语言以及一些第三方库如requests、BeautifulSoup或Scrapy。以下是一个简化的步骤: 1. **获取网页数据**:首先,你需要访问豆瓣读书Top250的页面(https://book.douban.com/top250/)。使用requests库发送HTTP请求,获取HTML内容。 ```python import requests url = 'https://book.douban.com/top250' response = requests.get(url) html_content = response.text ``` 2. **解析HTML**:然后,使用BeautifulSoup或其他类似库解析HTML,找到包含图片URL的HTML元素,例如`<img>`标签。 ```python from bs4 import BeautifulSoup soup = BeautifulSoup(html_content, 'lxml') cover_elements = soup.find_all('img', class_='cover') # 查找class为"cover"的img标签 ``` 3. **提取封面链接**:从解析的结果中提取出实际的图片链接,通常是相对路径加上缀(如'https://img3.doubanio.com/')。 ```python covers = [element['src'] for element in cover_elements] ``` 4. **下载图片**:如果有需求,可以使用像`requests`配合`io`模块来下载图片并保存到本地。 ```python import os for i, cover_url in enumerate(covers): img_response = requests.get(cover_url) filename = f'top250_{i}.jpg' # 根据索引命名文件 with open(filename, 'wb') as f: f.write(img_response.content) ``` 注意:在实际操作中,可能会遇到反爬机制、版权等问题,所以请确保你的行为符合网站的robots.txt规则,并尊重版权。
评论 12
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值