python 定时执行 爬虫 模块_【Python】定时执行网站爬虫

本文介绍如何使用Python与BeautifulSoup爬取Yahoo股市数据,并通过pandas处理数据,利用SQLite存储结果,最后借助crontab实现定时任务。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

今天我们额讨论如何使用Python,SQLite数据库与crontab工具将爬虫程序部署到服务器上并实现定时爬取存储

编写爬虫代码

编写一个爬虫程序,使用requests与beautifulsoup4包爬取和解析Yahoo!股市-上市成交价排行与Yahoo!股市-上柜成交价排行的资料,再利用pandas包将解析后的展示出来。

import datetime

import requests

from bs4 import BeautifulSoup

import pandas as pd

def get_price_ranks():

current_dt = datetime.datetime.now().strftime("%Y-%m-%d %X")

current_dts = [current_dt for _ in range(200)]

stock_types = ["tse", "otc"]

price_rank_urls = ["https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e={}&n=100".format(st) for st in stock_types]

tickers = []

stocks = []

prices = []

volumes = []

mkt_values = []

ttl_steps = 10*100

each_step = 10

for pr_url in price_rank_urls:

r = requests.get(pr_url)

soup = BeautifulSoup(r.text, 'html.parser')

ticker = [i.text.split()[0] for i in soup.select(".name a")]

tickers += ticker

stock = [i.text.split()[1] for i in soup.select(".name a")]

stocks += stock

price = [float(soup.find_all("td")[2].find_all("td")[i].text) for i in range(5, 5+ttl_steps, each_step)]

prices += price

volume = [int(soup.find_all("td")[2].find_all("td")[i].text.replace(",", "")) for i in range(11, 11+ttl_steps, each_step)]

volumes += volume

mkt_value = [float(soup.find_all("td")[2].find_all("td")[i].text)*100000000 for i in range(12, 12+ttl_steps, each_step)]

mkt_values += mkt_value

types = ["上市" for _ in range(100)] + ["上柜" for _ in range(100)]

ky_registered = [True if "KY" in st else False for st in stocks]

df = pd.DataFrame()

df["scrapingTime"] = current_dts

df["type"] = types

df["kyRegistered"] = ky_registered

df["ticker"] = tickers

df["stock"] = stocks

df["price"] = prices

df["volume"] = volumes

df["mktValue"] = mkt_values

return df

price_ranks = get_price_ranks()

print(price_ranks.shape)

这个的结果展示为

## (200, 8)

接下来我们利用pandas进行前几行展示

price_ranks.head()

price_ranks.tail()

接下来我们就开始往服务器上部署

对于服务器的选择,环境配置不在本课的讨论范围之内,我们主要是要讲一下怎么去设置定时任务。

接下来我们改造一下代码,改造成结果有sqlite存储。

import datetime

import requests

from bs4 import BeautifulSoup

import pandas as pd

import sqlite3

def get_price_ranks():

current_dt = datetime.datetime.now().strftime("%Y-%m-%d %X")

current_dts = [current_dt for _ in range(200)]

stock_types = ["tse", "otc"]

price_rank_urls = ["https://tw.stock.yahoo.com/d/i/rank.php?t=pri&e={}&n=100".format(st) for st in stock_types]

tickers = []

stocks = []

prices = []

volumes = []

mkt_values = []

ttl_steps = 10*100

each_step = 10

for pr_url in price_rank_urls:

r = requests.get(pr_url)

soup = BeautifulSoup(r.text, 'html.parser')

ticker = [i.text.split()[0] for i in soup.select(".name a")]

tickers += ticker

stock = [i.text.split()[1] for i in soup.select(".name a")]

stocks += stock

price = [float(soup.find_all("td")[2].find_all("td")[i].text) for i in range(5, 5+ttl_steps, each_step)]

prices += price

volume = [int(soup.find_all("td")[2].find_all("td")[i].text.replace(",", "")) for i in range(11, 11+ttl_steps, each_step)]

volumes += volume

mkt_value = [float(soup.find_all("td")[2].find_all("td")[i].text)*100000000 for i in range(12, 12+ttl_steps, each_step)]

mkt_values += mkt_value

types = ["上市" for _ in range(100)] + ["上櫃" for _ in range(100)]

ky_registered = [True if "KY" in st else False for st in stocks]

df = pd.DataFrame()

df["scrapingTime"] = current_dts

df["type"] = types

df["kyRegistered"] = ky_registered

df["ticker"] = tickers

df["stock"] = stocks

df["price"] = prices

df["volume"] = volumes

df["mktValue"] = mkt_values

return df

price_ranks = get_price_ranks()

conn = sqlite3.connect('/home/ubuntu/yahoo_stock.db')

price_ranks.to_sql("price_ranks", conn, if_exists="append", index=False)

接下来如果我们让他定时启动,那么,我们需要linux的crontab命令:

如果我们要设置每天的 9:30 到 16:30 之间每小时都执行一次

那么我们只需要先把文件命名为price_rank_scraper.py

然后在crontab的文件中添加

30 9-16 * * * /home/ubuntu/miniconda3/bin/python /home/ubuntu/price_rank_scraper.py

这样我们就成功的做好了一个定时任务爬虫

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值