博客搬家系列（六）-爬取今日头条文章

最新推荐文章于 2024-05-02 01:13:24 发布

rico_zhou

最新推荐文章于 2024-05-02 01:13:24 发布

阅读量4.4k

点赞数 1

CC 4.0 BY-SA版权

分类专栏： java spider 大数据

本文链接：https://blog.csdn.net/rico_zhou/article/details/83619564

本文是博客搬家系列的第六篇，主要探讨如何爬取今日头条的文章。由于文章列表是通过js动态加载的，作者分析了请求和参数，特别是加密算法，发现as和cp参数可以通过时间戳MD5加密获取，但_signature参数的获取较为复杂。尽管首次加载的文章可以获取，后续页面的爬取仍存在挑战。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

博客搬家系列（六）-爬取今日头条文章

一.前情回顾

博客搬家系列（一）-简介：https://blog.csdn.net/rico_zhou/article/details/83619152

博客搬家系列（二）-爬取CSDN博客：https://blog.csdn.net/rico_zhou/article/details/83619509

博客搬家系列（三）-爬取博客园博客：https://blog.csdn.net/rico_zhou/article/details/83619525

博客搬家系列（四）-爬取简书文章：https://blog.csdn.net/rico_zhou/article/details/83619538

博客搬家系列（五）-爬取开源中国博客：https://blog.csdn.net/rico_zhou/article/details/83619561

博客搬家系列（七）-本地WORD文档转HTML：https://blog.csdn.net/rico_zhou/article/details/83619573

博客搬家系列（八）-总结：https://blog.csdn.net/rico_zhou/article/details/83619599

二.开干（获取文章URL集合）

爬取今日头条的文章算是本系列中比较难的，不像其他如CSDN等网站，基本信息可以直接使用htmlunit就能爬取，但是当用同样的方法爬取今日头条时，则不行，很简单，我们随便找一多文章的博主，如https://www.toutiao.com/c/user/101528687217/ 打开主页，右击查看源码，我们发现源码中并不包含文章列表等信息，说明文章列表是js动态加载的，于是还是老规矩，先审查元素，查看一下都进行了哪些请求再说

发现这个get请求正是我想要的，preview查看一下不难发现这里的数据即是文章列表，但是我却并没有在url中发现跟页数相关的参数，只是滚动会发现另一个请求，而且最后的三个参数as，cp，_signature是不一样的，当max_behot_time=0时，即第一页信息，随便更改这三个参数均能得到正确数据