基于WebMagic的java爬虫，爬取知乎用户Demo.zip_爬取知乎用户画像构建资源-CSDN下载

共50个文件

jar：25个

class：7个

java：7个

版权申诉

73 浏览量 2025-04-14 18:52:12 上传评论收藏 7.07MB ZIP 举报

在当今的互联网信息时代，爬虫技术作为一种自动化收集网络数据的重要工具，已经被广泛应用于各个领域。Java作为一种强大的编程语言，在爬虫开发中也占有一席之地。WebMagic是一个开源的、专注于Java开发的爬虫框架，它简化了网络爬虫的开发过程，使得开发者能够更加专注于业务逻辑的实现。本压缩包文件名为“基于WebMagic的java爬虫，爬取知乎用户Demo.zip”，从名称上我们可以推断，这是一个利用WebMagic框架开发的Java爬虫项目，其目的在于从知乎平台上爬取用户相关信息。知乎是中国领先的问答社区，汇聚了大量的用户交流信息，爬取知乎用户的数据可以用于数据分析、用户画像构建、市场研究等多种应用场景。该Demo项目可能包含以下几个关键部分： 1. **项目构建文件**：由于文件名称为“java0323”，可以推测这可能是项目中使用的Java版本号，或者是一个特定的模块标识。项目中应该包含标准的构建文件，如Maven的pom.xml或者Gradle的build.gradle，以方便项目的依赖管理和构建执行。 2. **爬虫配置文件**：WebMagic框架要求爬虫开发者编写配置文件，定义爬虫的行为，如起始URL、页面解析规则、数据抽取方法等。这通常涉及到正则表达式、CSS选择器或Xpath表达式等技术。 3. **数据处理模块**：爬取数据后，需要进行数据清洗、格式转换、存储等处理。数据处理模块是爬虫系统的关键一环，它直接关系到爬虫数据的质量和可用性。 4. **爬虫执行脚本**：可能包含一个主程序入口，用于启动爬虫任务。该脚本可以是一个Java类文件，其中包含main方法，用于控制爬虫的启动、运行和结束。 5. **异常处理和日志记录**：在爬虫运行过程中，难免会遇到各种异常情况，如网络问题、解析错误等。因此，Demo项目中可能会包含异常处理机制和日志记录功能，以便于监控爬虫的运行状态和调试问题。 6. **测试代码**：为了确保爬虫的稳定性和数据的准确性，Demo项目应该包含相应的单元测试和集成测试代码，以验证爬虫功能的实现是否符合预期。 7. **用户界面**：虽然这个Demo可能是一个后台运行的爬虫程序，但为了展示爬虫运行的结果和便于操作，也可能包含一个简单的命令行界面或图形用户界面。 8. **文档说明**：为了方便用户理解和使用该Demo，项目中应该包含必要的文档说明，如使用说明、API文档、项目结构说明等。通过以上分析，我们可以看出该Java爬虫Demo项目结构的复杂性和完整性。尽管具体的实现细节我们无法得知，但通过项目名称和文件列表的推断，我们可以得出这个项目涵盖了从数据爬取到处理的完整流程，是学习和实践WebMagic框架的良好示例。

资源推荐

资源详情

资源评论

收起资源包目录

基于WebMagic的java爬虫，爬取知乎用户Demo.zip （50个子文件）

java0323

.classpath 1003B

.settings

org.eclipse.wst.jsdt.ui.superType.name 6B

org.eclipse.jdt.core.prefs 357B

org.eclipse.core.resources.prefs 94B

.jsdtscope 555B

org.eclipse.wst.common.component 462B

org.eclipse.wst.common.project.facet.core.xml 335B

org.eclipse.wst.jsdt.ui.superType.container 49B

src

repo

ZhiHuUserPageProcessor.java 5KB

CsdnBlogPageProcessor.java 5KB

dao

ZhihuDao.java 234B

impl

ZhihuDaoImpl.java 1KB

entity

CsdnBlog.java 2KB

ZhihuUser.java 3KB

util

DBHelper.java 3KB

build

classes

repo

CsdnBlogPageProcessor.class 5KB

ZhiHuUserPageProcessor.class 6KB

dao

ZhihuDao.class 144B

impl

ZhihuDaoImpl.class 2KB

entity

CsdnBlog.class 2KB

ZhihuUser.class 4KB

util

DBHelper.class 3KB

WebContent

WEB-INF

lib

httpclient-4.3.3.jar 576KB

xsoup-0.2.4.jar 39KB

commons-lang3-3.1.jar 308KB

mysql-connector-java-5.1.37-bin.jar 963KB

slf4j-log4j12-1.7.6.jar 9KB

jsoup-1.7.2.jar 287KB

commons-io-1.3.2.jar 86KB

jedis-2.0.0.jar 123KB

commons-collections-3.2.1.jar 562KB

hamcrest-core-1.3.jar 44KB

junit-4.11.jar 239KB

commons-logging-1.1.3.jar 61KB

assertj-core-1.5.0.jar 563KB

commons-codec-1.6.jar 227KB

webmagic-extension-0.5.2.jar 92KB

webmagic-core-0.5.2.jar 93KB

httpcore-4.3.2.jar 276KB

json-smart-1.1.1.jar 50KB

commons-lang-2.6.jar 278KB

log4j-1.2.17.jar 478KB

json-path-0.8.1.jar 65KB

slf4j-api-1.7.6.jar 28KB

fastjson-1.1.37.jar 348KB

commons-pool-1.5.5.jar 98KB

guava-15.0.jar 2.07MB

web.xml 638B

META-INF

MANIFEST.MF 36B

.project 1003B

package repo; import java.util.Date; import dao.ZhihuDao; import dao.impl.ZhihuDaoImpl; import entity.ZhihuUser; import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.processor.PageProcessor; /** * 知乎用户小爬虫<br> * 输入搜索用户关键词(keyword)，并把搜出来的用户信息爬出来<br> * @date 2016-5-3 * @website ghb.soecode.com * @csdn blog.csdn.net/antgan * @author antgan * */ public class ZhiHuUserPageProcessor implements PageProcessor{ //抓取网站的相关配置，包括：编码、抓取间隔、重试次数等 private Site site = Site.me().setRetryTimes(10).setSleepTime(1000); //用户数量 private static int num = 0; //搜索关键词 private static String keyword = "北大"; //数据库持久化对象，用于将用户信息存入数据库 private ZhihuDao zhihuDao = new ZhihuDaoImpl(); /** * process 方法是webmagic爬虫的核心<br> * 编写抽取【待爬取目标链接】的逻辑代码在html中。 */ @Override public void process(Page page) { //1. 如果是用户列表页面【入口页面】，将所有用户的详细页面的url放入target集合中。 if(page.getUrl().regex("https://www\\.zhihu\\.com/search\\?type=people&q=[\\s\\S]+").match()){ page.addTargetRequests(page.getHtml().xpath("//ul[@class='list users']/li/div/div[@class='body']/div[@class='line']").links().all()); } //2. 如果是用户详细页面 else{ num++;//用户数++ /*实例化ZhihuUser，方便持久化存储。*/ ZhihuUser user = new ZhihuUser(); /*从下载到的用户详细页面中抽取想要的信息，这里使用xpath居多*/ /*为了方便理解，抽取到的信息先用变量存储，下面再赋值给对象*/ String name = page.getHtml().xpath("//div[@class='title-section ellipsis']/span[@class='name']/text()").get(); String identity = page.getHtml().xpath("//div[@class='title-section ellipsis']/span[@class='bio']/@title").get(); String location = page.getHtml().xpath("//div[@class='item editable-group']/span[@class='info-wrap']/span[@class='location item']/@title").get(); String profession = page.getHtml().xpath("//div[@class='item editable-group']/span[@class='info-wrap']/span[@class='business item']/@title").get(); boolean isMale = page.getHtml().xpath("//span[@class='item gender']/i[@class='icon icon-profile-male']").match(); boolean isFemale = page.getHtml().xpath("//span[@class='item gender']/i[@class='icon icon-profile-female']").match(); int sex = -1; /*因为知乎有一部分人不设置性别或者不显示性别。所以需要判断一下。*/ if(isMale&&!isFemale) sex=1;//1代表男性 else if(!isMale&&isFemale) sex=0;//0代表女性 else sex=2;//2代表未知 String school = page.getHtml().xpath("//span[@class='education item']/@title").get(); String major = page.getHtml().xpath("//span[@class='education-extra item']/@title").get(); String recommend = page.getHtml().xpath("//span[@class='fold-item']/span[@class='content']/@title").get(); String picUrl = page.getHtml().xpath("//div[@class='body clearfix']/img[@class='Avatar Avatar--l']/@src").get(); int agree = Integer.parseInt(page.getHtml().xpath("//span[@class='zm-profile-header-user-agree']/strong/text()").get()); int thanks = Integer.parseInt(page.getHtml().xpath("//span[@class='zm-profile-header-user-thanks']/strong/text()").get()); int ask = Integer.parseInt(page.getHtml().xpath("//div[@class='profile-navbar clearfix']/a[2]/span[@class='num']/text()").get()); int answer = Integer.parseInt(page.getHtml().xpath("//div[@class='profile-navbar clearfix']/a[3]/span[@class='num']/text()").get()); int article = Integer.parseInt(page.getHtml().xpath("//div[@class='profile-navbar clearfix']/a[4]/span[@class='num']/text()").get()); int collection = Integer.parseInt(page.getHtml().xpath("//div[@class='profile-navbar clearfix']/a[5]/span[@class='num']/text()").get()); //对象赋值 user.setKey(keyword); user.setName(name); user.setIdentity(identity); user.setLocation(location); user.setProfession(profession); user.setSex(sex); user.setSchool(school); user.setMajor(major); user.setRecommend(recommend); user.setPicUrl(picUrl); user.setAgree(agree); user.setThanks(thanks); user.setAsk(ask); user.setAnswer(answer); user.setArticle(article); user.setCollection(collection); System.out.println("num:"+num +" " + user.toString());//输出对象 zhihuDao.saveUser(user);//保存用户信息到数据库 } } @Override public Site getSite() { return this.site; } public static void main(String[] args) { long startTime ,endTime; System.out.println("========知乎用户信息小爬虫【启动】喽！========="); startTime = new Date().getTime(); //入口为：【https://www.zhihu.com/search?type=people&q=xxx 】，其中xxx 是搜索关键词 Spider.create(new ZhiHuUserPageProcessor()).addUrl("https://www.zhihu.com/search?type=people&q="+keyword).thread(5).run(); endTime = new Date().getTime(); System.out.println("========知乎用户信息小爬虫【结束】喽！========="); System.out.println("一共爬到"+num+"个用户信息！用时为："+(endTime-startTime)/1000+"s"); } }

评论收藏

内容反馈

版权申诉