java爬虫记录
2021/4/29 20:27:01
本文主要是介绍java爬虫记录,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
java用Jsoup来做爬虫
环境
jdk 1.8
依赖
<dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.2</version> </dependency>
demo例子
1.创建线程池
/** * 爬取数据线程池 */ public static ExecutorService exec = Executors.newFixedThreadPool(10);
2.从数据库查询待爬取的url
log.info("从数据库获取爬取url列表"); Example example = new Example(DemoPO.class); example.createCriteria().andEqualTo("type",NUMONE); List<DemoPO> poList = DemoMapper.selectByExample(example);
3. 用CompletionService 来异步获取执行结果
CompletionService<List<DemoPO>> everyWeekCs = new ExecutorCompletionService<>(exec);
4.向线程池提交任务
不同的url 具体怎么解析有差别
for (DemoPOpo : poList) { if("test1".equals(po.getSource())){ everyWeekCs.submit(()->getEveryWeekPoFromDocument(po,NUMONE,day)); } if("test2".equals(po.getSource())){ everyWeekCs.submit(()->getEveryWeekPoFromDocument(po,NUMTWO,day)); } if("test3".equals(po.getSource())){ everyWeekCs.submit(()->getEveryWeekPoFromDocument(po,NUMTHREE,day)); } }
5.简单爬数据
用Jsoup的api,根据页面标签来解析获取数据
List<DemoPO> list = new ArrayList<>(); String url = po.getUrl(); Connection connection = Jsoup.connect(url); Connection.Response response = connection.execute(); if(response.statusCode() == 200) { Document doc = connection.get(); List<Element> elements = doc.getElementsByClass("xlayer02 yh ohd clear"); for (Element element : elements) { DemoPO demoPo = new DemoPO(); String title = element.select("a").text(); po.setTitle(title); String contentUrl = element.select("a").attr("href"); Connection con = Jsoup.connect("http:" + contentUrl); Connection.Response res = con.execute(); if (res.statusCode() == 200) { Document contentDoc = con.get(); String content = contentDoc.getElementsByClass("xcc font14 yh ohd clear").get(0).getElementsByTag("p").toString(); po.setContent(content); list.add(po); } } } return list;
6.获取各子线程执行后得到的结果
List<DemoPO> list = new ArrayList<>(); //按任务完成顺序获取值,减少阻塞获取值的所需时间 for (int i = 0;i<poList.size();i++){ list.addAll(everyWeekCs.take().get()); }
这篇关于java爬虫记录的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-26Mybatis官方生成器资料详解与应用教程
- 2024-11-26Mybatis一级缓存资料详解与实战教程
- 2024-11-26Mybatis一级缓存资料详解:新手快速入门
- 2024-11-26SpringBoot3+JDK17搭建后端资料详尽教程
- 2024-11-26Springboot单体架构搭建资料:新手入门教程
- 2024-11-26Springboot单体架构搭建资料详解与实战教程
- 2024-11-26Springboot框架资料:新手入门教程
- 2024-11-26Springboot企业级开发资料入门教程
- 2024-11-26SpringBoot企业级开发资料详解与实战教程
- 2024-11-26Springboot微服务资料:新手入门全攻略