Java(Jsoup)实现图书爬虫
2022/3/21 17:59:09
本文主要是介绍Java(Jsoup)实现图书爬虫,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
Java(Jsoup)实现图书爬虫
- 初始准备
- 项目开始
初始准备
本项目后续会发布在git上会更新。
1.使用的网址为:https://www.qb5.tw/
该程序将基于此页面 进行爬虫
2.创建的数据库有:
1.novel 记录小说的基本信息
2.novel_chapter存放小说的章节名称
3.novel_detail 存放每章小说的内容
3.本项目基于springboot进行开发 使用jsoup进行爬虫.。需要创建springboot项目。添加的依赖有:
<dependencies> <!--SpringMVC--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-web</artifactId> </dependency> <!--SpringData Jpa--> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-data-jpa</artifactId> </dependency> <!--MySQL连接包--> <dependency> <groupId>mysql</groupId> <artifactId>mysql-connector-java</artifactId> <version>8.0.11</version> </dependency> <!-- HttpClient --> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> </dependency> <!--Jsoup--> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.10.3</version> </dependency> <!--工具包--> <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-lang3</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter</artifactId> </dependency> <dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-test</artifactId> <scope>test</scope> </dependency> </dependencies>
application.yml文件内容如下:
spring: # 数据库配置 mysql8 datasource: driver-class-name: com.mysql.cj.jdbc.Driver url: jdbc:mysql://127.0.0.1:3306/crawler?useSSL=false&useUnicode=true&characterEncoding=utf-8&serverTimezone=Asia/Shanghai username: root password: 123456 # jpa配置 jpa: database: MySQL show-sql: true
项目开始
1.首先创建文章对应数据库的pojo类,
@Entity @Table(name = "novel") public class Novel { //主键 @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; //小说名称 private String novel_name; //作者 private String author; //封面 private String img; //文章类型 private String type; //文章状态 private String status; //文章受欢迎程度 private String pop; //文章简介 private String brief; //章节个数 private Long chapter_num; // 生成get/set方法 }
@Entity @Table(name = "novel_chapter") public class NovelChapter { //主键 @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; //小说id private Long novel_id; //章节名称 private String chapter_name; public NovelChapter() { } }
@Entity @Table(name = "novel_detail") public class NovelDetail { //主键 @Id @GeneratedValue(strategy = GenerationType.IDENTITY) private Long id; //章节id private Long chapter_id; //章节内容 private String chapter_content; public NovelDetail() {} public NovelDetail( Long chapter_id, String chapter_content) { this.chapter_id = chapter_id; this.chapter_content = chapter_content; } }
2.创建对应的dao service impl
dao继承JpaRepository类,使用它对数据库进行操作。
dao:
public interface NovelDao extends JpaRepository<Novel,Long> { } public interface NovelDetailDao extends JpaRepository<NovelDetail,Long> { } public interface NovelChapterDao extends JpaRepository<NovelChapter,Long> { }
service 实现JpaRepository的方法 save 保存到数据库,findall 从数据库中查找存在该数据。
public interface NovelService { public void save(Novel item); public List<Novel> findAll(Novel item); } public interface NovelDetailService { public void save(NovelDetail item); public List<NovelDetail> findAll(NovelDetail item); } public interface NovelChapterService { public void save(NovelChapter item); public List<NovelChapter> findAll(NovelChapter item); }
impl实现service的方法 其余类与其类似
@Service public class NovelChapterServiceImpl implements NovelChapterService { @Autowired private NovelChapterDao itemDao; @Override @Transactional public void save(NovelChapter item) { this.itemDao.save(item); } @Override public List<NovelChapter> findAll(NovelChapter item) { //声明查询条件 Example<NovelChapter> example = Example.of(item); //根据查询条件进行查询 List<NovelChapter> list = this.itemDao.findAll(example); return list; } }
爬虫内容
package com.itcrawler.itcrawler.jd.controller; import com.itcrawler.itcrawler.jd.pojo.Novel; import com.itcrawler.itcrawler.jd.pojo.NovelChapter; import com.itcrawler.itcrawler.jd.pojo.NovelDetail; import com.itcrawler.itcrawler.jd.service.NovelChapterService; import com.itcrawler.itcrawler.jd.service.NovelDetailService; import com.itcrawler.itcrawler.jd.service.NovelService; import com.itcrawler.itcrawler.jd.util.HttpUtils; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.stereotype.Component; import org.springframework.stereotype.Controller; import org.springframework.web.bind.annotation.GetMapping; import org.springframework.web.bind.annotation.RequestMapping; import org.springframework.web.bind.annotation.RestController; import java.util.ArrayList; import java.util.List; @Component public class ItemTask { @Autowired private HttpUtils httpUtils; @Autowired private NovelService itemService; @Autowired private NovelChapterService novelChapterService; @Autowired private NovelDetailService novelDetailService; //爬取首页 @Scheduled(fixedDelay = 100*1000 ) //当下载任务完成后 间隔多长时间进行下一次的任务 public String first(){ String url="https://www.qb5.tw/"; String html = httpUtils.doGetHtml(url); Document doc = Jsoup.parse(html); //第一个页面的数据 # 表示id .表示class 即下面select的内容可以理解为:先找标签为div 同时类名为main 找到后 向下找 div 的标签为mainleft 后找div类为titletop的 如此类推 最后找到我们需要的信息 对该方面不熟悉的 可以看看 jsoup相关内容 Elements ele=doc.select("div#main div#mainleft div.titletop li.top div.pic a"); for (int i = 0; i < ele.size(); i++) { String href = ele.get(i).attr("href"); // System.out.println(href); this.parse(href); } return "first"; } //解析页面 获取商品数据并存储 private void parse(String url){ //声明需要解析的初始地址 先对一个文章页面进行爬虫 String html = httpUtils.doGetHtml(url); Document doc = Jsoup.parse(html); Novel novel=new Novel(); //根据文章内容 获取 文章的题目 img(封面) 作者 简介 type 文章状态(连载 等) 热度 需要一个章节数 Elements ele=doc.select("div#main div#bookdetail div.nav-mbx a[target=_blank]"); //类型 novel.setType(ele.get(1).text()); String tit= doc.select("div#main div#bookdetail div#info h1").text(); String[] split = tit.split("/"); novel.setNovel_name(split[0]); novel.setAuthor(split[1]); novel.setImg(doc.select("div#main div#bookdetail div#picbox div.img_in img").attr("src")); Elements select = doc.select("div#main div#bookdetail div#info p.booktag span"); novel.setPop(select.get(0).text()); novel.setStatus(select.get(1).text()); //限制brief的长度 String brief=doc.select("div#main div#bookdetail div#info div#intro").text().substring(0,200); brief=brief.replace("<br>", "").replace(" ", ""); novel.setBrief(brief); System.out.println(novel); List<Novel> list = this.itemService.findAll(novel); if(list.size()==0) { //之前不存在 //保存到数据库 this.itemService.save(novel); //获取章节信息 及内容 Elements as = doc.select("div.zjbox dl.zjlist dd a"); //内存有限 只爬取文章的前50章 for (int i = 0; i < as.size()&&i<10; i++) { Element a=as.get(i); String href = a.attr("href"); //文章内容 String title = a.text(); //章节名称 List<Novel> all = this.itemService.findAll(novel); long artid = all.get(0).getId(); // System.out.println(all.get(0).id); //将文章章节添加进去 NovelChapter novelChapter = new NovelChapter(artid, title); if (this.novelChapterService.findAll(novelChapter).size() == 0) { this.novelChapterService.save(novelChapter); //获取到网址 和章节名称 System.out.println("href:" + href + "title:" + title); this.addToDb(url,novelChapter, href); } } }} private void addToDb(String url,NovelChapter novelChapter,String href){ System.out.println(novelChapter); if(novelChapter.getId()==null)return; Long chapterid=novelChapter.getId(); url=url+href; System.out.println("url:"+url); String html = httpUtils.doGetHtml(url); Document doc = Jsoup.parse(html); //获取小说的正文 String content=doc.select("div#main div#readbox div#content ").html(); //处理特殊数据 //去掉前面 网页内容相关的 content=content.substring(90); content=content.replace("<br>", " "); content=content.replace("\n",""); String test="<br> "; //防止出现 bsp等信息 for (int i = 0; i < test.length(); i++) { test=test.substring(i); content=content.replace(test, ""); } NovelDetail novelDetail = new NovelDetail(chapterid, content); System.out.println(novelDetail); if(this.novelDetailService.findAll(novelDetail).size()==0){ this.novelDetailService.save(novelDetail); } } }
httpUtils 对网页请求进行封装:
package com.itcrawler.itcrawler.jd.util; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.CloseableHttpResponse; import org.apache.http.client.methods.HttpGet; import org.apache.http.impl.client.CloseableHttpClient; import org.apache.http.impl.client.HttpClients; import org.apache.http.impl.conn.PoolingHttpClientConnectionManager; import org.apache.http.util.EntityUtils; import org.springframework.stereotype.Component; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStream; import java.util.UUID; @Component public class HttpUtils { private PoolingHttpClientConnectionManager cm; public HttpUtils(){ this.cm=new PoolingHttpClientConnectionManager(); //设置最大连接数 this.cm.setMaxTotal(100); //设置每个主机的最大连接数 this.cm.setDefaultMaxPerRoute(10); } //根据请求地址下载页面数据 public String doGetHtml(String url){ //获取httpClient对象 CloseableHttpClient httpClient = HttpClients.custom().setConnectionManager(this.cm).build(); //创建httpget请求对象 设置url地址 HttpGet httpGet = new HttpGet(url); httpGet.setConfig(this.getConfig()); //设置请求头 否则无法正常爬取 httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); httpGet.setHeader("Accept", "Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"); httpGet.setHeader("Accept-Charset", "GB2312,utf-8;q=0.7,*;q=0.7"); httpGet.setHeader("Accept-Encoding", "gzip, deflate"); httpGet.setHeader("Accept-Language", "zh-cn,zh;q=0.5"); httpGet.setHeader("Connection", "keep-alive"); httpGet.setHeader("User-Agent", "Mozilla/5.0 (Windows NT 6.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2"); CloseableHttpResponse response=null; try { //使用httpClient发起请求 获取相应 response = httpClient.execute(httpGet); if(response.getStatusLine().getStatusCode()==200){ //判断响应体是否为空 if(response.getEntity()!=null){ String content = EntityUtils.toString(response.getEntity(), "gbk"); return content; } } } catch (IOException e) { e.printStackTrace(); }finally { try { if(response!=null){ response.close(); } } catch (IOException e) { e.printStackTrace(); } } //解析响应 返回结果 return ""; } //设置请求信息 private RequestConfig getConfig(){ RequestConfig config = RequestConfig.custom() .setConnectTimeout(1000)//创建连接的最长时间 .setConnectionRequestTimeout(500) //获取连接的最长时间 .setSocketTimeout(10000) //数据传输的最长时间 .build(); return config; } }
最后别忘记 启动类中:
@SpringBootApplication @EnableScheduling //开启定时任务 不可少 public class ItcrawlerJdApplication { public static void main(String[] args) { SpringApplication.run(ItcrawlerJdApplication.class, args); } }
最后就结束了 爬取的数据 如下 用户后面可以根据自己的不同需求 修改select下面的条件:
这篇关于Java(Jsoup)实现图书爬虫的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-26Mybatis官方生成器资料详解与应用教程
- 2024-11-26Mybatis一级缓存资料详解与实战教程
- 2024-11-26Mybatis一级缓存资料详解:新手快速入门
- 2024-11-26SpringBoot3+JDK17搭建后端资料详尽教程
- 2024-11-26Springboot单体架构搭建资料:新手入门教程
- 2024-11-26Springboot单体架构搭建资料详解与实战教程
- 2024-11-26Springboot框架资料:新手入门教程
- 2024-11-26Springboot企业级开发资料入门教程
- 2024-11-26SpringBoot企业级开发资料详解与实战教程
- 2024-11-26Springboot微服务资料:新手入门全攻略