爬虫:pyquery 解析库
2022/1/1 23:10:31
本文主要是介绍爬虫:pyquery 解析库,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
如果你比较喜欢CSS选择器,对jQuery有所了解,那么这个库更加适合——pyquery
目录
初始化
1、字符串初始化
2、URL初始化
3、文件初始化
基本CSS 选择器
查找节点
1、子节点
2、父节点
3、兄弟节点
遍历
获取信息
1、获取属性
2、获取文本
节点操作
1、add_class和remove_class
2、attr、text和html
3、remove()
伪类选择器
初始化
向Beautiful Soup一样,初始化pyquery的时候,也需要传入HTML文本来初始化一个PyQuery对象,它的初始化方式有很多种,比如直接传入字符串,传入URL,传入文件名
1、字符串初始化
from pyquery import PyQuery as pq html = ''' <div> <ul> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> ''' doc = pq(html) print(doc("li")) 结果: <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li>
把HTML字符串当作参数传递给PyQuery类,这样就成功完成了初始化。初始化的对象传入CSS选择器,通过参数li,获取到li的所有节点
2、URL初始化
初始化的参数不仅可以以字符串的形式传递,还可以传入网页的URL,此时只需要指定参数为RUL即可
from pyquery import PyQuery as pq doc = pq(url="https://www.taobao.com/") print(doc("title")) 结果: <title>淘宝网 - 淘!我喜欢</title>
这样传递的话,PyQuery对象会首先请求这个URL,然后用得到的HTML内容完成初始化。相当于用网页的源代码;
from pyquery import PyQuery as pq import requests doc =pq(requests.get("https://www.taobao.com").text) print(doc("title")) 结果: <title>淘宝网 - 淘!我喜欢</title>
3、文件初始化
除了URL,还可以传递本地的文件名,此时将参数指定为filename即可:
from pyquery import PyQuery as pq doc =pq(filename="test.html") print(doc("li")) 结果: <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li>
读取本地的文件内容,然后用文件内容以字符串的形式传递给PyQuery类初始化
基本CSS 选择器
from pyquery import PyQuery as pq html = ''' <div id="container"> <ul class="list"> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> ''' doc = pq(html) print(doc('#container .list li')) print(type(doc('#container .list li'))) 结果: <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-1"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li> <class 'pyquery.pyquery.PyQuery'>
选取id为container节点,然后再选取其内部的class为list内部的所有节点,然后打印输出。
查找节点
1、子节点
需要用到find() 方法
from pyquery import PyQuery as pq html = ''' <div> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) print(doc('.list')) print(type(doc('.list'))) print(doc('.list').find('li')) print(type(doc('.list').find('li'))) 结果: <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li></ul> <class 'pyquery.pyquery.PyQuery'> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li> <class 'pyquery.pyquery.PyQuery'>
选取class为list的节点然后调用find() 方法,传入CSS选择器的li节点,最后打印输出。find() 方法会将符合条件的所有节点选择出来,结果类型是PyQuery类型
find() 查找子孙节点
from pyquery import PyQuery as pq html = ''' <div> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) items = doc('.list') lis = items.children() print(type(lis)) print(lis) 结果: <class 'pyquery.pyquery.PyQuery'> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li>
筛选所有子节点中符合条件的节点,比如向筛选子节点中class为active的节点向children() 方法传入CSS选择器.active
from pyquery import PyQuery as pq html = ''' <div> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) items = doc('.list') lis = items.children(".active") print(type(lis)) print(lis) 结果: <class 'pyquery.pyquery.PyQuery'> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li>
2、父节点
可以使用parent() 方法获取某个节点的父节点
from pyquery import PyQuery as pq html = ''' <div> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) items = doc('.list') lis = items.parent() print(type(lis)) print(lis) 结果: <class 'pyquery.pyquery.PyQuery'> <div> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li></ul> </div>
使用parents() 返回所有的节点
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) items = doc('.list') lis = items.parents() print(type(lis)) print(lis) 结果: <class 'pyquery.pyquery.PyQuery'> <html><body><div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li></ul> </div> </div> </body></html><body><div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li></ul> </div> </div> </body><div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li></ul> </div> </div> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li></ul> </div>
可以看到返回一个是class为wrap的节点,一个为id为container的节点
如果想要获取其中的一个节点信使用CSS来选择
items = doc('.list') lis = items.parents(".warp")
3、兄弟节点
使用siblings() 方法获取
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) items = doc('.list .item-0.active') print(items.siblings()) 结果: <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li>
选择class为list的节点内部class为item-0和active的节点,也就是第三个li节点,他的兄弟节点有四个
在对兄弟节点中再次进行选择
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) items = doc('.list .item-0.active') print(items.siblings(".active")) 结果: <li class="item-1 active"><a href="link2.html">second item</a></li>
遍历
pyquery的选择结果可能是多个节点,也可能是单个节点,类型都是Pyquesr类型,并没有返回Beautiful Soup那样的列表
对于单个节点我们可以直接答应
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) li = doc('.list .item-0.active') print(li) print(str(li)) 结果: <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-0 active"><a href="link1.html">first item</a></li>
对于多个结果就需要循环遍历来取了
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) lis = doc('li').items() print(type(lis)) for li in lis: print(li,type(li)) 结果: <class 'generator'> <li class="item-0 active"><a href="link1.html">first item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-1 active"><a href="link2.html">second item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-inactive"><a href="link3.html">third item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-1"><a href="link4.html">fourth item</a></li> <class 'pyquery.pyquery.PyQuery'> <li class="item-0"><a hrer="link5.html">fifth item</a> </li> <class 'pyquery.pyquery.PyQuery'>
调用items() 方法后会得到一个生成器,遍历一下,就可以得到li节点对象了
获取信息
1、获取属性
得到某个PyQuery类型的节点后,就可以调用attr() 方法来获取属性值
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) a = doc('.item-0.active a') print(type(a)) print(a.attr("href")) print(a.attr.href) 结果: <class 'pyquery.pyquery.PyQuery'> link1.html link1.html
2、获取文本
使用节点之后的一个主要操作就是获取其内部的文本了,测试可以调用text() 方法来实现
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) a = doc('.item-0.active a') print(a) print(a.text()) 结果: <a href="link1.html">first item</a> first item
这里获取的内部信息,他会忽略掉节点内部包含的所有HTML,只返回纯文本内容,但如果想要获取这个节点内部的HTML文本,就要用html() 方法
doc = pq(html) li = doc('.item-0.active') print(li) print(li.html()) 结果: <li class="item-0 active"><a href="link1.html">first item</a></li> <a href="link1.html">first item</a>
这里返回的节点中所有的HTML文本
那如果有多个节点,test() 和 html() 会返回什么?
doc = pq(html) li = doc('li') print(li.html()) print(li.text()) print(type(li.text())) 结果: <a href="link1.html">first item</a> first item second item third item fourth item fifth item <class 'str'>
html() 方法返回的是第一个li节点的内部HTML文本,而text() 则返回了所有的li节点的纯文本,中间用一个空格隔开,即返回结果过是一个字符串
节点操作
1、add_class和remove_class
add_calss、remove_class这些方法可以动态改变节点的class属性
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) li = doc('.item-0.active') print(li) li.remove_class("active") print(li) li.add_class("active") print(li) 结果: <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-0"><a href="link1.html">first item</a></li> <li class="item-0 active"><a href="link1.html">first item</a></li>
调用 remove_class() 方法,将li节点的active 这个class 移除, 后来又调 add_calss() 方法,将 class添加回来。每执行一次操作,就打印输出当前 li 节点的内容,一共输出了3次, 第二次输出时, li 节点的 active 这个 class 被移除了,第 三次class 又添加回来
2、attr、text和html
可以使用attr() 方法对属性进行操作,text() 和html() 来改变节点内部的内容
doc = pq(html) li = doc('.item-0.active') print(li) li.attr("name","link") print(li) li.text("chamge item") print(li) li.html("<span> chanage item </span>") print(li) 结果: <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-0 active" name="link"><a href="link1.html">first item</a></li> <li class="item-0 active" name="link">chamge item</li> <li class="item-0 active" name="link"><span> chanage item </span></li>
attr()方法只传入第一个参数的属性名,则是获取这个属性值 如果传入第二个参 数,可以用来修改属性值 text() 和 html()方法如果不传参数 ,则是获取节点内纯文本和 HTML文本本; 如果传人参数 ,则进行赋值
3、remove()
remove()方法就是移除,它有时会为信息的提取带来非常大的便利
html=''' <div class="wrap"> Hellow,World <p>This is a paragraph.</p> </div> ''' from pyquery import PyQuery as pq doc=pq(html) wrap = doc(".wrap") print(wrap.text()) 结果: Hellow,World This is a paragraph.
想要提取Hellow, Word,而不需要 This is paragraphs;先定位到wrap节点,去除p节点内部的文本
html=''' <div class="wrap"> Hellow,World <p>This is a paragraph.</p> </div> ''' from pyquery import PyQuery as pq doc=pq(html) wrap = doc(".wrap") wrap.find("p").remove() print(wrap.text()) 结果: Hellow,World
其他操作参考:http: //pyquery.readthedocs.io/en/latest/api.html
伪类选择器
css 选择器之所以强大,还有 个很重要的原因,那就是它支持多种多样的伪类选择器,例如选择第一个节点、最后一个节点、奇偶数节点、包含某一文本的节点等
from pyquery import PyQuery as pq html = ''' <div class="warp"> <div id="container"> <ul class="list"> <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-inactive"><a href="link3.html">third item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </ul> </div> </div> ''' doc = pq(html) li = doc('li:first-child') print(li) li = doc('li:last-child') print(li) li = doc('li:nth-child(2)') print(li) li = doc('li:gt(2)') print(li) li = doc("li:nth-child(2n)") print(li) li = doc("li:contains(second)") print(li) 结果: <li class="item-0 active"><a href="link1.html">first item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-0"><a hrer="link5.html">fifth item</a> </li> <li class="item-1 active"><a href="link2.html">second item</a></li> <li class="item-1"><a href="link4.html">fourth item</a></li> <li class="item-1 active"><a href="link2.html">second item</a></li>
CSS3的伪类选择器,一次选择第一个li节点,左后一个li节点,第二个li节点,偶数位置的li节点,包含secnd文本的li节点;可以操作http://www.w3school.com.cn/css
这篇关于爬虫:pyquery 解析库的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-24Java中定时任务实现方式及源码剖析
- 2024-11-24Java中定时任务实现方式及源码剖析
- 2024-11-24鸿蒙原生开发手记:03-元服务开发全流程(开发元服务,只需要看这一篇文章)
- 2024-11-24细说敏捷:敏捷四会之每日站会
- 2024-11-23Springboot应用的多环境打包入门
- 2024-11-23Springboot应用的生产发布入门教程
- 2024-11-23Python编程入门指南
- 2024-11-23Java创业入门:从零开始的编程之旅
- 2024-11-23Java创业入门:新手必读的Java编程与创业指南
- 2024-11-23Java对接阿里云智能语音服务入门详解