学习爬虫之网页解析pyquery的学习

2021/6/9 18:28:19

编程Tag： html 爬虫 li 节点学习 doc PyQuery pq

本文主要是介绍学习爬虫之网页解析pyquery的学习，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

如果你对web比较熟悉，比较喜欢用CSS选择器，如果你对jQuery有所了解。那么，就一起来看看这个解析库——pyquery吧！

安装：pip install pyquery

1.初始化：

初始化pyquery时，需要传入HTML文本来初始化一个PyQuery对象。初始化有三种：直接传入字符串；传入URL；传入文件名。

①直接传入字符串

from pyquery import PyQuery as pq

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

doc = pq(html_doc)
print(doc('a'))

将HTML字符串当做参数传递给PyQuery类，这就完成了初始化，接下来，将初始化的对象传入CSS选择器。此处，传入a节点，这样就可以选择所有的a节点。
在这里插入图片描述

②传入URL

from pyquery import PyQuery as pq

doc = pq(url='https://www.baidu.com')
print(doc('a'))

这样的话，PyQuery对象会首先请求这个URL，然后用得到的HTML内容完成初始化，这其实就相当于用网页的源代码以字符串的形式传递给PyQuery类来初始化。

from pyquery import PyQuery as pq
import requests
doc = pq(requests.get('https://www.baicu.com').text)
print(doc('a'))

③传入文件名

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('a'))

这样会首先读取本地的文件内容，然后用文件内容以字符串的形式传递给PyQuery类来初始化！

2.基本CSS选择器

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')
print(doc('#container .list li'))

初始化PyQuery对象之后，传入了一个CSS选择器#container .list li’，它的意思是：先选取id为container的节点，然后再选取其内部的class为list的节点内部的所有li节点！

3.查找节点

①子孙节点

查找节点时，需要用到find()方法，此时传入的参数是CSS选择器。

from pyquery import PyQuery as pq
doc = pq(filename='demo.html')

# 首先，选取class为list的节点
items = doc('.list')
# 然后，调用方法find()，传入CSS选择器，选取其内的li节点。（find()方法会将符合条件的所有节点选择出来！结果的类型是PyQuery类型。）
lis = items.find('li')

②直接子节点

find()的查找范围是节点的所有子孙节点。如果只想查找子节点，可以使用children()方法：

# 查找选中节点的直接子节点，且class为active。
lis1 = items.children('.active')

③父节点（只返回直接父节点）

# 查找选中节点的直接父节点。
lis2 = items.parent()

④祖先节点（返回符合条件所有的祖先节点）

# 查找选中节点的祖先节点
lis3 = items.parents()

⑤兄弟节点

# 查找选中节点的所有兄弟节点
lis4 = items.siblings()

4.遍历

pyquery的选择结果可能是多个，也可能是单个节点。类型都是PyQuery类型。

①对于单个节点，可以直接打印输出，也可以直接转成字符串：

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.itme-0.active')
print(li)
print(str(li))

②对于多个节点的结果，就需要遍历来获取了：

例如：这里把每一个li节点进行遍历，需要使用方法items（）。

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('li').items()
for li in lis:
	print(li)

调用items()方法后，会得到一个生成器，遍历即可获取每个值。遍历之后的每个值依旧是PyQuery类型。

5.获取信息

①获取属性

第一种方法：

from pyquery import PyQuery as pq
doc = pq(html)
li = doc('.itme-0.active')
print(li.attr('href'))

此处先选中class为item-0和active的li节点（假定HTML文本里符合此要求的只有li节点），它的类型是PyQuery，然后调用attr()方法，在这个方法中传入属性的名称，即可获得这个属性值了！

第二种方法：
print(li.attr.href)
也可以通过调用attr属性来获取属性。

注意：
如果当返回结果包含多个节点时，调用attr()方法，只会得到第一个节点的属性。如果想获取所有的li节点的属性，就要使用遍历了！

from pyquery import PyQuery as pq
doc = pq(html)
a = doc('li')
for item in a.items():
	print(item.attr('href'))

②获取文本

第一种方法：
text()方法
注意：此方法会忽略掉节点内部包含的所有HTML，只返回纯文字内容！

第二种方法：
html()方法。
注意：此方法返回的是所有符合匹配的HTML文本！

注意：如果我们选中的结果是多个节点，两种方法中html（）方法返回的是第一个节点内部HTML文本；
而text()方法返回的是所有节点内部的纯文字，中间用一个空格隔开，组合成一个字符串！

6.节点操作

pyquery提供了一系列方法来对节点进行动态修改，比如为某个节点添加一个class，移除某个节点等！

①addClass（删除指定class属性值）和removeClass（添加指定class属性值）

from pyquery import PyQuery as pq

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

doc = pq(html_doc)
li = doc('#link3')
print(li)
li.remove_class('sister')
print(li)
li.add_class('sister_new')
print(li)

在这里插入图片描述

②attr（对属性进行操作）,text和html（这俩方法改变节点内部的内容）

from pyquery import PyQuery as pq

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

doc = pq(html_doc)
li = doc('#link3')
li.attr('name','new_attr')
print(li)
li.text('changed Tillie')
print(li)
li.html('<span>changed Tillie</span>')
print(li)

attr()方法修改属性，第一个参数为属性名，第二个参数为属性值；
text()方法改变选中节点内部的全部文本为传入的字符串文本；
html()方法改变选中节点内部为传入的HTML文本。

在这里插入图片描述 **注意：
attr()方法如果只传入第一个参数的属性名，则是获取这个属性值，如果传入第二个参数，则是修改这个属性值；
text()方法和html()方法如果不传参数，则是获取节点内纯文本和HTML文本，如果传入参数，则进行赋值！
**

③remove()

from pyquery import PyQuery as pq

html_doc = """
<div class="wrap">
    Hello,World
<p>This is a paragraph.</p>
</div>
"""
doc = pq(html_doc)
wrap = doc('.wrap')
print(wrap.text())

现在想要提取Hello,World而不想要p节点内部的字符串。先获取class为wrap的节点的内容：