python爬虫库的补充

2021/7/23 11:39:26

编程Tag： 标签 html 爬虫 list python BS print urllib 补充

本文主要是介绍python爬虫库的补充，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

urllib补充：

get请求：

import urllib.request
#获取一个get请求
response = urllib.request.urlopen("http://www.baidu.com")#内容返回到response文件
print(response.read().decode('utf-8'))#对获取的网页源码解码为utf-8
#显示了百度网页的html代码

post请求（需要传值）：

模拟浏览器网站：http://httpbin.org/

import urllib.request

#获取一个post请求
import urllib.parse#解析器
data = bytes(urllib.parse.urlencode({"hello":"rosie"}),encoding="utf-8")
#封装一个byte数组，类似于用户密码
response = urllib.request.urlopen("http://httpbin.org/post",data=data)
print(response.read().decode('utf-8'))

出现异常后跳过，超时处理

#超时处理
try:
    response = urllib.request.urlopen("http://httpbin.org/get",timeout=0.01)
    print(response.read().decode('utf-8'))
except urllib.error.URLError as e:
    print("time out!")

以豆瓣为例子，get直接访问会报错418
此时在headers添加信息模拟浏览器
header内的信息在开发者工具--->Network--->header

url = "https://www.douban.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"}
req = urllib.request.Request(url=url,headers=headers)#封装的对象
response = urllib.request.urlopen(req)
print(response.read().decode('utf-8'))

BeatifullSoup补充：

'''
Beautifulsoup4将复杂html文档转换成一个复杂的树形结构，每个节点都是python对象，所有对象可归纳成4种

-Tag
-Navigablestring
-BeautifulSoup
-Comment
'''
import bs4
from bs4 import BeautifulSoup
file = open("baidu.html","rb")
html = file.read()
bs = BeautifulSoup(html,"html.parser")
'''
#1Tag标签及其内容，只能拿到她找到的第一个内容
print(bs.title)
print(bs.head)

#只打印内容
# print(type(bs.title.string))

#2.NavigableString  标签里的内容（字符串）
print(bs.a.attrs)#能拿到标签以及内容，以字典方式存在。
#例子：{'class': ['toindex'], 'href': '/'}

#3.BeautifulSoup    表示整个文档
print(bs)#整个文档的内容

#Comment    是一个特殊的NavigableString，输出内容不包含注释符号

以上都只得到文档的一句，个人认为不常用

文档的遍历：

import bs4
from bs4 import BeautifulSoup
file = open("baidu.html","rb")
html = file.read()
bs = BeautifulSoup(html,"html.parser")
#文档的遍历
print(bs.head.contents)#head标签的内容，返回列表
print(bs.head.contents[0])#列表中第一个元素

文档的搜索

字符串过滤
正则表达式
方法

#文档的搜索
'''
#字符串过滤：查找与字符串·完全匹配的内容
t_list = bs.find_all("a") #找到所有的a标签
print(t_list)
#------------------------
import re
#正则表达式搜索：使用search(),方法来匹配内容
t_list = bs.find_all(re.compile("a"))#含有a的所有标签
print(t_list)
'''
#方法：传入一个函数（方法），根据函数要求来搜索（了解）
def name_is_exists(tag):
    return tag.has_attr("name")
#返回含有name标签的标签
t_list = bs.find_all(name_is_exists)
print(t_list)

kwargs
text
limit

#2.kwargs   参数
# t_list=bs.find_all(id="head")
t_list=bs.find_all(class_=True)
for i in t_list:
    print(i)

#3.text参数
import re
# t_list=bs.find_all(text="hao123")
# t_list=bs.find_all(text=["hao123","地图","贴吧"])
t_list=bs.find_all(text = re.compile("\d"))#找到含数字的文本
#应用正则表达式查找包含特定文本的内容（标签里的字符串）
for i in t_list:
    print(i)

#4.limit 参数
t_list=bs.find_all("a",limit=3)#限制输出a标签的个数
for i in t_list:
     print(i)

css选择器（子标签和属性未出结果）

#css选择器
# t_list = bs.select('.mnav')#通过类名来查找
# t_list = bs.select('#u1')#通过ID来查找
# t_list = bs.select("a[class='mnav']")#通过属性来查找
# t_list = bs.select("head > title")#通过子标签来查找
t_list = bs.select('title')#通过标签来查找
print(t_list[0].get_text())

# for i in t_list:
#      print(i)

XLWT补充

import xlwt

workbook = xlwt.Workbook(encoding="utf-8")  #创建workbook对象
worksheet = workbook.add_sheet("sheeet1")   #创建工作表
worksheet.write(0,0,"hello") #写入数据，第一参数“行”，第二参数为列，第三个参数是内容
workbook.save('student.xls')#保存数据表

这篇关于python爬虫库的补充的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

python爬虫库的补充

urllib补充：

BeatifullSoup补充：

XLWT补充

相关编程文章