采集数据的html解析方法
2020/11/7 8:15:42
本文主要是介绍采集数据的html解析方法,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
通过爬虫请求url一般会获取html数据,需要快捷进行文档解析,定位获取元素数据。Beautiful Soup 能够从HTML或XML文件中提取数据的Python库.可以通过转换器实现惯用的文档导航,查找,修改文档的方法。Beautiful Soup会极大的提高文档分析效率,减少研发的投入时间。
下面将展示BeautifulSoup4中所有主要特性,表明它适合做什么,如何工作和使用,并到达想要的效果和处理异常情况.
html_doc = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """
使用BeautifulSoup解析这段html,能够获取BeautifulSoup 的对象,并能按照标准的格式结构输出:
from bs4 import BeautifulSoup soup = BeautifulSoup(html_doc) print(soup.prettify()) # <html> # <head> # <title> # The Dormouse's story # </title> # </head> # <body> # <p class="title"> # <b> # The Dormouse's story # </b> # </p> # <p class="story"> # Once upon a time there were three little sisters; and their names were # <a class="sister" href="http://example.com/elsie" id="link1"> # Elsie # </a> # , # <a class="sister" href="http://example.com/lacie" id="link2"> # Lacie # </a> # and # <a class="sister" href="http://example.com/tillie" id="link2"> # Tillie # </a> # ; and they lived at the bottom of a well. # </p> # <p class="story"> # ... # </p> # </body> # </html>
一些浏览结构化数据的方法:
soup.title # <title>The Dormouse's story</title> soup.title.name # u'title' soup.title.string # u'The Dormouse's story' soup.title.parent.name # u'head' soup.p # <p class="title"><b>The Dormouse's story</b></p> soup.p['class'] # u'title' soup.a # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a> soup.find_all('a') # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>] soup.find(id="link3") # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
从文档中找到所有特定字符<a>标签的链接:
for link in soup.find_all('a'): print(link.get('href')) # http://example.com/elsie # http://example.com/lacie # http://example.com/tillie
这篇关于采集数据的html解析方法的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-24Vite多环境配置学习:新手入门教程
- 2024-11-23实现OSS直传,前端怎么实现?-icode9专业技术文章分享
- 2024-11-22在 HTML 中怎么实现当鼠标光标悬停在按钮上时显示提示文案?-icode9专业技术文章分享
- 2024-11-22html 自带属性有哪些?-icode9专业技术文章分享
- 2024-11-21Sass教程:新手入门及初级技巧
- 2024-11-21Sass学习:初学者必备的简单教程
- 2024-11-21Elmentplus入门:新手必看指南
- 2024-11-21Sass入门:初学者的简单教程
- 2024-11-21前端页面设计教程:新手入门指南
- 2024-11-21Elmentplus教程:初学者必备指南