python3爬取新闻页面中的网页链接

2021/7/13 11:36:59

本文主要是介绍python3爬取新闻页面中的网页链接,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!

使用工具:Jupyter Notebook

示例网页:网易新闻https://3g.163.com/touch/news?referFrom=

导入requests库(注意不是request):

import requests    

从网页获取源码:

r = requests.get("https://3g.163.com/touch/news")
r.encoding = "utf-8"
r.text


(输出太长就不贴了)

导入lxml库:

from lxml import html

解析树:

tree = html.fromstring(r.text)
tree


<Element html at 0x1d70dffe130>

爬取标题信息,根据网页开发页面(f12)中的元素信息用XPath写路径。上面是返回文本,下面是返回element:

tree.xpath("//div[contains(@class, 'tab-content')]//*[contains(@class, 'title')]/text()")


tree.xpath("//div[contains(@class, 'tab-content')]//*[contains(@class, 'title')]/")

爬取链接,返回成element

t = tree.xpath("//div[contains(@class, 'tab-content')]//article/a/@href")


['//3g.163.com/news/article/GEP9DPO5000189FH.html?clickfrom=channel2018_news_newsList#offset=0',
 '//3g.163.com/news/article/GEP9GL4K000189FH.html?clickfrom=channel2018_news_newsList#offset=1',
 '//3g.163.com/news/article/GENUISCU000189FH.html?clickfrom=channel2018_news_newsList#offset=2',
 '//3g.163.com/news/article/GEKDOC04000189FH.html?clickfrom=channel2018_news_newsList#offset=3',
 '//3g.163.com/news/article/GENIFQ29053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14890',
 '//3g.163.com/news/article/GEOFCFP5053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14891',
 '//3g.163.com/news/article/GEP44IPC05503FCU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14892',
 '//3g.163.com/news/article/GENR4E7O0515CCSC.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14893',
 '//3g.163.com/news/article/GENSJU9B0001899O.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14894',
 '//3g.163.com/news/article/GENC1JKF051795VD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14895',
 '//3g.163.com/news/article/GEKMO4J80512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14896',
 '//3g.163.com/news/article/GEPCQS8000258152.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14897',
 '//3g.163.com/news/article/GENG81J405528G7P.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14898',
 '//3g.163.com/news/article/GENTU50O05390TQD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14899',
 '//3g.163.com/news/article/GEKDCO2T05238V2G.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14900',
 '//3g.163.com/news/article/GENEQK0K05527WCX.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14901',
 '//3g.163.com/news/article/GEO9ATNH05527EP3.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14902',
 '//3g.163.com/news/article/GENGMEH505128ELF.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14903',
 '//3g.163.com/news/article/GENKOU0S0552C180.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14904',
 '//3g.163.com/news/article/GEO70EKO051796Q9.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14905',
 '//3g.163.com/news/article/GEOC9O4D051484S5.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14906',
 '//3g.163.com/news/article/GEMVDSV90534M1TZ.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14907',
 '//3g.163.com/news/article/GENU1BE505148JTU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14908',
 '//3g.163.com/news/article/GEMV89510534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14909',
 '//3g.163.com/news/article/GENOLQAF0537A693.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14910',
 '//3g.163.com/news/article/GENBORD10537N9PG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14911',
 '//3g.163.com/news/article/GENNO2UA05521A2M.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14912',
 '//3g.163.com/news/article/GEKS7KAG0552CPF4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14913',
 '//3g.163.com/news/article/GEN0IBGO0512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14914',
 '//3g.163.com/news/article/GEN9DQOA0534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14915',
 '//3g.163.com/news/article/GENASTAJ051100DH.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14916',
 '//3g.163.com/news/article/GEN5LQCK0517KC40.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14917',
 '//3g.163.com/news/article/GEN19FNP00058781.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14918',
 '//3g.163.com/news/article/GEN0DVG70514R9P4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14919',
 'https://3g.163.com/news/article/EUM2KO9N000189FH.html?#offset=0',
 'https://3g.163.com/news/article/EUM24BRS000189FH.html?#offset=1',
 'https://3g.163.com/news/article/EUJ790P7000189FH.html?#offset=2',
 'https://3g.163.com/news/article/EUJ76SJ2000189FH.html?#offset=3',
 'https://3g.163.com/news/article/EUJ75M7V000189FH.html?#offset=4',
 'https://3g.163.com/news/article/EUJ72ESB000189FH.html?#offset=5',
 'https://3g.163.com/news/article/EUJ70D5R000189FH.html?#offset=6',
 'https://3g.163.com/news/article/E6KATJC70514HDK6.html?#offset=7',
 'https://3g.163.com/news/article/E6KA44UI0514HDK6.html?#offset=8',
 'https://3g.163.com/news/article/E6K77H210514HDK6.html?#offset=9']

注意到有些链接开头没有https: ,观察原网站得知部分链接为原网页直接跳转,故考虑给它们加上抬头。导入urljoin库:

from urllib.parse import urljoin

用urljoin将爬取到的链接拼接:

for i in t:
    x = urljoin("https://3g.163.com/touch/news?referFrom=", i)
    print(x)


https://3g.163.com/news/article/GEP9DPO5000189FH.html?clickfrom=channel2018_news_newsList#offset=0
https://3g.163.com/news/article/GEP9GL4K000189FH.html?clickfrom=channel2018_news_newsList#offset=1
https://3g.163.com/news/article/GENUISCU000189FH.html?clickfrom=channel2018_news_newsList#offset=2
https://3g.163.com/news/article/GEKDOC04000189FH.html?clickfrom=channel2018_news_newsList#offset=3
https://3g.163.com/news/article/GENIFQ29053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14890
https://3g.163.com/news/article/GEOFCFP5053469LG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14891
https://3g.163.com/news/article/GEP44IPC05503FCU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14892
https://3g.163.com/news/article/GENR4E7O0515CCSC.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14893
https://3g.163.com/news/article/GENSJU9B0001899O.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14894
https://3g.163.com/news/article/GENC1JKF051795VD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14895
https://3g.163.com/news/article/GEKMO4J80512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14896
https://3g.163.com/news/article/GEPCQS8000258152.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14897
https://3g.163.com/news/article/GENG81J405528G7P.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14898
https://3g.163.com/news/article/GENTU50O05390TQD.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14899
https://3g.163.com/news/article/GEKDCO2T05238V2G.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14900
https://3g.163.com/news/article/GENEQK0K05527WCX.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14901
https://3g.163.com/news/article/GEO9ATNH05527EP3.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14902
https://3g.163.com/news/article/GENGMEH505128ELF.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14903
https://3g.163.com/news/article/GENKOU0S0552C180.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14904
https://3g.163.com/news/article/GEO70EKO051796Q9.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14905
https://3g.163.com/news/article/GEOC9O4D051484S5.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14906
https://3g.163.com/news/article/GEMVDSV90534M1TZ.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14907
https://3g.163.com/news/article/GENU1BE505148JTU.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14908
https://3g.163.com/news/article/GEMV89510534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14909
https://3g.163.com/news/article/GENOLQAF0537A693.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14910
https://3g.163.com/news/article/GENBORD10537N9PG.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14911
https://3g.163.com/news/article/GENNO2UA05521A2M.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14912
https://3g.163.com/news/article/GEKS7KAG0552CPF4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14913
https://3g.163.com/news/article/GEN0IBGO0512B07B.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14914
https://3g.163.com/news/article/GEN9DQOA0534MH06.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14915
https://3g.163.com/news/article/GENASTAJ051100DH.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14916
https://3g.163.com/news/article/GEN5LQCK0517KC40.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14917
https://3g.163.com/news/article/GEN19FNP00058781.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14918
https://3g.163.com/news/article/GEN0DVG70514R9P4.html?clickfrom=channel2018_news_index_newslist#child=index&offset=14919
https://3g.163.com/news/article/EUM2KO9N000189FH.html#offset=0
https://3g.163.com/news/article/EUM24BRS000189FH.html#offset=1
https://3g.163.com/news/article/EUJ790P7000189FH.html#offset=2
https://3g.163.com/news/article/EUJ76SJ2000189FH.html#offset=3
https://3g.163.com/news/article/EUJ75M7V000189FH.html#offset=4
https://3g.163.com/news/article/EUJ72ESB000189FH.html#offset=5
https://3g.163.com/news/article/EUJ70D5R000189FH.html#offset=6
https://3g.163.com/news/article/E6KATJC70514HDK6.html#offset=7
https://3g.163.com/news/article/E6KA44UI0514HDK6.html#offset=8
https://3g.163.com/news/article/E6K77H210514HDK6.html#offset=9

参考资料:Requests: 让 HTTP 服务人类 — Requests 2.18.1 文档 (python-requests.org)

                  XPath 语法 | 菜鸟教程 (runoob.com)



这篇关于python3爬取新闻页面中的网页链接的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!


扫一扫关注最新编程教程