【python爬虫】对站长网址中免费简历模板进行爬取
2022/6/16 1:20:16
本文主要是介绍【python爬虫】对站长网址中免费简历模板进行爬取,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
本篇仅在于交流学习
解析页面
可以采用xpath进行页面连接提取
进入页面
通过进入的页面可以得到下载地址
步骤:
提取表页面模板链接——>进入连接——>提取页面内下载地址连接——>下载保存
headers = { 'User-Agent': '用自己得头部' } response = requests.get(url=url, headers=headers).text #解析页面 tree = etree.HTML(response) #print(tree) page_list = tree.xpath('//div[@id="main"]/div/div/a') #捕获信息位置 for li in page_list: page_list_url = li.xpath('./@href')[0] page_list_url = 'https:' + page_list_url #提取页面地址 #print(page_list_url) in_page = requests.get(url=page_list_url,headers=headers).text #进入地址 trees = etree.HTML(in_page) #print(trees) download_url = trees.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')[0] #目的文件地址 name = trees.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0] + '.rar' #文件名字 name = name.encode('iso-8859-1').decode('utf-8') #修改命名乱码 #print(download_url) if not os.path.exists('./download'): #保存 os.mkdir('./download') download = requests.get(url=download_url,headers=headers).content page_name = 'download/' + name with open(page_name,'wb') as fp: fp.write(download) print(name,'end!')
分析网页之间联系:
实施多页面提取
try: start = int(input('请输入要爬取到的尾页:')) if start == 1 : url = 'https://sc.chinaz.com/jianli/free.html' get_page(url) print("爬取完毕") elif start == 2 : url = 'https://sc.chinaz.com/jianli/free.html' get_page(url) url = 'https://sc.chinaz.com/jianli/free_2.html' get_page(url) print("爬取完毕") elif start >= 3 : url = 'https://sc.chinaz.com/jianli/free.html' get_page(url) for i in range(2, start): url = 'https://sc.chinaz.com/jianli/free_%s.html' % (i * 1) get_page(url) print("爬取完毕") except ValueError: print('请输入数字:')
完整代码:
import requests from lxml import etree import os def get_page(url): headers = { 'User-Agent': '自己的头部' } response = requests.get(url=url, headers=headers).text #解析页面 tree = etree.HTML(response) #print(tree) page_list = tree.xpath('//div[@id="main"]/div/div/a') #捕获信息位置 for li in page_list: page_list_url = li.xpath('./@href')[0] page_list_url = 'https:' + page_list_url #提取页面地址 #print(page_list_url) in_page = requests.get(url=page_list_url,headers=headers).text #进入地址 trees = etree.HTML(in_page) #print(trees) download_url = trees.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')[0] #目的文件地址 name = trees.xpath('//div[@class="ppt_tit clearfix"]/h1/text()')[0] + '.rar' #文件名字 name = name.encode('iso-8859-1').decode('utf-8') #修改命名乱码 #print(download_url) if not os.path.exists('./download'): #保存 os.mkdir('./download') download = requests.get(url=download_url,headers=headers).content page_name = 'download/' + name with open(page_name,'wb') as fp: fp.write(download) print(name,'end!') def go(): #多页面选择 try: start = int(input('请输入要爬取到的尾页:')) if start == 1 : url = 'https://sc.chinaz.com/jianli/free.html' get_page(url) print("爬取完毕") elif start == 2 : url = 'https://sc.chinaz.com/jianli/free.html' get_page(url) url = 'https://sc.chinaz.com/jianli/free_2.html' get_page(url) print("爬取完毕") elif start >= 3 : url = 'https://sc.chinaz.com/jianli/free.html' get_page(url) for i in range(2, start): url = 'https://sc.chinaz.com/jianli/free_%s.html' % (i * 1) get_page(url) print("爬取完毕") except ValueError: print('请输入数字:') go() if __name__ == '__main__': go()
效果:
这篇关于【python爬虫】对站长网址中免费简历模板进行爬取的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-21Python编程基础教程
- 2024-11-20Python编程基础与实践
- 2024-11-20Python编程基础与高级应用
- 2024-11-19Python 基础编程教程
- 2024-11-19Python基础入门教程
- 2024-11-17在FastAPI项目中添加一个生产级别的数据库——本地环境搭建指南
- 2024-11-16`PyMuPDF4LLM`:提取PDF数据的神器
- 2024-11-16四种数据科学Web界面框架快速对比:Rio、Reflex、Streamlit和Plotly Dash
- 2024-11-14获取参数学习:Python编程入门教程
- 2024-11-14Python编程基础入门