【python】实验2项目1:使用多协程和队列,爬取时光网电视剧TOP100的数据
2021/7/3 22:51:20
本文主要是介绍【python】实验2项目1:使用多协程和队列,爬取时光网电视剧TOP100的数据,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
请使用多协程和队列,爬取时光网电视剧TOP100的数据(剧名、导演、主演和简介),并用CSV模块将数据存储下来(文件名:time100.csv)。
时光网电视剧排行榜链接:http://list.mtime.com/listIndex
知识点:
该站点启用了cookies反爬技术,因此,需要准确复制你的headers:例如:
a=’’‘Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate
Accept-Language: zh-CN,zh;q=0.9
Connection: keep-alive
Cookie: userId=0; defaultCity=%25E5%258C%2597%25E4%25BA%25AC%257C290; waf_cookie=59ca4180-5a16-459e122021f2731eb3889667e33bee3b5cd0; _ydclearance=dca49a10afc623028d11eefe-48d8-4053-bde0-dea67b20ab57-1586501304; userCode=20204101248277038; userIdentity=2020410124827743; tt=731C76D4E29CB5ED5BD5F19F3774A2AC; Hm_lvt_6dd1e3b818c756974fb222f0eae5512e=1586494108; __utma=196937584.377597232.1586494108.1586494108.1586494108.1; __utmc=196937584; __utmz=196937584.1586494108.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmt=1; _utmt~1=1; __utmb=196937584.18.10.1586494108; Hm_lpvt_6dd1e3b818c756974fb222f0eae5512e=1586495472
Host: www.mtime.com
Referer: http://www.mtime.com/top/tv/top100/index-2.html
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36’’’
注意headers需要换行且更改类型为字典,需要用到字符串操作:
字典的Key:line.split(": “,1)
字典的Value:for line in a.split(”\n")
形成键值对后使用dict(XXXX)进行类型转换
需要导入的包:
from gevent import monkey
monkey.patch_all()
import gevent,requests,bs4,csv
from gevent.queue import Queue
使用gevent实现多协程爬虫的重点:
定义爬取函数
用gevent.spawn()创建任务
用gevent.joinall()执行任务
使用queue模块的重点:
用Queue()创建队列
用put_nowait()储存数据
用get_nowait()提取数据
queue对象的其他方法:empty()判断队列是否为空,full()判断队列是否为满,qsize()判断队列还剩多少
csv写入的步骤:
创建文件调用open()函数
创建对象借助writer()函数
写入内容调用writer对象的writerow()方法
关闭文件close()
解题思路:
该站点启用了cookies反爬技术,因此,需要准确复制你的headers,首先找到headers
找到我们需要的剧名,导演,主演,简介,遍历前Top100条,当出现空的时候即director == ‘’ or isinstance(director, str) == 0时我们表明未知防止报错。
使用gevent,用gevent.spawn()创建任务,用gevent.joinall()执行任务,最后存储文件CSV格式即可。
from json.decoder import JSONDecodeError import gevent import requests import csv from gevent import monkey monkey.patch_all() id = [] director=[] actor=[] name=[] movie=[] story=[] task=[] Actors=[] #cookie headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36 Edg/89.0.774.63', 'Content-Type': 'application/json' } #获取json url = 'http://front-gateway.mtime.com/library/index/app/topList.api?tt=1616811596867&' request = requests.get(url, headers=headers) #取元素 def catch(x,y): request = requests.get(url, headers=headers) html = request.json() items = ((((html['data'])['tvTopList'])['topListInfos'])[0])['items'] for i in range(100):#前100条电视剧 tvid = ((items[i])['movieInfo'])['movieId'] id.append(tvid) for i in range(x,y): try: url1 = 'http://front-gateway.mtime.com/library/movie/detail.api?tt=1617412224076&movieId='+str(id[i])+'&locationId=290' request = requests.get(url=url1, headers=headers) tvhtml = request.json() except JSONDecodeError: url1 = 'http://front-gateway.mtime.com/library/movie/detail.api?tt=1617412224076&movieId=' + str(id[i]) + '&locationId=290' request = requests.get(url=url1, headers=headers) tvhtml = request.json() a = [] Movie =(((tvhtml['data'])['basic'])['director']) #如果信息为空,则填写未知 if Movie==None: director.append('未知') elif Movie['name']=='': director.append(Movie['nameEn']) else: director.append(Movie['name']) Actors=(((tvhtml['data'])['basic'])['actors']) for j in (((tvhtml['data'])['basic'])['actors']): if j['name'] == '': a.append(j['nameEn']) else: a.append(j['name']) actor.append(a) demo=(((tvhtml['data'])['basic'])['name']) movie.append(demo) simple=(((tvhtml['data'])['basic'])['story']) # 如果信息为空,则填写未知 if simple==None: story.append('未知') else: story.append(simple) if __name__ == '__main__': x = 0 for i in range(10): task1 = gevent.spawn(catch(x, x + 10)) task.append(task1) x = x + 10 gevent.joinall(task) f=open('Timetop100.csv','w',newline='',encoding='gb18030') csv_write = csv.writer(f) for i in range(100): csv_write.writerow(['电视剧',movie[i]]) csv_write.writerow(['导演',director[i]]) csv_write.writerow(['演员']) for x in actor[i]: csv_write.writerow([x]) csv_write.writerow(['剧情',story[i]]) print("完成") f.close() print('Tans.plt')
这篇关于【python】实验2项目1:使用多协程和队列,爬取时光网电视剧TOP100的数据的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-24Python编程基础详解
- 2024-11-21Python编程基础教程
- 2024-11-20Python编程基础与实践
- 2024-11-20Python编程基础与高级应用
- 2024-11-19Python 基础编程教程
- 2024-11-19Python基础入门教程
- 2024-11-17在FastAPI项目中添加一个生产级别的数据库——本地环境搭建指南
- 2024-11-16`PyMuPDF4LLM`:提取PDF数据的神器
- 2024-11-16四种数据科学Web界面框架快速对比:Rio、Reflex、Streamlit和Plotly Dash
- 2024-11-14获取参数学习:Python编程入门教程