Python爬取豆瓣各类电影排行榜
2021/4/16 20:25:57
本文主要是介绍Python爬取豆瓣各类电影排行榜,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
文章目录
- Python爬取豆瓣各类电影排行榜
- 首先分析豆瓣网页url
Python爬取豆瓣各类电影排行榜
爬虫就是模仿浏览器对网页信息进行收集,而过多的请求页面会造成网页服务气压力过大,所以网页也会执行一系列反爬机制,比如爬取时间间隔限制等,豆瓣相对来说没有国多的反爬机制,只要伪装好请求头,即可爬取信息。
本文将豆瓣排行榜中各种类型电影的排名情况爬取并存入excel中,爬取的内容主要包括电影名称、发行时间、演员、网址链接、排名和评分等内容,可以自己设定爬取的数量。
首先分析豆瓣网页url
首先找到请求方式:GET
另外可以看到响应信息格式
可见式JSON格式
再找到请求信息
start:是开始请求的内容
limit:是每次请求的数量
爬虫主要使用了request、json、pandas这三个库
经过分析发现不同种类类型的url仅仅是type发生改变,对应关系如下
通过字符串修改type即可实现爬取不同页面
下面是程序的主要代码:
#!/usr/bin/env python #-*- coding:utf-8 -*- import requests import json import pandas as pd def douban(word1,Number): type_name={"剧情":"11","喜剧":"24","动作":"5","爱情":"13","科幻":"17","动画":"25","悬疑":"10", \ "惊悚":"19","恐怖":"20","纪录片":"1","短片":"23","情色":"6","同性":"26",\ "音乐":"14","歌舞":"7","家庭":"28","儿童":"8","传记":"2","历史":"4","战争":"22","犯罪":"3","西部":"27",\ "奇幻":"16","冒险":"15","灾难":"12","武侠":"29","古装":"30","运动":"18","黑色电影":"31"} url = "https://movie.douban.com/j/chart/top_list" Params = { 'type': f"{type_name[word1]}", 'interval_id': "100:90", 'action': None, 'start': "0", 'limit': f"{Number}", } headers = { "User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36" } response = requests.get(url=url , params=Params ,headers=headers) result = response.json() title=["片名",] release_date=["发布日期",] actors=["演员",] rating=["分数",] rank=["排名",] url=["链接",] for i in result: title.append(i["title"]) release_date.append(i["release_date"]) actors.append(i["actors"]) rating.append(i["rating"][0]) rank.append(i["rank"]) url.append(i["url"]) output_excel={"片名":[],"发布日期":[],"演员":[],"分数":[],"排名":[]} output_excel["片名"]=title output_excel["发布日期"]=release_date output_excel["演员"]=actors output_excel["分数"]=rating output_excel["排名"]=rank output_excel["链接"]=url output = pd.DataFrame(output_excel) ''' with open(f"./{word1}.json","w",encoding="utf-8") as f: json.dump(result,fp=f,ensure_ascii=False) ''' return output word_list=["剧情","喜剧","动作","爱情","科幻","动画","悬疑","惊悚","恐怖",\ "纪录片","短片","情色","同性","音乐","歌舞","家庭","儿童","传记","历史","战争","犯罪","西部",\ "奇幻","冒险","灾难","武侠","古装","运动","黑色电影"] #存入excel m=1 writer = pd.ExcelWriter("豆瓣电影.xlsx") Number=input("请输入要爬取的数量:") for i in word_list: exec('output'+f'{m}'+' = douban(i,Number)') #exec('print('+'output'+f'{m}'+')') exec('output'+f'{m}'+'.to_excel(writer,i)') m+=1 writer.save()
程序中用到exec来进行批量存储,exec的具体用法可以参考我的另一篇文章利用ecec批量生成变量
爬取结果展示:
这篇关于Python爬取豆瓣各类电影排行榜的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2024-11-21Python编程基础教程
- 2024-11-20Python编程基础与实践
- 2024-11-20Python编程基础与高级应用
- 2024-11-19Python 基础编程教程
- 2024-11-19Python基础入门教程
- 2024-11-17在FastAPI项目中添加一个生产级别的数据库——本地环境搭建指南
- 2024-11-16`PyMuPDF4LLM`:提取PDF数据的神器
- 2024-11-16四种数据科学Web界面框架快速对比:Rio、Reflex、Streamlit和Plotly Dash
- 2024-11-14获取参数学习:Python编程入门教程
- 2024-11-14Python编程基础入门