写python爬虫的第一天,拿百度练手被反爬遇到<title>百度安全验证</title>的解决方案
2021/9/15 17:34:53
本文主要是介绍写python爬虫的第一天,拿百度练手被反爬遇到<title>百度安全验证</title>的解决方案,对大家解决编程问题具有一定的参考价值,需要的程序猿们随着小编来一起学习吧!
博主第一次写博文,第一次学爬虫,就是想分享,大家见怪不怪,
首先我设置了一个自定义UA代理池并没有采用插件pip install fake-useragent形式进行随机获取print(ua.ie)
ua_list = [ 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11', 'User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11', 'Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0', 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1', 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1', ] a = random.choice(ua_list) print(a)
中间出现不能获取rs1,通过查百度,最终改成如下程序
url = 'http://www.baidu.com/' rs1 = ua_info.a headers = {'User-Agent': rs1} # 1、创建请求对象,包装ua信息 # req = request.Request(url=url, headers=headers) query_string = { 'wd': '爬虫' } result = parse.urlencode(query_string) url1 = 'http://www.baidu.com/s?{}'.format(result) req = request.Request(url=url1, headers=headers) res = urllib.request.urlopen(req) html = res.read().decode('utf-8') print(html)
爬个五次吧,出现了下面结果
<!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="utf-8"> <title>百度安全验证</title> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> <meta name="apple-mobile-web-app-capable" content="yes"> <meta name="apple-mobile-web-app-status-bar-style" content="black"> <meta name="viewport" content="width=device-width, user-scalable=no, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0"> <meta name="format-detection" content="telephone=no, email=no"> <link rel="shortcut icon" href="https://www.baidu.com/favicon.ico" type="image/x-icon"> <link rel="icon" sizes="any" mask href="https://www.baidu.com/img/baidu.svg"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests"> <link rel="stylesheet" href="https://wappass.bdimg.com/static/touch/css/api/mkdjump_0635445.css" /> </head> <body> <div class="timeout hide"> <div class="timeout-img"></div> <div class="timeout-title">网络不给力,请稍后重试</div> <button type="button" class="timeout-button">返回首页</button> </div> <div class="timeout-feedback hide"> <div class="timeout-feedback-icon"></div> <p class="timeout-feedback-title">问题反馈</p> </div>
查百度解决方案让我在headers中加个参数,并说明找到的位置,并且已经得到了解决,
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36 Edg/83.0.478.50', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' }
好奇之下我查了爬虫与反爬的对抗,如下
文章链接:反爬虫策略及破解方法 - 特洛伊-Micro - 博客园反爬虫策略及破解方法 作者出蜘蛛网了 反爬虫策略及破解方法 作者出蜘蛛网了 反爬虫策略及破解方法 作者出蜘蛛网了 反爬虫策略及破解方法爬虫和反爬的对抗一直在进行着…为了帮助更好的进行爬虫行为以及反爬,https://www.cnblogs.com/micro-chen/p/8676312.html
这篇关于写python爬虫的第一天,拿百度练手被反爬遇到<title>百度安全验证</title>的解决方案的文章就介绍到这儿,希望我们推荐的文章对大家有所帮助,也希望大家多多支持为之网!
- 2025-01-03用FastAPI掌握Python异步IO:轻松实现高并发网络请求处理
- 2025-01-02封装学习:Python面向对象编程基础教程
- 2024-12-28Python编程基础教程
- 2024-12-27Python编程入门指南
- 2024-12-27Python编程基础
- 2024-12-27Python编程基础教程
- 2024-12-27Python编程基础指南
- 2024-12-24Python编程入门指南
- 2024-12-24Python编程基础入门
- 2024-12-24Python编程基础:变量与数据类型