使用scrapy、requests遇到503状态码问题解决

2021/7/11 6:05:51

编程Tag： headers self requests Scrapy __ 503 cfscrape

本文主要是介绍使用scrapy、requests遇到503状态码问题解决，对大家解决编程问题具有一定的参考价值，需要的程序猿们随着小编来一起学习吧！

错误日志如下：

2021-07-11 02:19:11 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <503 https://xxxx.com/tags/undef>: HTTP status code is not handled or not allowed

问题分析

请求的503状态html内容进行翻译

503错误信息：

Checking your browser before accessing xxxx.com
This process is automatic. Your browser will redirect to your requested content shortly.
Please allow up to 5 seconds…

从翻译的内容来看是为了浏览器验证等待5s 网上搜了一下说是有个Cloudflare机制为了防止机器人非正常获取数据搜到需要搭配使用cfscrape 绕过页面等待，配置如下：

安装 pip install cfscrape

class DrdSpider(scrapy.Spider):
    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url, USER_AGENT)
            #token, agent = cfscrape.get_tokens(url)
            cf_requests.append(scrapy.Request(url=url, cookies={'__cfduid': token['__cfduid']}, headers={'User-Agent': agent}))
            print "useragent in cfrequest: " , agent
            print "token in cfrequest: ", token
        return cf_requests

但是配置好后运行报错，信息如下：

Traceback (most recent call last):
  File "C:\workspace\new-crm-agent\env\lib\site-packages\scrapy\core\engine.py", line 129, in _next_request
    request = next(slot.start_requests)
  File "C:\workspace\phub\scrapy_obj\mySpider\spiders\drd.py", line 35, in start_requests
    token, agent = cfscrape.get_tokens(url)
  File "C:\workspace\new-crm-agent\env\lib\site-packages\cfscrape\__init__.py", line 398, in get_tokens
    'Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I\'m Under Attack Mode") enabled?'
ValueError: Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM ("I'm Under Attack Mode") enabled?

从报错信息来看意思是该站点没有采用Cloudflare机制，于是我在报错前一行代码打断点看请求内容。发现状态码为200状态。

那么问题来了，为什么我使用cfscrape访问正常200，scrapy爬取却是503？

我觉得可能是scrayp框架本身问题。于是使用requests模块请求获取看看是否能正常访问，发现依然是503状态

if __name__ == "__main__":
    session = requests.session()
    heads = OrderedDict([('Host', None),
             ('Connection', 'keep-alive'),
             ('Upgrade-Insecure-Requests', '1'),
             ('User-Agent',
              'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.78 Safari/537.36'),
             ('Accept',
              'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8'),
             ('Accept-Language', 'en-US,en;q=0.9'),
             ('Accept-Encoding', 'gzip, deflate')])
    session.headers = heads
    resp = session.get("https://drd.com/tags/undi")
    print(resp)

返回结果：

<Response [503]>
Process finished with exit code 0

除了cfscrape。python自带的requests和scrapy都不能正常访问, 可能是cfscrape源码做了特殊设置，查看源码特殊部分代码如下：

class CloudflareAdapter(HTTPAdapter):
    """ HTTPS adapter that creates a SSL context with custom ciphers """

    def get_connection(self, *args, **kwargs):
        conn = super(CloudflareAdapter, self).get_connection(*args, **kwargs)

        if conn.conn_kw.get("ssl_context"):
            conn.conn_kw["ssl_context"].set_ciphers(DEFAULT_CIPHERS)
        else:
            context = create_urllib3_context(ciphers=DEFAULT_CIPHERS)
            conn.conn_kw["ssl_context"] = context

        return conn
        
class CloudflareScraper(Session):
    def __init__(self, *args, **kwargs):
        self.delay = kwargs.pop("delay", None)
        # Use headers with a random User-Agent if no custom headers have been set
        headers = OrderedDict(kwargs.pop("headers", DEFAULT_HEADERS))

        # Set the User-Agent header if it was not provided
        headers.setdefault("User-Agent", DEFAULT_USER_AGENT)

        super(CloudflareScraper, self).__init__(*args, **kwargs)

        # Define headers to force using an OrderedDict and preserve header order
        self.headers = headers
        self.org_method = None

        self.mount("https://", CloudflareAdapter())

问题出在这里self.mount("https://", CloudflareAdapter()), 我照着这个请求逻辑用requests发现能正常请求200。问题可能是https请求前需要ssl认证,并且设置ssl_context。于是我搜了一下set_ciphers是干什么用的。python官方解释如下：

SSLContext.set_ciphers(ciphers)
为使用此上下文创建的套接字设置可用密码。 它应当为 OpenSSL 密码列表格式 的字符串。 如果没有可被选择的密码（由于编译时选项或其他配置禁止使用所指定的任何密码），则将引发 SSLError。

備註 在连接后，SSL 套接字的 SSLSocket.cipher() 方法将给出当前所选择的密码。
TLS 1.3 cipher suites cannot be disabled with set_ciphers().

ssl可用密码是什么东西？由于没时间了解字义，于是搜了一下使用scrapy解决问题, 配置如下：

/settings.py

DOWNLOADER_CLIENT_TLS_CIPHERS = "DEFAULT:!DH"

这篇关于使用scrapy、requests遇到503状态码问题解决的文章就介绍到这儿，希望我们推荐的文章对大家有所帮助，也希望大家多多支持为之网！

使用scrapy、requests遇到503状态码问题解决

错误日志如下：

问题分析

相关编程文章