scrapy不会拦截请求中的所有标记

gk7wooem 于 2023-06-29 发布在其他

关注(0)|答案(1)|浏览(82)

我试图拦截http包中的标记，但我只得到了部分标记。出于某种原因，它在中间被切断了。是不是跟那个有关？下面是我的代码：

import scrapy
from scrapy.crawler import CrawlerProcess
from scrapy.utils.log import configure_logging

class StackOverflowSpider(scrapy.Spider):
    
    name = 'stackoverflow'
    allowed_domains = ['stackoverflow.com']
    start_urls = ['https://stackoverflow.com/questions/tagged/python?tab=newest&page=1&pagesize=15']
    first_request_done = False
    
    def start_requests(self):
        if not self.first_request_done:
            self.first_request_done = True
            for url in self.start_urls:
                yield scrapy.Request(url=url, callback=self.parse, dont_filter=True)
            
    def parse(self, response):
        if response.status == 200 and response.headers.get('Content-Type', '').startswith(b'text/html'):
            html = response.body.decode('utf-8')
            print(html)
        
        yield
    

configure_logging()
process = CrawlerProcess(settings={
    'LOG_ENABLED': False,
    'DOWNLOAD_DELAY': 1,
    'CONCURRENT_REQUESTS': 1
})
process.crawl(StackOverflowSpider)
process.start(stop_after_crawl=False)

scrapy

来源：https://stackoverflow.com/questions/76530358/scrapy-intercepts-not-all-of-the-markup-that-comes-in-the-request

1条答案

按热度按时间

w51jfk4q1#

这只是python print函数没有正确刷新输出。。这可以通过将页面内容拆分为行并一次打印一行来演示，或者将内容写入文件并查看写入文件中的完整输出。
例如，您可以尝试逐行打印：

def parse(self, response):
    for line in response.text.splitlines():
        print(line)

或者如果你想将内容写入文件：

def parse(self, response):
    with open('response.html', "wt", encoding="utf8") as htmlfile:
        htmlfile.write(response.text)
    ...
    ...

赞(0）回复(0）举报 2023-06-29

我来回答

scrapy不会拦截请求中的所有标记

1条答案

相关问题

热门标签

最新问答