Scrapy一直被堵

cu6pst1q 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(162)

我试图从http://cinematreasures.org/获得美国电影院的列表，作为我学习python和scrapy过程的一部分。
我写了一个爬行器来抓取网站，但是运行它时没有得到任何响应。请查看附件中的html树、我的爬行器、运行爬行器时的响应以及我对seetings.py所做的更改的图片。
我想尝试代理IP的，但我不知道如何使用他们与scrapy.请帮助

我已经尝试了在scrapy壳的代码，它的工作很好。
当我试图运行它通过scrapy爬行listor我什么都没有得到！
我只是希望能够通过Pandas导出到csv，如果可能的话。
这是我的代码：

name = 'listall'
allowed_domains = ['cinematreasures.org']
start_urls = ['http://cinematreasures.org/theaters/united-states?page=1&status=all']

# url = 'http://cinematreasures.org/theaters/united-states?page={}&status=all'

def parse(self, response):

    for row in response.xpath('//table//tr')[1:]:
        name =  row.xpath('td//text()')[2].get()
        address = row.xpath('td//text()')[4].get()   
        yield {
            'Name':name,
            'Address':address,
        }
    next_page = response.xpath("//a[@class='next_page']").get()
    if next_page:
        yield scrapy.Request(response.urljoin(next_page))

来源：https://stackoverflow.com/questions/73662286/scrapy-keeps-getting-blocked

1条答案

按热度按时间

您的xpath表达式不正确。当您使用相对xpath表达式时，它们需要以"./"开头，并且在我看来，使用类说明符比索引容易得多。

def parse(self, response):
        for row in response.xpath('//table[@class="list"]//tr'):
            name =  row.xpath('./td[@class="name"]/a/text()').get()
            address = row.xpath('./td[@class="location"]/text()').get()
            yield {
                'Name':name,
                'Address':address,
            }
        next_page = response.xpath("//a[@class='next-page']/@href").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))

输出

...
...
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': None, 'Address': None}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Airdome', 'Address': '\n                Ardmore, OK, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Liberty Theatre', 'Address': '\n                Chickamauga, GA, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Route 54 Drive-In', 'Address': '\n                Tularosa, NM, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Auto Theatre', 'Address': '\n                Daytona Beach, FL, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Drive-In', 'Address': '\n                Apalachicola, FL, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$1.00 Cinema', 'Address': '\n                Sherman, TX, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$uper Cinemas', 'Address': '\n                East Lansing, MI, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '0only Outdoor Theatre', 'Address': '\n                Little Chute, WI, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '10 Hi Drive-In', 'Address': '\n                St. Cloud, MN, United States\n              '}
...
...

赞(0）回复(0）举报 2022-11-09

相关问题

热门标签

Java query python Node 开发语言 request Util 数据库 Table 后端算法 Logger Message Element Parser

最新问答

xxl-job 安全组扫描到执行器端口服务存在信息泄露漏洞
回答(1) 发布于 22天前
xxl-job 不能和nacos兼容？
回答(3) 发布于 22天前
xxl-job 任务执行完后无法结束，日志一直转圈
回答(3) 发布于 22天前
xxl-job-admin页面上查看调度日志样式问题
回答(1) 发布于 22天前
xxl-job 参数512字符限制能否去掉
回答(1) 发布于 22天前