Scrapy一直被堵

cu6pst1q  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(162)

我试图从http://cinematreasures.org/获得美国电影院的列表,作为我学习python和scrapy过程的一部分。
我写了一个爬行器来抓取网站,但是运行它时没有得到任何响应。请查看附件中的html树、我的爬行器、运行爬行器时的响应以及我对seetings.py所做的更改的图片。
我想尝试代理IP的,但我不知道如何使用他们与scrapy.请帮助

我已经尝试了在scrapy壳的代码,它的工作很好。
当我试图运行它通过scrapy爬行listor我什么都没有得到!
我只是希望能够通过Pandas导出到csv,如果可能的话。
这是我的代码:

name = 'listall'
allowed_domains = ['cinematreasures.org']
start_urls = ['http://cinematreasures.org/theaters/united-states?page=1&status=all']

# url = 'http://cinematreasures.org/theaters/united-states?page={}&status=all'

def parse(self, response):

    for row in response.xpath('//table//tr')[1:]:
        name =  row.xpath('td//text()')[2].get()
        address = row.xpath('td//text()')[4].get()   
        yield {
            'Name':name,
            'Address':address,
        }
    next_page = response.xpath("//a[@class='next_page']").get()
    if next_page:
        yield scrapy.Request(response.urljoin(next_page))
qgelzfjb

qgelzfjb1#

您的xpath表达式不正确。当您使用相对xpath表达式时,它们需要以"./"开头,并且在我看来,使用类说明符比索引容易得多。

def parse(self, response):
        for row in response.xpath('//table[@class="list"]//tr'):
            name =  row.xpath('./td[@class="name"]/a/text()').get()
            address = row.xpath('./td[@class="location"]/text()').get()
            yield {
                'Name':name,
                'Address':address,
            }
        next_page = response.xpath("//a[@class='next-page']/@href").get()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page))
输出
...
...
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': None, 'Address': None}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Airdome', 'Address': '\n                Ardmore, OK, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Liberty Theatre', 'Address': '\n                Chickamauga, GA, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': ' Route 54 Drive-In', 'Address': '\n                Tularosa, NM, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Auto Theatre', 'Address': '\n                Daytona Beach, FL, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '#1 Drive-In', 'Address': '\n                Apalachicola, FL, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$1.00 Cinema', 'Address': '\n                Sherman, TX, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '$uper Cinemas', 'Address': '\n                East Lansing, MI, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '0only Outdoor Theatre', 'Address': '\n                Little Chute, WI, United States\n              '}
2022-09-09 08:22:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://cinematreasures.org/theaters/united-states?page=1&status=all>
{'Name': '10 Hi Drive-In', 'Address': '\n                St. Cloud, MN, United States\n              '}
...
...

相关问题