使用scrapy抓取url

rpppsulh  于 2022-12-13  发布在  其他
关注(0)|答案(2)|浏览(147)

我试图提取网址,但他们给了我这些错误Ignoring response <403 https://www.askgamblers.com/online-casinos/countries/ca>: HTTP status code is not handled or not allowed这些是页面链接https://www.askgamblers.com/online-casinos/countries/ca

import scrapy
from scrapy.http import Request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from scrapy_selenium import SeleniumRequest

class TestSpider(scrapy.Spider):
    name = 'test'
    start_urls = ['https://www.askgamblers.com/online-casinos/countries/ca']

    
   
    def parse(self, response):
            books = response.xpath("//div[@class='card__desc']//a[starts-with(@href, '/online')]").extract()
            for book in books:
                    url = response.urljoin(book)
                    print(url)
ykejflvf

ykejflvf1#

我认为您的问题在于使用extract()。请尝试改用extract_first()extract[0]
另一个问题可能是您编写xpath表达式的方式。//div[@class='card__desc']//a[starts-with(@href, '/online')]似乎检索的是<a>元素,而不是其中包含的url。
试试看:

//div[@class='card__desc']//a/@href
kninwzqo

kninwzqo2#

使用您的用户代理而不是默认用户代理
在settings.py

user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36

相关问题