我试图提取网址,但他们给了我这些错误Ignoring response <403 https://www.askgamblers.com/online-casinos/countries/ca>: HTTP status code is not handled or not allowed
这些是页面链接https://www.askgamblers.com/online-casinos/countries/ca
import scrapy
from scrapy.http import Request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
from scrapy_selenium import SeleniumRequest
class TestSpider(scrapy.Spider):
name = 'test'
start_urls = ['https://www.askgamblers.com/online-casinos/countries/ca']
def parse(self, response):
books = response.xpath("//div[@class='card__desc']//a[starts-with(@href, '/online')]").extract()
for book in books:
url = response.urljoin(book)
print(url)
2条答案
按热度按时间ykejflvf1#
我认为您的问题在于使用
extract()
。请尝试改用extract_first()
或extract[0]
。另一个问题可能是您编写xpath表达式的方式。
//div[@class='card__desc']//a[starts-with(@href, '/online')]
似乎检索的是<a>
元素,而不是其中包含的url。试试看:
kninwzqo2#
使用您的用户代理而不是默认用户代理
在settings.py