我试图访问或遵循每个链接,出现的商业承包商从这个网站:https://lslbc.louisiana.gov/contractor-search/search-type-contractor/然后从每个链接指向的站点中提取电子邮件,但当我运行此脚本时,scrapy会在基本url后面附加整个HTML元素,而不是只在给定元素处的链接后面。
有人知道我如何才能得到想要的结果,或者我做错了什么吗?
下面是我目前拥有的代码:
from urllib import request
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
#user_agent = 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36'
#start_urls= ['https://lslbc.louisiana.gov/contractor-search/search-type-contractor/']
def start_requests(self):
start_urls = [
'https://lslbc.louisiana.gov/contractor-search/search-type-contractor/',
]
#request = scrapy.Request(url=urls, callback=self.parse, method="GET", cookies=[{'domain': 'lslbc.louisiana.gov','path': '/wp-admin/admin-ajax.php?api_action=advanced&contractor_type=Commercial+License&classification=&action=api_actions'}], )
#yield request
for url in start_urls:
yield scrapy.Request(url=url, callback=self.parse, cookies=[{'name': 'test', 'value': '', 'domain': 'lslbc.louisiana.gov','path': '/wp-admin/admin-ajax.php?api_action=advanced&contractor_type=Commercial+License&classification=&action=api_actions'}],)
def parse(self, response):
links = response.xpath('//*[@id="search-results"]/table/tbody/tr/td/a')
for link in links:
yield response.follow(link.get(), callback=self.parse)
def parse_links(self, response):
contractors = response.css()
for contractor in contractors:
yield {
'name': contractor.css('').get().strip(),
'email': contractor.css('td.[email_address]').get().strip(),
}
它会传回:
2022-08-13 16:53:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://lslbc.louisiana.gov/contractor-search/search-type-contractor/> (referer: None)
2022-08-13 16:53:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://lslbc.louisiana.gov/contractor-search/search-type-contractor/%3Ca%20data-bind=%22attr:%20%7B%20href:%20$row.showURL%20%7D,%20text:%20$row.company_name%22%20target=%22_blank%22%3E%3C/a%3E> (referer: https://lslbc.louisiana.gov/contractor-search/search-type-contractor/)
2022-08-13 16:53:13 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://lslbc.louisiana.gov/contractor-search/search-type-contractor/%3Ca%20data-bind=%22attr:%20%7B%20href:%20$row.showURL%20%7D,%20text:%20$row.qualifying_party%22%20target=%22_blank%22%3E%3C/a%3E> (referer: https://lslbc.louisiana.gov/contractor-search/search-type-contractor/)
1条答案
按热度按时间zf9nrax11#
该网页包含内置的搜索选项。每当您通过选择商业承包商进行搜索时,数据将通过
API
方法以json格式由JS动态加载。这就是为什么您无法从纯HTML DOM中获取所需数据的原因。完整工作代码示例: