我试图从一个提供事故信息的网站上获取数据。我使用了Scrapy和Selenium,但它不起作用。我是新来的,并试图了解发生了什么。我在一个venv中安装了Scrapy和Selenium。网站的结构有点旧,很难理解。
任何帮助都将不胜感激!
我正在使用Firefox,所以在设置中我使用了这个:
SELENIUM_DRIVER_NAME = 'firefox'
SELENIUM_DRIVER_EXECUTABLE_PATH = which('geckodriver.exe')
SELENIUM_DRIVER_ARGUMENTS=\['-headless'\] \
我的程式码如下所示:
import scrapy
from selenium.webdriver import firefox
from http.server import executable
from lib2to3.pgen2 import driver
from scrapy.utils.project import get_project_settings
class MeldingenSpider(scrapy.Spider):
name = '112meldingen'
def start_requests(self):
settings = get_project_settings
driver_path = settings.get('SELENIUM_DRIVER_EXECUTABLE_PATH')
driver = firefox(executable_path=driver_path)
driver.get('http://ftp.112meldingen.nl/index.php')
xpath = '//*[@id="divContentAlerts"]'
link_elements = driver.find_elements_by_xpath(xpath)
def parse(self, response):
articles = response.css('table::attr(id.alerts)')
for article in articles:
#if "haven" in article.css('div.title a::text').get():
yield {
'headline': article.css('td.bold a::text').get() ,
'timestamp': article.css('td.bold span').get(),
'location' : article.xpath('td > td').get()[3]
}
1条答案
按热度按时间bxpogfeg1#
你可以在
SeleniumRequest
的帮助下抓取url。脚本:
您必须更改www.example.com文件中的以下指令settings.py
输出:
seleniumRequest