scrapy 页面源无法跟上Selenium对页面所做的更新

o75abkj4  于 2023-02-04  发布在  其他
关注(0)|答案(2)|浏览(122)

我使用下面的代码来抓取一个网页:

import scrapy
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException, NoSuchElementException, StaleElementReferenceException

class JornaleconomicoSpider(scrapy.Spider):
    name = 'jornaleconomico'
    allowed_domains = ['jornaleconomico.pt']
    start_urls = ['https://jornaleconomico.pt/categoria/economia']

    def parse(self, response):
        options = Options()
        driver_path = '###' #Your Chrome Webdriver Path
        browser_path = '###' #Your Google Chrome Path
        options.binary_location = browser_path
        options.add_experimental_option("detach", True)

        self.driver = webdriver.Chrome(options=options, executable_path=driver_path)
        self.driver.get(response.url)

        ignored_exceptions=(NoSuchElementException,StaleElementReferenceException,)
        wait = WebDriverWait(self.driver, 120, ignored_exceptions=ignored_exceptions)

        self.new_src = None
        self.new_response = None

        i=0

        while i<10:
            # click next link
            try:
                element = wait.until(EC.element_to_be_clickable((By.XPATH, '*//div[@class="je-btn je-btn-more"]')))
                self.driver.execute_script("arguments[0].click();", element)
                self.new_src = self.driver.page_source
                self.new_response = response.replace(body=self.new_src)
                i += 1
            except TimeoutException:
                self.logger.info('No more pages to load.')
                self.driver.quit()
                break
            
        # grab the data
        headlines = self.new_response.xpath('*//h1[@class="je-post-title"]/a/text()').extract()

        for headline in headlines:
            yield {
            'text': headline
        }

上面的代码应该在Ver迈斯artigos上单击10次(* 查看更多文章 *)并获取所有标题的文本,但它只获取前九个原始标题。我检查了Chrome Selenium上的页面源代码(使用options.add_experimental_option("detach", True)行冻结Selenium窗口),而且我算出了页面源和原始页面是一样的,对我来说,这不应该发生,因为在同一个Selenium窗口中,我可以正确地检查所有文章,而不仅仅是前九篇文章,即使使用WebDriveWait也不能防止这种情况发生。如何解决这个问题?

eqfvzcg8

eqfvzcg81#

以下是(几乎)完整的解决方案:

from json import loads, dumps
from requests import get, post
from lxml.html import fromstring
from re import search, sub, findall

headerz = {
    "accept": "application/json, text/javascript, */*; q=0.01",
    "accept-language": "en-US,en;q=0.9",
    "sec-ch-ua": "'Chromium';v='106', 'Google Chrome';v='106', 'Not;A=Brand';v='99'",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "'Linux'",
    "sec-fetch-dest": "empty",
    "sec-fetch-mode": "cors",
    "sec-fetch-site": "same-site",
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "X-Requested-With": "XMLHttpRequest"
}

url = "https://jornaleconomico.pt/categoria/economia"
pag_href = "https://jornaleconomico.pt/wp-admin/admin-ajax.php"
page_count = 0

r = get(url)
html = fromstring(r.content.decode())

rawnonce = html.xpath("//script[@id='je-main-js-extra']/text()")

# print first 9 records
for p in html.xpath("//div[contains(@class,'je-posts-container')]//h1[contains(@class,'je-post-title')]/a"):
    ptitle = p.xpath("./text()")
    if isinstance(ptitle, list):
        post_title = ptitle[0]
        post_href = p.xpath("./@href")[0]
        print (post_href)

# pagination
while True:
    page_count += 9
    pag_params = {
        "action":"je_pagination",
        "nonce": "",
        "je_offset": page_count,
        "je_term": "economia"
    }
    r = post(pag_href, headers=headerz, data=pag_params)
    jdata = r.json()
    if (jdata and 'data' in jdata):
        jdata = jdata['data']['posts']
        html = fromstring(jdata)
        for p in html.xpath("//h1[contains(@class,'je-post-title')]/a"):
            ptitle = p.xpath("./text()")
            if isinstance(ptitle, list):
                post_title = ptitle[0]
                post_href = p.xpath("./@href")[0]
                print (post_href)
    else:
        break

输出如下所示:

https://jornaleconomico.pt/noticias/ministro-das-financas-diz-que-o-governo-esta-a-acompanhar-de-forma-atenta-inflacao-dos-produtos-alimentares-989146
https://jornaleconomico.pt/noticias/bancos-amortizam-antecipadamente-pagamento-dos-ltro-ao-bce-no-valor-de-16-mil-milhoes-989098
https://jornaleconomico.pt/noticias/je-podcast-ouca-aqui-as-noticias-mais-importantes-desta-terca-feira-51-988633
https://jornaleconomico.pt/noticias/prestacao-da-casa-sobe-quase-200-euros-para-creditos-de-150-mil-euros-a-6-meses-989134
https://jornaleconomico.pt/noticias/portugal-2020-atinge-85-de-execucao-e-116-de-compromisso-ate-dezembro-989132
https://jornaleconomico.pt/noticias/crescimento-do-pib-de-67-da-mais-confianca-para-desempenho-de-2023-diz-fernando-medina-989124
https://jornaleconomico.pt/noticias/apesar-dos-reforcos-salario-minimo-portugues-continua-a-meio-da-tabela-na-europa-988979
https://jornaleconomico.pt/noticias/atividade-turistica-dormidas-aumentaram-863-face-a-2021-988958
https://jornaleconomico.pt/noticias/producao-industrial-cresceu-25-em-dezembro-988947
https://jornaleconomico.pt/noticias/pib-cresce-36-na-ue-e-35-na-zona-euro-988921
https://jornaleconomico.pt/noticias/fundo-soberano-da-noruega-regista-maiores-perdas-desde-2008-988905
https://jornaleconomico.pt/noticias/economia-do-reino-unido-e-a-unica-do-g7-com-perspectivas-de-crescimento-negativo-988900
https://jornaleconomico.pt/noticias/economia-portuguesa-cresceu-67-em-2022-988868
https://jornaleconomico.pt/noticias/revista-de-imprensa-nacional-as-noticias-que-estao-a-marcar-esta-terca-feira-48-988814
https://jornaleconomico.pt/noticias/economia-chinesa-com-fortes-perspetivas-de-crescimento-988855
https://jornaleconomico.pt/noticias/fmi-reve-em-alta-as-previsoes-globais-de-crescimento-global-para-2023-e-agradece-a-china-988823
https://jornaleconomico.pt/noticias/alemanha-vendas-a-retalho-registam-a-maior-queda-desde-abril-de-2021-988817
https://jornaleconomico.pt/noticias/je-bom-dia-ine-divulga-dados-sobre-a-inflacao-e-a-economia-988416
https://jornaleconomico.pt/noticias/economia-francesa-cresce-26-em-2022-988767
https://jornaleconomico.pt/noticias/topo-da-agenda-o-que-nao-pode-perder-nos-mercados-e-na-economia-esta-terca-feira-31-988687
https://jornaleconomico.pt/noticias/auditoria-da-igf-ao-sifide-deteta-319-milhoes-de-euros-em-credito-fiscal-indevido-988699
https://jornaleconomico.pt/noticias/ministerio-das-infraestruturas-esta-a-acompanhar-subida-de-precos-das-operadoras-988697
https://jornaleconomico.pt/noticias/economistas-preveem-crescimento-do-pib-entre-66-e-68-em-2022-988681
https://jornaleconomico.pt/noticias/queda-do-pib-em-cadeia-na-alemanha-faz-soar-alarmes-de-recessao-na-zona-euro-de-novo-988583
https://jornaleconomico.pt/noticias/je-podcast-ouca-aqui-as-noticias-mais-importantes-desta-segunda-feira-49-988067
https://jornaleconomico.pt/noticias/jmj-investimentos-da-igreja-do-governo-e-dos-municipios-somam-pelo-menos-155-milhoes-de-euros-988637
https://jornaleconomico.pt/noticias/da-energia-europeia-a-economia-chinesa-veja-as-escolhas-da-semana-no-mercados-em-acao-988544
https://jornaleconomico.pt/noticias/riscos-de-uma-nova-moeda-comum-para-brasil-e-argentina-ouca-o-podcast-atlantic-connection-988395
https://jornaleconomico.pt/noticias/sindicatos-reunem-se-hoje-com-governo-para-tentar-evitar-greve-na-cp-e-ip-988622
https://jornaleconomico.pt/noticias/fundo-europeu-para-os-media-e-informacao-abre-novos-concursos-988564
https://jornaleconomico.pt/noticias/pt2020-portugal-entre-paises-que-mais-executam-fundos-europeus-988590
https://jornaleconomico.pt/noticias/maiores-bancos-espanhois-preparam-se-para-contestar-taxa-sobre-lucros-caidos-do-ceu-988545
5gfr0r5j

5gfr0r5j2#

你实际上并不需要使用Selenium这个非常容易获取的网站。如果我需要从那里获取数据,我会这样做。
与 Postman 测试

POST https://domain.pt/wp-admin/admin-ajax.php
content-type: application/x-www-form-urlencoded; charset=UTF-8
pragma: no-cache
user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36
x-requested-with: XMLHttpRequest

action=je_pagination&nonce=f2e925cd72&je_offset=9&je_term=economia

博客的前9条记录与链接一起打印,分页可以使用上面的 Postman 示例来完成,只需将“je_offset”更改为[9,18,27等]并更新“nonce”。
每次你加载页面时,你需要从html中获取新的“nonce”,这是网站在每个页面上显示的内容,尝试使用re.search获取“ AJAX _nonce”值。

<script type='text/javascript' id='je-main-js-extra'>
/* <![CDATA[ */
var ajax_object = {"ajax_url":"https:\/\/domain.pt\/wp-admin\/admin-ajax.php","ajax_nonce":"f2e925cd72"};
/* ]]> */
</script>

尝试使用requests.get加载页面,使用www.example.com分页requests.post-这会让你的工作变得超级简单,比selenium快得多。

相关问题