PYTHON Scrapy selenium 网络驱动程序等待

vwkv1x7d  于 2022-11-09  发布在  Python
关注(0)|答案(1)|浏览(185)

各位Maven,如果你们不介意的话,我正在寻求你们的帮助。
最近,我正在用scrapy和 selenium 在python上做一个网络爬虫。我的头脑已经崩溃了。
我只想问,是否有可能即使你使用了语句,你仍然得到空
WebDriverWait(driver, 100, 0.1).until(EC.presence_of_all_elements_located((By.XPATH,xxxxx)))
来获得这些元素。而且,它甚至不需要100秒就可以变空。为什么?
顺便说一句,这是一个随机的事情,这意味着这种现象发生在任何地方,任何时间。
我的网络连接有问题吗?
你能帮助我或给予我一些意见,建议对上述问题?
多谢了!

  • -------------补充说明----------------
    谢谢你的提醒。
    总之,我用scrapyselenium抓取一个关于评论的网站,并通过pipeline.py将用户名、发布时间、评论内容等写入.xlsx file,我希望它在收集完整信息的同时尽可能快。
    一个有很多人评论的页面,因为评论文本太长,所以它被收起来了,这意味着每页几乎有20条评论有他们的展开按钮。
    因此,我需要使用seleniumclick the expand button,然后使用driver来获取完整的注解。常识告诉我们,在单击展开按钮后,加载需要一点时间,而且我相信加载所需的时间取决于网络的速度。因此,在这里使用WebDriverWait似乎是一个明智的选择。经过我的实践,默认的参数timeout=10poll_frequency=0.5看起来太慢而且容易出错,所以我考虑使用timeout=100poll_frequency=0.1的规范。
    然而,问题是每次我通过cmd语句scrapy crawl spider运行项目时,总是有几个注解抓取是空的,而且每次空的位置都不一样。我曾想过使用time.sleep()强制停止,但如果每个页面都这样做,那会花费很多时间。当然,这是一种更有用的获取完整信息的方式。而且,在我看来,它看起来不是那么优雅,有点笨拙。
    我的问题表达清楚了吗?

I got somwhere empty的确切含义如下图所示。


import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')

users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
full_content, words = [], []
unfolds = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//a[@class='unfold']")))

# Here's how I think about and design my loop body.

# I click the expansion bottun, then grab the text, then put it away, then move on to the next one.

for i in range(len(unfolds)):
    unfolds[i].click()
    time.sleep(1)
    # After the javascript, the `div[@class='review-content clearfix']` appear,
    # and some of the full review content will be put in a `<p></p>` label
    find_full_content_p = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']/p")))
    full_content_p = [j.text for j in find_full_content_p]
    # and some of them will just put in `div[@class='review-content clearfix']` itself.
    find_full_content_div = WebDriverWait(driver,100,0.1).until(EC.presence_of_all_elements_located((By.XPATH,"//div[@class='review-content clearfix']")))
    full_content_div = [j.text for j in find_full_content_div]

    # and I made a list merge
    full_content_p.extend(full_content_div)
    full_content.append("".join(full_content_p))
    words.append(len("".join(full_content_p)))
    time.sleep(1)

    # then put it away
    WebDriverWait(driver,100,0.1).until(EC.element_to_be_clickable((By.XPATH,"//a[@class='fold']"))).click()
driver.close()
pd.DataFrame({"users":users, "dates":dates, "full_content":full_content, "words":words})

而且,这是我真正尊敬的一位Maven的代码,名叫声波。(这是轻微修改,核心代码没有改变)

import time
from selenium import webdriver
from selenium.webdriver.common.by import By

# from selenium.webdriver.chrome.service import Service

driver = webdriver.Chrome()

driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')

users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews, words = [], []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
    show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
    if show_more:
        # scroll to the show more button, needed to avoid ElementClickInterceptedException
        driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
        show_more[0].click()
        review = review.find_element(By.XPATH, 'following-sibling::div')
        while review.get_attribute('class') == 'hidden':
            time.sleep(0.2)
        review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
    reviews.append(review.text)
    words.append(len(review.text))
    print('done',len(reviews),end='\r')
 pd.DataFrame({"users":users,"dates":dates,"reviews":reviews,"words":words})

holgip5t

holgip5t1#

新产品
为站点douban添加了代码。要将抓取的数据导出到csv,请参见下面OLD部分中的Pandas代码

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

driver = webdriver.Chrome(service=Service('...'))

driver.get('https://movie.douban.com/subject/5045678/reviews?start=0')

users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'a[class=name]')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span[class=main-meta]')]
reviews = []
for review in driver.find_elements(By.CSS_SELECTOR, 'div.review-short'):
    show_more = review.find_elements(By.CSS_SELECTOR, 'a.unfold')
    if show_more:
        # scroll to the show more button, needed to avoid ElementClickInterceptedException
        driver.execute_script('arguments[0].scrollIntoView({block: "center"});', show_more[0])
        show_more[0].click()
        review = review.find_element(By.XPATH, 'following-sibling::div')
        while review.get_attribute('class') == 'hidden':
            time.sleep(0.2)
        review = review.find_element(By.CSS_SELECTOR, 'div.review-content')
    reviews.append(review.text)
    print('done',len(reviews),end='\r')

旧版
对于你提到的网站(imdb.com),为了抓取隐藏内容,不需要点击显示更多按钮,因为文本已经加载到HTML代码中,只是它没有显示在网站上。所以你可以一次抓取所有评论。下面的代码将用户,日期和评论存储在单独的列表中,最后将数据保存到一个.csv文件。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service

driver = webdriver.Chrome(service=Service(chromedriver_path))

driver.get('https://www.imdb.com/title/tt1683526/reviews')

# sets a maximum waiting time for .find_element() and similar commands

driver.implicitly_wait(10)

reviews = [el.get_attribute('innerText') for el in driver.find_elements(By.CSS_SELECTOR, 'div.text')]
users = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.display-name-link')]
dates = [el.text for el in driver.find_elements(By.CSS_SELECTOR, 'span.review-date')]

# store data in a csv file

import pandas as pd
df = pd.DataFrame(list(zip(users,dates,reviews)), columns=['user','date','review'])
df.to_csv(r'C:\Users\your_name\Desktop\data.csv', index=False)

要打印单个评论,您可以执行以下操作

i = 0
print(f'User: {users[i]}\nDate: {dates[i]}\n{reviews[i]}')

输出(截短)为

User: dschmeding
Date: 26 February 2012
Wow! I was not expecting this movie to be this engaging. Its one of those films...

相关问题