selenium 抓取不断不定期更新数据的网站

我正在尝试抓取一个Web应用程序以获取表的值。如何在每次向表中添加新值时抓取表，或者如何抓取网站？The website
我的基本代码只允许我手动抓取，导致许多值没有被抓取。

driver.find_elements_by_xpath

只返回

WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located()

工作。
下面是我的代码

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

website = "https://play.pakakumi.com/"
path = r'D:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.get(website)

page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')

'''
k =driver.find_elements_by_xpath('/html/body/div/div[2]/div[2]/div/div[1]/div/div[3]/div/div[2]/div/table/tbody/tr/td[1]')

for item in k:
    print(item.text)
'''
foo = WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located((By.XPATH, '/html/body/div/div[2]/div[2]/div/div[1]/div/div[3]/div/div[2]/div/table/tbody/tr/td[1]')))
for b in foo:
    print(b.text)
#print(foo)

注意：所有3个函数的完整定义都是**pasted here**，输出已经上传到this spreadsheet。[顺便说一句，我使用CSS selectors是因为我更习惯使用它们，但是XPath的等价物可能没有太大的不同。]

解决方案1 [较短但有限]

您可以抓取link in the first row [如下面的thref]，然后等待它[“link in first column of first row”]更改

# wait = WebDriverWait(driver, maxWait) 

    # while rowCt < maxRows and tmoCt < max_tmo:...

        # parsing the whole table to gather as much data as possible 
        tSoup = BeautifulSoup(driver.find_element(
            By.CSS_SELECTOR, 'table:has(th.text-center) tbody'
        ).get_attribute('outerHTML'), 'html.parser')

        # get link from first column of first row
        thref = tSoup.select_one(
            f'tr:first-child>td:first-child>a[href]'
        ).get('href')

        ################### scrape rows' data from tSoup ###################

        try:
            thref = thref.replace('\\/', '/').replace('/', '\\/')
            thSel = 'table:has(th.text-center) tbody>tr:first-child>'
            thSel += 'td:first-child>a[href^="\/games\/"]'
            wait.until(EC.presence_of_all_elements_located((
                By.CSS_SELECTOR, f'{thSel}:not([href="{thref}"])')))
        except: tmoCt += 1 # program ends if tmoCt gets too high

使用这个函数，第一个函数（* scrape_pakakumi_lim *）尝试抓取一定数量的行（maxRows），然后uses pandas尝试将抓取的数据保存到opfn（默认为“pakakumi.csv”）。
主要问题是

你需要指定maxRows [这样你就不能在没有预设限制的情况下抓取]
如果您将maxRows设置为太大的数字，您可能最终使用太多的内存
如果有任何东西破坏程序[错误、中断等]，* 所有 * 刮取的数据将丢失

溶液2

scrape_pakakumi * [第三个也是最后一个函数]依赖于 * scrape_pakakumi_api * [第二个函数]使用the API [如果一切正常，则返回JSON response]收集额外数据。API有时会失败，特别是在太多请求发送得太快的情况下;在这种情况下，仅保存来自表的hash和crash，但是created_at保持为空并且不添加plays。

通过将其设置为None，可以取消maxRows限制（尽管缺省值为999），还允许您指定要等待装载的新行数（但它必须小于39，因为表只有40行）。代替检查第一个单元格不再包含相同的链接，它会检查以前位于顶部的链接现在是否在第n行[n= wAmt below]之下（不要忘记，应该调整maxWait，以便有足够的时间加载 n 个新行）。

# wait = WebDriverWait(driver, maxWait) 

            thSel = 'table:has(th.text-center) tbody'
            if isinstance(wAmt, int) and 1 < wAmt < 39:
                thSel = f'{thSel}>tr:nth-child({wAmt})~tr>td:first-child'
            else: thSel = f'{thSel}>tr:first-child~tr>td:first-child'
            wait.until(EC.presence_of_all_elements_located((
                By.CSS_SELECTOR, f'{thSel}>a[href="{thref}"]')))
        # except: tmoCt += 1

如果wAmt作为浮点数传递（如10.0或3.5），那么程序只会休眠几秒钟，而不是扫描新行。

if isinstance(wAmt, float):
            if not gData: # only wait if there's no new data
                time.sleep(wAmt)
            continue # skip rest of loop

        # try....except: tmoCt += 1

两种解决方案都跟踪以前添加的game_id，并对照它们进行检查以避免重复。
在解决方案1中，addedIds初始化为空列表，然后使用list comprehension简单地过滤掉重复项。

addedIds, games, thref = [], [], '' # initiated outside loop

    # and then inside the loop:
        tGames = [t for t in tGames if t['game_id'] not in addedIds] # filter out duplicates

        games += tGames # add to main list

    # [main list (games) saved after loop]

在解决方案2中，首先检查输出文件中是否有旧数据，由于每个game_id都是[使用API][在内部循环中]单独提取的，因此重复的数据将被continue跳过。[ID被转换为字符串，因为read_csv将它们提取为数字，JSON也将它们作为数字，但它们最初是作为字符串从链接中提取的。]

try: # [before loop]
        prevData = pd.read_csv(gfn).to_dict('records') # get data from previous scrape
        addedIds = [str(g['game_id']) for g in prevData if 'game_id' in g][-1*maxIds:]
    except: addedIds = []

    maxIds = maxRows if maxRows and 100 < maxRows < 500 else 100 # for trimming

    # and then inside the loop:
        addedIds = addedIds[-1*maxIds:] # to reduce memory-usage a bit

        # scrape table

        for tg in tGames:
            if str(tg['game_id']) in addedIds: continue
            # tgg, tgp = scrape_pakakumi_api....
        # save scraped data

        addedIds += [str(g['game_id']) for g in gData]

selenium 抓取不断不定期更新数据的网站

1条答案

解决方案1 [较短但有限]

溶液2

相关问题

热门标签

最新问答