selenium 抓取不断不定期更新数据的网站

gmol1639  于 2022-12-29  发布在  其他
关注(0)|答案(1)|浏览(168)

我正在尝试抓取一个Web应用程序以获取表的值。如何在每次向表中添加新值时抓取表,或者如何抓取网站?The website
我的基本代码只允许我手动抓取,导致许多值没有被抓取。

driver.find_elements_by_xpath

只返回

WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located()

工作。
下面是我的代码

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
import time

website = "https://play.pakakumi.com/"
path = r'D:\chromedriver_win32\chromedriver.exe'
driver = webdriver.Chrome(path)
driver.get(website)

page = driver.page_source
soup = BeautifulSoup(page, 'html.parser')

'''
k =driver.find_elements_by_xpath('/html/body/div/div[2]/div[2]/div/div[1]/div/div[3]/div/div[2]/div/table/tbody/tr/td[1]')

for item in k:
    print(item.text)
'''
foo = WebDriverWait(driver, 100).until(EC.presence_of_all_elements_located((By.XPATH, '/html/body/div/div[2]/div[2]/div/div[1]/div/div[3]/div/div[2]/div/table/tbody/tr/td[1]')))
for b in foo:
    print(b.text)
#print(foo)
cvxl0en2

cvxl0en21#

注意:所有3个函数的完整定义都是**pasted here**,输出已经上传到this spreadsheet。[顺便说一句,我使用CSS selectors是因为我更习惯使用它们,但是XPath的等价物可能没有太大的不同。]

解决方案1 [较短但有限]

您可以抓取link in the first row [如下面的thref],然后等待它[“link in first column of first row”]更改

# wait = WebDriverWait(driver, maxWait) 

    # while rowCt < maxRows and tmoCt < max_tmo:...

        # parsing the whole table to gather as much data as possible 
        tSoup = BeautifulSoup(driver.find_element(
            By.CSS_SELECTOR, 'table:has(th.text-center) tbody'
        ).get_attribute('outerHTML'), 'html.parser')

        # get link from first column of first row
        thref = tSoup.select_one(
            f'tr:first-child>td:first-child>a[href]'
        ).get('href')

        ################### scrape rows' data from tSoup ###################

        try:
            thref = thref.replace('\\/', '/').replace('/', '\\/')
            thSel = 'table:has(th.text-center) tbody>tr:first-child>'
            thSel += 'td:first-child>a[href^="\/games\/"]'
            wait.until(EC.presence_of_all_elements_located((
                By.CSS_SELECTOR, f'{thSel}:not([href="{thref}"])')))
        except: tmoCt += 1 # program ends if tmoCt gets too high

使用这个函数,第一个函数(* scrape_pakakumi_lim *)尝试抓取一定数量的行(maxRows),然后uses pandas尝试将抓取的数据保存到opfn(默认为“pakakumi.csv”)。
主要问题是

  • 你需要指定maxRows [这样你就不能在没有预设限制的情况下抓取]
  • 如果您将maxRows设置为太大的数字,您可能最终使用太多的内存
  • 如果有任何东西破坏程序[错误、中断等],* 所有 * 刮取的数据将丢失

溶液2

  • scrape_pakakumi * [第三个也是最后一个函数]依赖于 * scrape_pakakumi_api * [第二个函数]使用the API [如果一切正常,则返回JSON response]收集额外数据。API有时会失败,特别是在太多请求发送得太快的情况下;在这种情况下,仅保存来自表的hashcrash,但是created_at保持为空并且不添加plays

通过将其设置为None,可以取消maxRows限制(尽管缺省值为999),还允许您指定要等待装载的新行数(但它必须小于39,因为表只有40行)。代替检查第一个单元格不再包含相同的链接,它会检查以前位于顶部的链接现在是否在第n行[n= wAmt below]之下(不要忘记,应该调整maxWait,以便有足够的时间加载 n 个新行)。

# wait = WebDriverWait(driver, maxWait) 

            thSel = 'table:has(th.text-center) tbody'
            if isinstance(wAmt, int) and 1 < wAmt < 39:
                thSel = f'{thSel}>tr:nth-child({wAmt})~tr>td:first-child'
            else: thSel = f'{thSel}>tr:first-child~tr>td:first-child'
            wait.until(EC.presence_of_all_elements_located((
                By.CSS_SELECTOR, f'{thSel}>a[href="{thref}"]')))
        # except: tmoCt += 1

如果wAmt作为浮点数传递(如10.03.5),那么程序只会休眠几秒钟,而不是扫描新行。

if isinstance(wAmt, float):
            if not gData: # only wait if there's no new data
                time.sleep(wAmt)
            continue # skip rest of loop

        # try....except: tmoCt += 1

两种解决方案都跟踪以前添加的game_id,并对照它们进行检查以避免重复。
在解决方案1中,addedIds初始化为空列表,然后使用list comprehension简单地过滤掉重复项。

addedIds, games, thref = [], [], '' # initiated outside loop

    # and then inside the loop:
        tGames = [t for t in tGames if t['game_id'] not in addedIds] # filter out duplicates

        games += tGames # add to main list

    # [main list (games) saved after loop]

在解决方案2中,首先检查输出文件中是否有旧数据,由于每个game_id都是[使用API][在内部循环中]单独提取的,因此重复的数据将被continue跳过。[ID被转换为字符串,因为read_csv将它们提取为数字,JSON也将它们作为数字,但它们最初是作为字符串从链接中提取的。]

try: # [before loop]
        prevData = pd.read_csv(gfn).to_dict('records') # get data from previous scrape
        addedIds = [str(g['game_id']) for g in prevData if 'game_id' in g][-1*maxIds:]
    except: addedIds = []

    maxIds = maxRows if maxRows and 100 < maxRows < 500 else 100 # for trimming

    # and then inside the loop:
        addedIds = addedIds[-1*maxIds:] # to reduce memory-usage a bit

        # scrape table

        for tg in tGames:
            if str(tg['game_id']) in addedIds: continue
            # tgg, tgp = scrape_pakakumi_api....
        # save scraped data

        addedIds += [str(g['game_id']) for g in gData]

相关问题