python 使用Selenium抓取LinkedIn公司名称

6qqygrtg  于 2023-03-07  发布在  Python
关注(0)|答案(3)|浏览(203)

我尝试抓取LinkedIn站点并将站点上的所有公司名称保存到 Dataframe 中,但是当我运行for循环来循环list元素时,它会在整个循环中打印第一个公司名称

from selenium import webdriver
import os
import time
import selenium 
from selenium import webdriver
from selenium.webdriver.common.by import By
import undetected_chromedriver as uc
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
import pandas as pd

url = 'https://www.linkedin.com/jobs/search/?currentJobId=3492578215&geoId=105365761&keywords=data%20analyst&location=Nigeria&refresh=true'
options = webdriver.ChromeOptions()
options.add_experimental_option('detach',True)
driver = webdriver.Chrome(r"C:\Users\i\Desktop\PPstuff\selenium\chromedriver.exe", options=options)
driver.get(url)
jobs = driver.find_elements(By.TAG_NAME,'li')
company_name = []
for job in jobs:
      company = job.find_element(By.XPATH,"//h4").text
      company_name.append(company)
      print(company)
vshtjzan

vshtjzan1#

要提取所有公司名称,您需要为visibility_of_all_elements_located()导出WebDriverWait,并使用List Comprehension,您可以使用以下Locator Strategies之一:

  • 使用 * CSS选择器 * 和get_attribute("innerHTML")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a.job-card-list__title")))])
  • 使用 * XPATH * 和 * text * 属性:
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[contains(@class, 'job-card-list__title')]")))])
      • 注意**:您必须添加以下导入:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

结束

有用文档链接:

  • 方法
  • text属性返回The text of the element.
  • 使用Selenium的文本和innerHTML之间的区别
ovfsdjhp

ovfsdjhp2#

我用CSS找到了这些元素(只是我的喜好),我用的是FireFox,但是Chrome应该也能用。我用了一个if条件来跳过重复。这应该能用。

from selenium import webdriver
from selenium.webdriver.common.by import By

url = f'https://www.linkedin.com/jobs/search/?currentJobId=3492578215&geoId=105365761&keywords=data%20analyst&location=Nigeria&refresh=true'

driver = webdriver.Firefox()
company_name = []

driver.get(url)
jobs = driver.find_elements(By.CSS_SELECTOR, ".hidden-nested-link")

for job in jobs:
    #if the company name is already in the list skip it
    if job.text not in company_name:
        company = job.text
        company_name.append(company)
        print(company)
jpfvwuh4

jpfvwuh43#

尝试下面的代码,它打印所需的元素:

from selenium import webdriver
from selenium.webdriver.common.by import By

driver = webdriver.Chrome()
url = 'https://www.linkedin.com/jobs/search/?currentJobId=3492578215&geoId=105365761&keywords=data%20analyst&location=Nigeria&refresh=true'
driver.get(url)
driver.maximize_window()

companyNames = driver.find_elements(By.XPATH, '//h4/a')
for x in range(len(companyNames)):
    print(companyNames[x].text)

控制台输出:

CareerMatch
Turing
Data2Bots
NewGlobe
Canonical
Turing
TEDxMaitama Official
Flutterwave
CareerMatch
Mshel Homes Limited
Jobberman Nigeria
Flutterwave
Zer0Paper
CareerMatch
Renesas Electronics
Turing
KNN Corporate Services Ltd
AppCake
Canonical
Turing
CareerMatch
Turing
Verraki Africa
Shaldag Limited
Turing

Process finished with exit code 0

相关问题