python-3.x 如何抓取特定元素的文本值?

s3fp2yjn  于 2022-12-15  发布在  Python
关注(0)|答案(2)|浏览(149)

我试着从网站上提取位置数据,为一个网页抓取项目,我试图这样做。遗憾的是,我不能刮位置。只有工资和日期。
我认为这是因为Date和salary都在job-details类中,并且由于它具有相同的类名,所以在尝试提取salary时,我无法找到绕过它的方法。
有人能帮帮我吗?
我打算刮使用python,然后转换为SQL。可悲的是,我无法找到一种方法来获得的位置以及。

from bs4 import BeautifulSoup
import requests
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' 
           '(KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}

    
postList = []

def getPosts(page):
    url = url = 'https://www.technojobs.co.uk/search.phtml?page={page}&row_offset=10&keywords=data%20analyst&salary=0&jobtype=all&postedwithin=all'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')

    
    posts = soup.find_all('div', {'class': 'jobbox'})

    for item in posts:
        post = {
        'title': item.find('div', {'class': 'job-ti'}).text,
        'dateSalaryLocation': item.find('div', {'class': 'job-details'}).text,
        'description': item.find('div', {'class': 'job-body'}).text,
        }
        postList.append(post)
    return



for x in range(1, 37):
         getPosts(x)
    
df = pd.DataFrame(postList)
df.to_csv('TechnoJobs.csv', index=False, encoding='utf-8')
jtjikinw

jtjikinw1#

元素分布在同一类的两个元素上,因此您可以一个接一个地选择它们,或者尝试泛化,选择所有<strong>作为键,选择其next_sibling作为值,以创建dict

for item in posts:
    post = {
    'title': item.find('div', {'class': 'job-ti'}).text,
    'description': item.find('div', {'class': 'job-body'}).text,
    }
    post.update(
        {(x.text[:-1],x.next_sibling) if x.text[:-1] != 'Date' else (x.text[:-1],x.find_next_sibling().text)  for x in item.select('.job-details strong')}
    )
    postList.append(post)
示例
from bs4 import BeautifulSoup
import requests
import pandas as pd

headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36' 
           '(KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36'}
postList = []
def getPosts(page):
    url = url = 'https://www.technojobs.co.uk/search.phtml?page={page}&row_offset=10&keywords=data%20analyst&salary=0&jobtype=all&postedwithin=all'
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'html.parser')
    
    posts = soup.find_all('div', {'class': 'jobbox'})
    
    for item in posts:
        post = {
        'title': item.find('div', {'class': 'job-ti'}).text,
        'description': item.find('div', {'class': 'job-body'}).text,
        }
        post.update(
            {(x.text[:-1],x.next_sibling) if x.text[:-1] != 'Date' else (x.text[:-1],x.find_next_sibling().text)  for x in item.select('.job-details strong')}
        )
        postList.append(post)
    return

for x in range(1, 2):
         getPosts(x)
    
pd.DataFrame(postList)
输出

| | 标题|描述|日期|薪金/标准|位置|
| - ------|- ------|- ------|- ------|- ------|- ------|
| 无|大数据分析师|职位描述数据分析师与大数据-金丝雀码头我们的客户正在寻找一名数据分析师的数据产品团队正在推动金融服务业的创新使用大数据。客户有一个高素质,专注和使命驱动的团队。我们建立的模型和分析,我们从金融数据的重要性至关重要。|十一月二十四日|300英镑至450英镑|伦敦金丝雀码头|
| 1个|大数据分析师|大数据分析师-金丝雀码头我们的客户正在寻找一名数据产品团队的数据分析师,该团队正在利用大数据推动金融服务行业的创新。客户拥有一支高素质、专注和使命驱动的团队。我们构建的模型和从金融数据中得出的分析对关键的前沿金融服务行业至关重要。|十一月三十日|250英镑至450英镑|金丝雀码头滨海港区伦敦|
| 第二章|ESA -彼尔姆-数据分析师BI -数据|我们对人员的研究有一个混合的方面:为了了解一个或多个BI/ Dataviz解决方案,能够直接与行业设备建立联系,并能够对项目数据进行试点。您同意:旅行者的方式灵活和自主,与短跑三个法院...|十一月二十五日|45,180英镑至63,252英镑|巴黎|
...

abithluo

abithluo2#

import requests
from bs4 import BeautifulSoup, SoupStrainer
from concurrent.futures import ThreadPoolExecutor, as_completed
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:107.0) Gecko/20100101 Firefox/107.0'
}

def get_soup(content):
    return BeautifulSoup(content, 'lxml', parse_only=SoupStrainer('div', attrs={'class': 'jobbox'}))

def worker(req, page):
    params = {
        "jobtype": "all",
        "keywords": "data analyst",
        "page": page,
        "postedwithin": "all",
        "row_offset": "10",
        "salary": "0"
    }
    while True:
        try:
            r = req.get(
                'https://www.technojobs.co.uk/search.phtml', params=params)
            if r.ok:
                break
        except requests.exceptions.RequestException:
            continue
    soup = get_soup(r.content)
    print(f'Extracted Page# {page}')
    return [
        (
            x.select_one('.job-ti a').get_text(strip=True, separator=' '),
            x.select_one('.job-body').get_text(strip=True, separator=' '),
            list(x.select_one(
                'strong:-soup-contains("Salary/Rate:")').next_elements)[1].get_text(strip=True),
            list(x.select_one(
                'strong:-soup-contains("Location:")').next_elements)[1].get_text(strip=True)
        )
        for x in soup.select('.jobbox')
    ]

def main():
    with requests.Session() as req, ThreadPoolExecutor() as executor:
        req.headers.update(headers)
        futures = (executor.submit(worker, req, page) for page in range(1, 38))
        allin = []
        for res in as_completed(futures):
            allin.extend(res.result())
        df = pd.DataFrame(
            allin, columns=['Name', 'Description', 'Salary', 'Location'])
        # df.to_sql()
        print(df)

main()

输出:

Name  ...          Location
0                  Junior Data Insights Analyst  ...         Knutsford
1                          Data Insight Analyst  ...         Knutsford
2                   Senior Data Insight Analyst  ...         Knutsford
3                         Business Data Analyst  ...           Glasgow
4                 Data Analyst with VBA and SQL  ...  Docklands London
..                                          ...  ...               ...
331             Power Platform Business Analyst  ...        Manchester
332             Power Platform Business Analyst  ...           Glasgow
333             Power Platform Business Analyst  ...        Birmingham
334           Cyber Security Compliance Analyst  ...            London
335  |Sr Salesforce Admin/ PM| $120,000|Remote|  ...          New York

[336 rows x 4 columns]

相关问题