迭代一组URL并收集CSV格式的数据输出

eyh26e7m  于 2023-03-05  发布在  其他
关注(0)|答案(1)|浏览(130)

类似于此线程和任务scrape with BS4 Wikipedia text (pair each heading with paragraphs associated) - and output it to CSV-format
我有一个问题:如何迭代一组700个URL,以CSV(或Excel格式)格式获取700个数字集线器的数据-
请参阅我们提供数据的页面:此处显示

https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool

URL的列表如这里所示;

https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view
https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view

等等等等

    • 问题**:我们能否将此应用于类似的任务:将数据收集视为数字枢纽:我已经应用了一个刮刀到一个单一的网站与此-它的工作-但如何实现一个csv输出到刮刀迭代的网址,我们可以把输出到csv太-同时应用同样的技术!?

我想将网页抓取的段落与最近从hubCards抓取的标题配对:我目前刮hubCards作为单页找到方法,但是,我想得到所有700卡刮刮与标题,这样我就可以看到数据一起在一个CSV文件.我想把它写入结果到一个适当的格式-这可能是一个CSV文件.注意:我们有以下h2标题;
注:我们在每个集线器卡上有以下标题:

Title: (probably a h4 tag)
Contact: 
Description:
'Organization', 
'Evolutionary Stage', 
'Geographical Scope', 
'Funding', 
'Partners', 
'Technologies'

我在一页纸上写的是:

from bs4 import BeautifulSoup
import requests

page_link = 'https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view'
page_response = requests.get(page_link,verify=False, timeout=5)
page_content = BeautifulSoup(page_response.content, "html.parser")
textContent = []
for tag in page_content.find_all('h4')[1:]:
    texth4=tag.text.strip()
    textContent.append(texth4)
    for item in tag.find_next_siblings('p'):
        if texth4 in item.find_previous_siblings('h4')[0].text.strip():
            textContent.append(item.text.strip())

print(textContent)

控制台中的输出

Description', 'Link to national or regional initiatives for digitising industry', 'Market and Services', 'Service Examples', 'Leveraging the holding system "EndoTAIX" from scientific development to ready-to -market', 'For one of SurgiTAIX AG\'s products, the holding system "EndoTAIX" for surgical instrument fixation, the SurgiTAIX AG cooperated very closely with the RWTH University\'s Helmholtz institute. The services provided comprised the complete first phase of scientific development. Besides, after the first concepts of the holding system took shape, a prototype was successfully build in the scope of a feasibility study. In the role regarding the self-conception as a transfer service provider offering services itself, the SurgiTAIX AG refined the technology to market level and successfully performed all the steps necessary within the process to the approval and certification of the product. Afterwards, the product was delivered to another vendor with SurgiTAIX AG carrying out the production process as an OEM.', 'Development of a self-adapting robotic rehabilitation system', 'Based on the expertise of different partners of the hub, DIERS International GmbH (SME) was enabled to develop a self-adapting robotic rehabilitation system that allows patients after stroke to relearn motion patterns autonomously. The particular challenge of this cooperation was to adjust the robot to the individual and actual needs of the patient at any particular time of the exercise. Therefore, different sensors have been utilized to detect the actual movement performance of the patient. Feature extraction algorithms have been developed to identify the actual needs of the individual patient and intelligent predicting control algorithms enable the robot to independently adapt the movement task to the needs of the patient. These challenges could be solved only by the services provided by different partners of the hub which include the transfer of the newly developed technologies, access to patient data, acquisition of knowledge and demands from healthcare personal and coordinating the application for public funding.', 'Establishment of a robotic couch lab and test facility for radiotherapy', 'With the help of services provided by different partners of the hub, the robotic integrator SME BEC GmbH was given the opportunity to enhance their robotic patient positioning device "ExaMove" to allow for compensation of lung tumor movements during free breathing. The provided services solved the need to establish a test facility within the intended environment (the radiotherapy department) and provided the transfer of necessary innovative technologies such as new sensors and intelligent automatic control algorithms. Furthermore, the provided services included the coordination of the consortium, identifying, preparing and coordinating the application for public funding, provision of access to the hospital’s infrastructure and the acquisition of knowledge and demands from healthcare personal.', 'Organization', 'Evolutionary Stage', 'Geographical Scope', 'Funding', 'Partners', 'Technologies']

到目前为止一切顺利:现在的目标是具有良好的解决方案:如何迭代一组700个URL(换句话说,700个hubCard),以CSV(或Excel格式)格式获取700个数字集线器的数据?

    • 更新日期:**

下面是使用Python和BeautifulSoup抓取网页并提取每个数字中心的信息的示例代码:

import requests
from bs4 import BeautifulSoup
import csv

# create a list of the URLs for each digital hub
urls = ['https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/details/AL00106',
        'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/details/AT00020',
        # add the rest of the URLs here
       ]

# create an empty list to store the data for each digital hub
data = []

# iterate over each URL and extract the relevant information
for url in urls:
    # make a GET request to the webpage
    response = requests.get(url)
    
    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, 'html.parser')
    
    # extract the relevant information from the HTML
    name = soup.find('h3', class_='mb-0').text.strip()
    country = soup.find('div', class_='col-12 col-md-6 col-lg-4 mb-3 mb-md-0').text.strip()
    website = soup.find('a', href=lambda href: href and 'http' in href).get('href')
    description = soup.find('div', class_='col-12 col-md-8').text.strip()
    
    # add the extracted information to the data list as a dictionary
    data.append({'Name': name, 'Country': country, 'Website': website, 'Description': description})

# write the data to a CSV file
with open('digital_hubs.csv', 'w', newline='') as csvfile:
    fieldnames = ['Name', 'Country', 'Website', 'Description']
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    
    writer.writeheader()
    for hub in data:
        writer.writerow(hub)

在这个示例代码中,我们首先为每个数字集线器创建一个URL列表。2然后我们使用for循环迭代每个URL,并使用BeautifulSoup提取相关信息。3我们将每个数字集线器的提取信息作为字典存储在数据列表中。4最后,我们使用csv模块将数据写入CSV文件。

wn9m85ua

wn9m85ua1#

您可以使用zip()迭代包含class="hubCardTitle"的标记和后面的下一个元素:

import requests
import pandas as pd
from bs4 import BeautifulSoup

urls = [
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/3480/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/13281/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1417/view",
    "https://s3platform-legacy.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/1349/view",
]

out = []
for url in urls:
    print(f"Getting {url}")
    soup = BeautifulSoup(requests.get(url).content, "html.parser")
    d = {"URL": url, "Title": soup.h2.text}

    titles = soup.select("div.hubCardTitle")
    content = soup.select("div.hubCardTitle + div")

    for t, c in zip(titles, content):
        t = t.get_text(strip=True)
        c = c.get_text(strip=True, separator="\n")
        d[t] = c

    out.append(d)

df = pd.DataFrame(out)
df.to_csv("data.csv", index=False)

创建data.csv(LibreOffice的屏幕截图):

相关问题