csv “NoneType”对象不可订阅:bs4任务永久失败

zyfwsgd6  于 2023-03-27  发布在  其他
关注(0)|答案(2)|浏览(139)

**更新:**尝试了Driftr 95的脚本..在google-colab -并得到了一些问题-脚本失败-并不成功- queston.在脚本的开始,我注意到有些行是推荐出来.为什么会这样.我会尝试调查更多-同时感谢很多..-.真棒.

两个想法浮现在脑海中a.结果页面的整个站点包含更多的数据:见
查看一页(共700页)的结果:
数字创新中心:4PDIH -公私合作数字创新中心
https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool/-/dih/17265/view?_eu_europa_ec_jrc_dih_web_DihWebPortlet_backUrl=%2Fdigital-innovation-hubs-tool
具有以下类别的数据集:

Hub Information
Description
Contact Data
Organisation
Technologies
Market and Services
Service Examples
Funding
Customers
Partners

第二个想法:使用awesme-scripts:

NameError("name 'pd' is not defined") from https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur=1
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-c1f39e3c2547> in <module>
     11 
     12 # pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
---> 13 df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
     14 df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)

NameError: name 'pd' is not defined

除此之外在下一次庭审中

NameError                                 Traceback (most recent call last)
<ipython-input-5-538670405002> in <module>
      1 # df = pd.concat(dfList.....
----> 2 orig_cols = list(df.columns)
      3 for ocn in orig_cols:
      4     if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
      5     if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links

NameError: name 'df' is not defined

下一条线索

NameError                                 Traceback (most recent call last)
<ipython-input-1-4a00208c3fe6> in <module>
     10     pg_num += 1
     11     if isinstance(max_pg, int) and pg_num>max_pg: break
---> 12     pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
     13     rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
     14     all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]

NameError: name 'BeautifulSoup' is not defined

更新结束:

完整的故事:

我目前正在尝试学习Beautiful Soup(BS 4),从获取数据开始。
与刮刀,应该与美丽的汤和刮this页面的数据集,并把数据转换为csv格式或使用Pandas。如果我在谷歌colab运行这个-我面临着一些奇怪的问题。见下文:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# Make a request to the webpage
url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
response = requests.get(url)

# Parse the HTML content with Beautiful Soup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table with the data
table = soup.find('table')

# Extract the table headers
headers = []
for th in table.find_all('th'):
    headers.append(th.text.strip())

# Extract the table rows
rows = []
for tr in table.find_all('tr')[1:]:
    row = []
    for td in tr.find_all('td'):
        row.append(td.text.strip())
    rows.append(row)

# Find the total number of pages
num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']

# Loop through each page and extract the data
for page in range(2, int(num_pages) + 1):
    # Make a request to the next page
    page_url = f'{url}?page={page}'
    page_response = requests.get(page_url)

    # Parse the HTML content with Beautiful Soup
    page_soup = BeautifulSoup(page_response.content, 'html.parser')

    # Find the table with the data
    page_table = page_soup.find('table')

    # Extract the table rows
    for tr in page_table.find_all('tr')[1:]:
        row = []
        for td in tr.find_all('td'):
            row.append(td.text.strip())
        rows.append(row)

# Create a Pandas DataFrame with the data
df = pd.DataFrame(rows, columns=headers)

# Save the DataFrame to a CSV file
df.to_csv('digital-innovation-hubs.csv', index=False)

看看我得到了什么-如果我在Google-Colab中运行此操作

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-1-f87e37f02fde> in <module>
     27 
     28 # Find the total number of pages
---> 29 num_pages = soup.find('input', {'id': 'paginationPagesNum'})['value']
     30 
     31 # Loop through each page and extract the data

TypeError: 'NoneType' object is not subscriptable

更新:看看有什么回来:

由于riley.johnson3的帮助,我发现分页 Package 器应该被修复。

  • 非常感谢您的快速帮助和解释-已经收集了一组数据-这是一个样本。现在必须找出如何获得完整的数据集。所有的700条记录-所有的数据.. -猜猜我们都在那里。-再次感谢您的杰出帮助。这是伟大的...和赞赏了很多...;)
thtygnil

thtygnil1#

pandas-仅解决方案

如果你只是想要[立即可见]的表数据,你可以使用pandas read_html on循环,直到它引发异常,然后concat将所有抓取的DataFrame连接在一起:

# import pandas as pd
output_fp = 'digital-innovation-hubs.csv'
dfList, pg_num, max_pg = [], 0, None
base_url = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while (pg_num:=pg_num+1) and (not isinstance(max_pg,int) or pg_num<max_pg):
    pg_url = f'{base_url}?_eu_europa_ec_jrc_dih_web_DihWebPortlet_cur={pg_num}'
    # try: dfList += pd.read_html(pg_url, extract_links='all')[:1] ## [needs v1.5.0.]
    try: dfList += pd.read_html(pg_url)[:1]
    except Exception as e: pg_num, _ = -1, print(f'\n{e!r} from {pg_url}')
    else: print('', end=f'\rScraped {len(dfList[-1])} rows from {pg_url}')

# pd.concat(dfList).to_csv(output_fp, index=False) ## save without page numbers
df=pd.concat(dfList, keys=list(range(1,len(dfList)+1)),names=['from_pg','pgi'])
df.reset_index().drop('pgi',axis='columns').to_csv(output_fp, index=False)

正如你在the output中看到的,链接不会被抓取。但是,应该注意的是,在pandas1.5.0中,你可以在read_html中设置一个extract_links参数;结果将看起来是like this,但也可以是cleaned up,如:

# df = pd.concat(dfList.....
orig_cols = [c for c in df.columns if c != 'from_pg']
for ocn in orig_cols:
    if any(vals:=[cv for cv,*_ in df[ocn]]): df[ocn[0]] = vals
    if any(links:=[c[1] for c in df[ocn]]): df[ocn[0].split()[0]+' Links'] = links
if 'Email Links' in df.columns:
    df['Email'] = df['Email Links'].str.replace('mailto:','',1)
    df = df.drop('Email Links', axis='columns')
df = df.drop(orig_cols, axis='columns')
# df.....to_csv(output_fp, index=False)

requests + bs4解决方案

下面的函数(view outputs for first page)应该从单个表行(tr标记)中提取所有数据:

def get_row_dict(trTag):
    row = { td['data-ecl-table-header']: td.get_text(' ', strip=True) 
            for td in trTag.select('td[data-ecl-table-header]')} 
    for td in trTag.select('td[data-ecl-table-header]:has(a[href])'):
        k, link = td['data-ecl-table-header'], td.find('a',href=True)['href']
        if k=='Email' and link.startswith('mailto:'):
            link = link.replace('mailto:', '', 1)
        row[(k.split()[0]+' Link') if row[k] else k] = link
    return row

在抓取分页数据时,我首选的方法是使用while循环,条件是存在下一个链接。

# import requests
# import pandas as pd
# from bs4 import BeautifulSoup
# def get_row_dict...
output_fp = 'digital-innovation-hubs.csv'

all_rows, pg_num, max_pg = [], 0, None
next_link = 'https://s3platform.jrc.ec.europa.eu/digital-innovation-hubs-tool'
while next_link:
    pg_num += 1
    if isinstance(max_pg, int) and pg_num>max_pg: break
    pgSoup = BeautifulSoup((pgReq:=requests.get(next_link)).content, 'lxml')
    rows = pgSoup.select('tr:has(td[data-ecl-table-header])')
    all_rows += [{'from_pg': pg_num, **get_row_dict(r)} for r in rows]
    # all_rows += [get_row_dict(r) for r in rows] # no "from_pg" column

    ## just for printing ##
    pgNum = pgSoup.find('span', {'aria-current':"true", 'aria-label':True})
    if pgNum: pgNum = ['',*pgNum.get_text(' ', strip=True).split()][-1]
    from_pg=int(pgNum) if isinstance(pgNum,str) and pgNum.isdigit() else pg_num
    rowCt = pgSoup.find('div', class_='ecl-u-type-prolonged-s')
    rowCt = rowCt.text.split(':')[-1].strip() if rowCt else 'UNKNOWN'  
    vStr = f'{len(rows)} scraped [total: {len(all_rows)} of {rowCt}] - '
    vStr += f'<{pgReq.status_code} {pgReq.reason}> from {pgReq.url}'
    print(f'\r[{pg_num}][{pgNum}] {vStr}', end='')

    next_link = pgSoup.find('a', {'href':True, 'aria-label':'Go to next page'})
    if next_link: next_link = next_link['href']

pd.DataFrame(all_rows).to_csv(output_fp, index=False)

[ view results spreadsheet ]

o7jaxewo

o7jaxewo2#

问题是您使用的id(paginationPagesNum)在页面中不存在。此语句返回None

soup.find('input', {'id': 'paginationPagesNum'})

您试图从NoneType访问'value'属性,这就是导致错误的原因。要修复此错误,您需要找到正确的标记。此代码示例查找分页 Package 器,查找各个元素,并确定其长度:

pagination_wrapper = soup.select_one(
    '#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator'
)

pagination_items = pagination_wrapper.select(
    'ul > li:not(.ecl-pagination__item--next)'
)

num_pages = len(pagination_items)

或者,这里有一个一行程序来实现同样的事情:

num_pages = len(soup.select('#_eu_europa_ec_jrc_dih_web_DihWebPortlet_list-web-page-iterator > ul > li:not(.ecl-pagination__item--next)'))

请注意,:not(.ecl-pagination__item--next)是过滤掉下一页按钮所必需的;否则,num_pages将偏离1。

相关问题