html 为什么我的代码给我一个AttributeError?

evrscar2  于 2023-04-27  发布在  其他
关注(0)|答案(1)|浏览(140)

我试图通过几个层次的html来检索与立法相关的链接。然而,一旦我到达链接的第二层,而不是检索与单个法案相关的链接列表,我得到了错误:
发生异常:AttributeError 'NoneType' object has no attribute 'startswith' File“C:\Users\Justin\Desktop\ilgascrapetest1.py”,line 14,in if href.startswith('/ legislation/BillStatus.asp?'):^^^^^^^^^^^^“NoneType”对象没有属性“startswith”
这是到目前为止的代码:

import requests
from bs4 import BeautifulSoup

url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the House Bills section
house_bills = soup.find('a', {"name": "h_bills"}).parent

# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
    href = link.get('href')
    if href.startswith('/legislation/BillStatus.asp?'):
        bill_url = url + href
        bill_response = requests.get(bill_url)
        bill_soup = BeautifulSoup(bill_response.content, 'html.parser')

        # Find the table cell with width
        td = bill_soup.find('td', {'width': '100%'})
        
        # Iterate through all the <li> elements in table
        for li in td.find_all('li'):
            print(li.text)

我能够从第一页的Html中的"House Bills" table检索链接列表并迭代,但在give a list of links to individual bills的下一个级别中,我得到的是错误,而不是从HB 0001到HB 4042的账单链接。为什么我得到这个错误

wvmv3b1j

wvmv3b1j1#

这个站点上有多个<a>元素没有href,所以在这种情况下link.get('href')将返回None。你不能在None上调用startswith(),所以你必须添加一个检查href是否是None

import requests
from bs4 import BeautifulSoup

url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the House Bills section
house_bills = soup.find('a', {"name": "h_bills"}).parent

# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
    href = link.get('href')
    if not href:
        continue  # Ignore links without href
    if href.startswith('/legislation/BillStatus.asp?'):
        bill_url = url + href
        bill_response = requests.get(bill_url)
        bill_soup = BeautifulSoup(bill_response.content, 'html.parser')

        # Find the table cell with width
        td = bill_soup.find('td', {'width': '100%'})
        
        # Iterate through all the <li> elements in table
        for li in td.find_all('li'):
            print(li.text)

另外,您混淆了URL:首先,你需要打开“grplist.asp”,然后链接以“BillStatus.asp”开头。要只访问房屋账单部分的链接,你需要选择a后面的div,名称为h_bills,而不是它的父节点。我还修改了你的代码,这样bill_url就不再是从包括“/default.asp”在内的完整URL构建的了。

import requests
from bs4 import BeautifulSoup

url = 'https://www.ilga.gov/legislation/default.asp'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the House Bills section (next div after a with name "h_bills")
house_bills = soup.find('a', {"name": "h_bills"}).find_next_sibling("div")

# Iterate through all links in the House Bills section
for link in house_bills.find_all('a'):
    href = link.get('href')
    if not href:
        continue  # Ignore links without href

    if href.startswith('grplist.asp?'):
        bill_url = "https://www.ilga.gov/legislation/" + href

        bill_response = requests.get(bill_url)
        if bill_response.status_code != 200:  # Prevent crash when response is not valid
            continue

        bill_soup = BeautifulSoup(bill_response.content, 'html.parser')

        # Find the table cell with width
        td = bill_soup.find('td', {'width': '100%'})
        
        # Iterate through all the <li> elements in table
        for li in td.find_all('li'):
            print(li.text)

相关问题