无法在web scraping python中获取字段

knpiaxh1  于 2021-08-25  发布在  Java
关注(0)|答案(2)|浏览(458)

我正试图从下面的网站上获取所有公司名称(突出显示)。这是我的第一次网络抓取工作,所以我正在努力理解为什么我不能抓取公司名称,尽管我有正确的参数,

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content)
soup.find_all("a href") # this is not getting me company names
soup.find_all('alt') #this either

我在网页上找到了html标签,并尝试了许多小组合,但似乎没有任何效果。任何将所有公司名称集中到一个地方的建议对我来说都意义重大。

63lcw9qa

63lcw9qa1#

公司名称显示为 alt 属性 img 标签内的 <td> 类名称为-company的标记。
您正在使用 soup.find_all('alt') - alt 不是标签。只能从soup对象中选择html标记,而不能从属性中选择。

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')

t = soup.findAll('td', class_='company')

for i in t:
    print(i.find('img')['alt'])
Output:

First Focus
Managed Solution
centrexIT
Carbon60
Redcentric
BlackPoint IT Services
.
.
.
bpzcxfmw

bpzcxfmw2#

您没有使用beautifulsoup正确引用正确的标记和/或属性。我建议找一个关于html的小教程来理解标记和属性,然后看看如何使用bs4选择它们。然后,您可以看到如何拉出标记,并从这些标记中拉出文本和/或属性值。请尝试以下代码:

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')
data = soup.find_all('td', {'class':'company'})

for each in data:
    print(each.find('img')['alt'])

输出:

Managed Solution
Redcentric
First Focus
K3 Technology
ICC Managed Services
AffinityMSP
BCA IT, Inc.
CloudCoCo Plc (formerly Adept4 PLC)
SCC
Datacom Systems
Compugen
Cancom
All Covered
Computacenter
q.beyond AG
Atos
Controlware GmbH Firmenzentrale
Trustmarque
Bytes
AHEAD
ACP IT Solutions GmbH
PROFI Engineering Systems AG
PQR
Orbit GmbH
SVA System Vertrieb Alexander GmbH
Ensono
Phoenix Software Ltd
Atea Norge AS
Axians
Kick ICT Group
Atea Sverige AB
Catapult Systems LLC
Valid

相关问题