试图从页面https://clutch.co/il/it-services
收集数据,我认为可能有几个选项可以做到这一点
使用bs4
并请求B.使用Pandas
第一种方法使用A。
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://clutch.co/il/it-services"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
company_names = soup.find_all("h3", class_="company-name")
locations = soup.find_all("span", class_="locality")
company_names_list = [name.get_text(strip=True) for name in company_names]
locations_list = [location.get_text(strip=True) for location in locations]
data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
df.to_csv("it_services_data.csv", index=False)
此代码将刮
a.来自指定网页的公司名称和位置,以及b.将它们存储在Pandas DataFrame中。c.然后将数据保存到当前工作目录中名为it_services_data.csv
的CSV文件中。
但我最终得到了一个空的结果文件。实际上,该文件是空的:
我所做的是:
1.安装所需的软件包:
pip install beautifulsoup4 requests pandas
1.导入必要的库:import requests
from bs4 import BeautifulSoup
import pandas as pd
1.向网页发送GET请求并检索HTML内容:url = "https://clutch.co/il/it-services"
response = requests.get(url)
1.创建一个BeautifulSoup对象来解析HTML内容:soup = BeautifulSoup(response.content, "html.parser")
1.识别包含我们要抓取的数据的HTML元素。检查网页的源代码以查找相关的标签和属性。例如,假设我们要提取公司名称及其各自的位置。在本例中,公司名称包含在类名为“company-name”的标记中,位置包含在类名为“locality”的标记中:company_names = soup.find_all("h3", class_="company-name")
locations = soup.find_all("span", class_="locality")
1.从HTML元素中提取数据并将其存储在列表中:company_names_list = [name.get_text(strip=True) for name in company_names] locations_list = [location.get_text(strip=True) for location in locations]
1.创建一个Pandas DataFrame来组织提取的数据:data = {"Company Name": company_names_list, "Location": locations_list}
df = pd.DataFrame(data)
8:可选地,您可以使用Pandas DataFrame执行进一步的数据处理或分析,或将数据导出到文件。例如,要将数据保存到CSV文件:
`df.to_csv("it_services_data.csv", index=False)`
就是这样!我就做了这么多:我认为,通过这种方法,我可以使用Python和Beautiful Soup,Requests和Pandas包从指定的网页中抓取公司名称及其位置。
好吧-我也需要有公司的网址。如果我能收集到更多的数据,那就太好了。
更新:非常感谢badduker:我在Colab中尝试了一下-在安装cloudsraper-plugin之后-运行代码并得到以下结果:
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
During handling of the above exception, another exception occurred:
AttributeError: 'CloudflareChallengeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
AssertionError
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
During handling of the above exception, another exception occurred:
AttributeError: 'CloudflareChallengeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError: object of type 'NoneType' has no len()
During handling of the above exception, another exception occurred:
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
AssertionError
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 Captcha challenge, This feature is not available in the opensource (free) version.
During handling of the above exception, another exception occurred:
AttributeError: 'CloudflareChallengeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError: object of type 'NoneType' has no len()
During handling of the above exception, another exception occurred:
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
TypeError: object of type 'NoneType' has no len()
During handling of the above exception, another exception occurred:
AttributeError: 'TypeError' object has no attribute '_render_traceback_'
During handling of the above exception, another exception occurred:
AssertionError
1条答案
按热度按时间juzqafwq1#
该站点返回一个错误,提示您需要启用
JavaScript
。换句话说,普通的requests
可能不够。但是,您可以尝试使用
cloudscraper
模块。例如:
输出: