不知道问题出在哪里,但是代码没有给出从网页中检索到的框架。这是我的第一个提取项目,我似乎无法识别问题。
这就是代码:
import requests
import sqlite3
import pandas as pd
from bs4 import BeautifulSoup
from datetime import datetime
url = 'https://en.wikipedia.org/wiki/List_of_largest_banks#By_market_capitalization'
db_name = 'Banks.db'
table_name = 'Largest_banks'
csv_path = '/home/project/Largest_banks_data.csv'
log_file = '/home/project/code_log.txt'
table_attribs = {'Bank name': 'Name', 'Market Cap (US$ Billion)': 'MC_USD_Billion'}
### Task 2 - Extract process
def extract(url, table_attribs):
# Loading the webpage for scraping
html_page = requests.get(url).text
# Parse the HTML content of the webpage
data = BeautifulSoup(html_page, 'html.parser')
# Find the table with specified attributes
# Find the main table containing the relevant data
main_table = data.find('table', class_='wikitable sortable')
# Find the desired `tbody` elements within the main table
table_bodies = main_table.find_all('tbody', attrs=table_attribs)
# Extract data from each `tbody` element
extracted_data = []
for table_body in table_bodies:
rows = table_body.find_all('tr')
for row in rows:
extracted_data.append([cell.text for cell in row.find_all('td')])
# Use pandas to create a DataFrame from the extracted data
df = pd.DataFrame(extracted_data, columns=list(table_attribs.values()))
return df
# Calling the extract function
df = extract(url, table_attribs)
if df is not None:
# Print the result DataFrame
print(df)
else:
print("Extraction failed.")
字符串
1条答案
按热度按时间yeotifhr1#
你可以直接在pandas中读取页面:
字符串
这将加载3个框架,对应于页面上的3个表。
型
将打印出第一个表格(“按市值计算”)。