我厌倦了使用Beautifulsoup从本地可用的html文件(下面提供了下载链接)中抓取表数据行,但没有任何成功:
以下是我的努力:
from bs4 import BeautifulSoup
import json
with open("web_summary.html", "r") as file:
html_file = file.read()
soup = BeautifulSoup(html_file, "html.parser")
script = soup.find("div", {"data-component": "CellRangerSummary", "data-key": "summary"}).find('script')
table_data = json.loads(script.text.split('=')[1], encoding='utf-8')
summary_data = table_data['summary']
summary_tab = summary_data['summary_tab']
rows = summary_tab['table']['rows']
for row in rows:
print(row[0],row[1])
html file download link
以下是作为 Dataframe 的预期输出(所有表的行):
Number of Spots Under Tissue 2,987
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Number of Reads 384,076,450
Valid Barcodes 97.70%
Valid UMIs 99.90%
Sequencing Saturation 80.20%
Q30 Bases in Barcode 98.90%
Q30 Bases in RNA Read 89.60%
Q30 Bases in UMI 98.80%
Reads Mapped to Genome 86.00%
Reads Mapped Confidently to Genome 79.10%
Reads Mapped Confidently to Intergenic Regions 5.20%
Reads Mapped Confidently to Intronic Regions 0.00%
Reads Mapped Confidently to Exonic Regions 73.90%
Reads Mapped Confidently to Transcriptome 65.60%
Reads Mapped Antisense to Gene 1.40%
Fraction Reads in Spots Under Tissue 97.30%
Mean Reads per Spot 128,583
Median Genes per Spot 4,553
Total Genes Detected 21,673
Median UMI Counts per Spot 14,169
有什么想法(Beautifulsoup或任何其他框架)使我的代码工作?
2条答案
按热度按时间dced5bon1#
您要查找的表格内容在特定的表中并不整齐;相反,它们出现在脚本标记中偶尔出现的不同表中。我建议的脚本尝试从不同的表中获取所有数据。然而,使用您开始使用的方法的最接近的可能解决方案是:
输出:
6rqinv9w2#
Pandas有一个适用于您的案例的
read_html