pandas 使用lxml进行Web抓取

omqzjyyz  于 2023-06-20  发布在  其他
关注(0)|答案(1)|浏览(134)

我对Python很陌生。我试图用xpath从www.example.com抓取一个html表https://tshf.sas.com/techsup/download/hotfix/HF2/E1Y_r64.html#E1Y015。//table[contains(.,'E1 Y 015')]
它基本上是唯一的大表,你可以在那个网址找到
有人能帮我编写代码,用lxml或任何其他有效的方法将数据从这个特定的html表中抓取到python数据框中吗?我也想处理好colspan。
我试过read_html,但它不能有效地处理colspan。任何帮助都是非常感谢的。
下面是我使用read_html的代码

from selenium import webdriver
from selenium.webdriver.common.by import By 
import pandas as pd
import html5lib

driver = webdriver.Chrome()
driver.get("xxxxx.html")
#the above html file has some urls under which the html tables that I need to scrape are present
#Im taking those elements that have the hyperlinks here
links = driver.find_elements("xpath", '//table/tbody/tr/td[3]/a[1]')
for l in links:
    #below I'm taking the value of href attribute for the hyperlinks
    url = str(l.get_attribute('href'))
    #I'm doing some manipulations to locate the right table
    spl_char='#'
    r = url.partition(spl_char)[2] + '\s+'
    all_tables = pd.read_html(url,match=r,keep_default_na=False)
    df = all_tables[0]
    #finally Im converting it to csv. Ignore "arc" variable. It does not concern my problem here. It is just for the filename
    with open('./%s.csv' %arc[i], 'a') as f:
        f.write('\n')
    df.to_csv('./%s.csv' %arc[i] , mode='a', index=False, header=False)

如果存在列跨度,csv文件会在行的相邻单元格中创建重复值。在阅读了一些stackoverflow帖子和其他论坛后,我明白了LXML可以更好地处理它。我参考了Wes McKinney的XML和HTML数据分析Python:网页搜罗
我看到下面的代码来获取所有表,然后他使用表索引来获取特定的表

tables = doc.findall('.//table')
calls = tables[9]
puts = tables[13]

我不能在我的情况下使用表索引,因为我必须通过多个类似的URL和所需的表不是在相同的位置所有的时间。相反,我可以使用如下的xpath获取所需的表- .//table[contains(.,'E1 Y 015')]因为我是一个编程新手,我这样做是为了简化我们所做的平凡任务,所以我需要帮助来弄清楚这一点。
example screenshot of the current output table

mv1qrgav

mv1qrgav1#

以下是将该表作为可理解的 Dataframe 加载的一种方法:

import pandas as pd
import requests
import re
from bs4 import BeautifulSoup as bs

pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

url = 'https://tshf.sas.com/techsup/download/hotfix/HF2/E1Y_r64.html#E1Y015'

soup = bs(requests.get(url).text, 'html.parser')
table = soup.select('table[class="ITPtable2022"]', string=re.compile('E1Y015'))[-1]
df = pd.read_html(str(table), skiprows=2, header=0)[0]
display(df)
df.to_csv('test_example_csv.csv', index=False, escapechar='\\')

终端结果:

Issue(s) Addressed:     Issue(s) Addressed:.1   Introduced:
0   63282   The INFOMAPS procedure does not recognize DEFAULT_VALUES=_ALL_ as a valid value for a numeric prompt    E1Y001
1   64719   SAS® XML Mapper contains an XML External Entity (XXE) vulnerability that also affects the XMLV2 LIBNAME engine  E1Y005
2   65891   The XSL procedure contains an authorization-bypass vulnerability    E1Y006
3   66134   SAS® Management Console contains an XML External Entity (XXE) processing vulnerability  E1Y007
4   66307   ALERT - You see a null pointer exception message in the Schedule Manager when you initially schedule a flow with Platform Suite for SAS®    E1Y008
5   67449   A security vulnerability in IBM Platform Process Manager affects Platform Suite for SAS®    E1Y011
6   68479   The GemfireBasedTicketCache cache locator can fail with the error "ClassCastException" in a clustered SAS® 9.4 middle-tier environment  E1Y013
7   68672   An XML file that is read by using the AUTOMAP= option with the XMLV2 LIBNAME engine cannot be deleted within the SAS® session   E1Y015
8   NOTE: If you install this hot fix and have SAS XML Mapper 9.45 installed, you must also install hot fix G1Z002. If you install this hot fix and need the fix documented in SAS Note 65891, you must also install hot fix H7X001 for Base SAS 9.4_m6.    NOTE: If you install this hot fix and have SAS XML Mapper 9.45 installed, you must also install hot fix G1Z002. If you install this hot fix and need the fix documented in SAS Note 65891, you must also install hot fix H7X001 for Base SAS 9.4_m6.    NOTE: If you install this hot fix and have SAS XML Mapper 9.45 installed, you must also install hot fix G1Z002. If you install this hot fix and need the fix documented in SAS Note 65891, you must also install hot fix H7X001 for Base SAS 9.4_m6.
9   Released: December 13, 2021 Documentation: E1Y015r6.html Download: E1Y015pt.zip     Released: December 13, 2021 Documentation: E1Y015r6.html Download: E1Y015pt.zip     Released: December 13, 2021 Documentation: E1Y015r6.html Download: E1Y015pt.zip

相关文件:pandasrequestsBeautifulSoup

相关问题