pandas lxml xpath指数性能行为

ukxgm1gy 于 2022-12-02 发布在其他

关注(0)|答案(1)|浏览(151)

我尝试使用xpath来查询一个包含多个表的大型html，并且只提取其中一个单元格中包含特定模式的几个表。我遇到了与时间相关的挑战。
我已经尽可能地减少了问题。
代码设置：-创建10个（300x15）表格，随机值在0 - 100之间

import pandas as pd
import numpy as np

dataframes = [pd.DataFrame(np.random.randint(0,100, (300, 15)), columns=[f"Col-{i}" for i in range(15)]) for k in range(10)]
html_strings = [df.to_html() for df in dataframes]
combined_html = '\n'.join(html_strings)
source_html = f'<html><body>{combined_html}</body></html>'

代码执行：我想提取其中包含值"80"的所有表（在本例中，将提取所有10个表）

from lxml import etree
root = etree.fromstring(source_html.encode())

PAT = '80' # this should result in returning all 10 tables as 80 will definitely be there in all of them (pandas index)

# method 1: query directly using xpath - this takes a long time to run - and this seems to exhibit exponential time behavior
xpath_query = "//table//*[text() = '{PAT}']/ancestor::table"
tables = root.xpath(xpath_query)

# method 2: this runs in under a second. first get all the tables and then run the same xpath expression within the table context
all_tables = root.xpath('//table')
table_xpath_individual = ".//*[text() = '{PAT}']/ancestor::table"
selected_tables = [table for table in all_tables if table.xpath(table_xpath_individual)]

方法1需要40 - 50秒完成，方法2需要〈1秒
我不确定是方法1中的xpath表达式有问题，还是这里的lxml有问题。

pandas

来源：https://stackoverflow.com/questions/74595362/lxml-xpath-exponential-performance-behavior

1条答案

按热度按时间

3htmauhk1#

我不知道这是否相关（但我怀疑是相关的）。您可以通过删除//*步骤和尾随的/ancestor::table步骤来简化这些XPath。

//table[descendant::text() = '{PAT}']

请注意，在有问题的XPath中，对于每个表，都将找到文本为80的 every 后代元素（一个表中可能有多个），对于每个元素，返回该元素table祖先的 * 所有 *（同样，因为理论上如果您有一个包含一个表的表，XPath处理器将不得不费力地遍历所有这些ancestor路径）。这将返回大量结果，这样就不会返回任何给定table的多个示例（保证XPath1.0节点集不包含重复项）。

赞(0）回复(0）举报 2022-12-02

我来回答

pandas lxml xpath指数性能行为

1条答案

相关问题

热门标签

最新问答