pandas 如何在嵌套列表中从字符串中剥离html元素,Python

enyaitl3  于 2022-12-21  发布在  Python
关注(0)|答案(2)|浏览(128)

我决定使用BeautifulSoup从Pandas列中提取字符串整数。BeautifulSoup在一个简单的例子中运行良好,但是在Pandas中的列表列中不起作用。我找不到任何错误。你能帮忙吗?
输入:

df = pd.DataFrame({
    "col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
    "col2":[0, 1, 0, 1],
})

for list in df["col1"]:
    for item in list:
        if "span" in item:
            soup = BeautifulSoup(item, features = "lxml")
            item = soup.get_text()
        else:
            None  

print(df)

预期输出:

df = pd.DataFrame({
        "col1":[["9", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
        "col2":[0, 1, 0, 1],
    })
dldeef67

dldeef671#

您尝试在Series上使用for循环进行迭代,但在使用Pandas时,更好且更简单的方法是使用apply函数,如下所示:

def extract_text(lst):
    new_lst = []
    for item in lst:
        if "span" in item:
            new_lst.append(BeautifulSoup(item, features="lxml").text)
        else:
            new_lst.append(item)
            
    return new_lst

df['col1'] = df['col1'].apply(extract_text)

或者你可以用列表解析法把它写成一行:

df['col1'] = df['col1'].apply(
    lambda lst: [BeautifulSoup(item, features = "lxml").text if "span" in item else item for item in lst]
)
ws51t4hk

ws51t4hk2#

这会将extract_integer函数应用于col1列的每个元素,如果元素包含"span"标记,则用提取的整数替换原始值,如果元素不包含"span"标记,则保持值不变。

def extract_integer(item):
    if "span" in item:
        soup = BeautifulSoup(item, features = "lxml")
        return soup.get_text()
    return item

df = pd.DataFrame({
    "col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
    "col2":[0, 1, 0, 1],
})

df["col1"] = df["col1"].apply(lambda x: [extract_integer(item) for item in x])

print(df)

输出:

col1  col2
0         [9, abcd]     0
1         [a, b, d]     1
2   [a, b, z, x, y]     0
3   [a, y, y, z, b]     1

相关问题