pandas 基于包含字符串而不使用迭代器合并两个数据框

bqucvtff 于 2023-03-11 发布在其他

关注(0)|答案(1)|浏览(128)

我有两个csv文件作为 Dataframe A和C导入。我希望将content列的字符串与data.data中包含A中的字符串的条目进行匹配。

A  time_a content    C  time_c data.data
   100    f00           400    otherf00other
   101    ba7           402    onlyrandom
   102    4242          407    otherba7other
                        409    other4242other

Should become:
time_a time_c content
100    400    f00
101    407    ba7
102    409    4242

我下面的解决方案使用迭代器。但它工作太慢。This answer解释了原因并给出了如何改进的方法。但我很难实现任何方法。
我怎样才能用Pandas的优化方法做到这一点呢？

# reset_index() on both df
df_alsa_copy = df_alsa.copy() # Never modify your iterator
df_alsa_copy['cap_fno'] = -1

for aIndex, aRow in df_alsa.iterrows():
    for cIndex, cRow in df_c.iterrows():
        if str(aRow['content']) in str(cRow['data.data']):
            df_alsa_copy.loc[aIndex, 'cap_fno'] = df_c.loc[cIndex, 'frame.number']
# https://stackoverflow.com/questions/31528819/using-merge-on-a-column-and-index-in-pandas
# Merge on frame.number column (bc I chose it to be included in alsa_copy as a column)
df_ltnc = pd.merge(df_alsa_copy, df_c, left_on='cap_fno', right_on='frame.number')

还尝试了：

如果存在完全匹配，则会起作用：Pandas: Join dataframe with condition .
我还设法将第二帧与series.str.contains的已知字符串进行了匹配。
问题是，我无法在merge on=中输入匹配的 Dataframe 列。我只能输入已知的字符串。
当我使用apply时，出现了同样的问题。
我没有成功与isin或类似的。

1条答案

按热度按时间

eufgjt7s1#

使用pandas.unique()、pandas.Series.str.contains和pandas.DataFrame.merge尝试此方法

unique_str = A['content'].unique()
matching_rows = C[C['data.data'].str.contains('|'.join(unique_str))]

out = pd.merge(matching_rows, A, left_on=matching_rows['data.data']
               .str.extract(f'({"|".join(unique_str)})')[0],
                right_on='content')[['time_a', 'time_c', 'content']]
print(out)

time_a  time_c content
0     100     400     f00
1     101     407     ba7
2     102     409    4242

赞(0）回复(0）举报 2023-03-11

我来回答

pandas 基于包含字符串而不使用迭代器合并两个数据框

还尝试了：

更多信息：

1条答案

相关问题

热门标签

最新问答