numpy 当存在重复的列值时，从pandas dataframe中提取相关行

sauutmhj 于 2023-08-05 发布在其他

关注(0)|答案(1)|浏览(85)

我有一个pandas数据框架，如下所示：
| 左|顶部|宽度|高度，高度|文字档| Text |
| --|--|--|--|--| ------------ |
| 十二个|三十四|十二个|三十四|商用的| commercial |
| 九九|四十二|九九|四十二|一般| general |
| 一个|四十七|九个|四个|负债| liability |
| 十个|六十九|三十二|六十七|商用的| commercial |
| 九九|七十二|七十九个|八十八个|可用的| available |
我想根据列值**Text**提取特定行。因此，我想在Text列中使用re.search搜索某些关键词，如liability commercial，如果得到匹配，则提取行，即第三排和第四排。因此，如果输入是liability commercial，那么输出应该是提取的以下行：
| 左|顶部|宽度|高度，高度|文字档| Text |
| --|--|--|--|--| ------------ |
| 一个|四十七|九个|四个|负债| liability |
| 十个|六十九|三十二|六十七|商用的| commercial |
请记住，列Text可能包含重复值。因此，在上面的情况下，有2行存在字commerial。

提前感谢！

numpy

来源：https://stackoverflow.com/questions/76626134/extract-relevant-rows-from-pandas-dataframe-when-duplicate-column-values-are-pre

1条答案

按热度按时间

gmxoilav1#

用途：

phrase = 'liability commercial'

#match by substrings - splitted values by spaces
m = df['Text'].str.contains(phrase.replace(' ','|'))
#match by splitted values by spaces
m = df['Text'].isin(phrase.split())

#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

字符串
或者，如果需要按条件更改掩码按匹配行分组，则此处拆分值的位置和可能的重复不计数：

phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())

df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

型
如果需要通过拆分值进行精确匹配，则可以修改this solution：

phrase = 'liability commercial'

#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

型

赞(0）回复(0）举报 2023-08-05

我来回答

numpy 当存在重复的列值时，从pandas dataframe中提取相关行

1条答案

相关问题

热门标签

最新问答