numpy 当存在重复的列值时,从pandas dataframe中提取相关行

sauutmhj  于 2023-08-05  发布在  其他
关注(0)|答案(1)|浏览(85)

我有一个pandas数据框架,如下所示:
| 左|顶部|宽度|高度,高度|文字档| Text |
| --|--|--|--|--| ------------ |
| 十二个|三十四|十二个|三十四|商用的| commercial |
| 九九|四十二|九九|四十二|一般| general |
| 一个|四十七|九个|四个|负债| liability |
| 十个|六十九|三十二|六十七|商用的| commercial |
| 九九|七十二|七十九个|八十八个|可用的| available |
我想根据列值**Text**提取特定行。因此,我想在Text列中使用re.search搜索某些关键词,如liability commercial,如果得到匹配,则提取行,即第三排和第四排。因此,如果输入是liability commercial,那么输出应该是提取的以下行:
| 左|顶部|宽度|高度,高度|文字档| Text |
| --|--|--|--|--| ------------ |
| 一个|四十七|九个|四个|负债| liability |
| 十个|六十九|三十二|六十七|商用的| commercial |
请记住,列Text可能包含重复值。因此,在上面的情况下,有2行存在字commerial

提前感谢!

gmxoilav

gmxoilav1#

用途:

phrase = 'liability commercial'

#match by substrings - splitted values by spaces
m = df['Text'].str.contains(phrase.replace(' ','|'))
#match by splitted values by spaces
m = df['Text'].isin(phrase.split())

#filter rows by mask and get last duplicated values in Text column
df = df[m].drop_duplicates(['Text'], keep='last')
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

字符串
或者,如果需要按条件更改掩码按匹配行分组,则此处拆分值的位置和可能的重复不计数:

phrase = 'liability commercial'
m = ~df['Text'].str.contains(phrase.replace(' ','|'))
#m = ~df['Text'].isin(phrase.split())

df = df[m.cumsum().duplicated(keep=False) & ~m]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial


如果需要通过拆分值进行精确匹配,则可以修改this solution

phrase = 'liability commercial'

#https://stackoverflow.com/a/49005205/2901002
pat = np.asarray(phrase.split())
N = len(pat)

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
    return c

arr = df['Text'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]

d = [i  for x in c for i in range(x, x+N)]
df = df[np.in1d(np.arange(len(arr)), d)]
print (df)
   id  left  top  width  height        Text
2   3     1   47      9       4   liability
3   4    10   69     32      67  commercial

相关问题