Pandas基于前一行删除数据

0lvr5msh  于 2024-01-04  发布在  其他
关注(0)|答案(4)|浏览(80)

假设我有一个Pandas框架,列为first_date、label、last_date,其中所有行的first_date < last_date。(我们称该行为z)如果行Z的first_date在同一标签的行z上方的 ANY 行的first_date和last_date之间(不只是同一个标签的正上方的一行),怎么做呢?DataFrame如下:

first_date label last_date
2023-09-11 A     2023-09-17
2023-09-11 B     2023-09-15
2023-09-20 A     2023-09-28
2023-09-17 B     2023-09-30
2023-09-30 A     2023-10-05
2023-10-03 A     2023-10-07
2023-10-05 B     2023-10-15
2023-10-19 A     2023-10-20
2023-10-10 B     2023-10-15

字符串
正确的输出应该是:

first_date label last_date
2023-09-11 A     2023-09-17
2023-09-11 B     2023-09-15
2023-09-20 A     2023-09-28
2023-09-17 B     2023-09-30
2023-09-30 A     2023-10-05
2023-10-05 B     2023-10-15
2023-10-19 A     2023-10-20


其中删除的行为:

2023-10-03 A     2023-10-07
2023-10-10 B     2023-10-15


因为标签A的2023-10-03在2023-09-30和2023-10-05之间,标签B的2023-10-10在2023-10-05和2023-10-15之间
该框架是巨大的,超过10,000行,所以应该避免循环尽可能多,提前感谢!

34gzjxbg

34gzjxbg1#

我重新创建了你的dataFrame并尝试获取你的输出。我认为你可能在根据条件进行过滤之前错过了按标签对数据集进行排序。条件看起来很简单,只是不要忘记最后的按位非操作。你得到的输出只是相反的。

import pandas as pd

data = {
    'first_date': ['2023-09-11', '2023-09-11', '2023-09-20', '2023-09-17', '2023-09-30', '2023-10-03', '2023-10-05', '2023-10-20', '2023-10-10'],
    'label': ['A', 'B', 'A', 'B', 'A', 'A', 'B', 'A', 'B'],
    'last_date': ['2023-09-17', '2023-09-15', '2023-09-28', '2023-09-30', '2023-10-05', '2023-10-07', '2023-10-15', '2023-10-19', '2023-10-15']
}

df = pd.DataFrame(data)
df['first_date'] = pd.to_datetime(df['first_date'])
df['last_date'] = pd.to_datetime(df['last_date'])
print("Original DataFrame")
print(df)

# I guess you'were missing this part
df = df.sort_values(by=['label'])

# put your condition to check if first_date in between 
# first_date and last_date of any row above row z of the same label 

mask = []
for label, group in df.groupby('label'):
    date_filter = ~((group['first_date'] > group['first_date'].shift().fillna(pd.to_datetime('1900-01-01'))) & 
                  (group['first_date'] < group['last_date'].shift().fillna(pd.to_datetime('1900-01-01'))))
    mask.extend(date_filter.tolist())

filtered_df = df[mask]
print("\nFiltered DataFrame: ")
print(filtered_df)

字符串
输出量:

Original DataFrame
  first_date label  last_date
0 2023-09-11     A 2023-09-17
1 2023-09-11     B 2023-09-15
2 2023-09-20     A 2023-09-28
3 2023-09-17     B 2023-09-30
4 2023-09-30     A 2023-10-05
5 2023-10-03     A 2023-10-07
6 2023-10-05     B 2023-10-15
7 2023-10-20     A 2023-10-19
8 2023-10-10     B 2023-10-15

Filtered DataFrame: 
  first_date label  last_date
0 2023-09-11     A 2023-09-17
2 2023-09-20     A 2023-09-28
4 2023-09-30     A 2023-10-05
7 2023-10-20     A 2023-10-19
1 2023-09-11     B 2023-09-15
3 2023-09-17     B 2023-09-30
6 2023-10-05     B 2023-10-15

a1o7rhls

a1o7rhls2#

不幸的是,没有有效的方法在纯Pandas中做到这一点。
但是,您可以使用janitor的高效conditional_join执行左合并并识别匹配行,然后使用右索引删除它们:

# pip install pyjanitor
import janitor

df[['first_date', 'last_date']] = df[['first_date', 'last_date']].apply(pd.to_datetime)

drop = (df.reset_index()
          .conditional_join(df.reset_index(), 
                            ('first_date', 'first_date', '>='),
                            ('first_date', 'last_date', '<='),
                            ('label', 'label', '=='),
                            ('index', 'index', '>'),
                            how='left')
         .dropna(subset=[('right', 'index')])['left']['index'].unique()
       )

out = df.drop(drop)

字符串

  • 注意,假设这里有一个范围索引,如果没有,就给列index分配一个范围,而不是使用reset_index

输出量:

first_date label  last_date
0 2023-09-11     A 2023-09-17
1 2023-09-11     B 2023-09-15
2 2023-09-20     A 2023-09-28
3 2023-09-17     B 2023-09-30
4 2023-09-30     A 2023-10-05
5 2023-10-03     A 2023-10-07
6 2023-10-05     B 2023-10-15
7 2023-10-19     A 2023-10-20
8 2023-10-10     B 2023-10-15


中间体droparray([5, 8])
合并中间体:

left                             right                            
  index first_date label  last_date index first_date label  last_date
0     0 2023-09-11     A 2023-09-17   NaN        NaT   NaN        NaT
1     1 2023-09-11     B 2023-09-15   NaN        NaT   NaN        NaT
2     2 2023-09-20     A 2023-09-28   NaN        NaT   NaN        NaT
3     3 2023-09-17     B 2023-09-30   NaN        NaT   NaN        NaT
4     4 2023-09-30     A 2023-10-05   NaN        NaT   NaN        NaT
5     5 2023-10-03     A 2023-10-07   4.0 2023-09-30     A 2023-10-05
6     6 2023-10-05     B 2023-10-15   NaN        NaT   NaN        NaT
7     7 2023-10-19     A 2023-10-20   NaN        NaT   NaN        NaT
8     8 2023-10-10     B 2023-10-15   6.0 2023-10-05     B 2023-10-15

egmofgnx

egmofgnx3#

我会对这个框架进行排序,然后使用Apply和一个记住最后一行处理的函数来覆盖它:

import pandas as pd
import numpy as np

data = [
  ['2023-09-11', 'A', '2023-09-17'],
  ['2023-09-11', 'B', '2023-09-15'],
  ['2023-09-20', 'A', '2023-09-28'],
  ['2023-09-17', 'B', '2023-09-30'],
  ['2023-09-30', 'A', '2023-10-05'],
  ['2023-10-03', 'A', '2023-10-07'],
  ['2023-10-05', 'B', '2023-10-15'],
  ['2023-10-19', 'A', '2023-10-20'],
  ['2023-10-10', 'B', '2023-10-15'],
  ['2023-10-12', 'B', '2023-10-18'], # Added by me
  ['2023-10-16', 'B', '2023-10-20'], # Added by me
]

df = pd.DataFrame(
  data,
  columns = ['first_date', 'label', 'last_date'],
)

df['first_date'] = pd.to_datetime(df['first_date'])
df['last_date' ] = pd.to_datetime(df['last_date' ])

print(df.dtypes)
print()

df.sort_values(['label', 'first_date', 'last_date'], inplace = True)
print(df)
print()
class Filter:
  def __init__(self):
    self._label = None
    self._first = None
    self._last  = None
  
  def _persist(self, row):
    self._label = row['label']
    self._first = row['first_date']
    self._last  = row['last_date']
  
  def validate(self, row):
    if self._label != row['label']:
      self._persist(row)
      return True
    
    if self._first <= row['first_date'] <= self._last:
      return False
    
    self._persist(row)
    return True
    

filter = Filter()

df['flag'] = df.apply(filter.validate, axis=1)

print(df)
print()
first_date label  last_date   flag
0  2023-09-11     A 2023-09-17   True
2  2023-09-20     A 2023-09-28   True
4  2023-09-30     A 2023-10-05   True
5  2023-10-03     A 2023-10-07  False
7  2023-10-19     A 2023-10-20   True
1  2023-09-11     B 2023-09-15   True
3  2023-09-17     B 2023-09-30   True
6  2023-10-05     B 2023-10-15   True
8  2023-10-10     B 2023-10-15  False
9  2023-10-12     B 2023-10-18  False
10 2023-10-16     B 2023-10-20   True
dtcbnfnu

dtcbnfnu4#

import numpy as np
# merge on self (highly used sql trick)
new_df = df.merge(df, on="label", how="left",  suffixes=("","_new"))
# filter similar dates (not a necessary step)
new_df = new_df[(new_df["first_date"] != new_df["first_date_new"]) ]
# flag where your conditions is true
new_df["flag"] = np.where((new_df["first_date_new"] < new_df["first_date"]) & (new_df["first_date"] < new_df["last_date_new"]), True,False )

to_drop = new_df[new_df["flag"]].copy()

to_drop["shift_first_date_new"] = to_drop["first_date"].shift(+1)
to_drop["flag"] = to_drop["flag"] & (to_drop["first_date_new"] != to_drop["shift_first_date_new"])

to_drop = to_drop[to_drop["flag"]][["first_date","label","last_date"]]
pd.concat([df,to_drop]).drop_duplicates(keep=False).sort_values("label")
#
  first_date label  last_date
0 2023-09-11     A 2023-09-17
1 2023-09-11     B 2023-09-15
2 2023-09-20     A 2023-09-28
3 2023-09-17     B 2023-09-30
4 2023-09-30     A 2023-10-05
6 2023-10-05     B 2023-10-15
7 2023-10-19     A 2023-10-20

    first_date label   last_date
0   2023-09-11     A  2023-09-17
2   2023-09-20     A  2023-09-28
4   2023-09-30     A  2023-10-05
7   2023-10-19     A  2023-10-20
1   2023-09-11     B  2023-09-15
3   2023-09-17     B  2023-09-30
6   2023-10-05     B  2023-10-15
10  2023-10-16     B  2023-10-20

字符串

相关问题