python-3.x 根据日期项合并列表中的元组

xzlaal3s  于 2023-10-21  发布在  Python
关注(0)|答案(3)|浏览(133)

我有一个元组列表,其中每个元组中的前两个项目是日期,第三个总是一些名字。我想检查一下:1)如果任何两个或两个以上的元组具有作为第三项的名称; 2)如果元组的两个日期是任何其他元组的两个日期的一部分或在任何其他元组的两个日期内。如果1)和2)为真,则只保留具有相同名称并且具有最长时间跨度的元组。
我下面有一个示例数据列表

data = [(pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-01-21 00:00:00'), 'John'), 
        (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-01-21 00:00:00'), 'John'),
        (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-02-04 00:00:00'), 'Jane'),
        (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-02-04 00:00:00'), 'John'),
        (pd.Timestamp('2017-01-21 00:00:00'), pd.Timestamp('2017-02-04 00:00:00'), 'John'),
       (pd.Timestamp('2017-01-01 00:00:00'), pd.Timestamp('2017-02-10 00:00:00'), 'Jane'),]

它应该输出下面的列表,因为这两个元组的日期覆盖了所有其他具有相同名称的元组(即“John”或“Jane”):

[(Timestamp('2017-01-02 00:00:00'), Timestamp('2017-02-04 00:00:00'), 'John'),
 (Timestamp('2017-01-01 00:00:00'), Timestamp('2017-02-10 00:00:00'), 'Jane')]

然而,我的代码如下所示,

names = set([x[2] for x in data])
to_remove = []

for i in range(len(data)):
  for j in range(i+1, len(data)):
    if data[i][2] == data[j][2]:
      if data[i][0] >= data[j][0] and data[i][1] <= data[j][1]:
        to_remove.append(j)
      elif data[j][0] >= data[i][0] and data[j][1] <= data[i][1]:
        to_remove.append(i)

data = [x for i,x in enumerate(data) if i not in set(to_remove)]

输出错误答案:

[(Timestamp('2017-01-02 00:00:00'), Timestamp('2017-01-21 00:00:00'), 'John'),
 (Timestamp('2017-01-02 00:00:00'), Timestamp('2017-02-04 00:00:00'), 'Jane'),
 (Timestamp('2017-01-21 00:00:00'), Timestamp('2017-02-04 00:00:00'), 'John')]
disho6za

disho6za1#

我认为你必须交换的位置ij时,添加索引删除列表如下实现。

names = set([x[2] for x in data])
to_remove = []

for i in range(len(data)):
  for j in range(i+1, len(data)):
    if data[i][2] == data[j][2]:
      if data[i][0] >= data[j][0] and data[i][1] <= data[j][1]:
        to_remove.append(i)
      elif data[j][0] >= data[i][0] and data[j][1] <= data[i][1]:
        to_remove.append(j)

data = [x for i,x in enumerate(data) if i not in set(to_remove)]
wwodge7n

wwodge7n2#

我会尝试使用一种更暴力的方法,按名称的键累计所有日期,然后找到最小值和最大值。
假设你以:

import pandas as pd

data = [
    (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-01-21 00:00:00'), 'John'), 
    (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-01-21 00:00:00'), 'John'),
    (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-02-04 00:00:00'), 'Jane'),
    (pd.Timestamp('2017-01-02 00:00:00'), pd.Timestamp('2017-02-04 00:00:00'), 'John'),
    (pd.Timestamp('2017-01-21 00:00:00'), pd.Timestamp('2017-02-04 00:00:00'), 'John'),
    (pd.Timestamp('2017-01-01 00:00:00'), pd.Timestamp('2017-02-10 00:00:00'), 'Jane'),
]

然后你可以使用:用途:

## -------------------
## Accumulate based on name
## -------------------
results = {}
for *dates, name in data:
    results.setdefault(name, []).extend(dates)
## -------------------

## -------------------
## Reshape to get min and max
## -------------------
results = [(min(dates), max(dates), name) for name, dates in results.items()]
## -------------------

for result in results:
    print(result)

这应该给你给予:

(Timestamp('2017-01-02 00:00:00'), Timestamp('2017-02-04 00:00:00'), 'John')
(Timestamp('2017-01-01 00:00:00'), Timestamp('2017-02-10 00:00:00'), 'Jane')
ax6ht2ek

ax6ht2ek3#

可以稍微调整一下@jakevdp的回答:

#pip install scipy
from scipy.sparse.csgraph import connected_components

df = pd.DataFrame(data, columns=["start", "end", "name"])

def reductionFunction(df):    
    start = df["start"].to_numpy()
    end = df["end"].to_numpy()
    
    graph = (start <= end[:, None]) & (end >= start[:, None])
    n_components, indices = connected_components(graph)
      
    return df.groupby(indices).agg({"start": "min", "end": "max"})

out = (list(df.groupby("name", sort=False).apply(reductionFunction)
        .reset_index()[df.columns].itertuples(index=False, name=None)))

输出量:

print(out)

[(Timestamp('2017-01-02 00:00:00'), Timestamp('2017-02-04 00:00:00'), 'John'),
 (Timestamp('2017-01-01 00:00:00'), Timestamp('2017-02-10 00:00:00'), 'Jane')]

相关问题