pandas Python:对 Dataframe 的列中的连续重复值进行分组和计数

j13ufse2  于 2022-11-27  发布在  Python
关注(0)|答案(2)|浏览(614)

我非常想在python中的 Dataframe 上执行一个数据分析任务。下面是我的 Dataframe :

df = pd.DataFrame({"Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"], 
                   "Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
                   "Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
                   })

我想

  • 找出每人连续重复活动“A”超过2次的组数,以及
  • 将连续重复“A“的平均时间计算为每个组的结束时间减去开始时间除以组数

即,目标结果 Dataframe 应如下所示(P1的AVGTime计算为(1-0 + 6-1)/2):

solution = pd.DataFrame({"Person": ["P1", "P2"],
                    "Activity": ["A", "A"],
                    "Count": [2, 1], 
                    "AVGTime": [3, 0]})

我知道这里有一个近似的解决方案:https://datascience-stackexchange-com.translate.goog/questions/41428/how-to-find-the-count-of-consecutive-same-string-values-in-a-pandas-dataframe?_x_tr_sl=en&_x_tr_tl=de&_x_tr_hl=de&_x_tr_pto=sc
然而,该解决方案并没有在列上聚合,例如在我的例子中的“Person”。而且,该解决方案似乎也没有很好地执行,因为我有一个大约700万行的 Dataframe 。
我真的很感激任何提示!

pw9qyyiw

pw9qyyiw1#

您可以将数据作为流来处理,而无需创建 Dataframe , Dataframe 应该适合内存。我建议尝试convtools库(我必须承认-我是作者)。
由于您已经有了一个 Dataframe ,让我们将其用作输入:

import pandas as pd

from convtools import conversion as c
from convtools.contrib.tables import Table

# fmt: off
df = pd.DataFrame({
    "Person": ["P1", "P1","P1","P1","P1","P1","P1","P1","P1","P1", "P2", "P2","P2","P2","P2","P2","P2","P2","P2","P2"], 
    "Activity": ["A", "A", "A", "B", "A", "A", "A", "A", "A", "A", "A", "A", "A", "B", "A", "A", "B", "A", "B", "A"],
    "Time": ["0", "0", "1", "1", "1", "3", "5", "5", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6", "6"]
})
# fmt: on

# transforming DataFrame into an iterable of dicts not to allocate all rows at
# once by df.to_dict("records")
iter_rows = Table.from_rows(
    df.itertuples(index=False), header=list(df.columns)
).into_iter_rows(dict)

result = (
    # chunk by consecutive "person"+"activity" pairs
    c.chunk_by(c.item("Person"), c.item("Activity"))
    .aggregate(
        # each chunk gets transformed into a dict like this:
        {
            "Person": c.ReduceFuncs.First(c.item("Person")),
            "Activity": c.ReduceFuncs.First(c.item("Activity")),
            "length": c.ReduceFuncs.Count(),
            "time": (
                c.ReduceFuncs.Last(c.item("Time")).as_type(float)
                - c.ReduceFuncs.First(c.item("Time")).as_type(float)
            ),
        }
    )
    # remove short groups
    .filter(c.item("length") > 2)
    .pipe(
        # now group by "person"+"activity" pair to calculate avg time
        c.group_by(c.item("Person"), c.item("Activity")).aggregate(
            {
                "Person": c.item("Person"),
                "Activity": c.item("Activity"),
                "avg_time": c.ReduceFuncs.Average(c.item("time")),
                "number_of_groups": c.ReduceFuncs.Count(),
            }
        )
    )
    # should you want to reuse this conversion multiple times, run
    # .gen_converter() to get a function and store it for further reuse
    .execute(iter_rows)
)

结果:

In [37]: result
Out[37]:
[{'Person': 'P1', 'Activity': 'A', 'avg_time': 3.0, 'number_of_groups': 2},
 {'Person': 'P2', 'Activity': 'A', 'avg_time': 0.0, 'number_of_groups': 1}]
kdfy810k

kdfy810k2#

试试看:

def group_func(x):
    groups = []
    for _, g in x.groupby((x["Activity"] != x["Activity"].shift()).cumsum()):
        if len(g) > 2 and g["Activity"].iat[0] == "A":
            groups.append(g)

    avgs = sum(g["Time"].max() - g["Time"].min() for g in groups) / len(groups)

    return pd.Series(
        ["A", len(groups), avgs], index=["Activity", "Count", "AVGTime"]
    )

df["Time"] = df["Time"].astype(int)
x = df.groupby("Person", as_index=False).apply(group_func)
print(x)

印刷品:

Person Activity  Count  AVGTime
0     P1        A      2      3.0
1     P2        A      1      0.0

相关问题