python 如何根据自定义函数从Polars DataFrame删除重复行？

eyh26e7m 于 2023-06-28 发布在 Python

关注(0)|答案(2)|浏览(265)

我有一个dataframe，我试图删除特定列中的所有重复项，同时聚合非重复值。
.unique函数只允许我选择{‘first’, ‘last’, ‘any’, ‘none’}中的一个。然而，我想要的是将mean函数应用于所有数值，并将mode函数应用于所有分类值。
我可以通过在我感兴趣的列上使用groupby来做到这一点，如下面的示例所示：

df = pl.DataFrame(
    {
        "id": [0, 0, 0, 1, 1],
        "color": ["red", "green", "green", "red", "red"],
        "shape": ["square", "triangle", "square", "triangle", "square"],
        "size": [2, 4, 6, 1, 3]
    }
)

df_list = []
for gkey, group in df.groupby("id"):
    g = group.select(pl.col("id"),
       pl.all().exclude("id", "size").mode().first(),
       pl.col("size").mean()
    ).unique()
    df_list.append(g)

df_dedup = pl.concat(df_list)

这给了我期望的输出：

> print(df_dedup)
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

问题是这个实现（毫不奇怪）非常慢，所以我想知道是否有更好的方法来做到这一点，或者是否有可能以某种方式优化我的代码。

python

来源：https://stackoverflow.com/questions/76564824/how-can-i-drop-duplicate-rows-from-a-polars-dataframe-according-to-a-custom-func

2条答案

按热度按时间

bfrts1fy1#

不如

In [22]: df.groupby("id").agg(
    ...:     pl.col(["color", "shape"]).mode().sort(descending=True).first(),
    ...:     pl.col("size").mean(),
    ...: )
    ...:
Out[22]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

或者，使用列选择器：

In [30]: import polars.selectors as cs

In [31]: df.groupby("id").agg(
    ...:     cs.string().mode().sort(descending=True).first(),
    ...:     cs.numeric().mean(),
    ...: )
Out[31]:
shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 0   ┆ green ┆ square   ┆ 4.0  │
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
└─────┴───────┴──────────┴──────┘

赞(0）回复(0）举报 2023-06-28

pgccezyw2#

您可以按类型选择列，例如所有str列的pl.col(pl.Utf8)。
还有新的polars.selectors helper module.

import polars.selectors as cs

df.groupby('id').agg(
   cs.string().mode().first(),
   cs.numeric().mean()
)

shape: (2, 4)
┌─────┬───────┬──────────┬──────┐
│ id  ┆ color ┆ shape    ┆ size │
│ --- ┆ ---   ┆ ---      ┆ ---  │
│ i64 ┆ str   ┆ str      ┆ f64  │
╞═════╪═══════╪══════════╪══════╡
│ 1   ┆ red   ┆ triangle ┆ 2.0  │
│ 0   ┆ green ┆ square   ┆ 4.0  │
└─────┴───────┴──────────┴──────┘

赞(0）回复(0）举报 2023-06-28

我来回答

python 如何根据自定义函数从Polars DataFrame删除重复行？

2条答案

相关问题

热门标签

最新问答