csv 极性列表类型转换为逗号分隔的字符串

我有一个df，我想将其分组并写入csv格式，但是其中一列的列表类型阻止了将df写入csv。

df = pl.DataFrame({"Column A": ["Variable 1", "Variable 2", "Variable 2", "Variable 3", "Variable 3", "Variable 4"],
                    "Column B": ["AB", "AB", "CD", "AB", "CD", "CD"]})

我想按以下方式进行分组：

df.groupby(by="Column A").agg(pl.col("Column B").unique())

输出：

shape: (4, 2)
┌────────────┬──────────────┐
│ Column A   ┆ Column B     │
│ ---        ┆ ---          │
│ str        ┆ list[str]    │
╞════════════╪══════════════╡
│ Variable 3 ┆ ["AB", "CD"] │
│ Variable 1 ┆ ["AB"]       │
│ Variable 4 ┆ ["CD"]       │
│ Variable 2 ┆ ["CD", "AB"] │
└────────────┴──────────────┘

尝试将上述 Dataframe 写入csv时，出现错误：* “计算错误：CSV格式不支持嵌套数据。请考虑使用其他数据格式。已获取：'列表[字符串]'“*
如果尝试将列表类型转换为pl.Utf8，则会导致错误

(df
    .groupby(by="Column A").agg(pl.col("Column B").unique())
    .with_columns(pl.col("Column B").cast(pl.Utf8))
)

输出：“计算错误：无法转换列表类型”
如果我尝试在groupby上下文中分解列表：

df.groupby(by="Column A").agg(pl.col("Column B").unique().explode())

输出不符合要求：

shape: (4, 2)
┌────────────┬─────────────────────┐
│ Column A   ┆ Column B            │
│ ---        ┆ ---                 │
│ str        ┆ list[str]           │
╞════════════╪═════════════════════╡
│ Variable 1 ┆ ["A", "B"]          │
│ Variable 3 ┆ ["A", "B", ... "D"] │
│ Variable 2 ┆ ["A", "B", ... "B"] │
│ Variable 4 ┆ ["A", "B", ... "D"] │
└────────────┴─────────────────────┘

对我来说，groupby然后写到csv的最方便的方法是什么？
以csv格式写入的所需输出：

shape: (4, 2)
┌────────────┬──────────────┐
│ Column A   ┆ Column B     │
│ ---        ┆ ---          │
│ str        ┆ list[str]    │
╞════════════╪══════════════╡
│ Variable 3 ┆ ["AB", "CD"] │
│ Variable 1 ┆ ["AB"]       │
│ Variable 4 ┆ ["CD"]       │
│ Variable 2 ┆ ["CD", "AB"] │
└────────────┴──────────────┘

有一个recent discussion about why this is the case.
可以使用._s.get_fmt()来"字符串化"列表：

print(
   df
    .groupby(by="Column A").agg(pl.col("Column B").unique())
    .with_columns(
       pl.col("Column B").map(lambda row: 
          [row._s.get_fmt(n, 0) for n in range(row.len())]
       ).flatten())
    .write_csv(),
    end=""
)

Column A,Column B
Variable 3,"[""AB"", ""CD""]"
Variable 1,"[""AB""]"
Variable 4,"[""CD""]"
Variable 2,"[""AB"", ""CD""]"

另一种方法是使用@FObersteiner建议的str()。
一个二个一个一个
"字符串化"列表的主要问题是-当你读回CSV数据时-你不再有list[]类型。

import io

pl.read_csv(io.StringIO(
   'Column A,Column B\nVariable 4,"[""CD""]"\n'
   'Variable 1,"[""AB""]"\nVariable 2,"[""AB"", ""CD""]"\n'
   'Variable 3,"[""CD"", ""AB""]"\n'
))

shape: (4, 2)
┌────────────┬──────────────┐
│ Column A   | Column B     │
│ ---        | ---          │
│ str        | str          │
╞════════════╪══════════════╡
│ Variable 4 | ["CD"]       │
│ Variable 1 | ["AB"]       │
│ Variable 2 | ["AB", "CD"] │
│ Variable 3 | ["CD", "AB"] │
└────────────┴──────────────┘

这就是建议使用替代格式的原因。

csv 极性列表类型转换为逗号分隔的字符串

1条答案

相关问题

热门标签

最新问答