pandas 连接 Dataframe 时处理极坐标中的空列

h6my8fg2 于 2023-03-11 发布在其他

关注(0)|答案(2)|浏览(125)

我希望能够连接polars中的 Dataframe ，其中 Dataframe 具有相同的列，但一些 Dataframe 没有列子集的数据。
更准确地说，我在寻找这个pandas最小工作示例的polars等价物：

from io import StringIO
import polars as pl
import pandas as pd 

TESTDATA1 = StringIO("""
    col1,col2,col3
    1,1,"a"
    2,1,"b"
"""
)

TESTDATA2 = StringIO("""
    col1,col2,col3
    1,,"a"
    2,,"b"
"""
)

df = pd.concat(
    [
        pd.read_csv(TESTDATA1),
        pd.read_csv(TESTDATA2),
    ],
)
print(df)

这个打印

col1  col2 col3
0         1   1.0    a
1         2   1.0    b
0         1   NaN    a
1         2   NaN    b

我尝试了以下polars实现，但不适用：

TESTDATA1 = StringIO("""
    col1,col2,col3
    1,1,"a"
    2,1,"b"
""")

TESTDATA2 = StringIO("""
    col1,col2,col3
    1,,"a"
    2,,"b"
""")

df = pl.concat(
    [
        pl.read_csv(TESTDATA1),
        pl.read_csv(TESTDATA2),
    ],
    how ="diagonal"
)

我收到错误消息：

SchemaError: cannot vstack: because column datatypes (dtypes) in the two DataFrames do not match for left.name='col2' with left.dtype=i64 != right.dtype=str with right.name='col2'

似乎空列被视为polars中的str，并且不能与类型为i64的其他 Dataframe 合并。
我知道这是我的问题的解决方案：

df = pl.concat(
    [
        pl.read_csv(TESTDATA1),
        pl.read_csv(TESTDATA2).with_columns(pl.col("col2").cast(pl.Int64)),
    ],
    how ="diagonal"
)

但实际上，我有大约20个列可能是null，我不想强制转换所有列。
在pandas和polars中起作用的是从 Dataframe 中删除空列的情况，即

TESTDATA1 = StringIO("""
    col1,col2,col3
    1,1,"a"
    2,1,"b"
""")

TESTDATA2 = StringIO("""
    col1,col3
    1,"a"
    2,"b"
""")

pl.concat(
    [
        pl.read_csv(TESTDATA1),
        pl.read_csv(TESTDATA2),
    ],
    how ="diagonal"
)

在pandas中，我也可以通过调用.dropna(how="all",axis=1)来删除空列，但我不知道在polars中的等效方法。
所以，总结一下：

如果polars中的一些 Dataframe 包含没有数据的列（null），我如何在polars中连接 Dataframe ？
如何在polars中实现与.dropna(how="all",axis=1)的等效？

谢谢！

pandas

来源：https://stackoverflow.com/questions/75693988/handle-empty-columns-in-polars-when-concatenating-dataframes

2条答案

按热度按时间

nkcskrwz1#

也许有一种更直接的方法--您可以循环遍历每个.schema并构建自己的“超类型”模式。
您可以使用它来生成强制类型转换信息。

import polars as pl
import tempfile

file1 = tempfile.NamedTemporaryFile()
file2 = tempfile.NamedTemporaryFile()

csv1 = b"""
col1,col2,col3
1,1,"a"
2,1,"b"
"""

csv2 = b"""
col1,col2,col3
1,,"a"
2,,"b"
"""

file1.write(csv1)
file2.write(csv2)

file1.seek(0)
file2.seek(0)

schema = {}
sources = file1, file2

frames = [ pl.scan_csv(source.name) for source in sources ]
for frame in frames:
    for name, dtype in frame.schema.items():
        if dtype == pl.Utf8 and schema.get(name, pl.Utf8) != pl.Utf8:
            pass
        else:
            schema[name] = dtype

df = pl.concat(
    frame.with_columns(
        pl.col(name).cast(dtype) for name, dtype in schema.items())
    for frame in frames
)

>>> df
<polars.LazyFrame object at 0x12D67D570>
>>> df.collect()
shape: (4, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ i64  | i64  | str  │
╞══════╪══════╪══════╡
│ 1    | 1    | a    │
│ 2    | 1    | b    │
│ 1    | null | a    │
│ 2    | null | b    │
└──────┴──────┴──────┘

要删除null列，可以选择任何值都不为空的列

>>> df = pl.read_csv(csv2)
>>> df
shape: (2, 3)
┌──────┬──────┬──────┐
│ col1 | col2 | col3 │
│ ---  | ---  | ---  │
│ i64  | str  | str  │
╞══════╪══════╪══════╡
│ 1    | null | a    │
│ 2    | null | b    │
└──────┴──────┴──────┘
>>> df.select(col for col in df if col.is_not_null().any())
shape: (2, 2)
┌──────┬──────┐
│ col1 | col3 │
│ ---  | ---  │
│ i64  | str  │
╞══════╪══════╡
│ 1    | a    │
│ 2    | b    │
└──────┴──────┘

.drop需要字符串-因此您可以

>>> df.drop([col.name for col in df if col.is_null().all()])
shape: (2, 2)
┌──────┬──────┐
│ col1 | col3 │
│ ---  | ---  │
│ i64  | str  │
╞══════╪══════╡
│ 1    | a    │
│ 2    | b    │
└──────┴──────┘

赞(0）回复(0）举报 2023-03-11

afdcj2ne2#

很抱歉造成了混淆。pl.concat（）的how参数只支持“vertical”、“diagonal”和“horizontal”，这是正确的。排除所有值为空的行的正确参数是drop_rows。
下面是一个更新的示例：

TESTDATA1 = StringIO("""
    col1,col2,col3
    1,1,"a"
    2,1,"b"
""")

TESTDATA2 = StringIO("""
    col1,col2,col3
    1,,"a"
    2,,"b"
""")

df = pl.concat(
    [
        pl.read_csv(TESTDATA1),
        pl.read_csv(TESTDATA2),
    ],
    how="diagonal"
)

df = df.fill_none() # fill null values with default values of the same type

# exclude rows with all null values
df = df.drop_rows(
    condition=lambda row: all(row.get(col) is pl.NA for col in df.columns)
)

关于drop_nulls（）中的axis参数，Polars中不存在该参数，这是正确的，但可以使用subset参数传递列的子集来检查空值。
下面是一个更新的示例：

# drop rows where all values in "col2" are null
df = df.drop_nulls(how="all", subset=["col2"])

我为我之前的回复引起的任何混乱道歉。如果你有任何进一步的问题，请让我知道。

赞(0）回复(0）举报 2023-03-11

我来回答

pandas 连接 Dataframe 时处理极坐标中的空列

2条答案

相关问题

热门标签

最新问答