如果不引入“NaN”,则无法向Dask“DataFrame”添加列(Pandas“Series”)

lokaqttq  于 2023-02-14  发布在  其他
关注(0)|答案(1)|浏览(138)

我正在从numpy数组构造一个任务DataFrame,然后我想从pandasSeries添加一列。
不幸的是,生成的 Dataframe 包含NaN值,我无法理解错误所在。

from dask.dataframe.core import DataFrame as DaskDataFrame
import dask.dataframe as dd
import pandas as pd
import numpy as np

xy = np.random.rand(int(3e6), 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], int(3e6)), dtype='category')

# alternative 1 ->  # lot of values of x, y are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=1)
print(table.compute())

# alternative 2 ->  # lot of values of c are NaN
table: DaskDataFrame = dd.from_array(xy, columns=['x', 'y'])
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())

任何帮助都很感激。

e4eetjau

e4eetjau1#

这都是因为在进行分区时c和xy中的元素数量不匹配。您可以尝试使用dd. from_panda而不是dd. from_array来创建DaskDataFrame。

import numpy as np
import pandas as pd
import dask.dataframe as dd

n = int(3e6)
xy = np.random.rand(n, 2)
c = pd.Series(np.random.choice(['a', 'b', 'c'], n), dtype='category')

table = dd.from_pandas(pd.DataFrame(xy, columns=['x', 'y']), npartitions=table.npartitions)
table['c'] = dd.from_pandas(c, npartitions=table.npartitions)
print(table.compute())

该函数返回:

x         y  c
0        0.488121  0.568258  b
1        0.090625  0.459087  b
2        0.563856  0.193026  a
3        0.333338  0.220935  c
4        0.769926  0.195786  a
...           ...       ... ..
2999995  0.241800  0.114924  b
2999996  0.462755  0.567131  c
2999997  0.473718  0.481577  b
2999998  0.424875  0.937403  c
2999999  0.189081  0.793600  c

相关问题