按统计计数值随机填充形状(大小)pandas Dataframe,.count()的反向操作

dba5bblo 于 2023-06-20 发布在其他

关注(0)|答案(2)|浏览(107)

我需要一个DataFrame与r行和动态数量的列（基于组）。输入count列指定在新DataFrame中需要多少个True值。我目前的实现创建了一个临时DataFrame，其中一行包含df中每个group的True值，然后explode()是该临时 Dataframe 。最后，它按count分组并聚合为结果df

输入

| group | count | ... 
|   A   |   2   |     
|   B   |   0   |     
|   C   |   4   |     
|   D   |   1   |

我需要用这个值填充新的DataFrame随机（c-（列）值是动态的，与名称相同）

预期输出

--
| 一个|B级|C类|D级|
| - -----|- -----|- -----|- -----|
| NaN| NaN| * * 真**|* * 真**|
| * * 真**| NaN| * * 真**| NaN|
| NaN| NaN| NaN| NaN|
| NaN| NaN| * * 真**| NaN|
| * * 真**| NaN| * * 真**| NaN|
我认为可以添加一个随机的长度集，从1到**r**，然后扩展等等。就用这个值来表示agg（sum）。

我的代码

inputs = [
    {"group": "A", "count": 2},
    {"group": "B", "count": 0}, 
    {"group": "C", "count": 4}, 
    {"group": "D", "count": 1}, 
    ]
df = pd.DataFrame(inputs)

def expand(count:int, group: str) -> pd.DataFrame:
    """expands DF by counts"""
    count = int(round(count))
    df1 = pd.DataFrame([{group: True}])
    # I'm thinking here i need to add random seed
    df1 = df1.assign(count = [list(range(1, count+1))])\
             .explode('count')\
             .reset_index(drop=True)
    return df1

def creator(df: pd.DataFrame) -> pd.DataFrame:
    """create new DF for every group value(count)"""
    dfs = [expand(r, df['group'].values[0]) for r in list(df['count'].values)]
    df = pd.concat(dfs, ignore_index=True)
    return df
    
df.groupby('group', as_index=False)\
    .apply(creator)\
    .drop('count', axis=1)\
    # and groupby my seed
    .groupby(level=1)\
    .agg(sum)

我试着声明我的问题，如果它会有帮助：
1.在pandas中有没有什么方法可以让这变得更容易/更好？
1.如何在expand()函数中进行随机计数并分配它们？
1.这是一种用NaN创建大小为DataFrame的方法，然后随机地将我的值放在那里（比如pd.where或其他东西）吗？
PS：这是我第一次问问题，所以希望我已经提供了足够的信息!

pandas

来源：https://stackoverflow.com/questions/76495014/fill-shapedsized-pandas-dataframe-with-values-randomly-by-stat-count-value-re

2条答案

按热度按时间

gjmwrych1#

一个纯粹的 pandas 解决方案是使用sample：

out = pd.DataFrame(
    {g: [True]*c + [np.nan]*(R-c) for g, c in df.to_numpy()}
).sample(frac=1)

输出：

print(out)

      A   B     C     D
0  True NaN  True   NaN
1   NaN NaN  True   NaN
2  True NaN   NaN   NaN
3   NaN NaN  True  True
4   NaN NaN  True   NaN

老答案：*

一个简单的方法是 * 引导 * 一个预空DataFrame，同时随机选择/拾取一个坐标[index, column]：

np.random.seed(0)

R = 5 # <-- rows

out = pd.DataFrame(
    np.nan, index=range(R), columns=list(df["group"])
)

for g, c in df.to_numpy():
    out.loc[np.random.choice(out.index, c, replace=False), g] = True

赞(0）回复(0）举报 2023-06-20

5f0d552i2#

步骤：
1.从字典列表中定义DataFrame

r是最终DataFrame中的行数
1.创建一个字典group_indexes，每个键是一个group名称，每个值是随机选择的唯一行索引。索引的数量是count和r中的最小值。
1.创建空DataFrame df_empty，其中r行和列由唯一的group名称定义。
1.遍历df_empty中的每一列。如果column name在group_indexes中，则它将列中那些行索引处的值设置为True。
df_filled是通过将df_empty中的所有其他非True值替换为NaN来创建的。

import pandas as pd
import numpy as np

# Step 1: initial DataFrame
inputs = [
    {"group": "A", "count": 2},
    {"group": "B", "count": 0}, 
    {"group": "C", "count": 4}, 
    {"group": "D", "count": 1}, 
]
df = pd.DataFrame(inputs)

# Step 2: Define the number of rows
r = 5

# Step 3: Create a dictionary with unique random indexes for each group
group_indexes = {group: np.random.choice(np.arange(r), min(count, r), replace=False) for group, count in zip(df['group'], df['count']) if count > 0}

# Step 4: Create an empty DataFrame with 'r' rows and columns defined by groups
df_empty = pd.DataFrame(index=np.arange(r), columns=df['group'])

# Step 5: Place 'True' in the DataFrame according to the indexes in group_indexes
for col in df_empty:
    if col in group_indexes:
        df_empty.loc[group_indexes[col], col] = True

# Step 6: Replace all other values in the DataFrame with NaN
df_filled = df_empty.where(df_empty, np.NaN)

这是另一个我认为更简单的方法：

import pandas as pd
import numpy as np

# initial DataFrame
inputs = [
    {"group": "A", "count": 2},
    {"group": "B", "count": 0}, 
    {"group": "C", "count": 4}, 
    {"group": "D", "count": 1}, 
]
df = pd.DataFrame(inputs)

r = 5  # specify number of rows here

# Create a DataFrame filled with NaN
df_filled = pd.DataFrame(index=np.arange(r), columns=df['group'])

# Randomly assign True values in each column based on its count
for index, row in df.iterrows():
    if row['count'] > 0:
        random_indexes = np.random.choice(df_filled.index, size=min(r, row['count']), replace=False)
        df_filled.loc[random_indexes, row['group']] = True

这首先初始化一个用NaN填充的DataFrame。然后，对于每个组，它根据组的计数随机选择唯一的索引，并将这些位置设置为True。

赞(0）回复(0）举报 2023-06-20

我来回答

按统计计数值随机填充形状(大小)pandas Dataframe,.count()的反向操作

输入

预期输出

我的代码

2条答案

相关问题

热门标签

最新问答