大型数据集上的Pandas Groupby示例

ff29svar 于 9个月前发布在其他

关注(0)|答案(2)|浏览(115)

我试图随机抽样一个 * 相对 * 大的数据集（由90 [mn]左右的数据点组成）
我想根据列“a”（大约有100 k个唯一值）对数据集进行采样，每个a都有不同的n值。
我知道有这样的东西存在：

df.groupby("a").sample(n=1, random_state=1)

字符串
但这并没有考虑到不同的n值。
接下来的想法是在循环过滤后通过a然后采样来过滤df（'m '是'a'的唯一值，我正在采样）：

filter_df = df.loc[(df['a'] == f)]

filter_df = filter_df.sample(n=m, random_state=6)

型
为了增加另一层潜在的复杂性，如果数据超过每组“a”的样本总数，我想对数据进行采样，否则如果样本大于replace = False，则使用replace = True，以便我尽可能多地选择唯一的行。
因此，在枚举的for循环中（例如，4个100 k长的单独列表，包括列“a”的唯一值，让我们称之为变量“kk”，它是一个整数：

if kk >= m:
        filter_df = filter_df.sample(n=m, random_state=6)
        print("sampling okay")
        test["flag"] = "ok"
    else:
        filter_df = filter_df.sample(n=m, random_state=6,replace=True)

型
最后用concat来合并。
这是基本的想法，代码工作正常，但是，性能是次优的。想知道是否有一个矢量化的潜在解决方案，我可以使用。
为了简洁起见，样本的数量“m”是第一个df的计数：
| a个|count个|
| --|--|
| 1个|1个|
| 2个|3个|
| 3个|2|
我正在尝试从原始DF采样：
| 一|X|
| --|--|
| 1 |一|
| 1 |B|
| 1 |C|
| 2 |D|
| 2 |e|
| 3 |F|
| 3 |G|
期望的采样输出（当然采样x将是“随机”的）：
| a个|x个|
| --|--|
| 1 |一|
| 2 |D|
| 2 |e|
| 2 |e|
| 3 |F|
| 3 |G|

pandas

来源：https://stackoverflow.com/questions/77657265/pandas-groupby-sample-on-a-large-dataset

2条答案

按热度按时间

nkhmeac61#

从第一个数组中创建一个字典，并创建自定义分组函数：

def get_sample(df, dct, random_state):
    a = df["a"].iat[0]
    n = dct.get(a)

    if n is None:
        return

    return df.sample(n=n, random_state=random_state, replace=len(df) <= n)

dct = df1.set_index("a")["count"].to_dict()

out = df2.groupby("a", group_keys=False).apply(get_sample, dct=dct, random_state=6)
print(out)

字符串
印刷品：

型
输入框：

# df1
   a  count
0  1      1
1  2      3
2  3      2

# df2
   a  x
0  1  a
1  1  b
2  1  c
3  2  d
4  2  e
5  3  f
6  3  g

型

赞(0）回复(0）举报 9个月前

aij0ehis2#

(df.groupby('a', as_index = False)
  .apply(lambda x: x.sample(n := df1.loc[df1['a'] == x['a'].iloc[0],'count'].iloc[0], 
                             replace = n > x['a'].size, random_state=6))
  .reset_index(drop = True))
   a  x
0  1  a
1  2  d
2  2  e
3  2  e
4  3  g
5  3  f

字符串
数据

df = pd.DataFrame({'a': {0: 1, 1: 1, 2: 1, 3: 2, 4: 2, 5: 3, 6: 3}, 'x': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e', 5: 'f', 6: 'g'}})
df1 = pd.DataFrame({'a': {0: 1, 1: 2, 2: 3}, 'count': {0: 1, 1: 3, 2: 2}})

型

赞(0）回复(0）举报 9个月前

我来回答

大型数据集上的Pandas Groupby示例

2条答案

相关问题

热门标签

最新问答