pandas 按行随机压缩 Dataframe

6mzjoqzu 于 2023-08-01 发布在其他

关注(0)|答案(3)|浏览(88)

如何按行随机合并、连接或连接pandas Dataframe ？假设我有四个类似这样的 Dataframe （有更多的行）：

df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"]})

字符串
我如何将这四个 Dataframe 随机地连接起来输出这样的东西（它们是随机地逐行合并的）：

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  1_1  1_2  1_3  4_1  4_2  4_3  2_1  2_2  2_3  3_1  3_2  3_3
1  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3

型
我在想我可以这样做：

my_list = [df1,df2,df3,df4]
my_list = random.sample(my_list, len(my_list))
df = pd.DataFrame({'empty' : []})

for row in df:
    new_df = pd.concat(my_list, axis=1)

print new_df

型
上面的for语句将不能工作超过第一行，后面的每一行（我有更多）将只是相同，即它只会 Shuffle 一次：

col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  4_1  4_2  4_3  1_1  1_2  1_3  2_1  2_2  2_3  3_1  3_2  3_3
1  4_1  4_2  4_3  1_1  1_2  1_3  2_1  2_2  2_3  3_1  3_2  3_3

型

pandas

来源：https://stackoverflow.com/questions/38506360/randomly-concat-data-frames-by-row

3条答案

按热度按时间

qv7cva1a1#

也许是这样的？

import random
import numpy as np

dfs = [df1, df2, df3, df4]
n = np.sum(len(df.columns) for df in dfs)
pd.concat(dfs, axis=1).iloc[:, random.sample(range(n), n)]

Out[130]: 
  col1 col3 col1 col2 col1 col1 col2 col2 col3 col3 col3 col2
0  4_1  4_3  1_1  4_2  2_1  3_1  1_2  3_2  1_3  3_3  2_3  2_2

字符串
或者，如果只有df应该被 Shuffle ，你可以这样做：

dfs = [df1, df2, df3, df4]
random.shuffle(dfs)
pd.concat(dfs, axis=1)

Out[133]: 
  col1 col2 col3 col1 col2 col3 col1 col2 col3 col1 col2 col3
0  4_1  4_2  4_3  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3

型

赞(0）回复(0）举报 2023-08-01

3hvapo4f2#

**更新：**来自@Divakar的更好的解决方案：

df1 = pd.DataFrame({'col1':["1_1", "1_1"], 'col2':["1_2", "1_2"], 'col3':["1_3", "1_3"], 'col4':["1_4", "1_4"]})
df2 = pd.DataFrame({'col1':["2_1", "2_1"], 'col2':["2_2", "2_2"], 'col3':["2_3", "2_3"], 'col4':["2_4", "2_4"]})
df3 = pd.DataFrame({'col1':["3_1", "3_1"], 'col2':["3_2", "3_2"], 'col3':["3_3", "3_3"], 'col4':["3_4", "3_4"]})
df4 = pd.DataFrame({'col1':["4_1", "4_1"], 'col2':["4_2", "4_2"], 'col3':["4_3", "4_3"], 'col4':["4_4", "4_4"]})

dfs = [df1, df2, df3, df4]
n = len(dfs)
nrows = dfs[0].shape[0]
ncols = dfs[0].shape[1]
A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)
sidx = np.random.rand(nrows,n).argsort(1)
out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)
df = pd.DataFrame(out_arr)

字符串
输出量：

In [203]: df
Out[203]:
    0    1    2    3    4    5    6    7    8    9    10   11   12   13   14   15
0  3_1  3_2  3_3  3_4  1_1  1_2  1_3  1_4  4_1  4_2  4_3  4_4  2_1  2_2  2_3  2_4
1  4_1  4_2  4_3  4_4  2_1  2_2  2_3  2_4  3_1  3_2  3_3  3_4  1_1  1_2  1_3  1_4

型
说明：（c）Divakar

NumPy解决方案

让我们有一个基于NumPy的矢量化解决方案，希望是一个快速的解决方案！
1)让我们将一个串联值的数组重新塑造成一个3D数组，将每行“切割”成ncols组，对应于每个输入 Dataframe 中的列数。

A = pd.concat(dfs, axis=1).values.reshape(nrows,-1,ncols)

型
2)接下来，我们欺骗np.aragsort来给予我们随机唯一索引，范围从0到N-1，其中N是输入 Dataframe 的数量。

sidx = np.random.rand(nrows,n).argsort(1)

型
3)最后一个技巧是NumPy的花哨的索引和一些广播来索引到A和sidx给予我们输出数组-

out_arr = A[np.arange(nrows)[:,None],sidx,:].reshape(nrows,-1)

型
4)如果需要，转换为 Dataframe -

df = pd.DataFrame(out_arr)

型

旧答案：

IIUC你可以这样做：

dfs = [df1, df2, df3, df4]
n = len(dfs)
ncols = dfs[0].shape[1]
v = pd.concat(dfs, axis=1).values
a = np.arange(n * ncols).reshape(n, df1.shape[1])

df = pd.DataFrame(np.asarray([v[i, a[random.sample(range(n), n)].reshape(n * ncols,)] for i in dfs[0].index]))

型
产出

In [150]: df
Out[150]:
    0    1    2    3    4    5    6    7    8    9    10   11
0  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3  2_1  2_2  2_3
1  2_1  2_2  2_3  1_1  1_2  1_3  3_1  3_2  3_3  4_1  4_2  4_3

型
说明：

In [151]: v
Out[151]:
array([['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3'],
       ['1_1', '1_2', '1_3', '2_1', '2_2', '2_3', '3_1', '3_2', '3_3', '4_1', '4_2', '4_3']], dtype=object)

In [152]: a
Out[152]:
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

型

赞(0）回复(0）举报 2023-08-01

gzjq41n43#

我认为这个答案更容易，它适用于每一个df维度

df = pd.concat([df1, df2, df3, df4])
df = df.sample(frac=1)

字符串
样本给你一个随机样本的DF。如果你要求完整的DF。它会随机化列

赞(0）回复(0）举报 2023-08-01

我来回答

pandas 按行随机压缩 Dataframe

3条答案

相关问题

热门标签

最新问答