在列中收集值列表的最有效方法Pandas - Numpy

x8goxv8g  于 2023-10-19  发布在  其他
关注(0)|答案(1)|浏览(99)

下面我有一个逻辑,对于框架测试中的每一行,收集所有级别低于当前行并共享id_group的id值。
当我将此代码应用于我的真实的世界数据集时,它非常慢。有没有可能用numpy做类似的事情?

test = pd.DataFrame({'id_group':[10]*10+[11]*5,
             'level':list(range(0,10)) + list(range(0,5)),
             'id':[i+20 for i in list(range(0,10))]+[i+30 for i in list(range(0,5))]})

def unique_id(row):
    threshold = row['level']
    main_id = row['id_group']
    unique_values = test[(test['level'] < threshold) & (test.id_group==main_id)]['id'].unique()
    return unique_values.tolist()

test['list of ids'] = test.apply(lambda row: unique_id(row), axis=1)
7eumitmz

7eumitmz1#

假设级别在一个组中是唯一的,您可以重写代码以使用排序值:

def to_list(g):
    s = g.sort_values(by='level')['id']
    out = [[]]
    for x in s.iloc[:-1]:
        out.append(out[-1]+[x])
    return pd.Series(out, index=s.index)
    
test['list of ids'] = test.groupby('id_group').apply(to_list).droplevel(0)

输出(修改输入以更好地演示代码):

id_group  level  id                           list of ids
0         10      0  20                                    []
1         10      1  21                                  [20]
2         10      2  22                              [20, 21]
3         10      3  23                          [20, 21, 22]
4         10      4  24                      [20, 21, 22, 23]
5         10      5  25                  [20, 21, 22, 23, 24]
6         10      8  26      [20, 21, 22, 23, 24, 25, 29, 27]
7         10      7  27          [20, 21, 22, 23, 24, 25, 29]
8         10      9  28  [20, 21, 22, 23, 24, 25, 29, 27, 26]
9         10      6  29              [20, 21, 22, 23, 24, 25]
10        11      0  30                                    []
11        11      1  32                                  [30]
12        11      2  31                              [30, 32]
13        11      3  34                          [30, 32, 31]
14        11      4  33                      [30, 32, 31, 34]

如果你想确保id的唯一性,那么使用set也会更有效:

def to_sets(g):
    s = g.sort_values(by='level')['id']
    out = [set()]
    for x in s.iloc[:-1]:
        out.append(out[-1]|{x})
    return pd.Series(out, index=s.index)
    
test['sets of ids'] = test.groupby('id_group').apply(to_sets).droplevel(0)

输出量:

id_group  level  id                           sets of ids
0         10      0  20                                    {}
1         10      1  21                                  {20}
2         10      2  22                              {20, 21}
3         10      3  23                          {20, 21, 22}
4         10      4  24                      {20, 21, 22, 23}
5         10      5  25                  {20, 21, 22, 23, 24}
6         10      8  26      {20, 21, 22, 23, 24, 25, 27, 29}
7         10      7  27          {20, 21, 22, 23, 24, 25, 29}
8         10      9  28  {20, 21, 22, 23, 24, 25, 26, 27, 29}
9         10      6  29              {20, 21, 22, 23, 24, 25}
10        11      0  30                                    {}
11        11      1  32                                  {30}
12        11      2  31                              {32, 30}
13        11      3  34                          {32, 30, 31}
14        11      4  33                      {32, 34, 30, 31}

相关问题