如何在具有不同长度列表列的Pandas Dataframe 上使用numpy的广播功能

cyej8jka  于 2022-11-24  发布在  其他
关注(0)|答案(2)|浏览(135)

我正在尝试对我的大数据使用numpy的广播功能。我的列表列可以在许多行中包含数百个元素。我需要根据列表列中列值的存在来过滤行。如果col_a中的数字存在于col_b中,我需要在该行中进行过滤。
示例数据:

import pandas as pd
import numpy as np
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
                   'col_a': [[1],[2],[5],[1],[2]],
                   'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
                   })
dt

    id  col_a   col_b
0   a   [1]     [2]
1   a   [2]     [2, 4]
2   a   [5]     [2, 5, 7]
3   b   [1]     [4]
4   b   [2]     [3, 2]

我尝试了下面的代码来添加维度到col_b,并检查该值是否存在于col_a中:

(dt['col_a'] == dt['col_b'][:,None]).any(axis = 1)

但我得到以下错误:

ValueError: ('Shapes must match', (5,), (5, 1))

谁能告诉我正确的方法是什么?

inkz8wg9

inkz8wg91#

import pandas as pd
import numpy as np
from itertools import product

根据逗号解析出列:

dt2 = pd.DataFrame([j for i in dt.values for j in product(*i)], columns=dt.columns)

筛选至col_a相等col_B的位置:

dt2 = dt2[dt2['col_a'] == dt2['col_b']]

结果:

eqqqjvef

eqqqjvef2#

I think you've been told that numpy "vectorization" is the key to speeding up your code, but you don't have a good grasp of what this. It isn't something magical that you can apply to any pandas task. It's just "shorthand" for making full use of numpy array methods, which means, actually learning numpy .
But let's explore your task:

In [205]: dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
         ...:                    'col_a': [[1],[2],[5],[1],[2]],
         ...:                    'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
         ...:                    })
    In [206]: dt
    Out[206]: 
      id col_a      col_b
    0  a   [1]        [2]
    1  a   [2]     [2, 4]
    2  a   [5]  [2, 5, 7]
    3  b   [1]        [4]
    4  b   [2]     [3, 2]
    
    In [207]: dt.dtypes
    Out[207]: 
    id       object
    col_a    object
    col_b    object
    dtype: object

Because the columns contain lists, their dtype is object; they have references to lists.
Doing things like == on columns, pandas Series is not the same as doing things with the arrays of their values.
But to focus on the numpy aspect, lets get numpy arrays:

In [208]: a = dt['col_a'].to_numpy()    
    In [209]: b = dt['col_b'].to_numpy()    
    In [210]: a
    Out[210]: 
    array([list([1]), list([2]), list([5]), list([1]), list([2])],
          dtype=object)    
    In [211]: b
    Out[211]: 
    array([list([2]), list([2, 4]), list([2, 5, 7]), list([4]), list([3, 2])],
          dtype=object)

The fast numpy operations use compiled code, and, for the most part, only work with numeric dtypes. Arrays like this, containing references to lists, are basically the same as lists. Math, and other operations like equalty, operate at list comprehension speeds. That may be faster than pandas speeds, but no where like the highly vaunted "vectorized" numpy speeds.
So lets to a list comprehension on the elements of these lists. This is a lot like pandas apply , though I think it's faster (pandas apply is notoriously slow).

In [212]: [i in j for i,j in zip(a,b)]
    Out[212]: [False, False, False, False, False]

Oops, not matches - must be because i from a is a list. Let's extract that number:

In [213]: [i[0] in j for i,j in zip(a,b)]
    Out[213]: [False, True, True, False, True]

Making col_a contain lists instead of numbers does not help you.
Since a and b are arrays, we can use == , but that essentially the same operation as [212] (timeit is slightly better):

In [214]: a==b
    Out[214]: array([False, False, False, False, False])

We could make b into a (5,1) array, but why?

In [215]: b[:,None]
    Out[215]: 
    array([[list([2])],
           [list([2, 4])],
           [list([2, 5, 7])],
           [list([4])],
           [list([3, 2])]], dtype=object)

What I think you were trying to imitate an array comparison like this, broadcasting a (5,) against a (3,1) to produce a (3,5) truth table:

In [216]: x = np.arange(5); y = np.array([3,5,1])    
    In [217]: x==y[:,None]
    Out[217]: 
    array([[False, False, False,  True, False],
           [False, False, False, False, False],
           [False,  True, False, False, False]])
    
    In [218]: (x==y[:,None]).any(axis=1)
    Out[218]: array([ True, False,  True])

isin can do the same sort of comparision:

In [219]: np.isin(x,y)
    Out[219]: array([False,  True, False,  True, False])    
    In [220]: np.isin(y,x)
    Out[220]: array([ True, False,  True])

While this works for numbers, it does not work for the arrays of lists, especially not your case where you want to test the lists in a against the corresponding list in b . You aren't testing all of a against all of b .
Since the lists in a are all the same size, we can join them into one number array:

In [225]: np.hstack(a)
    Out[225]: array([1, 2, 5, 1, 2])

We cannot do the same for b because the lists very in size. As a general rule, when you have lists (or arrays) that vary in size, you cannot do the fast numeric numpy math and comparisons.
We could test (5,) a against (5,1) b , producing a (5,5) truth table:

In [227]: a==b[:,None]
Out[227]: 
array([[False,  True, False, False,  True],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False],
       [False, False, False, False, False]])

But that is True for a couple of cells in the first row; that's where the list([2]) from b matches the same list in a .

相关问题