我正在尝试对我的大数据使用numpy的广播功能。我的列表列可以在许多行中包含数百个元素。我需要根据列表列中列值的存在来过滤行。如果col_a
中的数字存在于col_b
中,我需要在该行中进行过滤。
示例数据:
import pandas as pd
import numpy as np
dt = pd.DataFrame({'id' : ['a', 'a', 'a', 'b', 'b'],
'col_a': [[1],[2],[5],[1],[2]],
'col_b': [[2],[2,4],[2,5,7],[4],[3,2]],
})
dt
id col_a col_b
0 a [1] [2]
1 a [2] [2, 4]
2 a [5] [2, 5, 7]
3 b [1] [4]
4 b [2] [3, 2]
我尝试了下面的代码来添加维度到col_b
,并检查该值是否存在于col_a
中:
(dt['col_a'] == dt['col_b'][:,None]).any(axis = 1)
但我得到以下错误:
ValueError: ('Shapes must match', (5,), (5, 1))
谁能告诉我正确的方法是什么?
2条答案
按热度按时间inkz8wg91#
根据逗号解析出列:
筛选至col_a相等col_B的位置:
结果:
eqqqjvef2#
I think you've been told that
numpy
"vectorization" is the key to speeding up your code, but you don't have a good grasp of what this. It isn't something magical that you can apply to anypandas
task. It's just "shorthand" for making full use ofnumpy
array methods, which means, actually learningnumpy
.But let's explore your task:
Because the columns contain lists, their
dtype
is object; they have references to lists.Doing things like
==
on columns, pandasSeries
is not the same as doing things with the arrays of their values.But to focus on the numpy aspect, lets get numpy arrays:
The fast
numpy
operations use compiled code, and, for the most part, only work with numeric dtypes. Arrays like this, containing references to lists, are basically the same as lists. Math, and other operations like equalty, operate at list comprehension speeds. That may be faster than pandas speeds, but no where like the highly vaunted "vectorized" numpy speeds.So lets to a list comprehension on the elements of these lists. This is a lot like pandas
apply
, though I think it's faster (pandas apply is notoriously slow).Oops, not matches - must be because
i
froma
is a list. Let's extract that number:Making col_a contain lists instead of numbers does not help you.
Since
a
andb
are arrays, we can use==
, but that essentially the same operation as [212] (timeit is slightly better):We could make
b
into a (5,1) array, but why?What I think you were trying to imitate an array comparison like this, broadcasting a (5,) against a (3,1) to produce a (3,5) truth table:
isin
can do the same sort of comparision:While this works for numbers, it does not work for the arrays of lists, especially not your case where you want to test the lists in
a
against the corresponding list inb
. You aren't testing all ofa
against all ofb
.Since the lists in
a
are all the same size, we can join them into one number array:We cannot do the same for
b
because the lists very in size. As a general rule, when you have lists (or arrays) that vary in size, you cannot do the fast numericnumpy
math and comparisons.We could test (5,)
a
against (5,1)b
, producing a (5,5) truth table:But that is
True
for a couple of cells in the first row; that's where thelist([2])
fromb
matches the same list ina
.