The pandas documentation recommends installing numexpr to speed up numeric calculation when using query(). Use pip install numexpr (or conda, sudo etc. depending on your environment) to install it.
For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask]. In particular, it performs better for the following cases.
Logical and/or comparison operators on columns of strings
If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask]. For example, for a dataframe with 80k rows, it's 30% faster1 and for a dataframe with 800k rows, it's 60% faster.2
This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask])1,2 and/or the dataframe length increases.2
Multiple operations on numeric columns
If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df, query() performs faster. For example, for a frame with 80k rows, it's 20% faster1 and for a frame with 800k rows, it's 2 times faster.2
For example, it doesn't support integer division (//). However, calling the equivalent pandas method (floordiv()) and accessing the values attribute on the resulting Series makes numexpr evaluate its underlying numpy array and query() works. Or setting engine parameter to 'python' also works.
df.query('B.floordiv(2).values <= 3') # or
df.query('B.floordiv(2).le(3).values') # or
df.query('B.floordiv(2).le(3)', engine='python')
The same applies for Erfan's suggested method calls as well. The code in their answer spits TypeError as is (as of Pandas 1.3.4) for numexpr engine but accessing .values attribute makes it work.
df.query('`Sender email`.str.endswith("@shop.com")') # <--- TypeError
df.query('`Sender email`.str.endswith("@shop.com").values') # OK
1: Benchmark code using a frame with 80k rows
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000,
'B': np.random.rand(80000)})
%timeit df[df.A == 'foo']
# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5)**2 < 0.1]
# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.query("(B % 5)**2 < 0.1")
# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2: Benchmark code using a frame with 800k rows
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000,
'B': np.random.rand(800000)})
%timeit df[df.A == 'foo']
# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo'")
# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]
# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")
# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df[(df.B % 5)**2 < 0.1]
# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit df.query("(B % 5)**2 < 0.1")
# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
3: Code used to produce the performance graphs of the two methods for strings and numbers.
from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
setup=constructor,
kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
labels= ['df[(df.B % 5)**2 < 0.1]', 'df.query("(B % 5)**2 < 0.1")'],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Multiple mathematical operations on numbers',
equality_check=pd.DataFrame.equals);
plot(
setup=constructor,
kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
n_range=[2**k for k in range(4, 24)],
xlabel='Rows in DataFrame',
title='Comparison operation on strings',
equality_check=pd.DataFrame.equals);
16条答案
按热度按时间nimxete216#
1. Install
numexpr
to speed upquery()
callsThe pandas documentation recommends installing numexpr to speed up numeric calculation when using
query()
. Usepip install numexpr
(orconda
,sudo
etc. depending on your environment) to install it.For larger dataframes (where performance actually matters),
df.query()
withnumexpr
engine performs much faster thandf[mask]
. In particular, it performs better for the following cases.Logical and/or comparison operators on columns of strings
If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation,
query()
performs faster thandf[mask]
. For example, for a dataframe with 80k rows, it's 30% faster1 and for a dataframe with 800k rows, it's 60% faster.2This gap increases as the number of operations increases (if 4 comparisons are chained
df.query()
is 2-2.3 times faster thandf[mask]
)1,2 and/or the dataframe length increases.2Multiple operations on numeric columns
If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter
df
,query()
performs faster. For example, for a frame with 80k rows, it's 20% faster1 and for a frame with 800k rows, it's 2 times faster.2This gap in performance increases as the number of operations increases and/or the dataframe length increases.2
The following plot shows how the methods perform as the dataframe length increases.3
2. Access
.values
to call pandas methods insidequery()
Numexpr
currently supports only logical (&
,|
,~
), comparison (==
,>
,<
,>=
,<=
,!=
) and basic arithmetic operators (+
,-
,*
,/
,**
,%
).For example, it doesn't support integer division (
//
). However, calling the equivalent pandas method (floordiv()
) and accessing thevalues
attribute on the resulting Series makesnumexpr
evaluate its underlying numpy array andquery()
works. Or settingengine
parameter to'python'
also works.The same applies for Erfan's suggested method calls as well. The code in their answer spits TypeError as is (as of Pandas 1.3.4) for
numexpr
engine but accessing.values
attribute makes it work.1: Benchmark code using a frame with 80k rows
2: Benchmark code using a frame with 800k rows
3: Code used to produce the performance graphs of the two methods for strings and numbers.