如何根据列值从DataFrame中选择行?

fnatzsnv  于 2022-09-18  发布在  Java
关注(0)|答案(16)|浏览(225)

如何根据Pandas中某些列的值从DataFrame中选择行?

在SQL中,我将使用:

SELECT *
FROM table
WHERE column_name = some_value
nimxete2

nimxete216#

1. Install numexpr to speed up query() calls

The pandas documentation recommends installing numexpr to speed up numeric calculation when using query(). Use pip install numexpr (or conda, sudo etc. depending on your environment) to install it.

For larger dataframes (where performance actually matters), df.query() with numexpr engine performs much faster than df[mask]. In particular, it performs better for the following cases.

Logical and/or comparison operators on columns of strings

If a column of strings are compared to some other string(s) and matching rows are to be selected, even for a single comparison operation, query() performs faster than df[mask]. For example, for a dataframe with 80k rows, it's 30% faster1 and for a dataframe with 800k rows, it's 60% faster.2

df[df.A == 'foo']
df.query("A == 'foo'")  # <--- performs 30%-60% faster

This gap increases as the number of operations increases (if 4 comparisons are chained df.query() is 2-2.3 times faster than df[mask])1,2 and/or the dataframe length increases.2

Multiple operations on numeric columns

If multiple arithmetic, logical or comparison operations need to be computed to create a boolean mask to filter df, query() performs faster. For example, for a frame with 80k rows, it's 20% faster1 and for a frame with 800k rows, it's 2 times faster.2

df[(df.B % 5)**2 < 0.1]
df.query("(B % 5)**2 < 0.1")  # <--- performs 20%-100% faster.

This gap in performance increases as the number of operations increases and/or the dataframe length increases.2

The following plot shows how the methods perform as the dataframe length increases.3

2. Access .values to call pandas methods inside query()

Numexprcurrently supports only logical (&, |, ~), comparison (==, >, <, >=, <=, !=) and basic arithmetic operators (+, -, *, /, **, %).

For example, it doesn't support integer division (//). However, calling the equivalent pandas method (floordiv()) and accessing the values attribute on the resulting Series makes numexpr evaluate its underlying numpy array and query() works. Or setting engine parameter to 'python' also works.

df.query('B.floordiv(2).values <= 3')  # or 
df.query('B.floordiv(2).le(3).values') # or
df.query('B.floordiv(2).le(3)', engine='python')

The same applies for Erfan's suggested method calls as well. The code in their answer spits TypeError as is (as of Pandas 1.3.4) for numexpr engine but accessing .values attribute makes it work.

df.query('`Sender email`.str.endswith("@shop.com")')         # <--- TypeError
df.query('`Sender email`.str.endswith("@shop.com").values')  # OK

1: Benchmark code using a frame with 80k rows

import numpy as np
df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*10000, 
                   'B': np.random.rand(80000)})

%timeit df[df.A == 'foo']

# 8.5 ms ± 104.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.query("A == 'foo'")

# 6.36 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]

# 29 ms ± 554 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")

# 16 ms ± 339 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[(df.B % 5)**2 < 0.1]

# 5.35 ms ± 37.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.query("(B % 5)**2 < 0.1")

# 4.37 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

2: Benchmark code using a frame with 800k rows

df = pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*100000, 
                   'B': np.random.rand(800000)})

%timeit df[df.A == 'foo']

# 87.9 ms ± 873 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df.query("A == 'foo'")

# 54.4 ms ± 726 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[((df.A == 'foo') & (df.A != 'bar')) | ((df.A != 'baz') & (df.A != 'buz'))]

# 310 ms ± 3.4 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df.query("A == 'foo' & A != 'bar' | A != 'baz' & A != 'buz'")

# 132 ms ± 2.43 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df[(df.B % 5)**2 < 0.1]

# 54 ms ± 488 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

%timeit df.query("(B % 5)**2 < 0.1")

# 26.3 ms ± 320 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)

3: Code used to produce the performance graphs of the two methods for strings and numbers.

from perfplot import plot
constructor = lambda n: pd.DataFrame({'A': 'foo bar foo baz foo bar foo foo'.split()*n, 'B': np.random.rand(8*n)})
plot(
    setup=constructor,
    kernels=[lambda df: df[(df.B%5)**2<0.1], lambda df: df.query("(B%5)**2<0.1")],
    labels= ['df[(df.B % 5)**2 < 0.1]', 'df.query("(B % 5)**2 < 0.1")'],
    n_range=[2**k for k in range(4, 24)],
    xlabel='Rows in DataFrame',
    title='Multiple mathematical operations on numbers',
    equality_check=pd.DataFrame.equals);
plot(
    setup=constructor,
    kernels=[lambda df: df[df.A == 'foo'], lambda df: df.query("A == 'foo'")],
    labels= ["df[df.A == 'foo']", """df.query("A == 'foo'")"""],
    n_range=[2**k for k in range(4, 24)],
    xlabel='Rows in DataFrame',
    title='Comparison operation on strings',
    equality_check=pd.DataFrame.equals);

相关问题