numpy 在pandas groupby中合并每行组信息的最有效方法

2exbekwf  于 2023-08-05  发布在  其他
关注(0)|答案(3)|浏览(69)

考虑以下代码。此代码的目的是遍历groupby,并将来自组中每行的信息合并并保存到pd.Series中。

import numpy as np
df = pd.DataFrame({"A" : ["a", "a", "b", "b", "a"], "B" : ["x", "y", "z", "w", "t"] })
groups = df.groupby("A")
strings = groups["B"].apply(lambda x : "".join(sorted(x.values + " ")))
strings2 = groups["B"].apply(lambda x : str(len(np.unique(x))) + "hello")
print(strings)
print(strings2)

字符串
输出

a    t x y 
b      w z 
Name: B, dtype: object
A
a    3hello
b    2hello
Name: B, dtype: object


例如,上面代码示例中的第一个apply语句是将"B"列的行连接成一个字符串,并添加空格。
我想使这段代码尽可能快,并能够缩放到数百万行的设置。
这是目前为止我能找到的最有效的方法来完成这项任务,但我认为部分问题在于.apply在pandas中没有矢量化,所以将其应用到每个组是一个非常缓慢的过程(通常,有超过30万组)。每个组只有大小2或3。
或者,我可以尝试跳过groupby,直接在原来的DataFrame上这样做。也将赞赏以这种方式工作的解决方案。
总之,我希望有一种方法来重写上面的代码,实现相同的事情,但尽可能快和空间效率(可扩展到超大数据集)。

jyztefdp

jyztefdp1#

我找到的strings2的最快解决方案是:

def cast_multiple_apply_approach(dfg):
    return dfg["B"].apply(pd.unique).map(len).map(repr) + "hello"

字符串
对于strings2,OP的方法非常慢:

import numpy as np
import pandas as pd

_rng = np.random.default_rng(123)

def setup(N):
    x = ["a", "a", "b", "b", "a"] * N
    _rng.shuffle(x)
    y = ["x", "y", "z", "w", "t"] * N
    _rng.shuffle(y)
    df = pd.DataFrame({"A": x, "B": y})

    groups = df.groupby("A")
    return [groups]

def original_poster_approach(dfg):
    return dfg["B"].apply(lambda x: str(len(np.unique(x))) + "hello")

def bracula_approach(dfg):
    return dfg["B"].apply(lambda x: f"{len(pd.unique(x))}hello")

def len_multiple_apply_approach(dfg):
    return dfg["B"].apply(pd.unique).map(len).astype("string") + "hello"

def npsize_multiple_apply_approach(dfg):
    return dfg["B"].apply(pd.unique).map(np.size).astype("string") + "hello"

def set_multiple_apply_approach(dfg):
    return dfg["B"].apply(set).map(len).astype("string") + "hello"

def cast_multiple_apply_approach(dfg):
    return dfg["B"].apply(pd.unique).map(len).map(repr) + "hello"

approaches = [
    original_poster_approach,
    bracula_approach,
    len_multiple_apply_approach,
    npsize_multiple_apply_approach,
    set_multiple_apply_approach,
    cast_multiple_apply_approach,
]
for approach in approaches[1:]:
    data = setup(100)
    assert (approach(*data) == approaches[0](*data)).all()

run_performance_comparison(
    approaches,
    [
        1000,
        3000,
        5000,
        10000,
        30000,
        100_000,
    ],  # ,300_000,500_000,1_000_000],#3_000_000,5_000_000,10_000_000],
    setup=setup,
    title="Performance Comparison",
    number_of_repetitions=1,
)


的数据
因此,我将其排除在大样本上运行:



对于strings,我惊讶地发现:

  • OP的方法其实真的很难被打败
  • 事先排序是可怕的

我可以通过使用原生pd.Series.str.join来改进OP的解决方案,而不是使用纯Python的连接:

def pandas_str_accessor_approach(df):
    return df.groupby("A")["B"].apply(sorted).str.join(" ")
def setup(N):
    x = ["a", "a", "b", "b", "a"] * N
    _rng.shuffle(x)
    y = ["x", "y", "z", "w", "t"] * N
    _rng.shuffle(y)
    df = pd.DataFrame({"A": x, "B": y})
    return [df]

def original_poster_approach(df):
    return df.groupby("A")["B"].apply(lambda x: "".join(sorted(x.values + " ")))

def pandas_str_accessor_approach(df):
    return df.groupby("A")["B"].apply(sorted).str.join(" ")

def sort_first_approach(df):
    df.sort_values(by=["B"], inplace=True)
    strings = df.groupby("A")["B"].apply(" ".join)
    return strings

approaches = [
    original_poster_approach,
    pandas_str_accessor_approach,
    sort_first_approach,
]

run_performance_comparison(
    approaches,
    [
        1000,
        3000,
        5000,
        10000,
        30000,
        100_000,
        300_000,
        500_000,
        1_000_000,
        3_000_000,
    ],  # 5_000_000,10_000_000],
    setup=setup,
    title="Performance Comparison",
    number_of_repetitions=1,
)

的字符串




使用的分析代码:

import timeit
import matplotlib.pyplot as plt
from typing import List, Dict, Callable

from contextlib import contextmanager

@contextmanager
def data_provider(data_size, setup=lambda N: N, teardown=lambda: None):
    data = setup(data_size)
    yield data
    teardown()

def run_performance_comparison(approaches: List[Callable],
                               data_size: List[int],
                               setup=lambda N: N,
                               teardown=lambda: None,
                               number_of_repetitions=5, title='Performance Comparison',data_name='N'):
    approach_times: Dict[Callable, List[float]] = {approach: [] for approach in approaches}

    for N in data_size:
        with data_provider(N, setup, teardown) as data:
            for approach in approaches:
                approach_time = timeit.timeit(lambda: approach(*data), number=number_of_repetitions)
                approach_times[approach].append(approach_time)

    for approach in approaches:
        plt.plot(data_size, approach_times[approach], label=approach.__name__)
    plt.yscale('log')
    plt.xscale('log')

    plt.xlabel(data_name)
    plt.ylabel('Execution Time (seconds)')
    plt.title(title)
    plt.legend()
    plt.show()

ehxuflar

ehxuflar2#

import numpy as np
import pandas as pd

# create a reproducible dataframe
_rng = np.random.default_rng(123)
x = ["a", "a", "b", "b", "a"] * 10_000_000
_rng.shuffle(x)
y = ["x", "y", "z", "w", "t"] * 10_000_000
_rng.shuffle(y)
df = pd.DataFrame({"A" : x, "B" : y})

groups = df.groupby("A") 

def strings2_old(dfg):
    return dfg['B'].apply(lambda x : str(len(np.unique(x))) + "hello")

def strings2_new(dfg):
    return dfg['B'].apply(lambda x : f"{len(pd.unique(x))}hello")
%timeit strings2_old(groups)
>>> 20.7 s ± 82.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit strings2_new(groups)
>>> 1.9 s ± 10.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

我尝试的第一个方法的其他实现,包括Andrej建议的,都比你的要差。除非是一个排序网络,否则无论规模大小,排序都是非常昂贵的。
UPD
阅读Sebastian's惊人的答案让我陷入了优化的兔子洞。我们真的能做得更好吗?显然,我们可以:Python lists: why is .sort() much faster than sorted()?

import pandas as pd
import numpy as np

_rng = np.random.default_rng(123)

def setup(N):
    x = ["a", "a", "b", "b", "a"] * N
    _rng.shuffle(x)
    y = ["x", "y", "z", "w", "t"] * N
    _rng.shuffle(y)
    df = pd.DataFrame({"A": x, "B": y})
    return [df]

def pandas_str_accessor_approach(df):
    return df.groupby("A")["B"].apply(sorted).str.join(" ")

def pandas_str_accessor_approach2(df):
    def transform(_x):
        _x = _x.tolist()
        _x.sort()
        return _x
    return df.groupby("A")["B"].apply(lambda x: transform(x)).str.join(' ')

approaches = [
    pandas_str_accessor_approach,
    pandas_str_accessor_approach2
]

run_performance_comparison(
    approaches,
    [
        1_000,
        3_000,
        5_000,
        10_000,
        30_000,
        100_000,
        300_000,
        500_000,
        1_000_000,
        3_000_000,
        5_000_000,
        10_000_000],
    setup=setup,
    title="Performance Comparison",
    number_of_repetitions=5,
)


100d1x

的字符串

wixjitnu

wixjitnu3#

IIUC,你可以在分组之前按列B对 Dataframe 进行排序,然后只应用str.join

df = df.sort_values(by=['B'])
strings = df.groupby("A")['B'].apply(' '.join)

print(strings)

字符串
印刷品:

A
a    t x y
b      w z
Name: B, dtype: object

相关问题