python 将函数应用于每个groupby对象

wbrvyc0a 于 2022-12-21 发布在 Python

关注(0)|答案(1)|浏览(101)

我尝试对groupby操作的结果段/分区应用一个函数。

def get_click_rate(data):
    click_count = data[data['event'] == 'click'].shape[0]
    view_count = data[data['event'] == 'pageview'].shape[0]
    return click_count / view_count

data.groupby('linkid').apply(get_click_rate).reset_index(name='click rate')

这里的想法是，我将 Dataframe 按网页的LinkID进行分组，然后将每个分区传递给一个函数，该函数过滤子 Dataframe ，计算一个数字并返回它。然而，它返回错误的数字。下面是返回正确数字的代码片段

click_event = data[data['event'] == 'click'].groupby('linkid')['event'].count().reset_index(name='click count')
view_event = data[data['event'] == 'pageview'].groupby('linkid')['event'].count().reset_index(name='view count')
merged_df = pd.merge(left=click_event, right=view_event, on='linkid', how='inner')
merged_df['click rate'] = merged_df['click count'] / merged_df['view count']

至少在我看来，他们在做同样的事情，但顺序不同，第二个片段首先过滤数据，分组，然后合并子 Dataframe ，以达到所需的数字。
有人能告诉我我错过了什么吗？

来源：https://stackoverflow.com/questions/74870875/apply-function-to-each-groupby-object

1条答案

按热度按时间

我在这里尝试了get_click_rate函数，它返回的结果似乎与您编写的第二种方法相同。我遇到的唯一问题是，当您试图计算没有'pageview'事件的linkid组的点击率时。因此，我对get_click_rate函数做了一些小修改：

import pandas as pd

def get_click_rate(data: pd.DataFrame) -> pd.Series:
    """Calculate the click rate for a given ``linkid`` group.

    Parameters
    ----------
    data : pd.DataFrame
        A dataframe representing values from a given ``linkid`` group,
        containing the column 'event'.

    Returns
    -------
    pd.Series
        A series containing the click count, view count, and click rate.
        The click rate is calculated as the ratio of the click count to
        the view counts. If the view count is zero, click rate gets set to zero.

    Examples
    --------
    >>> df = pd.DataFrame(
    >>>     {'event': ['click', 'pageview', 'click', 'some_other_value'],
    >>>              'linkid': [1, 1, 2, 2]}
    >>> )
    >>> df.groupby('linkid').apply(get_click_rate).reset_index()
       linkid  click count  view count  click_rate
    0       1          1.0         1.0         1.0
    1       2          1.0         0.0         0.0

    Notes
    -----
    This function returns a pandas Series regardless of whether the
    ``linkid`` group contains any view count or not. Therefore, if you want
    only the ``linkid``s' that have click rates, you can use the following code:

    .. code-block:: python

        (
            df.groupby('linkid')
            .apply(get_click_rate)
            .reset_index()
            .loc[lambda xdf: xdf['click_rate'] > 0, :]
        )

    """

    click_count: int = data[data['event'] == 'click'].shape[0]
    view_count: int = data[data['event'] == 'pageview'].shape[0]
    click_rate: float = 0

    # Only compute the click rate when `view_count` is greater than zero.
    if view_count > 0:
        click_rate = round(click_count / view_count, 2)

    return pd.Series({'click count': click_count,
                      'view count': view_count,
                      'click_rate': click_rate})

测试`get_click_rate`函数

import pandas as pd
import numpy as np

def get_click_rate(data: pd.DataFrame) -> pd.Series:
    """Calculate the click rate for a given ``linkid`` group.

    Parameters
    ----------
    data : pd.DataFrame
        A dataframe representing values from a given ``linkid`` group,
        containing the column 'event'.

    Returns
    -------
    pd.Series
        A series containing the click count, view count, and click rate.
        The click rate is calculated as the ratio of the click count to
        the view counts. If the view count is zero, click rate gets set to zero.

    Examples
    --------
    >>> df = pd.DataFrame(
    >>>     {'event': ['click', 'pageview', 'click', 'some_other_value'],
    >>>              'linkid': [1, 1, 2, 2]}
    >>> )
    >>> df.groupby('linkid').apply(get_click_rate).reset_index()
       linkid  click count  view count  click_rate
    0       1          1.0         1.0         1.0
    1       2          1.0         0.0         0.0

    Notes
    -----
    This function returns a pandas Series regardless of whether the
    ``linkid`` group contains any view count or not. Therefore, if you want
    only the ``linkid``s' that have click rates, you can use the following code:

    .. code-block:: python

        (
            df.groupby('linkid')
            .apply(get_click_rate)
            .reset_index()
            .loc[lambda xdf: xdf['click_rate'] > 0, :]
        )

    """

    click_count: int = data[data['event'] == 'click'].shape[0]
    view_count: int = data[data['event'] == 'pageview'].shape[0]
    click_rate: float = 0

    # Only compute the click rate when `view_count` is greater than zero.
    if view_count > 0:
        click_rate = round(click_count / view_count, 2)

    return pd.Series({'click count': click_count,
                      'view count': view_count,
                      'click_rate': click_rate})

event_choices = ['click', 'pageview', 'some_other_value']
linkid_choices = ['1', '2', '3']
nrows = 30

# -- Generating a Dummy DataFrame for Testing ------------------------------
df = pd.concat(
    [
        pd.DataFrame(
            {
                'event': np.random.choice(event_choices, nrows),
                'linkid': np.random.choice(linkid_choices, nrows),
            }
        ),
        pd.DataFrame({'event': ['some_other_value'] * 3, 'linkid': '4'})
    ], ignore_index=True
)

(
    # Group dataframe by column `linkid`
    df.groupby('linkid')
    # Apply function `get_click_rate` that returns a pandas Series with three
    # columns ('click count', 'view count' and 'click_rate') for every 'linkid' value.
    .apply(get_click_rate)
    .reset_index()
    # Convert the data type of 'click count', and 'view count' column to integers
    .astype({'click count': int, 'view count': int})
    # Filter for `linkid`s' that have a click rate greater than zero.
    .loc[lambda xdf: xdf['click_rate'] > 0, :]
)
# Returns:
#
#   linkid  click count  view count  click_rate
# 0      1            2           3        0.67
# 1      2            3           3        1.00
# 2      3            3           6        0.50

输出：

赞(0）回复(0）举报 2022-12-21

相关问题

热门标签

Java query python Node 开发语言 request Util 数据库 Table 后端算法 Logger Message Element Parser

最新问答

xxl-job 安全组扫描到执行器端口服务存在信息泄露漏洞
回答(1) 发布于 22天前
xxl-job 不能和nacos兼容？
回答(3) 发布于 22天前
xxl-job 任务执行完后无法结束，日志一直转圈
回答(3) 发布于 22天前
xxl-job-admin页面上查看调度日志样式问题
回答(1) 发布于 22天前
xxl-job 参数512字符限制能否去掉
回答(1) 发布于 22天前