python 将函数应用于每个groupby对象

wbrvyc0a  于 2022-12-21  发布在  Python
关注(0)|答案(1)|浏览(101)

我尝试对groupby操作的结果段/分区应用一个函数。

def get_click_rate(data):
    click_count = data[data['event'] == 'click'].shape[0]
    view_count = data[data['event'] == 'pageview'].shape[0]
    return click_count / view_count

data.groupby('linkid').apply(get_click_rate).reset_index(name='click rate')

这里的想法是,我将 Dataframe 按网页的LinkID进行分组,然后将每个分区传递给一个函数,该函数过滤子 Dataframe ,计算一个数字并返回它。然而,它返回错误的数字。下面是返回正确数字的代码片段

click_event = data[data['event'] == 'click'].groupby('linkid')['event'].count().reset_index(name='click count')
view_event = data[data['event'] == 'pageview'].groupby('linkid')['event'].count().reset_index(name='view count')
merged_df = pd.merge(left=click_event, right=view_event, on='linkid', how='inner')
merged_df['click rate'] = merged_df['click count'] / merged_df['view count']

至少在我看来,他们在做同样的事情,但顺序不同,第二个片段首先过滤数据,分组,然后合并子 Dataframe ,以达到所需的数字。
有人能告诉我我错过了什么吗?

c6ubokkw

c6ubokkw1#

我在这里尝试了get_click_rate函数,它返回的结果似乎与您编写的第二种方法相同。我遇到的唯一问题是,当您试图计算没有'pageview'事件的linkid组的点击率时。因此,我对get_click_rate函数做了一些小修改:

import pandas as pd

def get_click_rate(data: pd.DataFrame) -> pd.Series:
    """Calculate the click rate for a given ``linkid`` group.

    Parameters
    ----------
    data : pd.DataFrame
        A dataframe representing values from a given ``linkid`` group,
        containing the column 'event'.

    Returns
    -------
    pd.Series
        A series containing the click count, view count, and click rate.
        The click rate is calculated as the ratio of the click count to
        the view counts. If the view count is zero, click rate gets set to zero.

    Examples
    --------
    >>> df = pd.DataFrame(
    >>>     {'event': ['click', 'pageview', 'click', 'some_other_value'],
    >>>              'linkid': [1, 1, 2, 2]}
    >>> )
    >>> df.groupby('linkid').apply(get_click_rate).reset_index()
       linkid  click count  view count  click_rate
    0       1          1.0         1.0         1.0
    1       2          1.0         0.0         0.0

    Notes
    -----
    This function returns a pandas Series regardless of whether the
    ``linkid`` group contains any view count or not. Therefore, if you want
    only the ``linkid``s' that have click rates, you can use the following code:

    .. code-block:: python

        (
            df.groupby('linkid')
            .apply(get_click_rate)
            .reset_index()
            .loc[lambda xdf: xdf['click_rate'] > 0, :]
        )

    """

    click_count: int = data[data['event'] == 'click'].shape[0]
    view_count: int = data[data['event'] == 'pageview'].shape[0]
    click_rate: float = 0

    # Only compute the click rate when `view_count` is greater than zero.
    if view_count > 0:
        click_rate = round(click_count / view_count, 2)

    return pd.Series({'click count': click_count,
                      'view count': view_count,
                      'click_rate': click_rate})

测试get_click_rate函数

import pandas as pd
import numpy as np

def get_click_rate(data: pd.DataFrame) -> pd.Series:
    """Calculate the click rate for a given ``linkid`` group.

    Parameters
    ----------
    data : pd.DataFrame
        A dataframe representing values from a given ``linkid`` group,
        containing the column 'event'.

    Returns
    -------
    pd.Series
        A series containing the click count, view count, and click rate.
        The click rate is calculated as the ratio of the click count to
        the view counts. If the view count is zero, click rate gets set to zero.

    Examples
    --------
    >>> df = pd.DataFrame(
    >>>     {'event': ['click', 'pageview', 'click', 'some_other_value'],
    >>>              'linkid': [1, 1, 2, 2]}
    >>> )
    >>> df.groupby('linkid').apply(get_click_rate).reset_index()
       linkid  click count  view count  click_rate
    0       1          1.0         1.0         1.0
    1       2          1.0         0.0         0.0

    Notes
    -----
    This function returns a pandas Series regardless of whether the
    ``linkid`` group contains any view count or not. Therefore, if you want
    only the ``linkid``s' that have click rates, you can use the following code:

    .. code-block:: python

        (
            df.groupby('linkid')
            .apply(get_click_rate)
            .reset_index()
            .loc[lambda xdf: xdf['click_rate'] > 0, :]
        )

    """

    click_count: int = data[data['event'] == 'click'].shape[0]
    view_count: int = data[data['event'] == 'pageview'].shape[0]
    click_rate: float = 0

    # Only compute the click rate when `view_count` is greater than zero.
    if view_count > 0:
        click_rate = round(click_count / view_count, 2)

    return pd.Series({'click count': click_count,
                      'view count': view_count,
                      'click_rate': click_rate})

event_choices = ['click', 'pageview', 'some_other_value']
linkid_choices = ['1', '2', '3']
nrows = 30

# -- Generating a Dummy DataFrame for Testing ------------------------------
df = pd.concat(
    [
        pd.DataFrame(
            {
                'event': np.random.choice(event_choices, nrows),
                'linkid': np.random.choice(linkid_choices, nrows),
            }
        ),
        pd.DataFrame({'event': ['some_other_value'] * 3, 'linkid': '4'})
    ], ignore_index=True
)

(
    # Group dataframe by column `linkid`
    df.groupby('linkid')
    # Apply function `get_click_rate` that returns a pandas Series with three
    # columns ('click count', 'view count' and 'click_rate') for every 'linkid' value.
    .apply(get_click_rate)
    .reset_index()
    # Convert the data type of 'click count', and 'view count' column to integers
    .astype({'click count': int, 'view count': int})
    # Filter for `linkid`s' that have a click rate greater than zero.
    .loc[lambda xdf: xdf['click_rate'] > 0, :]
)
# Returns:
#
#   linkid  click count  view count  click_rate
# 0      1            2           3        0.67
# 1      2            3           3        1.00
# 2      3            3           6        0.50

输出:

相关问题