pandas 动态for循环滤波器框架

qfe3c7zg  于 2023-10-14  发布在  其他
关注(0)|答案(2)|浏览(99)

我正在寻找一种方法来过滤数据,而无需手动添加for循环。
我有下面的代码,而我只想指定我希望数据被过滤的列(在一个名为'col_list'的变量中)。有没有一种更简单、更通用的方法来实现这一点?

import pandas as pd

df = pd.DataFrame({
    'Col1':['A','A','A','B','B','B'],
    'Col2':['AA','AB','AC','BA','BB','BC'],
    'Val':[1,2,3,1,2,3]
})

# Insert all Cols that should be looped through
col_list = ['Col1','Col2']

# Loop through Col1
for i in ['total'] + list(df[col_list[0]].unique()):
    d1 = df[(df[col_list[0]] == i)] if i != 'total' else df.copy()
    
    # Loop through Col2
    for j in ['total'] + list(d1[col_list[1]].unique()):
        d2 = d1[(d1[col_list[1]] == j)]  if j != 'total' else d1.copy()
        print(f"{i} | {j} | {sum(d2['Val'])}")

输出量:

total | total | 12
total | AA | 1
total | AB | 2
total | AC | 3
total | BA | 1
total | BB | 2
total | BC | 3
A | total | 6
A | AA | 1
A | AB | 2
A | AC | 3
B | total | 6
B | BA | 1
B | BB | 2
B | BC | 3

所需方法:

col_list = ['Col1','Col2']

get_data(df,col_list):
    '''Formula to return the data'''
f87krz0w

f87krz0w1#

是否确定

import pandas as pd

def get_data(df, col_list):
    filtered_df = df.copy()
    for col in col_list:
        unique_values = df[col].unique()
        if 'total' in unique_values:
            filtered_df = filtered_df[(filtered_df[col] == 'total') | (filtered_df[col].isin(unique_values))]
        else:
            filtered_df = filtered_df[filtered_df[col].isin(unique_values)]
    return filtered_df

df = pd.DataFrame({
    'Col1':['A','A','A','B','B','B'],
    'Col2':['AA','AB','AC','BA','BB','BC'],
    'Val':[1,2,3,1,2,3]
})

col_list = ['Col1','Col2']
get_data(df, col_list)
h9a6wy2h

h9a6wy2h2#

根据代码中的principal,您可以计算两个(动态指定的)列中每个元素组合的值之和('Val')。
下面的代码重新生成了您的principal,而没有对列进行硬编码,并将结果收集在单个principal中:

import pandas as pd
import numpy as np

def get_data(df, col_list):
    '''Operate sum() over all single and two-column groupings'''

    # 1. Grand total
    Total = pd.DataFrame({'group_1' : 'Total',
                          'Sum'     : [df['Val'].sum()]})
    
    # 2. Sums by one value
    Sums_by_1_col = pd.concat([df[[c, 'Val']].groupby(by=c).sum()
                                             .reset_index()
                                             .set_axis(['group_1','Sum'], axis=1) 
                               for c in col_list], axis=0)
    
    # 3.Sums grouping by values in 2 columns at a time
    Sums_by_2_col = df.groupby(by=col_list).sum(
                                          ).reset_index(drop=False
                                          ).set_axis(['group_1','group_2','Sum'], axis=1)
    
    # Assemble 1, 2 and 3
    return pd.concat([Sums_by_2_col,
                      Sums_by_1_col,
                      Total         ], axis=0  
                    ).replace([np.nan], [None] # pd.concat places NaN by default. None better fits a non numerical column.
                    ).reset_index(drop=True)

col_list = ['Col1','Col2']

get_data(df, col_list)

输出量:

group_1 group_2  Sum
0        A      AA    1
1        A      AB    2
2        A      AC    3
3        B      BA    1
4        B      BB    2
5        B      BC    3
6        A    None    6
7        B    None    6
8       AA    None    1
9       AB    None    2
10      AC    None    3
11      BA    None    1
12      BB    None    2
13      BC    None    3
14   Total    None   12

在您的问题中,您特别询问了动态迭代列,但根本没有提到sum函数。自然地,可以实现另一聚合函数。
引用

相关问题