在pandas中的pivot表上放置值过滤器

5sxhfpxr  于 2023-04-19  发布在  其他
关注(0)|答案(1)|浏览(103)

我试图在pandas中复制一些excel功能,我们有一个巨大的数据透视表,我们在上面做了许多非常慢的操作。
这就是我正在努力做的

import pandas as pd

Data = [["NonLin1", "NestleBig-100", "daily", "solved", "NestleBig", "v10_10", 435, 1.4],
["NonLin1", "NestleBig-100", "daily", "solved", "NestleBig", "v10_11", 743, 1.3],
["NonLin1", "NestleBig-101", "daily", "solved", "NestleBig", "v10_10", 542, 1.5],
["NonLin1", "NestleBig-101", "daily", "solved", "NestleBig", "v10_11", 324, 1.2],
["NonLin1", "NestleBig-102", "daily", "solved", "NestleBig", "v10_10", 243, 1.8],
["NonLin1", "NestleBig-102", "daily", "solved", "NestleBig", "v10_11", 444, 1.2],
["NonLin2", "NestleSmall-100", "daily", "solved", "NestleBig", "v10_10", 655, 1.0],
["NonLin2", "NestleSmall-100", "daily", "solved", "NestleBig", "v10_11", 252, 1.3],
["NonLin2", "NestleSmall-101", "daily", "solved", "NestleBig", "v10_10", 435, 1.1],
["NonLin2", "NestleSmall-101", "daily", "solved", "NestleBig", "v10_11", 542, 1.3],
["NonLin2", "NestleSmall-102", "daily", "solved", "NestleBig", "v10_10", 645, 1.5],
["NonLin2", "NestleSmall-102", "daily", "solved", "NestleBig", "v10_11", 435, 1.1],
["NonLin3", "NestleBig-100", "daily", "solved", "NestleBig", "v10_10", 653, 1.2],
["NonLin3", "NestleBig-100", "daily", "solved", "NestleBig", "v10_11", 435, 1.4],
["NonLin3", "NestleBig-101", "daily", "unsolved", "NestleBig", "v10_10", 875, 1.4],
["NonLin3", "NestleBig-101", "daily", "solved", "NestleBig", "v10_11", 214, 1.5],
["NonLin3", "NestleBig-102", "daily", "solved", "NestleBig", "v10_10", 890, 1.2],
["NonLin3", "NestleBig-102", "daily", "unsolved", "NestleBig", "v10_11", 432, 1.5]]

df = pd.DataFrame(Data, columns = ["ProblemClass", "inputID", "profile", "Status", "TestID","runID", "NofIters", "time"])

pivottab = pd.pivot_table(data=df, index=["ProblemClass", "inputID",  "Status", "TestID","runID"], values=['time'], aggfunc={'count'})

pivottab.columns = list("#")  #Can we rename this column white creating the pivot table itself?

数据透视表本身很好,看起来是这样的

但我想突出显示v10_11和v10_10为Status.产生不同结果的行,如进一步过滤数据透视表以仅查看状态不同的那些行。在excel数据透视表中,我们可以在Status上设置一个值过滤器,然后选择value filter on(#)!= 2但是我想知道如何进一步过滤这个数据透视表。我想在这个数据透视表上做操作,因为它会更容易,我想从数据透视表中排除这些,以便进一步分析。但任何其他选择都是受欢迎的。

cyvaqqii

cyvaqqii1#

要获取包含(或不包含)所选数据的数据透视表,您可以先预过滤 Dataframe ,然后再创建数据透视表。一种可能的方法是:

  • 使用duplicated方法,从原始表中删除所有数据,其中沿着此列集没有重复的行:['ProblemClass', 'inputID', 'TestID', 'Status']
  • 从预过滤数据创建透视表
    示例
  • 没有突出显示数据的数据透视表
pre_filtered_data = df[df.duplicated(subset=['ProblemClass', 'inputID', 'TestID', 'Status'], keep=False)]
pivottab = pd.pivot_table(data=pre_filtered_data, index=["ProblemClass", "inputID",  "Status", "TestID", "runID"], values=['time'])

输出:

time
ProblemClass inputID         Status TestID    runID       
NonLin1      NestleBig-100   solved NestleBig v10_10   1.4
                                              v10_11   1.3
             NestleBig-101   solved NestleBig v10_10   1.5
                                              v10_11   1.2
             NestleBig-102   solved NestleBig v10_10   1.8
                                              v10_11   1.2
NonLin2      NestleSmall-100 solved NestleBig v10_10   1.0
                                              v10_11   1.3
             NestleSmall-101 solved NestleBig v10_10   1.1
                                              v10_11   1.3
             NestleSmall-102 solved NestleBig v10_10   1.5
                                              v10_11   1.1
NonLin3      NestleBig-100   solved NestleBig v10_10   1.2
                                              v10_11   1.4
  • 带有高亮显示数据的数据透视表(只需使用~恢复过滤条件)
pre_filtered_data = df[~df.duplicated(subset=['ProblemClass', 'inputID', 'TestID', 'Status'], keep=False)]
pivottab = pd.pivot_table(data=pre_filtered_data, index=["ProblemClass", "inputID",  "Status", "TestID", "runID"], values=['time'])

输出

time
ProblemClass inputID       Status   TestID    runID       
NonLin3      NestleBig-101 solved   NestleBig v10_11   1.5
                           unsolved NestleBig v10_10   1.4
             NestleBig-102 solved   NestleBig v10_10   1.2
                           unsolved NestleBig v10_11   1.5

解释

  • df.duplicated(subset=['ProblemClass', 'inputID', 'TestID', 'Status'], keep=False)用于查找给定列中重复的所有行。对于您的数据,只有['ProblemClass', 'inputID', 'TestID']的值相同但'Status'的值不同的行不会具有此类重复
  • 上面的表达式用作掩码,以移除在这方面唯一的行,这意味着'Status'列获得了多个结果

相关问题