dataframe选择行计数大于x的所有行

u59ebvdq  于 2021-07-13  发布在  Java
关注(0)|答案(1)|浏览(315)

如何选择行数>=2的所有行?
我有下面的PandasDataframe。

df = pd.DataFrame({"date": ["2000-01-03", "2000-01-04", "2000-01-04", "2000-01-04", "2000-01-04",
                             "2000-01-03", "2000-01-04", "2000-01-05", "2000-01-05", 
                             "2000-01-03", "2000-01-05", "2000-01-05", 
                             "2000-01-04", "2000-01-05"],
                   "sym": ["A", "A", "A", "A", "A" ,"B", "B","B", "B" ,"C", "C", "C", "D", "E"],
                   "val1": [1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 2, 2],
                   "val2": [2, 2, 2, 2, 2, 2, 3, 3, 3, 1, 1, 2, 2, 2]

                  })

测向

date sym  val1  val2
0   2000-01-03   A     1     2
1   2000-01-04   A     1     2
2   2000-01-04   A     1     2
3   2000-01-04   A     1     2
4   2000-01-04   A     1     2
5   2000-01-03   B     2     2
6   2000-01-04   B     2     3
7   2000-01-05   B     2     3
8   2000-01-05   B     2     3
9   2000-01-03   C     3     1
10  2000-01-05   C     3     1
11  2000-01-05   C     3     2
12  2000-01-04   D     2     2
13  2000-01-05   E     2     2

我申请了

df.groupby(['date', 'sym'], as_index=False).mean().sort_values(['sym','date'])

为每个符号指定日期的val1、val2求平均值。

date   sym  val1 val2
0   2000-01-03  A   1.0 2.0
3   2000-01-04  A   1.0 2.0
1   2000-01-03  B   2.0 2.0
4   2000-01-04  B   2.0 3.0
6   2000-01-05  B   2.0 3.0
2   2000-01-03  C   3.0 1.0
7   2000-01-05  C   3.0 1.5
5   2000-01-04  D   2.0 2.0
8   2000-01-05  E   2.0 2.0

接下来,我需要选择行计数>=2的“sym”的所有行。在本例中,结果df将是sym=a,b,c中的所有行
期望输出:

date    sym val1 val2
0   2000-01-03  A   1.0 2.0
3   2000-01-04  A   1.0 2.0
1   2000-01-03  B   2.0 2.0
4   2000-01-04  B   2.0 3.0
6   2000-01-05  B   2.0 3.0
2   2000-01-03  C   3.0 1.0
7   2000-01-05  C   3.0 1.5

我尝试了组合groupby,pivot,count,但运气不好。

mhd8tkvw

mhd8tkvw1#

请参阅:如何基于值计数过滤Dataframe?

import pandas as pd

df = pd.DataFrame({"date": ["2000-01-03", "2000-01-04",
                            "2000-01-04", "2000-01-04",
                            "2000-01-04", "2000-01-03",
                            "2000-01-04", "2000-01-05",
                            "2000-01-05", "2000-01-03",
                            "2000-01-05", "2000-01-05",
                            "2000-01-04", "2000-01-05"],
                   "sym": ["A", "A", "A", "A", "A", "B",
                           "B", "B", "B", "C", "C", "C",
                           "D", "E"],
                   "val1": [1, 1, 1, 1, 1, 2, 2, 2, 2, 3,
                            3, 3, 2, 2],
                   "val2": [2, 2, 2, 2, 2, 2, 3, 3, 3, 1,
                            1, 2, 2, 2]
                   })

df = df \
    .groupby(['date', 'sym'], as_index=False) \
    .mean() \
    .sort_values(['sym', 'date'])

df = df[df['sym'].map(df['sym'].value_counts()) >= 2]
print(df)

输出:

date sym  val1  val2
0  2000-01-03   A   1.0   2.0
3  2000-01-04   A   1.0   2.0
1  2000-01-03   B   2.0   2.0
4  2000-01-04   B   2.0   3.0
6  2000-01-05   B   2.0   3.0
2  2000-01-03   C   3.0   1.0
7  2000-01-05   C   3.0   1.5

相关问题