选择pandas框架中的前n个组

v440hwme  于 2023-11-15  发布在  其他
关注(0)|答案(2)|浏览(112)

我有以下的框架:

Country Crop    Harvest Year    Area (ha)
Afghanistan Maize   2019    94910
Afghanistan Maize   2020    140498
Afghanistan Maize   2021    92144
Afghanistan Winter Wheat    2019    2334000
Afghanistan Winter Wheat    2020    2668000
Afghanistan Winter Wheat    2021    1833357
Argentina   Maize   2019    7232761
Argentina   Maize   2020    7730506
Argentina   Maize   2021    8146596
Argentina   Winter Wheat    2019    6050953
Argentina   Winter Wheat    2020    6729838
Argentina   Winter Wheat    2021    6394102
China   Maize   2019    41309740
China   Maize   2020    41292000
China   Maize   2021    43355859
China   Winter Wheat    2019    23732560
China   Winter Wheat    2020    23383000
China   Winter Wheat    2021    23571400
Ethiopia    Maize   2019    2274306
Ethiopia    Maize   2020    2363507
Ethiopia    Maize   2021    2530000
Ethiopia    Winter Wheat    2019    1789372
Ethiopia    Winter Wheat    2020    1829051
Ethiopia    Winter Wheat    2021    1950000
France  Maize   2019    1506100
France  Maize   2020    1691130
France  Maize   2021    1549520
France  Winter Wheat    2019    5244250
France  Winter Wheat    2020    4512420
France  Winter Wheat    2021    5276730
India   Maize   2019    9027130
India   Maize   2020    9569060
India   Maize   2021    9860000
India   Winter Wheat    2019    29318780
India   Winter Wheat    2020    31357020
India   Winter Wheat    2021    31610000
Namibia Maize   2019    21123
Namibia Maize   2020    35000
Namibia Maize   2021    46070
Namibia Winter Wheat    2019    1079
Namibia Winter Wheat    2020    2000
Namibia Winter Wheat    2021    3026

字符串
我想选择前2个国家的平均值Area (ha)列在整个收获年的.我尝试了这一点,但它不工作:df = df.groupby("Crop", dropna=False).apply( lambda x: x.nlargest(2, "Area (ha)") )个 产量应该是,这里中国和印度是玉米和冬小麦平均Area (ha)`最大的国家,但在完整的数据集中,不同国家的不同作物的平均值最大:

Country Crop    Harvest Year    Area (ha)
China   Maize   2019    41309740
China   Maize   2020    41292000
China   Maize   2021    43355859
China   Winter Wheat    2019    23732560
China   Winter Wheat    2020    23383000
China   Winter Wheat    2021    23571400
India   Maize   2019    9027130
India   Maize   2020    9569060
India   Maize   2021    9860000
India   Winter Wheat    2019    29318780
India   Winter Wheat    2020    31357020
India   Winter Wheat    2021    31610000

gblwokeq

gblwokeq1#

IIUC,你可以做两个.groupby

x = (
    df.groupby("Crop")
    .apply(lambda x: x.groupby("Country")["Area (ha)"].mean())
    .stack()
    .groupby(level=0, group_keys=False)
    .nlargest(2)
)

print(x)

字符串
按平均面积打印前2名作物/国家:

Crop          Country
Maize         China      4.198587e+07
              India      9.485397e+06
Winter Wheat  India      3.076193e+07
              China      2.356232e+07
dtype: float64


然后你可以使用这个索引来过滤原始的框架:

out = df.set_index(["Crop", "Country"]).loc[x.index].reset_index()
print(out)


打印:

Crop Country  Harvest Year  Area (ha)
0          Maize   China          2019   41309740
1          Maize   China          2020   41292000
2          Maize   China          2021   43355859
3          Maize   India          2019    9027130
4          Maize   India          2020    9569060
5          Maize   India          2021    9860000
6   Winter Wheat   India          2019   29318780
7   Winter Wheat   India          2020   31357020
8   Winter Wheat   India          2021   31610000
9   Winter Wheat   China          2019   23732560
10  Winter Wheat   China          2020   23383000
11  Winter Wheat   China          2021   23571400

2w2cym1i

2w2cym1i2#

您在分组中混淆了CropHarvest Year。首先,将数据限制为特定年份的所有项目,然后保留n个最大的项目:

import pandas as pd

df = pd.DataFrame([
    ["Califlower", 2019, 1],
    ["Sunflower", 2020, 2],
    ["Maize", 2020, 3],
    ["Carrot", 2020, 4],
    ["Beet", 2019, 5],
], columns=["Country", "Harvest Year", "Area"])

df.groupby("Harvest Year").apply( lambda x: x.nlargest(2, "Area") )

字符串
结果如预期


的数据

相关问题