带有多索引 Dataframe 的Pandas / Matplotlib条形图

hwamh0ep  于 2023-01-15  发布在  其他
关注(0)|答案(2)|浏览(125)

我有一个排序的多指标Pandas数据框,我需要在一个条形图。My data frame
我可能还没有找到解决方案,或者简单的解决方案不存在,但我需要在此数据上绘制一个条形图,其中ContentCategory位于x轴上,Installs为高度。
简而言之,我需要显示每个条形图的组成,例如20%由Everyone组成,40%由Teen组成等......我不确定这是否可能,因为平均值的平均值是不可能的,因为样本量不同,因此我制作了一个Uploads列来计算它,但还没有达到按平均值绘图的程度。
我认为累积作图会得出错误的结果。
我需要绘制一个条形图,其中X标记为Category(最好是前10个),然后每个X标记都有一个Content的条形图 * 不总是3,可以只是"每个人"和"青少年"*,每个条形图的高度为Installs
理想情况下,它应该如下所示:Bar Chart
但是每个条具有用于该特定CategoryContent的条。
我试过用DataFrame.unstack()展平,但它破坏了 Dataframe 的排序,所以使用了Cat2 = Cat1.reset_index(level = [0,1]),但仍需要绘图帮助。
到目前为止我有:

Cat = Popular.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum"})
Uploads = Popular[["Category","Content"]].value_counts().rename_axis(["Category","Content"]).reset_index(name = "Uploads")
Cat = pd.merge(Cat, Uploads, on = ["Category","Content"])
Cat = Cat.groupby(["Category","Content"]).agg({"Installs": "sum", "Rating Count": "sum", "Uploads": "sum"})

这就给出了
result
然后我这样排序

Cat1 = Cat.unstack() 
Cat1 = Cat1.sort_index(key = (Cat1["Installs"].sum(axis = 1)/Cat1["Uploads"].sum(axis = 1)).get, ascending = False).stack()

Thanks to one of those solutions
我只有这些了。
Data Set来自Kaggle,超过600MB,不要指望任何人下载它,但至少是一个简单的解决方案指南。
P.S.这应该可以帮助我以同样的方式分割散点图中的每个点,但如果没有,也没关系。
另外,我没有足够的声誉来发布图片,所以抱歉的链接

ig9co6j1

ig9co6j11#

    • 编辑:**添加了计算每个"类别"的"安装"百分比的代码。

数据集很大,但您应该提供模拟数据以轻松地重现示例,如下所示:

import pandas as pd
import numpy as np

categories = ["Productivity", "Arcade", "Business", "Social"]
contents = ["Everyone", "Matute", "Teen"]

index = pd.MultiIndex.from_product(
    [categories, contents], names=["Category", "Content"]
)
installs = np.random.randint(low=100, high=999, size=len(index))

df = pd.DataFrame({"Installs": installs}, index=index)
>>> df

                       Installs
Category     Content
Productivity Everyone       149
             Matute         564
             Teen           301
Arcade       Everyone       926
             Matute         542
             Teen           556
Business     Everyone       879
             Matute         921
             Teen           323
Social       Everyone       329
             Matute         320
             Teen           426

如果要计算每个"类别"的"安装"百分比,请使用groupby().apply()

>>> df["Installs (%)"] = (
...     df["Installs"]
...     .groupby(by="Category", group_keys=False)
...     .apply(lambda df: df / df.sum() * 100)
... )
>>> df

                       Installs  Installs (%)
Category     Content
Productivity Everyone       513     22.246314
             Matute         839     36.383348
             Teen           954     41.370338
Arcade       Everyone       122     10.581093
             Matute         519     45.013010
             Teen           512     44.405898
Business     Everyone       412     31.164902
             Matute         698     52.798790
             Teen           212     16.036309
Social       Everyone       874     52.555622
             Matute         326     19.603127
             Teen           463     27.841251

然后,您只需.unstack()一次:

>>> df = df.unstack()
>>> df

             Installs             Installs (%)
Content      Everyone Matute Teen     Everyone     Matute       Teen
Category
Arcade            499    904  645    24.365234  44.140625  31.494141
Business          856    819  438    40.511122  38.760057  20.728822
Productivity      705    815  657    32.384015  37.436840  30.179146
Social            416    482  238    36.619718  42.429577  20.950704

然后绘制所需特征的条形图:

fig, (ax, ax_percent) = plt.subplots(ncols=2, figsize=(14, 5))

df["Installs"].plot(kind="bar", rot=True, ax=ax)
ax.set_ylabel("Installs")

df["Installs (%)"].plot(kind="bar", rot=True, ax=ax_percent)
ax_percent.set_ylabel("Installs (%)")
ax_percent.set_ylim([0, 100])

plt.show()

w6lpcovy

w6lpcovy2#

ChatGPT已经回答了我的问题

import pandas as pd
import matplotlib.pyplot as plt

# create a dictionary of data for the DataFrame
data = {
    'app_name': ['Google Maps', 'Uber', 'Waze', 'Spotify', 'Pandora'],
    'category': ['Navigation', 'Transportation', 'Navigation', 'Music', 'Music'],
    'rating': [4.5, 4.0, 4.5, 4.5, 4.0],
    'reviews': [1000000, 50000, 100000, 500000, 250000]
}

# create the DataFrame
df = pd.DataFrame(data)

# set the 'app_name' and 'category' columns as the index
df = df.set_index(['app_name', 'category'])

# add a new column called "content_rating" to the DataFrame, and assign a content rating to each app
df['content_rating'] = ['Everyone', 'Teen', 'Everyone', 'Everyone', 'Teen']

# Grouping the Data by category and content_rating and getting the mean of reviews
df_grouped = df.groupby(['category','content_rating']).agg({'reviews':'mean'})

# Reset the index to make it easier to plot
df_grouped = df_grouped.reset_index()

# Plotting the stacked bar chart
df_grouped.pivot(index='category', columns='content_rating', values='reviews').plot(kind='bar', stacked=True)

这是一个样本数据集
我所做的就是在数据集中添加一个求和列,并按此求和对它进行排序。

piv = qw1.reset_index()
piv = piv.pivot_table(index='Category', columns='Content', values='per')#.plot(kind='bar', stacked = True)
piv["Sum"] = piv.sum(axis=1)
piv_10 = piv.sort_values(by = "Sum", ascending = False)[["Adult", "Everyone", "Mature", "Teen"]].head(10)

其中QW1是多索引 Dataframe 。
接下来要做的就是把它画出来:

piv_10.plot.bar(stacked = True, logy = False)

相关问题