按一列中的出现频率对整个CSV进行排序

gg0vcinb  于 12个月前  发布在  其他
关注(0)|答案(4)|浏览(102)

我有一个很大的CSV文件,这是一个呼叫者数据的日志。
我的文件的一小段:

CompanyName    High Priority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User

我想按照客户出现的频率对整个列表进行排序,这样就像:

CompanyName    High Priority     QualityIssue
Customer3         No               Equipment
Customer3         No               User
Customer3         Yes              User
Customer3         Yes              Equipment
Customer1         Yes              User
Customer1         Yes              User
Customer1         No               Neither
Customer2         No               User
Customer4         No               User

我试过groupby,但它只打印出公司名称和频率,而不是其他列,我也试过

df['Totals']= [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

df = [sum(df['CompanyName'] == df['CompanyName'][i]) for i in xrange(len(df))]

但这些给予我错误:
ValueError:传入的项数错误1,索引意味着24
我看过这样的东西:

for key, value in sorted(mydict.iteritems(), key=lambda (k,v): (v,k)):
    print "%s: %s" % (key, value)

但这只打印出两列,我想对整个CSV进行排序。我的输出应该是按第一列排序的整个CSV。
提前感谢您的帮助!

woobm2wo

woobm2wo1#

这似乎是你想要的,基本上通过执行groupbytransformvalue_counts来添加一个count列,然后你可以对该列进行排序:

df['count'] = df.groupby('CompanyName')['CompanyName'].transform(pd.Series.value_counts)
df.sort_values('count', ascending=False)

输出量:

CompanyName HighPriority QualityIssue count
5   Customer3           No         User     4
3   Customer3           No    Equipment     4
7   Customer3          Yes    Equipment     4
6   Customer3          Yes         User     4
0   Customer1          Yes         User     3
4   Customer1           No      Neither     3
1   Customer1          Yes         User     3
8   Customer4           No         User     1
2   Customer2           No         User     1

您可以使用df.drop删除无关列:

df.drop('count', axis=1)

输出量:

CompanyName HighPriority QualityIssue
5   Customer3           No         User
3   Customer3           No    Equipment
7   Customer3          Yes    Equipment
6   Customer3          Yes         User
0   Customer1          Yes         User
4   Customer1           No      Neither
1   Customer1          Yes         User
8   Customer4           No         User
2   Customer2           No         User
wj8zmpe1

wj8zmpe12#

top-voted answer需要一个小的补充:sort已被弃用,转而支持sort_valuessort_index
sort_values是这样工作的:

import pandas as pd
    df = pd.DataFrame({'a': [1, 2, 1], 'b': [1, 2, 3]})
    df['count'] = \
    df.groupby('a')['a']\
    .transform(pd.Series.value_counts)
    df.sort_values('count', inplace=True, ascending=False)
    print('df sorted: \n{}'.format(df))
df sorted:
a  b  count
0  1  1      2
2  1  3      2
1  2  2      1
cygmwpex

cygmwpex3#

更新2021

EdChumIlya K.提出的答案不再起作用。
函数pd.Series.value_counts返回一个包含唯一值计数的Series。但是,我们应用pd.Series.value_counts函数的Series本身只包含一个唯一值,因为我们之前将groupby应用于DataFrame并将CompanyNameSeries拆分为唯一值组。因此,我们应用函数后的最终输出看起来像这样。

Customer3        4
dtype: int64

这是无稽之谈,我们不能将一个系列中的值转换为整个系列。不知何故,我们只需要整数4,而不是整个系列。
但是,我们可以利用前面的groupby函数,计算每个组中的值的数量,将整个组转换为该组中的值的数量,并将它们放在一起形成最终的FrequencySeries。
我们可以将pd.Series.value_counts替换为pd.Series.count,或者仅使用函数名count

import pandas as pd

df = pd.DataFrame({'CompanyName': {0: 'Customer1', 1: 'Customer1', 2: 'Customer2', 3: 'Customer3', 4: 'Customer1', 5: 'Customer3', 6: 'Customer3', 7: 'Customer3', 8: 'Customer4'}, 'HighPriority': {0: 'Yes', 1: 'Yes', 2: 'No', 3: 'No', 4: 'No', 5: 'No', 6: 'Yes', 7: 'Yes', 8: 'No'}, 'QualityIssue': {0: 'User', 1: 'User', 2: 'User', 3: 'Equipment', 4: 'Neither', 5: 'User', 6: 'User', 7: 'Equipment', 8: 'User'}})

df['Frequency'] = df.groupby('CompanyName')['CompanyName'].transform('count')
df.sort_values(['Frequency', 'CompanyName'], inplace=True, ascending=[False, True])

输出

>>> df

  CompanyName HighPriority QualityIssue  Frequency
3   Customer3           No    Equipment          4
5   Customer3           No         User          4
6   Customer3          Yes         User          4
7   Customer3          Yes    Equipment          4
0   Customer1          Yes         User          3
1   Customer1          Yes         User          3
4   Customer1           No      Neither          3
2   Customer2           No         User          1
8   Customer4           No         User          1
yshpjwxd

yshpjwxd4#

我想一定有更好的办法,但这应该行得通:
准备数据:

import io
data = """
CompanyName  HighPriority     QualityIssue
Customer1         Yes             User
Customer1         Yes             User
Customer2         No              User
Customer3         No              Equipment
Customer1         No              Neither
Customer3         No              User
Customer3         Yes             User
Customer3         Yes             Equipment
Customer4         No              User
"""
df = pd.read_table(io.StringIO(data), sep=r"\s+")

然后进行转换:

# create a (sorted) data frame that lists the customers with their number of occurrences
count_df = pd.DataFrame(df.CompanyName.value_counts())

# join the count data frame back with the original data frame
new_index = count_df.merge(df[["CompanyName"]], left_index=True, right_on="CompanyName")

# output the original data frame in the order of the new index.
df.reindex(new_index.index)

输出:

CompanyName HighPriority    QualityIssue
3   Customer3   No  Equipment
5   Customer3   No  User
6   Customer3   Yes User
7   Customer3   Yes Equipment
0   Customer1   Yes User
1   Customer1   Yes User
4   Customer1   No  Neither
8   Customer4   No  User
2   Customer2   No  User

这可能不是直观的这里发生了什么,但目前我想不出一个更好的方法来做到这一点。我试图尽可能多地发表评论。
这里的棘手部分是count_df的索引是客户的(唯一)出现。因此,我将count_dfleft_index=True)的索引与dfright_on="CompanyName")的CompanyName列连接起来。
这里的神奇之处在于count_df已经按照出现的次数进行了排序,这就是为什么我们不需要显式排序。因此,我们所要做的就是将原始 Dataframe 的行按连接 Dataframe 的行重新排序,我们就得到了预期的结果。

相关问题