pandas：将字符串列转换为有序类别？

8nuwlpux 于 2022-12-02 发布在其他

关注(0)|答案(3)|浏览(138)

我是第一次和Pandas打交道，我有一个专栏，里面有调查结果，可以用“非常同意”、“同意”、“不同意”、“非常不同意”和“两者都不同意”来表示。
这是describe()和value_counts()针对数据行的输出：

count      4996
unique        5
top       Agree
freq       1745
dtype: object
Agree                1745
Strongly agree        926
Strongly disagree     918
Disagree              793
Neither               614
dtype: int64

我想对此问题与总分进行线性回归。但是，我觉得应该先将该列转换为类别变量，因为它本身是有序的。这样做正确吗？如果正确，该如何操作？
我试过这个：

df.EasyToUseQuestionFactor = pd.Categorical.from_array(df.EasyToUseQuestion)
print df.EasyToUseQuestionFactor

这产生的输出看起来似乎是正确的，但似乎类别的顺序是错误的。有没有一种方法可以指定顺序？我甚至需要指定顺序吗？
下面是我的代码的其余部分：

df = pd.read_csv('./data/responses.csv')
lm1 = ols('OverallScore ~ EasyToUseQuestion', data).fit()
print lm1.rsquared

pandas

来源：https://stackoverflow.com/questions/25938557/pandas-convert-string-column-to-ordered-category

3条答案

按热度按时间

oalqel3c1#

现在有两种方法。你的列会更可读，使用更少的内存。因为它将是一个分类类型，你仍然可以对值进行排序。
首先是我的首选：

df['grades'].astype('category')

astype用于接受categories参数，但它不是present anymore。因此，如果要以非词典顺序对类别进行排序，或者要使用数据中没有的额外类别，则必须使用下面的解决方案。
此建议来自文档

In [26]: from pandas.api.types import CategoricalDtype
In [27]: s = pd.Series(["a", "b", "c", "a"])
In [28]: cat_type = CategoricalDtype(categories=["b", "c", "d"],
   ....:                             ordered=True)
In [29]: s_cat = s.astype(cat_type)

额外提示：从df.colname.unique()列中获取所有现有值。

赞(0）回复(0）举报 2022-12-02

nzkunb0c2#

是的，您应该将其转换为分类数据，这样就可以了

likert_scale = {'strongly agree':2, 'agree':1, 'neither':0, 'disagree':-1, 'strongly disagree':-2}
df['categorical_data'] = df.EasyToUseQuestion.apply(lambda x: likert_scale[x])

赞(0）回复(0）举报 2022-12-02

ar7v8xwq3#

pandas.factorize()可以取得数组的数值表示。