ml时如何以另一种方式索引分类特征

col17t5w 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(442)

spark中的矢量索引器根据变量的频率索引分类特征。但我想用另一种方式索引分类特征。
例如，对于如下所示的数据集，如果我在spark中使用vectorindexer，“a”、“b”、“c”将被索引为0,1,2。但我想根据标签给它们编索引。共有4行数据被索引为1，其中3行具有特征“a”，1行具有特征“c”。所以这里我将索引'a'为0，'c'为1，'b'为2。
有什么方便的方法来实现这一点吗？

label|feature
-----------------
    1 | a
    1 | c
    0 | a
    0 | b
    1 | a
    0 | b
    0 | b
    0 | c
    1 | a

apache-spark apache-spark-mllib

来源：https://stackoverflow.com/questions/40262719/how-to-index-categorical-features-in-another-way-when-using-spark-ml

1条答案

按热度按时间

z9ju0rcb1#

如果我正确理解了您的问题，那么您希望在分组数据上复制stringindexer（）的行为。这样做（用英语） pySpark )，我们首先定义 udf 它将在一个 List 包含每个组的所有值的列。请注意，计数相等的元素将被任意排序。

from collections import Counter
from pyspark.sql.types import ArrayType, IntegerType

def encoder(col):

  # Generate count per letter
  x = Counter(col)

  # Create a dictionary, mapping each letter to its rank
  ranking = {pair[0]: rank 
           for rank, pair in enumerate(x.most_common())}

  # Use dictionary to replace letters by rank
  new_list = [ranking[i] for i in col]

  return(new_list)

encoder_udf = udf(encoder, ArrayType(IntegerType()))

现在我们可以汇总 feature 按列分组的列表 label 使用 collect_list() ，并应用我们的 udf 按行：

from pyspark.sql.functions import collect_list, explode

df1 = (df.groupBy("label")
       .agg(collect_list("feature")
            .alias("features"))
       .withColumn("index", 
                   encoder_udf("features")))

因此，可以将 index 列以获取编码值而不是字母：

df1.select("label", explode(df1.index).alias("index")).show()
+-----+-----+
|label|index|
+-----+-----+
|    0|    1|
|    0|    0|
|    0|    0|
|    0|    0|
|    0|    2|
|    1|    0|
|    1|    1|
|    1|    0|
|    1|    0|
+-----+-----+

赞(0）回复(0）举报 2021-05-27

我来回答

ml时如何以另一种方式索引分类特征

1条答案

相关问题

热门标签

最新问答