Apachespark tf idf

wvt8vs2t  于 2021-05-24  发布在  Spark
关注(0)|答案(0)|浏览(247)

apache spark提供了tf-idf算法:https://spark.apache.org/docs/latest/ml-features.html#tf-以色列国防军
运行该示例时,它会添加“rawfeatures”和“features”列,并输出以下Dataframe:

|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| l | sentence    | words       | rawFeatures                                             | features   
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | Hi...       | ["hi", ...] | [0, 32, [1, 12, 16, 22, 28], [1, 1, 1, 1, 1]]           | [0, 32, [1, 12, 16, 22, 28, [0.69, 0.69, 0.29, 0.29, 0.29]]                    |
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | I wish...   | [...]  | [0, 32, [11, 15, 16, 22, 28, 29, 31], [1, 1, 1, 1, 1, 1, 1]] | [0, 32, [11, 15, 16, 22, 28, 29, 31], [1, 1, 1, 1, 1, 1, 1]]                   |
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Logistic... | [...] | [0, 32, [3, 4, 15, 27, 30], [1, 1, 1, 1, 1]]                  | [0, 32, [3, 4, 15, 27, 30], [0.69, 0.69, 0.29, 0.69, 0.69]]                    |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|

我有两个问题:
“rawfeatures”和“features”列是什么?它们中的数组如何与tf idf相关?
(假设“features”列的最后一个元素是tf idf)如何将这个Dataframe转换成这样?

|-----------------------|
| word | label | TF-IDF |
|-----------------------|

本质上,我想要一个每个字有多行的Dataframe,它显示的标签,以及它的tf-idf。
提前感谢:)

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题