apache spark提供了tf-idf算法:https://spark.apache.org/docs/latest/ml-features.html#tf-以色列国防军
运行该示例时,它会添加“rawfeatures”和“features”列,并输出以下Dataframe:
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| l | sentence | words | rawFeatures | features
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | Hi... | ["hi", ...] | [0, 32, [1, 12, 16, 22, 28], [1, 1, 1, 1, 1]] | [0, 32, [1, 12, 16, 22, 28, [0.69, 0.69, 0.29, 0.29, 0.29]] |
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 0 | I wish... | [...] | [0, 32, [11, 15, 16, 22, 28, 29, 31], [1, 1, 1, 1, 1, 1, 1]] | [0, 32, [11, 15, 16, 22, 28, 29, 31], [1, 1, 1, 1, 1, 1, 1]] |
|---|----------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 1 | Logistic... | [...] | [0, 32, [3, 4, 15, 27, 30], [1, 1, 1, 1, 1]] | [0, 32, [3, 4, 15, 27, 30], [0.69, 0.69, 0.29, 0.69, 0.69]] |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
我有两个问题:
“rawfeatures”和“features”列是什么?它们中的数组如何与tf idf相关?
(假设“features”列的最后一个元素是tf idf)如何将这个Dataframe转换成这样?
|-----------------------|
| word | label | TF-IDF |
|-----------------------|
本质上,我想要一个每个字有多行的Dataframe,它显示的标签,以及它的tf-idf。
提前感谢:)
暂无答案!
目前还没有任何答案,快来回答吧!