如何在pyspark中将Dataframe行转换为indexedrow?

hwazgwia  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(400)

我有一张这样的table:

+-------+---- -+-------+-------+----
|movieId|Action| Comedy|Fantasy| ...
+-------+----- +-------+-------+----
|  1001 |  1   |   1   |   0   | ...
|  1011 |  0   |   1   |   1   | ...
+-------+------+-------+-------+----

如何将其每一行转换为indexedrow?所以我有这样的想法:

+-------+----------------+
|movieId|    Features    |
+-------+----------------+
|  1001 | [1, 1, 0, ...] | 
|  1011 | [0, 1, 1, ...] |
+-------+----------------+
bvpmtnay

bvpmtnay1#

如果需要数组类型输出,可以使用array()函数。

from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= spark.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_arr= tst.withColumn("Features",F.array(tst.columns))

tst_arr.show()
+----+----+----+-----------+
|col1|col2|col3|   Features|
+----+----+----+-----------+
|   1|   7|  80| [1, 7, 80]|
|   1|   8|  40| [1, 8, 40]|
|   1|   5| 100|[1, 5, 100]|
|   5|   8|  90| [5, 8, 90]|
|   7|   6|  50| [7, 6, 50]|
|   0|   3|  60| [0, 3, 60]|
+----+----+----+-----------+

如果您试图对ml操作执行此操作,那么最好使用向量汇编程序:http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/ml/feature.html#vectorassembler

相关问题