在pyspark中，如何在sparkDataframe中插入一个新列，其中m和n的值可以独立选择？

g2ieeal7 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(320)

我想在预先存在的数据框中插入一个新列，实际上我想将其用作键。我想自己选择密钥的第一个值，并且密钥将扩展到数据报的长度，注意，我希望值是连续的。例如：

--------------
|    Name    |
--------------
|     A      |
|     B      |
|     C      |
|     D      |
--------------

转换的Dataframe：

-------------------------
|    Name    | df_key   |
-------------------------
|     A      |   60     |
|     B      |   61     |
|     C      |   62     |
|     D      |   63     |
-------------------------

在上面的例子中：我希望60是一个变量，其余的键应该扩展到Dataframe的长度。

apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/61855443/how-to-insert-a-new-column-in-a-spark-dataframe-with-values-from-m-to-n-where-m

2条答案

按热度按时间

kcrjzv8t1#

你可以做一个 row_number 并添加 n-1 ```
import pyspark.sql.functions as F

n=60
df.withColumn('df_key',F.row_number().over(Window.orderBy(F.lit(0)))+(n-1)).show()

+----+------+
|Name|df_key|
+----+------+
| A| 60|
| B| 61|
| C| 62|
| D| 63|
+----+------+

赞(0）回复(0）举报 2021-05-27

mlmc2os52#

使用 row_number 按顺序打开窗口函数 monotonically_increasing_id . Example: ```
df.show()

+----+

|Name|

+----+

| A|

| B|

| C|

| D|

+----+

from pyspark.sql.window import *
from pyspark.sql.functions import *
w=Window.orderBy(monotonically_increasing_id())
constant=60
df.withColumn("df_key", constant + row_number().over(w)).show()

+----+------+

|Name|df_key|

+----+------+

| A| 60|

| B| 61|

| C| 62|

| D| 63|

+----+------+

赞(0）回复(0）举报 2021-05-27

我来回答

在pyspark中，如何在sparkDataframe中插入一个新列，其中m和n的值可以独立选择？

2条答案

+----+

|Name|

+----+

| A|

| B|

| C|

| D|

+----+

+----+------+

|Name|df_key|

+----+------+

| A| 60|

| B| 61|

| C| 62|

| D| 63|

+----+------+

相关问题

热门标签

最新问答