如何从列中提取值并将其作为pyspark中的float?

sy5wg1nm  于 2021-07-14  发布在  Spark
关注(0)|答案(3)|浏览(483)

我有一个pysparkDataframe,看起来像下面这样。我希望列只保存浮点值。请注意,当前值的周围有方括号。

from pyspark.sql.types import StructType,StructField 
from pyspark.sql.types import StringType, IntegerType, ArrayType
data = [
("Smith","OH","[55.5]"),
("Anna","NY","[33.3]"),
("Williams","OH","[939.3]"),    
]

schema = StructType([
StructField('name', StringType(), True),
StructField('state', StringType(), True),
StructField('salary', StringType(), True)
])

df = spark.createDataFrame(data = data, schema= schema)
df.show(truncate=False)

Input:
+--------+-----+-------+
|name    |state|salary |
+--------+-----+-------+
|Smith   |OH   |[55.5] |
|Anna    |NY   |[33.3] |
|Williams|OH   |[939.3]|
+--------+-----+-------+

结果应该是,

+--------+-----+------------------+
|name    |state|float_value_salary|
+--------+-----+------------------+
|Smith   |OH   |55.5              |
|Anna    |NY   |33.3              |
|Williams|OH   |939.3             |
+--------+-----+------------------+

谢谢你的帮助。

xsuvu9jc

xsuvu9jc1#

你可以 trim 方括号和铸造浮动:

import pyspark.sql.functions as F

df2 = df.withColumn('salary', F.expr("float(trim('[]', salary))"))

df2.show()
+--------+-----+------+
|    name|state|salary|
+--------+-----+------+
|   Smith|   OH|  55.5|
|    Anna|   NY|  33.3|
|Williams|   OH| 939.3|
+--------+-----+------+

或者你可以用 from_json 要将其解析为float数组,并获取第一个数组元素:

df2 = df.withColumn('salary', F.from_json('salary', 'array<float>')[0])
beq87vna

beq87vna2#

您可以使用正则表达式:

import pyspark.sql.functions as F

df.select(
    F.regexp_extract('salary', '([\d\.]+)', 1).cast('float').alias('salary')
).show()

输出:

+------+
|salary|
+------+
|  55.5|
|  33.3|
| 939.3|
+------+
mklgxw1f

mklgxw1f3#

您需要使用自定义项将字符串解析为浮点数组,然后可以分解数组以获得数组中的奇异值。
方案如下:

import json

from pyspark.sql import functions as F
from pyspark.sql.types import FloatType

def parse_value_from_string(x):
    res = json.loads(x)
    return res

parse_float_array = F.udf(parse_value_from_string, ArrayType(FloatType()))

df = df.withColumn('float_value_salary',F.explode(parse_float_array(F.col('salary'))))

df_output = df.select('name','state','float_value_salary')

输出Dataframe希望得到以下结果

+--------+-----+------------------+
|    name|state|float_value_salary|
+--------+-----+------------------+
|   Smith|   OH|              55.5|
|    Anna|   NY|              33.3|
|Williams|   OH|             939.3|
+--------+-----+------------------+

相关问题