python—将单列拆分为多列的最佳方法dataframe pyspark

tkclm6bt  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(285)

实际上,我是pyspark的初学者,我有一个csv文件,其中包含大约800万条记录,我用pyspark读取它作为df,如下所示:

此列包含的值被串接为字符串[经度-纬度时间戳,经度-纬度时间戳,…]。现在我想把它分成三列,分别是经度、纬度和时间戳列。
例如:假设第一条记录为“[104.07515 30.72649 1540803847,104.07515 30.72631 1540803850,104.07514 30.72605 1540803851,104.07516 30.72573 1540803854,104.07513 30.72537 1540803857,104.0751 30.72499 1540803860,104.0751 30.72455 1540803863,104.07506 30.7241 1540803866,104.07501 30.72363 1540803869, 104.07497 30.72316 1540803872, 104.07489 30.72264 1540803875, 104.07481 30.72211 1540803878, 104.07471 30.72159 1540803881, 104.07461 30.72107 1540803884]'.
输出应如下所示:
经度列:“[104.07515,104.07515,104.07514,104.07516,104.07513,…]”。
纬度栏:“[30.72649,30.72631,30.72605,30.72573,30.72537,30.72499,…]”。
时间戳列:“[1540803847,1540803850,1540803851,1540803854,…]”。
我正在试图找到最好的方法来在所有的Dataframe上做到这一点。
有没有人能建议一下有没有办法做到这一点?
提前多谢了。

llmtgqce

llmtgqce1#

你可以用 ', ' ,然后将结果数组中的每个项按 ' ' 使用 transform ,并从中获取经度、纬度和时间戳。

df2 = df.selectExpr(
    "split(trim('[]', Trajectory_GPS), ', ') as newcol"
).selectExpr(
    "transform(newcol, x -> split(x, ' ')[0]) as longitude", 
    "transform(newcol, x -> split(x, ' ')[1]) as latitude", 
    "transform(newcol, x -> split(x, ' ')[2]) as timestamp"
)

df2.show(truncate=False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|longitude                                                                                                                                               |latitude                                                                                                                                   |timestamp                                                                                                                                                               |
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[104.07515, 104.07515, 104.07514, 104.07516, 104.07513, 104.0751, 104.0751, 104.07506, 104.07501, 104.07497, 104.07489, 104.07481, 104.07471, 104.07461]|[30.72649, 30.72631, 30.72605, 30.72573, 30.72537, 30.72499, 30.72455, 30.7241, 30.72363, 30.72316, 30.72264, 30.72211, 30.72159, 30.72107]|[1540803847, 1540803850, 1540803851, 1540803854, 1540803857, 1540803860, 1540803863, 1540803866, 1540803869, 1540803872, 1540803875, 1540803878, 1540803881, 1540803884]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

要获得经度/纬度的最大/最小值,可以聚合Dataframe:

result = df2.agg(
    F.max(F.array_max('longitude')).alias('max_long'), 
    F.min(F.array_min('longitude')).alias('min_long'), 
    F.max(F.array_max('latitude')).alias('max_lat'), 
    F.min(F.array_min('latitude')).alias('min_lat')
).head().asDict()

print(result)

# {'max_long': '104.07516', 'min_long': '104.07461', 'max_lat': '30.72649', 'min_lat': '30.72107'}

相关问题