从Pyspark中充满此类列表的列扩展给定列表的范围

mmvthczy  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(161)

我需要扩展一个从给定的起始编号到结束编号的范围,例如,如果我有[1,4],我需要输出为[1,2,3,4]。我一直试图使用这个代码块,作为一个逻辑,但是,我不能使它成为动态的。当我在其中传递许多列表时,我得到一个错误。


# Create an empty list

My_list = []

# Value to begin and end with

start = 10
print(start)
end = 20
print(end)

# Check if start value is smaller than end value

if start < end:
    # unpack the result
    My_list.extend(range(start, end))
    # Append the last value
    # My_list.append(end)

# Print the list

print(My_list)

输出:10 20 [10、11、12、13、14、15、16、17、18、19]
这就是我需要的!但是...
我正在努力做到这一点:

import pandas as pd
My_list = []
isarray = []
pd_df = draft_report.toPandas()
for index, row in pd_df.iterrows():
   My_list = row[14] #14 is the place of docPage in the df
   start = My_list[1] #reads the 1st element eg: 1 in [1,16]
   print(start)
   end = My_list[3] #reads the last element eg: 16 in [1,16]
   print(end)
   if start < end:
       isarray.extend(range(int(start, end)))
       isarray.append(int(end))
   print(isarray)

输出量:

An error was encountered:
'str' object cannot be interpreted as an integer
Traceback (most recent call last):
TypeError: 'str' object cannot be interpreted as an integer

数据如下所示:

docPages
[1,16]
[17,22]
[23,24]
[25,27]
vm0i2vca

vm0i2vca1#

由于源列是StringType(),因此首先需要将字符串转换为数组-这可以使用from_json函数来完成。然后使用sequence函数中的结果数组元素。

data_sdf. \
    withColumn('arr', 
               func.sort_array(func.from_json('arr_as_str', 'array<integer>'))
               ). \
    withColumn('arr_range', func.expr('sequence(arr[0], arr[1], 1)')). \
    show(truncate=False)

# +----------+--------+-------------------------------------------------------+

# |arr_as_str|arr     |arr_range                                              |

# +----------+--------+-------------------------------------------------------+

# |[1,16]    |[1, 16] |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]|

# |[17,22]   |[17, 22]|[17, 18, 19, 20, 21, 22]                               |

# |[23,24]   |[23, 24]|[23, 24]                                               |

# |[25,27]   |[25, 27]|[25, 26, 27]                                           |

# +----------+--------+-------------------------------------------------------+

如果源列是一个ArrayType()字段,则可以直接使用sequence函数创建一个区域。
参见下面例子。

data_sdf. \
    withColumn('doc_range', func.expr('sequence(doc_pages[0], doc_pages[1], 1)')). \
    show(truncate=False)

# +---------+-------------------------------------------------------+

# |doc_pages|doc_range                                              |

# +---------+-------------------------------------------------------+

# |[1, 16]  |[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]|

# |[17, 22] |[17, 18, 19, 20, 21, 22]                               |

# |[23, 24] |[23, 24]                                               |

# |[25, 27] |[25, 26, 27]                                           |

# +---------+-------------------------------------------------------+

相关问题