Apache Spark 迭代每列并找到最大长度

y0u0uwnf  于 2022-12-27  发布在  Apache
关注(0)|答案(1)|浏览(144)

我想从pyspark Dataframe 中获取每列的最大长度。
以下是示例 Dataframe :

from pyspark.sql.types import StructType,StructField, StringType, IntegerType

data2 = [("James","","Smith","36636","M",3000),
    ("Michael","Rose","","40288","M",4000),
    ("Robert","","Williams","42114","M",4000),
    ("Maria","Anne","Jones","39192","F",4000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

我尝试实现solution provided in Scala,但无法转换它。

4xy9mtcn

4xy9mtcn1#

这样就行了

from pyspark.sql.functions import col, length, max

df=df.select([max(length(col(name))) for name in df.schema.names])

结果

    • 编辑:**供参考:转换为行(如此处所述,也在此处进行了更新- Dataframe 中每列的pyspark最大字符串长度)
df = df.select([max(length(col(name))).alias(name) for name in df.schema.names])

df2 = spark. createDataFrame([用于df. schema. names中的名称的行(列=名称,长度=行[名称])],["列","长度"])
输出:

相关问题