我想从pyspark Dataframe 中获取每列的最大长度。
以下是示例 Dataframe :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
我尝试实现solution provided in Scala,但无法转换它。
1条答案
按热度按时间4xy9mtcn1#
这样就行了
结果
df2 = spark. createDataFrame([用于df. schema. names中的名称的行(列=名称,长度=行[名称])],["列","长度"])
输出: