pyspark 将 Dataframe 数组类型列转换为字符串,而不丢失元素名称/模式

omvjsjqw  于 2022-11-01  发布在  Spark
关注(0)|答案(1)|浏览(130)

在我的 Dataframe 中,我需要将数组类型的列转换为字符串,而不丢失列中数据的元素名称/模式。
我的 Dataframe 架构:

root
 |-- accountId: string (nullable = true)
 |-- documents: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- accountId: string (nullable = true)
 |    |    |-- agreementId: string (nullable = true)
 |    |    |-- createdBy: string (nullable = true)
 |    |    |-- createdDate: string (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- obligations: array (nullable = true)
 |    |    |    |-- element: string (containsNull = true)
 |    |    |-- resourceVersion: long (nullable = true)
 |    |    |-- updatedBy: string (nullable = true)
 |    |    |-- updatedDate: string (nullable = true)

Dataframe示例数据(我以JSON格式显示它,但它是Spark dataframe中的列):

{
    "accountId":"1",
    "documents":{
        "list":[{
            "element":{
                "accountId":"1",
                "agreementId":"1.2",
                "createdDate":"2022-10-06T19:33:42.539646Z",
                "externalId":"16",
                "id":"123",
                "name":"test1.docx",
                "obligations":{},
                "resourceVersion":1,
                "updatedDate":"2022-10-06T19:33:42.680233Z"
            }
        }]
    }
}
{
    "accountId":"2",
    "documents":{
        "list":[{
            "element":{
                "accountId":"2",
                "agreementId":"2.2",
                "createdDate":"2022-10-06T19:33:42.539646Z",
                "externalId":"18",
                "id":"123",
                "name":"test2.docx",
                "obligations":{},
                "resourceVersion":1,
                "updatedDate":"2022-10-06T19:33:42.680233Z"
            }
        }]
    }
}

我的当前代码:

df_string = df.select([col(c).cast("string") for c in df.columns])

它可以做什么(列名在文档中消失):

{
    "accountId":"1",
    "documents":[{"1","1.2","2022-10-06T19:33:42.539646Z","16",:"123","test1.docx","",1,"2022-10-06T19:33:42.680233Z"}]
}
{
    "accountId":"2",
    "documents":[{"2","2.2","2022-10-06T19:33:42.539646Z","18","123","test2.docx","","1","2022-10-06T19:33:42.680233Z"}]
}

我需要完成的工作(文件中必须保留数据行名称):

{
    "accountId":"1",
    "documents":[{"accountId":"1","agreementId":"1.2","createdDate":"2022-10-06T19:33:42.539646Z","externalId":"16","id":"123","name":"test1.docx","obligations":"","resourceVersion":"1","updatedDate":"2022-10-06T19:33:42.680233Z"}]
}
{
    "accountId":"2",
    "documents":[{"accountId":"2","agreementId":"2.2","createdDate":"2022-10-06T19:33:42.539646Z","externalId":"18","id":"123","name":"test2.docx","obligations":"","resourceVersion":"1","updatedDate":"2022-10-06T19:33:42.680233Z"}]
}

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题