我有一个df,其中2列包含一个dicts列表,我试图将其分解为列,但没有成功。
这是我的模式:
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- value: string (nullable = true)
| | |-- origin: string (nullable = true)
|-- createdAt: long (nullable = true)
|-- modules: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- detected: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- id: integer (nullable = true)
| | | |-- reason: string (nullable = true)
| | | |-- state: string (nullable = true)
| | | |-- score: integer (nullable = true)
| | | |-- level: string (nullable = true)
|-- id: string (nullable = true)
它看起来是这样的:
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data |createdAt |modules |id |
+----------------------------------------------------------------------+-------------+-------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------+
|[{data_point_1, false, METADATA}, {data_point_2, some_string, DEVICE}]|1678148428468|[{ANOTHER_DUMMY, {null, null, null, null, null, null}}, {DUMMY, {dummy_user_agent, 1, Rule for integration tests, OPERATIONAL, 500, HIGH}}]|70ef58bf-b160-4abd-97c1-aa4780e74e1b|
|[{data_point_1, false, METADATA}, {data_point_3, 0, USER}] |1678148428495|[{ANOTHER_DUMMY, {null, null, null, null, null, null}}, {DUMMY, {null, null, null, null, null, null}}] |6ab33e95-dd94-4c95-b00f-edfe97d6f3d1|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
关于数据本身,在data
列中,名称值可以像示例中那样逐行不同,并且modules
列结构应该看起来相同,但它具有另一个dict detected
。
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|data_point_1|data_point_1_origin|data_point_2|data_point_2_origin|data_point_3|data_point_3_origin|createdAt |ANOTHER_DUMMY_name|ANOTHER_DUMMY_id|ANOTHER_DUMMY_reason|ANOTHER_DUMMY_state|ANOTHER_DUMMY_score|ANOTHER_DUMMY_level|DUMMY_name |DUMMY_id|DUMMY_reason |DUMMY_state|DUMMY_score|DUMMY_level|id |
+------------+-------------------+------------+-------------------+------------+-------------------+-------------+------------------+----------------+--------------------+-------------------+-------------------+-------------------+----------------+--------+--------------------------+-----------+-----------+-----------+------------------------------------+
|false |METADATA |some_string |DEVICE |null |null |1678148428468|null |null |null |null |null |null |dummy_user_agent|1 |Rule for integration tests|OPERATIONAL|500 |HIGH |70ef58bf-b160-4abd-97c1-aa4780e74e1b|
|false |METADATA |null |null |0 |USER |1678148428495|null |null |null |null |null |null |null |null |null |null |null |null |6ab33e95-dd94-4c95-b00f-edfe97d6f3d1|
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
当我不知道data
数组中的dict中的名称时,以及当我需要分解像modules
这样的嵌套dict时,是否有办法执行类似的操作?
1条答案
按热度按时间xdyibdwo1#
这是我的解决方案:
我把数组列和uuid保存到一个新的df中,然后把数组列分解成行,所以我现在只有map,而不是原来的map数组,从map创建列非常简单(大多数有趣的代码都在spark utils下):