我正在尝试清理一些嵌套数据并提取我关心的字段。
嵌套值的架构为:
|-- maritalstatus: struct (nullable = true)
| |-- id: string (nullable = true)
| |-- text_: string (nullable = true)
| |-- text__extensions: array (nullable = true)
| | |-- element: string (containsNull = true)
| |-- extension: array (nullable = true)
| | |-- element: string (containsNull = true)
我想将text_
字段提取为它自己的列。
我试过:df.select(col("maritalstatus.text_")).show()
和df.select(col("maritalstatus")["text_"]).show()
,但它会传回:
+-----+
|text_|
+-----+
| null|
| null|
...
| null|
+-----+
当我将UDF定义为:
def getMaritalStatus(ms):
return ms.text_
gms = udf(getMaritalStatus, StringType())
然后执行df.select(gms(col("maritalstatus")).show()
,它将返回所需的数据
有趣的是,我有另一个嵌套的struct字段,它具有类似的结构,但使用数字作为键而不是名称,而且我 am 能够使用df.select(col("birthdate")["0"]).show()
表示法
出生日期的结构描述:
root
|-- birthdate: struct (nullable = true)
| |-- 0: date (nullable = true)
| |-- 1: integer (nullable = true)
有没有办法在不使用UDF的情况下提取maritalstatus.text_
?我听说UDF的性能不如其他方法?
cassandra 表结构:
CREATE TABLE keyspace.patient (
id text PRIMARY KEY,
active boolean,
active_extensions list<text>,
address list<frozen<address>>,
birthdate frozen<tuple<date, int>>,
birthdate_extensions list<text>,
communication list<frozen<patient_communication>>,
contact list<frozen<patient_contact>>,
contained list<frozen<tuple<text, text, text>>>,
deceasedboolean boolean,
deceasedboolean_extensions list<text>,
deceaseddatetime frozen<tuple<timestamp, text, int>>,
deceaseddatetime_extensions list<text>,
extension list<text>,
gender text,
gender_extensions list<text>,
generalpractitioner list<text>,
identifier list<frozen<identifier>>,
implicitrules text,
implicitrules_extensions list<text>,
language text,
language_extensions list<text>,
link list<frozen<patient_link>>,
managingorganization text,
maritalstatus frozen<codeableconcept>,
meta frozen<meta>,
modifierextension list<text>,
multiplebirthboolean boolean,
multiplebirthboolean_extensions list<text>,
multiplebirthinteger int,
multiplebirthinteger_extensions list<text>,
name list<frozen<humanname>>,
photo list<frozen<attachment>>,
telecom list<frozen<contactpoint>>,
text_ frozen<narrative>
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
AND comment = ''
AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99PERCENTILE';
并且codeableconcept
被
CREATE TYPE keyspace.codeableconcept (
extension list<text>,
text_ text,
text__extensions list<text>,
id text,
coding list<frozen<coding>>
);
2条答案
按热度按时间qvtsj1bj1#
using pyspark sql你可以使用.“”查询嵌套数据,对于数组使用explode函数。
(1,不正确)
或
如果是,请输入以下命令:
vh0rcniy2#
您可以使用
[name]
来访问列中的嵌套值,也可以使用.getItem
函数来执行相同的操作。列可以显式 Package 为col
,也可以使用语法dataframe[column_name]
来代替。如果我们使用如下示例数据:那么所有三个呼叫:
会产生同样的结果: