cassandra 使用点或括号表示法进行选择时,列的值为空,但使用UDF时则不为空

5t7ly7z5  于 2022-11-05  发布在  Cassandra
关注(0)|答案(2)|浏览(167)

我正在尝试清理一些嵌套数据并提取我关心的字段。
嵌套值的架构为:

|-- maritalstatus: struct (nullable = true)
 |    |-- id: string (nullable = true)
 |    |-- text_: string (nullable = true)
 |    |-- text__extensions: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- extension: array (nullable = true)
 |    |    |-- element: string (containsNull = true)

我想将text_字段提取为它自己的列。
我试过:df.select(col("maritalstatus.text_")).show()df.select(col("maritalstatus")["text_"]).show(),但它会传回:

+-----+
|text_|
+-----+
| null|
| null|
 ...
| null|
+-----+

当我将UDF定义为:

def getMaritalStatus(ms):
    return ms.text_
gms = udf(getMaritalStatus, StringType())

然后执行df.select(gms(col("maritalstatus")).show(),它将返回所需的数据
有趣的是,我有另一个嵌套的struct字段,它具有类似的结构,但使用数字作为键而不是名称,而且我 am 能够使用df.select(col("birthdate")["0"]).show()表示法
出生日期的结构描述:

root
 |-- birthdate: struct (nullable = true)
 |    |-- 0: date (nullable = true)
 |    |-- 1: integer (nullable = true)

有没有办法在不使用UDF的情况下提取maritalstatus.text_?我听说UDF的性能不如其他方法?
cassandra 表结构:

CREATE TABLE keyspace.patient (
    id text PRIMARY KEY,
    active boolean,
    active_extensions list<text>,
    address list<frozen<address>>,
    birthdate frozen<tuple<date, int>>,
    birthdate_extensions list<text>,
    communication list<frozen<patient_communication>>,
    contact list<frozen<patient_contact>>,
    contained list<frozen<tuple<text, text, text>>>,
    deceasedboolean boolean,
    deceasedboolean_extensions list<text>,
    deceaseddatetime frozen<tuple<timestamp, text, int>>,
    deceaseddatetime_extensions list<text>,
    extension list<text>,
    gender text,
    gender_extensions list<text>,
    generalpractitioner list<text>,
    identifier list<frozen<identifier>>,
    implicitrules text,
    implicitrules_extensions list<text>,
    language text,
    language_extensions list<text>,
    link list<frozen<patient_link>>,
    managingorganization text,
    maritalstatus frozen<codeableconcept>,
    meta frozen<meta>,
    modifierextension list<text>,
    multiplebirthboolean boolean,
    multiplebirthboolean_extensions list<text>,
    multiplebirthinteger int,
    multiplebirthinteger_extensions list<text>,
    name list<frozen<humanname>>,
    photo list<frozen<attachment>>,
    telecom list<frozen<contactpoint>>,
    text_ frozen<narrative>
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
    AND comment = ''
    AND compaction = {'class': 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 'max_threshold': '32', 'min_threshold': '4'}
    AND compression = {'chunk_length_in_kb': '64', 'class': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1.0
    AND dclocal_read_repair_chance = 0.1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND read_repair_chance = 0.0
    AND speculative_retry = '99PERCENTILE';

并且codeableconcept

CREATE TYPE keyspace.codeableconcept (
    extension list<text>,
    text_ text,
    text__extensions list<text>,
    id text,
    coding list<frozen<coding>>
);
qvtsj1bj

qvtsj1bj1#

using pyspark sql你可以使用.“”查询嵌套数据,对于数组使用explode函数。

(1,不正确)

如果是,请输入以下命令:

vh0rcniy

vh0rcniy2#

您可以使用[name]来访问列中的嵌套值,也可以使用.getItem函数来执行相同的操作。列可以显式 Package 为col,也可以使用语法dataframe[column_name]来代替。如果我们使用如下示例数据:

from pyspark.sql.functions import col

rdd = sc.parallelize([('{"maritalstatus": {"id": "some_id", "text_": "some_text", "text__extensions": ["1", "2"], "extension": ["e1", "e2"]}}')])
df = spark.read.json(rdd)

那么所有三个呼叫:

df.select(df['maritalstatus']['text_']).show()
df.select(col('maritalstatus')['text_']).show()
df.select(col('maritalstatus').getItem('text_')).show()

会产生同样的结果:

+-------------------+
|maritalstatus.text_|
+-------------------+
|          some_text|
+-------------------+

相关问题