如何在PySpark中将现有DataFrame中的模式应用到另一个缺少列的DataFrame

xwbd5t1u 于 2022-12-26 发布在 Spark

关注(0)|答案(1)|浏览(101)

我有一个JSON文件，在一个DataFrame df_1中有不同层次的嵌套结构/数组列。我有一个较小的DataFrame df_2，列较少，但列名与df_1中的一些列名匹配，并且没有任何嵌套结构。
我希望以df_1和df_2共享相同模式的方式将模式从df_1应用到df_2，尽可能使用df_2中的现有列，并创建df_1中存在但df_2中不存在的列/嵌套结构。
df_1

root
 |-- association_info: struct (nullable = true)
 |    |-- ancestry: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- doi: string (nullable = true)
 |    |-- gwas_catalog_id: string (nullable = true)
 |    |-- neg_log_pval: double (nullable = true)
 |    |-- study_id: string (nullable = true)
 |    |-- pubmed_id: string (nullable = true)
 |    |-- url: string (nullable = true)
 |-- gold_standard_info: struct (nullable = true)
 |    |-- evidence: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- class: string (nullable = true)
 |    |    |    |-- confidence: string (nullable = true)
 |    |    |    |-- curated_by: string (nullable = true)
 |    |    |    |-- description: string (nullable = true)
 |    |    |    |-- pubmed_id: string (nullable = true)
 |    |    |    |-- source: string (nullable = true)
 |    |-- gene_id: string (nullable = true)
 |    |-- highest_confidence: string (nullable = true)

df_2

root
 |-- study_id: string (nullable = true)
 |-- description: string (nullable = true)
 |-- gene_id: string (nullable = true)

预期的输出将与df_1具有相同的模式，并且对于df_2中不存在的任何列，只填充null。
我尝试过将df_1的结构完全扁平化以连接两个DataFrame，但我不确定如何将其更改回原始模式。到目前为止，我尝试过的所有解决方案都是在PySpark中。出于性能考虑，使用PySpark更可取，但如果解决方案需要转换为Pandas DataFrame，这也是可行的。

pyspark

来源：https://stackoverflow.com/questions/74814293/how-to-apply-a-schema-from-an-existing-dataframe-to-another-dataframe-with-missi

1条答案

按热度按时间

okxuctiv1#

df1.select('association_info.study_id', 
           'gold_standard_info.evidence.element.description',
          'gold_standard_info.gene_id')

上面的代码将进入df1并在df2中提供必需的字段。模式将保持不变。
你能试试同样的吗。

赞(0）回复(0）举报 2022-12-26

我来回答

如何在PySpark中将现有DataFrame中的模式应用到另一个缺少列的DataFrame

1条答案

相关问题

热门标签

最新问答