Apache Spark 在AWS中更改Hudi表的位置?

wtlkbnrh  于 2023-06-24  发布在  Apache
关注(0)|答案(2)|浏览(133)

描述您面临的问题如何将hudi表的位置更改为新位置。我的客户表保存在s3://aws-amazon-com/Customer/,我想将其更改为s3://aws-amazon-com/CustomerUpdated/。我在做胶水4
使用这些jar:hudi-spark3-bundle_2.12-0.12.1.jar方解石核心-1.16.0.jar libfb303-0.9.3.jar

val partitionColumnName: String = "year"
val hudiTableName: String = "Customer"
val preCombineKey: String = "id"
val recordKey = "id"
val tablePath = "s3://aws-amazon-com/Customer/"
val databaseName="consumer_bureau"




val hudiCommonOptions: Map[String, String] = Map(
    "hoodie.table.name" -> hudiTableName,
    "hoodie.datasource.write.keygenerator.class" -> "org.apache.hudi.keygen.ComplexKeyGenerator",
    "hoodie.datasource.write.precombine.field" -> preCombineKey,
    "hoodie.datasource.write.recordkey.field" -> recordKey,
    "hoodie.datasource.write.operation" -> "bulk_insert",
    //"hoodie.datasource.write.operation" -> "upsert",
    "hoodie.datasource.write.row.writer.enable" -> "true",
    "hoodie.datasource.write.reconcile.schema" -> "true",
    "hoodie.datasource.write.partitionpath.field" -> partitionColumnName,
    "hoodie.datasource.write.hive_style_partitioning" -> "true",
    // "hoodie.bulkinsert.shuffle.parallelism" -> "2000",
    //  "hoodie.upsert.shuffle.parallelism" -> "400",
    "hoodie.datasource.hive_sync.enable" -> "true",
    "hoodie.datasource.hive_sync.table" -> hudiTableName,
    "hoodie.datasource.hive_sync.database" -> databaseName,
    "hoodie.datasource.hive_sync.partition_fields" -> partitionColumnName,
    "hoodie.datasource.hive_sync.partition_extractor_class" -> "org.apache.hudi.hive.MultiPartKeysValueExtractor",
    "hoodie.datasource.hive_sync.use_jdbc" -> "false",
    "hoodie.combine.before.upsert" -> "true",
    "hoodie.index.type" -> "BLOOM",
    "spark.hadoop.parquet.avro.write-old-list-structure" -> "false",
    DataSourceWriteOptions.TABLE_TYPE.key() -> "COPY_ON_WRITE"
  )
  
  
  val df=Seq((1,"Mark",1990),(2,"Martin",2009)).toDF("id","name","year")
  
  
     df.write.format("org.apache.hudi")
    .options(hudiCommonOptions)
    .mode(SaveMode.Append)
    .save(tablelocation)
    
    val tablelocationUpdated="s3://eec-aws-uk-ukidcibatchanalytics-prod-hudi-replication/consumer_bureau/production/CustomerUpdated/"
   

    df.write.format("org.apache.hudi") //writng to new location
    .options(hudiCommonOptions)
    .mode(SaveMode.Append)
    .save(tablelocationUpdated)

强文本

当我查询Athena时,customer表指向s3://aws-amazon-com/Customer/,而不是预期的更新位置s3://aws-amazon-com/CustomerUpdated/。表的位置改变可以使用AWS胶水或AWS lambda实现。
请帮帮忙

dgsult0t

dgsult0t1#

是的,你可以改变hudi表的位置,你还需要手动改变glue中表的位置路径(例如通过aws控制台或使用was SDK)。配置单元同步不会自行更新位置。

mwecs4sa

mwecs4sa2#

spark.sql(s"""alter table customer set location  's3://aws-amazon-com/CustomerUpdated/ '""")

将更改呼地表的表位置。

相关问题