如何在Spark中从Minio集群读取数据并写入到另一个Minio集群？

dldeef67 于 2023-10-23 发布在 Apache

关注(0)|答案(1)|浏览(298)

我有一个用例，我们在一个Minio集群中有一些输入数据，我们需要读取和转换这些数据，然后将它们添加到另一个Minio集群中，我们必须使用Spark来完成。我们如何实现同样的目标？

来源：https://stackoverflow.com/questions/77142184/how-can-we-read-from-a-minio-cluster-and-write-to-another-minio-cluster-in-spark

1条答案

按热度按时间

1szpjjfi1#

如果你使用hadoop-aws，你可以简单地使用s3 a：//协议读写Minio。你应该能够为每个单独的bucket设置不同的端点，凭证等，使用属性：

spark.hadoop.fs.s3a.bucket.<bucket>.endpoint
spark.hadoop.fs.s3a.bucket.<bucket>.aws.credentials.provider
spark.hadoop.fs.s3a.bucket.<bucket>.access.key
spark.hadoop.fs.s3a.bucket.<bucket>.secret.key
spark.hadoop.fs.s3a.bucket.<bucket>.path.style.access

所以，假设你有一个Minio服务器https://minio1.com和bucket dataIn，https://minio2.com和bucket dataOut，你可以设置以下配置（例如，在spark-defaults.conf中，使用spark-submit的--conf参数，或者直接在代码中的SparkConf对象上）：

spark.hadoop.fs.s3a.bucket.dataIn.endpoint                  https://minio1.com
spark.hadoop.fs.s3a.bucket.dataIn.aws.credentials.provider  org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.dataIn.access.key                ACCESS_KEY_1
spark.hadoop.fs.s3a.bucket.dataIn.secret.key                SECRET_KEY_1
spark.hadoop.fs.s3a.bucket.dataIn.path.style.access         true

spark.hadoop.fs.s3a.bucket.dataOut.endpoint                  https://minio2.com
spark.hadoop.fs.s3a.bucket.dataOut.aws.credentials.provider  org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider
spark.hadoop.fs.s3a.bucket.dataOut.access.key                ACCESS_KEY_2
spark.hadoop.fs.s3a.bucket.dataOut.secret.key                SECRET_KEY_2
spark.hadoop.fs.s3a.bucket.dataOut.path.style.access         true

然后，在您的应用程序中，简单地传输数据如下：

val documents = spark.read.parquet("s3a://dataIn/path/to/data")

val transformed = documents.select(...) // do your transformations here

transformed.write.parquet("s3a://dataOut/path/to/target")

赞(0）回复(0）举报 2023-10-23

我来回答

如何在Spark中从Minio集群读取数据并写入到另一个Minio集群？

1条答案

相关问题

热门标签

最新问答