在Pyspark中阅读和保存图像文件

e3bfsja2 于 2023-04-07 发布在 Apache

关注(0)|答案(1)|浏览(200)

我有一个要求，从S3桶读取图像，并将其转换为base64编码格式。
我能够从S3读取图像文件，但当我在base64方法中传递S3文件路径时，它无法识别该路径。
所以我想我会保存图像dataframe（相同的图像）在临时路径在集群，然后通过路径在base64方法。
但在保存图像 Dataframe 时，我得到以下错误：（* 最初我试图保存图像 Dataframe 与“图像”格式，但在谷歌我发现有一个错误，这种格式，有人建议使用下面的格式 *）

java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat.

请看我下面的示例代码，请告诉我在哪里可以找到依赖包

spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', '************')

def getImageStr(img):
  with open(img, "rb") as imageFile:
     str1 = base64.b64encode(imageFile.read())
     str2 = str(str1, 'utf-8')
  return str2

img_df = spark.read\
  .format("image")\
  .load("s3a://xxx/yyy/zzz/hello.jpg")

img_df.printSchema()

img_df.write\
    .format("org.apache.spark.ml.source.image.PatchedImageFileFormat")\
    .save("/tmp/sample.jpg")

img_str = getImageStr("/tmp/sample.jpg")

print(img_str)

请告诉我，如果有任何其他方式可以从Spark中的S3下载图像文件（* 不使用boto3包 *）

apache-spark

来源：https://stackoverflow.com/questions/65240716/reading-and-saving-image-file-in-pyspark

1条答案

按热度按时间

h5qlskok1#

当你使用image数据源时，你会得到一个带有image列的 Dataframe ，里面有一个二进制的有效载荷-image.data包含了实际的图像。然后你可以使用built-in function base64对该列进行编码，你可以将编码后的表示形式写入文件。类似这样（未测试）：

from pyspark.sql.functions import base64, col
img_df = spark.read.format("image").load("s3a://xxx/yyy/zzz/hello.jpg")
proc_df = img_df.select(base64(col("image.data")).alias('encoded')
proc_df.coalesce(1).write.format("text").save('/tmp/sample.jpg')

赞(0）回复(0）举报 2023-04-07

我来回答

在Pyspark中阅读和保存图像文件

1条答案

相关问题

热门标签

最新问答