sparkDataframe输出压缩(gzip),密码保护

llmtgqce  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(642)

使用下面的代码,我可以将其压缩并保存为.gz文件

import spark.implicits._

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

someDF.coalesce(1)
   .write.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("codec", "org.apache.hadoop.io.compress.GzipCodec")
    .save("example.csv.gz")

spark是否提供了使用密码保护压缩数据的选项?我在spark文档中找不到。

cczfrluj

cczfrluj1#

可以创建新的编解码器,先压缩文件,然后对其进行加密。其思想是在写入文件系统之前,用cipheroutputstream Package 编解码器的输出流。

import java.io.{IOException, OutputStream}

import javax.crypto.{Cipher, CipherOutputStream}
import javax.crypto.spec.SecretKeySpec
import org.apache.hadoop.io.compress._

class GzipEncryptionCodec extends GzipCodec {

  override def getDefaultExtension(): String = ".gz.enc"

  @throws[IOException]
  override def createOutputStream(out: OutputStream): CompressionOutputStream =
    super.createOutputStream(wrapWithCipherStream(out))

  @throws[IOException]
  override def createOutputStream(out: OutputStream, compressor: Compressor): CompressionOutputStream =
    super.createOutputStream(wrapWithCipherStream(out), compressor)

  def wrapWithCipherStream(out: OutputStream): OutputStream = {
    val cipher = Cipher.getInstance("AES/ECB/PKCS5Padding") //or another algorithm
    val secretKey = new SecretKeySpec(
      "hello world 1234".getBytes, //this is not a secure password!
      "AES")
    cipher.init(Cipher.ENCRYPT_MODE, secretKey)
    return new CipherOutputStream(out, cipher)
  }
}

写入csv文件时,可以使用此编解码器:

df.write
  .option("codec","GzipEncryptionCodec")
  .mode(SaveMode.Overwrite).csv("encryped_csv")

输出文件将被加密并得到后缀 .gz.enc .
此编解码器仅对数据进行加密,无法对其进行解密。在这里可以找到一些关于为什么改变编解码器读起来比写起来更困难的背景知识。
相反,可以用一个简单的scala程序读取和取消文件类型:

import javax.crypto.Cipher
import javax.crypto.spec.SecretKeySpec
import java.io.FileInputStream
import java.util.zip.GZIPInputStream

import javax.crypto.CipherInputStream
val cipher = Cipher.getInstance("AES/ECB/PKCS5Padding")
val secretKey = new SecretKeySpec("hello world 1234".getBytes(), "AES")
cipher.init(Cipher.DECRYPT_MODE, secretKey)

val files = new File("encryped_csv").listFiles.filter(_.getName().endsWith(".gz.enc")).toList

files.foreach(f => {
  val dec = new CipherInputStream(new FileInputStream(f), cipher)
  val gz = new GZIPInputStream(dec)
  val result = scala.io.Source.fromInputStream(gz).mkString
  println(f.getName)
  println(result)
})
m2xkgtsf

m2xkgtsf2#

gzip本身不支持密码保护。在unix上,您需要使用其他工具来使用密码加密文件。
p、 同时,更换 com.databricks.spark.csv 只是 csv -spark已经支持csv很久了。并移除相应的maven依赖。

相关问题