在delta表上执行append/delete时,它正在创建一个不必要的虚拟Parquet文件。
data = [["Alyssa", "maroon", [8,8,8]]]
df = spark.createDataFrame(data, "name string, favorite_color string, favorite_numbers array<int>")
df.write.format("delta").mode("append").save("/users")
以下是为将单个记录附加到增量表而生成的日志文件:
{"commitInfo":{"timestamp":1592247860686,"operation":"WRITE","operationParameters":{"mode":"Append","partitionBy":"[]"},"readVersion":2,"isBlindAppend":true,"operationMetrics":{"numFiles":"2","numOutputBytes":"1611","numOutputRows":"1"}}}
{"add":{"path":"part-00000-5a63f209-d88f-4453-9e7b-c7b2318160c7-c000.snappy.parquet","partitionValues":{},"size":548,"modificationTime":1592247860655,"dataChange":true}}
{"add":{"path":"part-00007-494b0d4f-36f7-4c3c-a46f-058865d36113-c000.snappy.parquet","partitionValues":{},"size":1063,"modificationTime":1592247860680,"dataChange":true}}
不必要的文件,part-00000-5a63f209-d88f-4453-9e7b-c7b2318160c7-c000.snappy.parquet:
spark.read.format("parquet")\
.load("/users/part-00000-5a63f209-d88f-4453-9e7b-c7b2318160c7-c000.snappy.parquet").show()
+----+--------------+----------------+
|name|favorite_color|favorite_numbers|
+----+--------------+----------------+
+----+--------------+----------------+
这是delta故意做的动作还是bug?
我曾在以下位置尝试过此代码:
scala版本2.11.12
spark版本2.4.5
delta jar包“io。delta:delta-core_2.11:0.6.1"
localhost,开放的delta湖源代码
暂无答案!
目前还没有任何答案,快来回答吧!