Flink 使用接收器表时输出文件的格式

mhd8tkvw  于 2022-12-09  发布在  Apache
关注(0)|答案(1)|浏览(143)

when I use TableAPI to create the sink table and submit the jobs. The files in S3 have the format like this

part-2db289e0-e70a-48d4-ac11-3e75372f621d-1-179

Therefore, I wonder what is the meaning of this format. To my knowledge, this format was followed this and I wonder if it is correct.

part-<job_id>-<partition_id>-[numOfcommit]

If it is correct, there is some questions that I would like to ask
I have set the commit time using this variable sink.rolling-policy.check-interval = 1min . Therefore, does the numberOfCommit part of the output files means that every time that reach the commit time the file will closed and have that number? If so, what if the data is quite huge and needs more than the commit time, will they generate to another file? If so, what is the format of the files ?
One more question is that, how can we set the file size of the output since what the doc recommend is we adjust the commit time.
Thanks all

8yparm6h

8yparm6h1#

DataStream FileSink连接器的文档中描述了底层文件系统连接器如何工作的详细信息。
默认命名方案为
进行中/搁置中:part-<uid>-<partFileIndex>.inprogress.uid
成品:part-<uid>-<partFileIndex>
Uid是示例化接收器的子任务时分配给该子任务的随机ID。此uid不容错,因此当子任务从失败中恢复时将重新生成它。
如果您使用DataStream API,则可以自定义存储桶分配器和滚动策略,但如果使用SQL/Table API,则只能使用其文档中描述的选项。

相关问题