gzip文件如何存储在hdfs中

gzszwxb4 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(597)

hdfs存储支持压缩格式来存储压缩文件。我知道gzip压缩不支持夹板。想象一下，现在这个文件是一个gzip压缩文件，其压缩大小是1gb。现在我的问题是：
如何将此文件存储在hdfs中（块大小为64mb）
通过这个链接，我知道gzip格式使用deflate存储压缩数据，deflate将数据存储为一系列压缩块。
但我不能完全理解它，并寻求广泛的解释。
来自gzip压缩文件的更多疑问：
这个1gb gzip压缩文件将有多少块。
它会在多个datanode上运行吗？
复制因子如何适用于此文件（hadoop集群复制因子为3。）
是什么 DEFLATE 算法？
读取gzip压缩文件时应用哪种算法？
我在这里寻找广泛而详细的解释。

hadoop hdfs Algorithm GZIP compression

来源：https://stackoverflow.com/questions/34953756/how-gzip-file-gets-stored-in-hdfs

1条答案

按热度按时间

qxsslcnc1#

如果zip文件格式不支持拆分，该文件将如何存储在hdfs中（块大小为64mb）？
所有dfs块将存储在单个datanode中。如果块大小为64 mb，文件大小为1 gb，则 Datanode 使用16个dfs块（1 gb/64 mb=15.625）将存储1 gb文件。
这个1gb gzip压缩文件将有多少块。
1 gb/64 mb=15.625~16个dfs块
复制因子如何适用于此文件（hadoop集群复制因子为3。）
与任何其他文件相同。如果文件是可拆分的，则没有更改。如果文件不可拆分，则将标识具有所需块数的数据节点。在本例中，3个datanode具有16个可用的dfs块。
来自此链接的源代码：http://grepcode.com/file_/repo1.maven.org/maven2/com.ning/metrics.action/0.2.7/org/apache/hadoop/hdfs/server/namenode/replicationtargetchooser.java/?v=source
和
http://grepcode.com/file_/repo1.maven.org/maven2/org.apache.hadoop/hadoop-hdfs/0.22.0/org/apache/hadoop/hdfs/server/namenode/blockplacementpolicydefault.java/?v=source

/**The class is responsible for choosing the desired number of targets
 * for placing block replicas.
 * The replica placement strategy is that if the writer is on a datanode,
 * the 1st replica is placed on the local machine, 
 * otherwise a random datanode. The 2nd replica is placed on a datanode
 * that is on a different rack. The 3rd replica is placed on a datanode
 * which is on the same rack as the first replca.
 */

什么是deflate算法？
delate是解压缩gzip格式的压缩文件的算法。
请看这张幻灯片，以了解zip文件不同变体的其他算法。

请看此演示文稿以了解更多详细信息。