hadoop sequencefile二进制安全吗？

h7wcgrx3 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(411)

我阅读了hadoop-1.0.4源代码中的sequencefile.java。我发现 sync(long) 在mapreduce中将sequencefile拆分为文件拆分时，用于在sequencefile中查找“同步标记”（文件创建时生成的16字节md5）的方法。

/**Seek to the next sync mark past a given position.*/
public synchronized void sync(long position) throws IOException {
  if (position+SYNC_SIZE >= end) {
    seek(end);
    return;
  }

  try {
    seek(position+4);                         // skip escape
    in.readFully(syncCheck);
    int syncLen = sync.length;
    for (int i = 0; in.getPos() < end; i++) {
      int j = 0;
      for (; j < syncLen; j++) {
        if (sync[j] != syncCheck[(i+j)%syncLen])
          break;
      }
      if (j == syncLen) {
        in.seek(in.getPos() - SYNC_SIZE);     // position before sync
        return;
      }
      syncCheck[i%syncLen] = in.readByte();
    }
  } catch (ChecksumException e) {             // checksum failure
    handleChecksumException(e);
  }
}

这些代码只是查找包含与“sync marker”相同数据的数据序列。
我的疑问是：
假设sequencefile中的数据恰好包含一个与“sync marker”相同的16字节数据序列，上面的代码会错误地将该16字节数据视为“sync marker”，然后sequencefile将无法正确解析？
我没有发现任何关于数据或同步标记的“转义”操作。sequencefile如何是二进制安全的？我错过什么了吗？

Java hadoop hdfs sequencefile cloud

来源：https://stackoverflow.com/questions/16251110/is-hadoop-sequencefile-binary-safe

1条答案

按热度按时间

qgelzfjb1#

从技术上讲，冲突是可能的，但事实上，这种可能性极低。
从http://search-hadoop.com/m/vyvra2krg5t1:
给定的随机16字节字符串出现在PB（均匀分布）数据中的概率约为10^-23。更有可能是你的数据中心被陨石摧毁了(http://preshing.com/20110504/hash-collision-probabilities).

赞(0）回复(0）举报 2021-06-03

我来回答

hadoop sequencefile二进制安全吗？

1条答案

相关问题

热门标签

最新问答