hadoop序列文件集合

ne5o7dgx 于 2021-06-04 发布在 Hadoop

关注(0)|答案(1)|浏览(327)

一个reducer（带有一个文本键和一个iterable mapwritable值）如何将它的所有Map输出到一个序列文件，以便在它的键上保留分组？例如，假设Map程序将记录发送到还原程序，如下所示：

<"dog", {<"name", "Fido">, <"pure bred?", "false">, <"type", "mutt">}>
<"cat", {<"name", "Felix">, <"color", "black">, <"origin", "film">, <"date", "1919">}>
<"dog", {<"name", "Lassie">, <"type", "collie">, <"origin", " short story">}>

我想把序列文件写成：

key = "dog"
value =  {
            {<"name", "Fido">, <"pure bred?", "false">, <"type", "mutt">},
            {<"name", "Lassie">, <"type", "collie">, <"origin", "short story">}
         }

key = "cat"
value = {
            {<"name", "Felix">, <"color", "black">, <"origin", "film">, <"date", "1919">}
        }

我猜我需要创建一个实现可写的自定义值输出类，但我不确定如何做到这一点，因为据我所知，集合实际上不能处理序列文件。我想这样做，以便下一个map/reduce阶段将读取与每个键相关联的所有map作为一个单元。
蒂亚，

hadoop sequencefile

来源：https://stackoverflow.com/questions/20205336/hadoop-sequence-file-collection

1条答案

按热度按时间

scyqe7ek1#

如您所注意到的，您可以创建一个扩展arraywritable的自定义可写文件：

public class MapWritableArray extends ArrayWritable {
    public MapWritableArray() {
        super(MapWritable.class);
    }
}

然后，在reducer中，需要将可Map可写值的iterable累积到一个数组中（记住，随着每次迭代底层内容的变化，要复制这些值）。类似于（完全未经测试，未经编译验证和优化）：

MapWritableArray mapWritableArray = new MapWritableArray();
ArrayList<MapWritable> valList = new ArrayList<MapWritable>();
for (MapWritable value : values) {
    MapWritable copy = ReflectionUtils.newInstance(context.getConfiguration(), MapWritable.class);
    ReflectionUtils.copy(context.getConfiguration, value, copy);
    valList.add(copy);
}
mapWritableArray.set(valList.toArray(new MapWritable[0]));

赞(0）回复(0）举报 2021-06-04

我来回答

hadoop序列文件集合

1条答案

相关问题

热门标签

最新问答