在mapreduce中使用globstatus过滤输入文件

rqenqsqc 于 2021-06-04 发布在 Hadoop

关注(0)|答案(2)|浏览(278)

我有很多输入文件，我想根据最后附加的日期处理选定的文件。我现在不知道在哪里使用globstatus方法来过滤文件。
我有一个定制的recordreader类，我试图在它的下一个方法中使用globstatus，但没有成功。

public boolean next(Text key, Text value) throws IOException {
    Path filePath = fileSplit.getPath();

    if (!processed) {
        key.set(filePath.getName());

        byte[] contents = new byte[(int) fileSplit.getLength()];
        value.clear();
        FileSystem fs = filePath.getFileSystem(conf);
        fs.globStatus(new Path("/*" + date));
        FSDataInputStream in = null;

        try {
            in = fs.open(filePath);
            IOUtils.readFully(in, contents, 0, contents.length);
            value.set(contents, 0, contents.length);
        } finally {
            IOUtils.closeStream(in);
        }
        processed = true;
        return true;
    }
    return false;
}

我知道它返回一个filestatus数组，但是如何使用它来过滤文件。有人能帮我弄点光吗？

Java hadoop mapreduce cloudera

来源：https://stackoverflow.com/questions/14332330/filtering-input-files-using-globstatus-in-mapreduce

2条答案

按热度按时间

d7v8vwbk1#

这个 globStatus 方法接受两个免费参数，允许您筛选文件。第一种是glob模式，但有时glob模式的功能不足以过滤特定的文件，在这种情况下，您可以定义 PathFilter .
关于glob模式，支持以下内容：

Glob   | Matches
-------------------------------------------------------------------------------------------------------------------

* | Matches zero or more characters

?      | Matches a single character
[ab]   | Matches a single character in the set {a, b}
[^ab]  | Matches a single character not in the set {a, b}
[a-b]  | Matches a single character in the range [a, b] where a is lexicographically less than or equal to b
[^a-b] | Matches a single character not in the range [a, b] where a is lexicographically less than or equal to b
{a,b}  | Matches either expression a or b
\c     | Matches character c when it is a metacharacter
``` `PathFilter` 就是这样一个界面：

public interface PathFilter {
boolean accept(Path path);
}

所以您可以实现这个接口并实现 `accept` 方法，您可以在其中放置逻辑以筛选文件。
汤姆·怀特的一本好书中的一个例子可以让你定义 `PathFilter` 要筛选与特定正则表达式匹配的文件，请执行以下操作：

public class RegexExcludePathFilter implements PathFilter {
private final String regex;

public RegexExcludePathFilter(String regex) {
    this.regex = regex;
}

public boolean accept(Path path) {
    return !path.toString().matches(regex);
}

}

您可以直接使用 `PathFilter` 调用实现 `FileInputFormat.setInputPathFilter(JobConf, RegexExcludePathFilter.class)` 初始化作业时。
编辑：因为你必须通过课程 `setInputPathFilter` ，您不能直接传递参数，但应该可以通过使用 `Configuration` . 如果你让你的 `RegexExcludePathFilter` 也从 `Configured` ，你可以拿回一个 `Configuration` 对象，以便您可以在筛选器中获取这些值并在 `accept` .
例如，如果按以下方式初始化：

conf.set("date", "2013-01-15");

然后您可以这样定义过滤器：

public class RegexIncludePathFilter extends Configured implements PathFilter {
private String date;
private FileSystem fs;

public boolean accept(Path path) {
    try {
        if (fs.isDirectory(path)) {
            return true;
        }
    } catch (IOException e) {}
    return path.toString().endsWith(date);
}

public void setConf(Configuration conf) {
    if (null != conf) {
        this.date = conf.get("date");
        try {
            this.fs = FileSystem.get(conf);
        } catch (IOException e) {}
    }
}

}

编辑2：原始代码有一些问题，请查看更新的类。您还需要删除构造函数，因为它不再被使用，并检查它是否是一个目录，在这种情况下，您应该返回true，以便目录的内容也可以被过滤。

赞(0）回复(0）举报 2021-06-04

yiytaume2#

对于任何阅读本文的人，我可以说“请不要在过滤器中做比验证路径更复杂的事情”。具体来说：不要检查文件是否是目录，获取它们的大小，等等。等到list/glob操作返回，然后在那里进行过滤，使用当前填充的文件中的信息 FileStatus 条目。
为什么？所有这些电话 getFileStatus() ，直接或通过 isDirectory() 正在对文件系统进行不必要的调用，这些调用会在hdfs集群上添加不必要的namenode负载。更关键的是，针对s3和其他对象存储，每个操作都有可能发出多个https请求，而这些请求确实需要可测量的时间。更好的是，如果s3认为您在整个计算机集群中发出的请求太多，它会限制您。你不想那样。
直到调用之后，您返回的文件状态条目是那些来自对象存储的list命令，通常每个https请求返回数千个文件条目，因此效率更高。
有关更多详细信息，请查看 org.apache.hadoop.fs.s3a.S3AFileSystem .

赞(0）回复(0）举报 2021-06-04

我来回答

在mapreduce中使用globstatus过滤输入文件

2条答案

相关问题

热门标签

最新问答