文件未正确放入分布式缓存

vbopmzt1 于 2021-06-03 发布在 Hadoop

关注(0)|答案(1)|浏览(381)

我使用以下代码将文件添加到分布式缓存：

Configuration conf2 = new Configuration();      
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);

然后我把文件读入Map：

protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();

URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));

String line;
        try {
              while ((line = joinReader.readLine()) != null) {
              s = line.toString().split("\t");
                  do stuff to s
                } finally {
                   joinReader.close();
                }

问题是我只读取了一行，而不是我放入缓存的文件。相反，它是：cm9vda==，或者根在base64中。
有没有其他人遇到过这个问题，或者看到我是如何错误地使用分布式缓存的？我使用的是完全分布式的hadoop0.20.2。

hadoop distributed-cache

来源：https://stackoverflow.com/questions/12708947/files-not-put-correctly-into-distributed-cache

1条答案

按热度按时间

r7knjye21#

作业配置中的常见错误：

Configuration conf2 = new Configuration();      
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);

创建作业对象后，需要在作业复制配置对象时收回该对象，并且在创建作业后在conf2中配置值对作业本身没有影响。试试这个：

job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);

您还应该检查分布式缓存中的文件数，可能不止一个，而且您正在打开一个随机文件，该文件将为您提供所看到的值。
我建议您使用符号链接，这将使文件在本地工作目录中可用，并具有已知名称：

DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);

// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));

赞(0）回复(0）举报 2021-06-03

我来回答

文件未正确放入分布式缓存

1条答案

相关问题

热门标签

最新问答