java—如何从mapreduce作业查询存储在hdfs中的嵌入式数据库?

nukf8bse  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(457)

我试图从hadoop mapreduce mapper查询geolite数据库来解析一个ip地址的国家。我尝试了两种方法:
1.使用 File 只在本地文件系统中工作,我收到一个“找不到文件”异常

File database = new File("hdfs://localhost:9000/input/GeoLite2-City.mmdb"); // <<< HERE
DatabaseReader reader = new DatabaseReader.Builder(database).build();

2.使用流,但我在运行时遇到这个错误
错误:java堆空间

Path pt = new Path("hdfs://localhost:9000/input/GeoLite2-City.mmdb");
FileSystem fs = FileSystem.get(new Configuration());

FSDataInputStream stream = fs.open(pt);
DatabaseReader reader = new DatabaseReader.Builder(stream).build();

InetAddress ipAddress = InetAddress.getByName(address.getHostAddress());
CityResponse response = null;
try {
    response = reader.city(ipAddress);
} catch (GeoIp2Exception ex) {
    ex.printStackTrace();
    return;
}

我的问题是:如何在hadoop中从mapper查询geolite数据库?

3lxsmp7m

3lxsmp7m1#

我通过分布式缓存方法解决了这个问题,将geolite数据库文件缓存到mapreduce作业中的每个Map器。

@Override
      public void setup(Context context)

      {
        Configuration conf = context.getConfiguration();

        try {

          cachefiles = DistributedCache.getLocalCacheFiles(conf);

          File database = new File(cachefiles[0].toString()); //

          reader = new DatabaseReader.Builder(database).build();

        } catch (IOException e) {
          e.printStackTrace();
        }

      }
public void map(Object key, Text line, Context context) throws IOException,
      InterruptedException {

                     .....

InetAddress ipAddress = InetAddress.getByName(address.getHostAddress());
      CityResponse response = null;
      try {
        response = reader.city(ipAddress);
      } catch (GeoIp2Exception ex) {
        ex.printStackTrace();
        return;
      }
                     ......

相关问题