mapreduce如何允许mapper读取xml文件进行查找

s1ag04yj 于 2021-06-02 发布在 Hadoop

关注(0)|答案(2)|浏览(362)

在mapreduce作业中，我将产品名作为字符串参数传递给Map器。py脚本导入一个名为process.py的辅助脚本，该脚本对产品名称执行某些操作，并向Map器返回一些emit字符串。然后Map器将这些字符串发送到hadoop框架，以便reducer可以提取它们。除了以下几点外，一切正常：
py脚本包含一个查找值字典，我想将它从脚本内部移动到一个xml文件中，以便于更新。我已经在本地进行了测试，如果在process.py脚本中包含xml文件的windows路径，它就可以正常工作。但是，在hadoopmapreduce环境中测试这一点由于某些原因不起作用。
我曾尝试在process.py脚本中指定xml文档的hdfs路径，并尝试在mapreduce job命令中添加xml文档的名称作为-file参数，但都不起作用。
例如，在process.py中，我尝试了：
xml文件=r'appers@hdfs.network.com：/nfs\u home/appers/cnielsen/product\u lookups.xml'
和
xml\u file=r'/nfs\u home/appers/cnielsen/product\u lookups.xml'
在mapreduce命令中，我将xml文件的名称作为-file参数包含在内。例如：
... -文件product\u lookups.xml-reducer。。。
问题是：在mapreduce环境中，如何允许process.py脚本读取存储在hdfs上的xml文档？

hadoop mapreduce python

来源：https://stackoverflow.com/questions/34362331/mapreduce-how-to-allow-mapper-to-read-an-xml-file-for-lookup

2条答案

按热度按时间

0x6upsns1#

下面是一个端到端的例子，它调整了上一个问题中提到的技巧，使之更适合您的问题。
python从hdfs读取文件作为流
这是一个小型的python-hadoop流应用程序，它读取键值对，对照hdfs中存储的xml配置文件检查键值，然后仅在键值与配置匹配时才发出值。匹配的逻辑被卸载到一个单独的process.py模块中，该模块使用对的外部调用从hdfs读取xml配置文件 hdfs dfs -cat .
首先，我们创建一个名为pythonapp的目录，其中包含实现的python源文件。稍后我们将在提交流作业时看到，我们将在 -files 争论。
为什么我们要把这些文件放在一个中间目录中，而不是在目录中单独列出每个文件 -files 争论？这是因为当yarn将要在容器中执行的文件本地化时，它引入了一层符号链接间接寻址。python无法通过symlink正确加载模块。解决方案是将两个文件打包到同一个目录中。然后，当yarn对文件进行本地化时，symlink间接寻址是在目录级别完成的，而不是单个文件。由于主脚本和模块在物理上都在同一个目录中，python将能够正确地加载模块。这个问题更详细地解释了这个问题：
如何在mapreduce作业中导入自定义模块？

Map器.py

import subprocess
import sys
from Process import match

for line in sys.stdin:
    key, value = line.split()
    if match(key):
        print value

流程.py

import subprocess
import xml.etree.ElementTree as ElementTree

hdfsCatProcess = subprocess.Popen(
        ['hdfs', 'dfs', '-cat', '/pythonAppConf.xml'],
        stdout=subprocess.PIPE)
pythonAppConfXmlTree = ElementTree.parse(hdfsCatProcess.stdout)
matchString = pythonAppConfXmlTree.find('./matchString').text.strip()

def match(key):
    return key == matchString

接下来，我们将2个文件放入hdfs/testdata是输入文件，包含制表符分隔的键值对/pythonappconf.xml是xml文件，我们可以在其中配置一个特定的密钥来匹配。

/测试数据

foo 1
bar 2
baz 3

/pythonappconf.xml文件

<pythonAppConf>
    <matchString>foo</matchString>
</pythonAppConf>

既然我们已经准备好了 matchString 至 foo ，并且由于我们的输入文件只包含一个键设置为 foo ，我们希望运行作业的输出是包含键对应的值的单行 foo ，即 1 . 把它试运行一下，我们确实得到了预期的结果。

> hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
      -D mapreduce.job.reduces=0 \
      -files pythonapp \
      -input /testData \
      -output /streamingOut \
      -mapper 'python pythonapp/Mapper.py'

> hdfs dfs -cat /streamingOut/part*
1

另一种方法是在 -files 争论。这样，在python脚本启动之前，yarn将把xml文件作为本地化资源拉送到运行容器的各个节点。然后，python代码可以像打开工作目录中的本地文件一样打开xml文件。对于运行多个任务/容器的非常大的作业，这种技术可能比调用 hdfs dfs -cat 从每个任务。
为了测试这项技术，我们可以尝试不同版本的process.py模块。

流程.py

import xml.etree.ElementTree as ElementTree

pythonAppConfXmlTree = ElementTree.parse('pythonAppConf.xml')
matchString = pythonAppConfXmlTree.find('./matchString').text.strip()

def match(key):
    return key == matchString

命令行调用更改为在中指定hdfs路径 -files ，再一次，我们看到了预期的结果。

> hadoop jar share/hadoop/tools/lib/hadoop-streaming-*.jar \
      -D mapreduce.job.reduces=0 \
      -files pythonapp,hdfs:///pythonAppConf.xml \
      -input /testData \
      -output /streamingOut \
      -mapper 'python pythonapp/Mapper.py'

> hdfs dfs -cat /streamingOut/part*
1

apachehadoop文档讨论了 -files 在这里本地拉hdfs文件的选项。
http://hadoop.apache.org/docs/r2.7.1/hadoop-streaming/hadoopstreaming.html#working_with_large_files_and_archives

赞(0）回复(0）举报 2021-06-02

nkhmeac62#

感谢chris nauroth提供的上述答案。通过这篇文章，我想总结一下究竟是什么解决了我的问题。
他提供的第二个答案与我最初想做的非常接近。我发现我只需要做一些小的改变就可以让它正常工作。例如，在process.py脚本中，我之前尝试包含指向小查找xml的完整路径，如下所示： xml_file = r'appers@hdfs.network.com:/nfs_home/appers/cnielsen/product_lookups.xml' 和 xml_file = r'/nfs_home/appers/cnielsen/product_lookups.xml' 原来我只需要在process.py脚本中提供文件名，而不需要路径。例如： xml_file = 'product_lookups.xml' 然后，对于实际的hadoop命令，我之前尝试过这个命令，但没有成功：（在-mapper列表之后使用-file product\u lookups.xml）

> hadoop jar /share/hadoop/tools/lib/hadoop-streaming.jar \
  -file /nfs_home/appers/cnielsen/Mapper.py \
  -file /nfs_home/appers/cnielsen/Reducer.py \
  -mapper '/usr/lib/python_2.7.3/bin/python Mapper.py ProductName' \
  -file Process.py \
  -file product_lookups.xml \
  -reducer '/usr/lib/python_2.7.3/bin/python Reducer.py' \
  -input /nfs_home/appers/extracts/*/*.xml \
  -output /user/lcmsprod/output/cnielsen/test47

构造hadoop命令的正确方法是使用-files并在任何其他文件列表之前列出这个查找文件。例如，这起到了作用：

> hadoop jar /share/hadoop/tools/lib/hadoop-streaming.jar \
  -files /nfs_home/appers/cnielsen/product_lookups.xml \
  -file /nfs_home/appers/cnielsen/Mapper.py \
  -file /nfs_home/appers/cnielsen/Reducer.py \
  -mapper '/usr/lib/python_2.7.3/bin/python Mapper.py ProductName' \
  -file Process.py \
  -reducer '/usr/lib/python_2.7.3/bin/python Reducer.py' \
  -input /nfs_home/appers/extracts/*/*.xml \
  -output /user/lcmsprod/output/cnielsen/test47

注意：尽管此页说明要像这样构造-files命令： -files hdfs://host:fs_port/user/testfile.txt 如果我包含hdfs://或host:部分，这对我来说不起作用，从上面列出的实际命令可以看出。

赞(0）回复(0）举报 2021-06-02

我来回答

mapreduce如何允许mapper读取xml文件进行查找

2条答案

Map器.py

流程.py

/测试数据

/pythonappconf.xml文件

流程.py

相关问题

热门标签

最新问答