使用aws数据管道活动

5w9g7ksd 于 2021-06-21 发布在 Pig

关注(0)|答案(2)|浏览(368)

我试图让一个简单的pigactivity在数据管道中工作。http://docs.aws.amazon.com/datapipeline/latest/developerguide/dp-object-pigactivity.html#pigactivity
此活动需要输入和输出字段。我将它们都设置为使用s3datanode。这两个数据节点都有一个指向s3输入和输出的directorypath。我最初尝试使用filepath，但出现以下错误：

PigActivity requires 'directoryPath' in 'Output' object.

我正在使用一个自定义的pig脚本，也位于s3中。
我的问题是如何在脚本中引用这些输入和输出路径？
引用中给出的示例使用了stage字段（可以禁用/启用）。我的理解是，它用于将数据转换为表。我不想这样做，因为它还要求您指定一个dataformat字段。

Determines whether staging is enabled and allows your Pig script to have access to the staged-data tables, such as ${INPUT1} and ${OUTPUT1}.

我已禁用暂存，并尝试按以下方式访问脚本中的数据：

input = LOAD '$Input';

但我得到以下错误：

IOException. org.apache.pig.tools.parameters.ParameterSubstitutionException: Undefined parameter : Input

我试过使用：

input = LOAD '${Input}';

但我也有个错误。
有可选的scriptvariable字段。我必须在这里使用某种Map吗？

amazon-emr amazon-s3 amazon-web-services apache-pig amazon-data-pipeline

来源：https://stackoverflow.com/questions/37974833/using-aws-data-pipeline-pigactivity

2条答案

按热度按时间

syqv5f0l1#

只是使用
加载'uri到你的s3'
应该有用。
通常这是在暂存（表创建）中完成的，您不必直接从脚本访问uri，只需在s3datanode中指定它。

赞(0）回复(0）举报 2021-06-21

o2g1uqev2#

确保已将“pigactivity”的“stage”属性设置为true。
一旦我这么做了，下面的脚本就开始为我工作了：

part  = LOAD ${input1} USING PigStorage(',') AS (p_partkey,p_name,p_mfgr,p_category,p_brand1,p_color,p_type,p_size,p_container);
grpd = GROUP part BY p_color;
${output1} = FOREACH grpd GENERATE group, COUNT(part);

赞(0）回复(0）举报 2021-06-21

我来回答

使用aws数据管道活动

2条答案

相关问题

热门标签

最新问答