pig udf-将动态模式作为一组字段(而不是元组)返回

bcs8qyzn 于 2021-06-25 发布在 Pig

关注(0)|答案(1)|浏览(245)

在按+展平分组之后，我有一个命名空间为的数据：

DESCRIBE users;
users: {user_id: int, group_id: int, registration_timestamp: int}

users_with_namespace = FOREACH (GROUP users BY group_id) {
    first_to_latest = ORDER users BY registration_timestamp ASC;
    first_user = LIMIT first_to_latest 1;
    GENERATE FLATTEN(first_user);
};

DESCRIBE users_with_namespace;
users_with_namespace: {first_user::user_id: int, first_user::group_id: int, first_user::registration_timestamp: int}

我想做一些事情，比如：

users = myudf.strip_namespace(users_with_namespace);

或者（因为，这似乎是不可能的）：

users = FOREACH (GROUP users_with_namespaceALL) 
GENERATE myudf.strip_namespace(users_with_namespace);

结果是：

> DESCRIBE users;
users: {user_id: int, registration_timestamp: int}

我已经编写了一个jython pig udf，它应该去掉任何名称空间的字段名，但是我似乎无法从我的udf返回一组字段。只有一个包/元组/单个字段是可能的，这给我留下了这样的结果：

DESCRIBE users;
users: {t: (user_id: int, registration_timestamp: int)}

有没有办法省略“t”并返回一个字段列表/字段集？我的自定义项如下所示：

@outputSchemaFunction("tupleSchema")
def strip_namespace(input):
    return input

@schemaFunction("tupleSchema")
def tupleSchema(input):
    fields = []
    dt = []
    for i in input.getField(0).schema.getFields():
        for field in i.schema.getFields():
            fields.append(field.alias.split("::")[-1])
            dt.append(field.type)
    return SchemaUtil.newTupleSchema(fields, dt)

到目前为止我已经用了

FOREACH .. GENERATE namespace::field as field, ...

剥离名称空间，但这种方法对于具有许多字段的数据集来说确实很乏味。

user-defined-functions apache-pig jython

来源：https://stackoverflow.com/questions/32064887/pig-udf-return-dynamic-schema-as-a-set-of-fields-not-tuple

1条答案

按热度按时间

xmjla07d1#

不幸的是你不能，至少现在不行。问题正是您所说的：现在您只能返回一个元组、一个包或单个字段。我在2个月前创建了一个jira问题，允许返回这个场景的多个字段，但是还没有回复。。。
我真的希望他们在将来实现这一点，因为当您必须执行许多连接时，您最终会得到更多的连接 FOREACH 语句来重命名字段，而不是实际代码。

赞(0）回复(0）举报 2021-06-26

我来回答

pig udf-将动态模式作为一组字段(而不是元组)返回

1条答案

相关问题

热门标签

最新问答