如何填充Java Apache Arrow Field List of Structs?

iovurdzv  于 2023-04-04  发布在  Java
关注(0)|答案(1)|浏览(128)

bounty将在7天后过期。回答此问题可获得+50的声誉奖励。Mark正在寻找来自声誉良好来源的答案

我有一个数据集,它主要是一个2D表,但是一个列(Field)(称为属性)在每个单元格中包含StructsList。每个Struct有三个Field:属性标签、属性类型和属性值。
属性Field的定义是:

/**
 * Attribute Tag - Two character tag.
 */
public static final Field ATTRIBUTE_TAG_FIELD =
        new Field("AttributeTag", FieldType.notNullable(new ArrowType.FixedSizeBinary(2)), null);

/**
 * Attribute Type - One character type.
 */
// todo this could be dictionary encoded but would require building a
dictionary which requires access to the allocator
public static final Field ATTRIBUTE_TYPE_FIELD =
        new Field(
                "AttributeType",
                new FieldType(false,
                new ArrowType.FixedSizeBinary(1), null),
                null
        );

/**
 * String representation of the Attribute value.
 */
public static final Field ATTRIBUTE_VALUE_FIELD = new Field("AttributeValue", FieldType.notNullable(new ArrowType.Utf8()), null);

/**
 * The field is a nullable List of Structs each with an attribute tag,
type and value.
 */
public static final Field ATTRIBUTES_FIELD =
        new Field("Attributes", FieldType.nullable(new ArrowType.List()), List.of(
                new Field("Attribute", FieldType.nullable(new ArrowType.Struct()), List.of(
                        ATTRIBUTE_TAG_FIELD, ATTRIBUTE_TYPE_FIELD, ATTRIBUTE_VALUE_FIELD))));

我有这样一段代码,它试图从一些源数据填充属性。尽管运行时不会产生错误,但它不会在属性向量中产生任何值。

final ListVector attributes = (ListVector)
ATTRIBUTES_FIELD.createVector(allocator);

// this is the source of the attributes that I will populate into the
attributes vector
final List<SAMRecord.SAMTagAndValue> recordAttributes =
samRecord.getAttributes();

if (recordAttributes != null && recordAttributes.size() > 0 ) {
    final UnionListWriter listWriter = attributes.getWriter();
    listWriter.allocate();

    IntStream.range(0, recordAttributes.size()).forEachOrdered(attributeIndex -> {
        listWriter.setPosition(attributeIndex);
        listWriter.startList();

        // put the values of the attribute in the arrow struct
        final SAMRecord.SAMTagAndValue samTagAndValue recordAttributes.get(attributeIndex);

        // I think the problem is here. In a debugger this seems to create a new writer not related to my Vector??
        final BaseWriter.StructWriter structWriter = listWriter.struct("Attribute");
        structWriter.start();

        final byte[] tagBytes =
            samTagAndValue.tag.getBytes(StandardCharsets.UTF_8);
        // todo find out the type from the value
        final byte[] typeBytes = "S".getBytes(StandardCharsets.UTF_8);
        final byte[] valueBytes =
            samTagAndValue.value.toString().getBytes(StandardCharsets.UTF_8);

        ArrowBuf tempBuf = allocator.buffer(tagBytes.length);
        tempBuf.setBytes(0, tagBytes);
        structWriter.varChar("AttributeTag").writeVarChar(0, tagBytes.length, tempBuf);
        tempBuf.close();

        tempBuf = allocator.buffer(typeBytes.length);
        structWriter.varChar("AttributeType").writeVarChar(0, typeBytes.length, tempBuf);
        tempBuf.close();

        tempBuf = allocator.buffer(valueBytes.length);
        structWriter.varChar("AttributeValue").writeVarChar(0, valueBytes.length, tempBuf);
        tempBuf.close();

        structWriter.end();
    });

    listWriter.setValueCount(recordAttributes.size());
    listWriter.end();
}

为什么attributesListVector中没有任何值?正确的方法是什么?

rxztt3cl

rxztt3cl1#

看起来问题可能与列表编写器的使用方式有关。当您调用listWriter.struct("Attribute")时,它会创建一个与vector的struct字段无关的新struct writer示例。相反,您应该使用listWriter.struct()来获取与vector的struct字段关联的struct writer示例。
下面是如何修改代码来解决这个问题:

final ListVector attributes = (ListVector) ATTRIBUTES_FIELD.createVector(allocator);

// this is the source of the attributes that I will populate into the attributes vector
final List<SAMRecord.SAMTagAndValue> recordAttributes = samRecord.getAttributes();

if (recordAttributes != null && recordAttributes.size() > 0 ) {
    final UnionListWriter listWriter = attributes.getWriter();
    listWriter.allocate();

    IntStream.range(0, recordAttributes.size()).forEachOrdered(attributeIndex -> {
        listWriter.setPosition(attributeIndex);
        listWriter.startList();

        // put the values of the attribute in the arrow struct
        final SAMRecord.SAMTagAndValue samTagAndValue recordAttributes.get(attributeIndex);

        final BaseWriter.StructWriter structWriter = listWriter.struct();
        structWriter.start();

        final byte[] tagBytes = samTagAndValue.tag.getBytes(StandardCharsets.UTF_8);
        final byte[] typeBytes = "S".getBytes(StandardCharsets.UTF_8);
        final byte[] valueBytes = samTagAndValue.value.toString().getBytes(StandardCharsets.UTF_8);

        ArrowBuf tempBuf = allocator.buffer(tagBytes.length);
        tempBuf.setBytes(0, tagBytes);
        structWriter.varChar("AttributeTag").writeVarChar(0, tagBytes.length, tempBuf);
        tempBuf.close();

        tempBuf = allocator.buffer(typeBytes.length);
        tempBuf.setBytes(0, typeBytes);
        structWriter.varChar("AttributeType").writeVarChar(0, typeBytes.length, tempBuf);
        tempBuf.close();

        tempBuf = allocator.buffer(valueBytes.length);
        tempBuf.setBytes(0, valueBytes);
        structWriter.varChar("AttributeValue").writeVarChar(0, valueBytes.length, tempBuf);
        tempBuf.close();

        structWriter.end();

        listWriter.endList();
    });

    listWriter.setValueCount(recordAttributes.size());
    listWriter.end();
}

在这个修改后的代码中,listWriter.struct()用于获取与vector的struct字段相关联的struct writer示例。代码的其余部分与原始代码类似,但有一些额外的更改,以确保ArrowBuf示例使用正确的字节数组进行初始化。

相关问题