pig的avrostorage加载从输入中删除unicode字符

pkwftd7m  于 2021-06-21  发布在  Pig
关注(0)|答案(0)|浏览(237)

我正在使用pig读取avro文件,并在写回之前对数据进行规范化/转换。avro文件的记录形式如下:

{
  "type" : "record",
  "name" : "KeyValuePair",
  "namespace" : "org.apache.avro.mapreduce",
  "doc" : "A key/value pair",
  "fields" : [ {
    "name" : "key",
    "type" : "string",
    "doc" : "The key"
  }, {
    "name" : "value",
    "type" : {
      "type" : "map",
      "values" : "bytes"
    },
    "doc" : "The value"
  } ]
}

我将avrotools命令行实用程序与jq结合使用,将第一条记录转储到json:

$ java -jar avro-tools-1.8.1.jar tojson part-m-00000.avro | ./jq --compact-output 'select(.value.pf_v != null)' | head -n 1 | ./jq .
{
  "key": "some-record-uuid",
  "value": {
    "pf_v": "v1\u0003Basic\u0001slcvdr1rw\u001a\u0004v2\u0003DayWatch\u0001slcva2omi\u001a\u0004v3\u0003Performance\u0001slc1vs1v1w1p1g1i\u0004v4\u0003Fundamentals\u0001snlj1erwi\u001a\u0004v5\u0003My Portfolio\u0001svr1dews1b2b3k1k2\u001a\u0004v0\u00035"
  }
}

我运行以下pig命令:

REGISTER avro-1.8.1.jar
REGISTER json-simple-1.1.1.jar
REGISTER piggybank-0.15.0.jar
REGISTER jackson-core-2.8.6.jar
REGISTER jackson-databind-2.8.6.jar

DEFINE AvroLoader org.apache.pig.piggybank.storage.avro.AvroStorage();
AllRecords = LOAD 'part-m-00000.avro'
    USING AvroLoader()
    AS (key: chararray, value: map[]);

Records = FILTER AllRecords BY value#'pf_v' is not null;

SmallRecords = LIMIT Records 10;
DUMP SmallRecords;

上面最后一条命令的相应记录如下:

...
(some-record-uuid,[pf_v#v03v1Basicslcviv2DayWatchslcva2omiv3Performanceslc1vs1v1w1p1g1i])
...

如您所见,unicode字符已从 pf_v 价值观。unicode字符实际上被用作这些值中的分隔符,因此我需要它们来将记录完全解析为所需的规范化状态。unicode字符在编码的 .avro 文件(如将文件转储为json所示)。有人知道一种方法让avrostorage在加载记录时不删除unicode字符吗?
谢谢您!
更新:我还使用avro的python datafilereader执行了相同的操作:

import avro.schema
from avro.datafile import DataFileReader, DataFileWriter
from avro.io import DatumReader, DatumWriter

reader = DataFileReader(open("part-m-00000.avro", "rb"), DatumReader())

for rec in reader:
    if 'some-record-uuid' in rec['key']:
        print rec
        print '--------------------------------------------'
        break

reader.close()

这将打印一个dict,看起来像是用十六进制字符代替unicode字符(这比完全删除它们更好):

{u'value': {u'pf_v': 'v0\x033\x04v1\x03Basic\x01slcvi\x1a\x04v2\x03DayWatch\x01slcva2omi\x1a\x04v3\x03Performance\x01slc1vs1v1w1p1g1i\x1a'}, u'key': u'some-record-uuid'}

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题