我们可以创建Parquet表从现有的?

vlurs2pr  于 2021-06-24  发布在  Hive
关注(0)|答案(2)|浏览(437)

我有一张有大量数据的table。我必须从中提取一些列并创建一个Parquet表。下面是我做的-

CREATE TABLE parqfiletable
AS
SELECT col2, col4, col8 FROM txtfiletable
STORED AS PARQUET;

这是成功的,但是当我试图从parquet表中检索数据时,它抛出了一个错误。
此外,相同的create table语句现在不起作用。
我会用一种普通的方式来完成它,比如创建一个存储为parquet的表定义,然后从原始源加载数据。但是上面的方法在那个时候立即起作用了,这个表就被创建了。
你能让我明白CREATETABLE语句有什么问题吗?我们不能创建“创建为…”。。作为Parquet地板储存?

btqmn9zl

btqmn9zl1#

我想出了上述问题的正确语法。 CREATE TABLE parqtable STORED AS PARQUET AS SELECT * FROM textfiletable;

ckx4rj1h

ckx4rj1h2#

CREATE TABLE parqtab
AS
SELECT col2, FROM txtfiletable
AS PARQUET;

这样,我们可以创建它,但这不是一个Parquet表根据扩展属性。

Detailed Table Information  Table(tableName:parqtab, dbName:dbtest, owner:cloudera, createTime:1606338549, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:col2, type:string, comment:null)], location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/dbtest.db/parqtab, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{numFiles=1, transient_lastDdlTime=1606338550, COLUMN_STATS_ACCURATE=true, totalSize=68, numRows=3, rawDataSize=65}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

观察:

inputFormat:org.apache.hadoop.mapred.TextInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false,

我直接创建了一个Parquet地板table。

create table parqtab2(id int) stored as parquet;

详细信息在这里。

Detailed Table Information  Table(tableName:parqtab2, dbName:dbtest, owner:cloudera, createTime:1606338696, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:id, type:int, comment:null)], location:hdfs://quickstart.cloudera:8020/user/hive/warehouse/dbtest.db/parqtab2, inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe, parameters:{serialization.format=1}), bucketCols:[], sortCols:[], parameters:{}, skewedInfo:SkewedInfo(skewedColNames:[], skewedColValues:[], skewedColValueLocationMaps:{}), storedAsSubDirectories:false), partitionKeys:[], parameters:{transient_lastDdlTime=1606338696}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)

从扩展属性观察:

inputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat,
outputFormat:org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat

相关问题