如何使用kite数据集分区模式正确导入csv数据集？

eqoofvh9 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(523)

我正在使用movielens的公共可用csv数据集我为ratings.csv创建了一个分区数据集：

kite-dataset create ratings --schema rating.avsc --partition-by year-month.json --format parquet

以下是my year-month.json：

[ {
  "name" : "year",
  "source" : "timestamp",
  "type" : "year"
}, {
  "name" : "month",
  "source" : "timestamp",
  "type" : "month"
} ]

这是我的csv导入命令：

mkite-dataset csv-import ratings.csv ratings

导入完成后，我运行此命令查看实际创建的年份和月份分区：

hadoop fs -ls /user/hive/warehouse/ratings/

我注意到，只创建了一个年分区，在其中创建了一个月分区：

[cloudera@quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/
Found 3 items
drwxr-xr-x   - cloudera supergroup          0 2016-06-12 18:49 /user/hive/warehouse/ratings/.metadata
drwxr-xr-x   - cloudera supergroup          0 2016-06-12 18:59 /user/hive/warehouse/ratings/.signals
drwxrwxrwx   - cloudera supergroup          0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970

[cloudera@quickstart ml-20m]$ hadoop fs -ls /user/hive/warehouse/ratings/year=1970/
Found 1 items
drwxrwxrwx   - cloudera supergroup          0 2016-06-12 18:59 /user/hive/warehouse/ratings/year=1970/month=01

执行这种分区导入的正确方法是什么？这样会创建所有年份和所有月份的分区？

hadoop hdfs cloudera-cdh hadoop-partitioning kite-dataset

来源：https://stackoverflow.com/questions/37778161/how-to-properly-import-csv-data-set-using-kite-dataset-partitioned-schema

1条答案

按热度按时间

9wbgstp71#

最后加三个零作为时间戳。
使用下面的shell脚本来完成


# !/bin/bash

# add the CSV header to both files

head -n 1 ratings.csv > ratings_1.csv
head -n 1 ratings.csv > ratings_2.csv

# output the first 10,000,000 rows to ratings_1.csv

# this includes the header, and uses tail to remove it

head -n 10000001 ratings.csv | tail -n +2 | awk '{print "000" $1 }' >> ratings_1.csv

    enter code here

# output the rest of the file to ratings_2.csv

# this starts at the line after the ratings_1 file stopped

tail -n +10000002 ratings.csv | awk '{print "000" $1 }' >> ratings_2.csv

就连我也有这个问题，加了3个零就解决了。

赞(0）回复(0）举报 2021-06-02

我来回答

如何使用kite数据集分区模式正确导入csv数据集？

1条答案

相关问题

热门标签

最新问答