我有一个csv文件有两个标题行。我想删除它们。如何删除配置单元或pig中csv文件的前两行?文件的前两行如下所示:
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM FL_NUM ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY
YEAR QUARTER MONTH DAY_OF_MONTH DAY_OF_WEEK FL_DATE UNIQUE_CARRIER AIRLINE_ID CARRIER TAIL_NUM FL_NUM ORIGIN ORIGIN_CITY_NAME ORIGIN_STATE_ABR ORIGIN_STATE_FIPS ORIGIN_STATE_NM ORIGIN_WAC DEST DEST_CITY_NAME DEST_STATE_ABR DEST_STATE_FIPS DEST_STATE_NM DEST_WAC CRS_DEP_TIME DEP_TIME DEP_DELAY DEP_DELAY_NEW DEP_DEL15 DEP_DELAY_GROUP DEP_TIME_BLK TAXI_OUT WHEELS_OFF WHEELS_ON TAXI_IN CRS_ARR_TIME ARR_TIME ARR_DELAY ARR_DELAY_NEW ARR_DEL15 ARR_DELAY_GROUP ARR_TIME_BLK CANCELLED CANCELLATION_CODE DIVERTED CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME FLIGHTS DISTANCE DISTANCE_GROUP CARRIER_DELAY WEATHER_DELAY NAS_DELAY SECURITY_DELAY LATE_AIRCRAFT_DELAY
2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 6 California 91 900 855 -5 0 0 -1 0900-0959 17 912 1230 7 1230 1237 7 7 0 0 1200-1259 0 0 390 402 378 1 2475 10
2015 1 1 2 5 2015-01-02 AA 19805 AA N795AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 6 California 91 900 850 -10 0 0 -1 0900-0959 15 905 1202 9 1230 1211 -19 0 0 -2 1200-1259 0 0 390 381 357 1 2475 10
1条答案
按热度按时间of1yzvn41#
试试这个。根据您的要求进行修改:我为每行加载了一行,您也可以为每个字段定义列。
a = LOAD 'file.csv' using TextLoader() as (line:chararray); b = FILTER a by SUBSTRING(line,0,4) != 'YEAR'; dump b;
或使用配置单元:这将删除前2行并加载其余记录