我有下面提到的csv文件。 VTS,51,0071,9739965515,NM,GP,INF01,V,19,072219,291014,0000.0000,N,00000.0000,E,07AE VTS,01,0097,9739965515,SP,GP,18,072253,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,169,B205 VTS,51,0071,9739965515,NM,GP,INF01,V,18,072311,291014,0000.0000,N,00000.0000,E,C24E VTS,01,0097,9739965515,NM,GP,19,072311,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,171,B358 VTS,51,0071,9739965515,NM,GP,INF01,V,18,072319,291014,0000.0000,N,00000.0000,E,012F VTS,51,0071,9739965515,NM,GP,INF01,V,19,072326,291014,0000.0000,N,00000.0000,E,B2E6 VTS,01,0097,9739965515,NM,GP,18,072326,V,0000.0000,N,00000.0000,E,0.0,0.0,291014,0000,00,4000,11,999,173,EAA0 VTS,51,0071,9739965515,NM,GP,INF01,V,18,072333,291014,0000.0000,N,00000.0000,E,9896 VTS,51,0071,9739965515,NM,GP,INF01,V,18,072340,291014,0000.0000,N,00000.0000,E,9B23
必须用以下字段Map: pkt_header,gprs_pkt_id,pkt_length,sim_no,msg_id,gprs_pkt,gsm_sig_strength,utc_time,pkt_validation,latitude,direction_n_s,longitude,direction_e_w,speed,track_angle,utc_date,fuel_adc_values,ignition,odometer_values,supply_int,battery_adc,pkt_id,check_sum
第二个字段,即值为01的gprs\u pkt\u id,表示有效的数据包。我的用例是只为有效数据包过滤csv数据,我正在使用regex,但是我无法获得整个数据包。任何帮助都将不胜感激。
使用的配置单元查询如下所示。 CREATE EXTERNAL TABlE sky_track_testing1( pkt_header STRING, gprs_pkt_id STRING, pkt_length STRING, sim_no STRING, msg_id STRING, gprs_pkt STRING, gsm_sig_strength STRING, utc_time STRING, pkt_validation STRING, latitude STRING, direction_n_s STRING, longitude STRING, direction_e_w STRING, speed STRING, track_angle STRING, utc_date STRING, fuel_adc_values STRING, ignition STRING, odometer_values STRING, supply_int STRING, battery_adc STRING, pkt_id STRING, check_sum STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "^(VTS,01).*$" ) STORED AS TEXTFILE LOCATION '/user/root/sky_track';
这绝对是一个错误的查询。请帮帮我。
2条答案
按热度按时间zujrkrfu1#
是的,根据上述答案,Pig将非常适合您的数据。你可以试试Pig。如果您对hive感兴趣,请参阅下面的示例(您的数据集不需要regex)。
vkc1a9a22#
我建议您使用pig: