清管器:替换问题

ne5o7dgx  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(311)

以下是我的数据:

199.72.81.55 - - [01/Jul/1995:00:00:01 -0400] "GET /history/apollo/ HTTP/1.0" 200 6245

unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985

199.120.110.21 - - [01/Jul/1995:00:00:09 -0400] "GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0" 200 4085

burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0

以下是清管器代码:

loadFulldata = LOAD '/root/Kennadi-Project/Kennadi-data.txt' USING PigStorage(',') AS (fullline:chararray);

extractData = FOREACH loadFulldata GENERATE FLATTEN (REGEX_EXTRACT_ALL(fullline,'(.*) - - (.*) -(.*)] "(.*)" (.*) (.*)'));

rowdata = FOREACH extractData GENERATE $0 as host,$1 as datetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;

我的数据看起来像:

(199.72.81.55,[01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,[01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,[01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,[01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)
(199.120.110.21,[01/Jul/1995:00:00:11,0400,GET /shuttle/missions/sts-73/sts-73-patch-small.gif HTTP/1.0,200,4179)
(burger.letters.com,[01/Jul/1995:00:00:12,0400,GET /images/NASA-logosmall.gif HTTP/1.0,304,0)

当我使用 REGEX_EXTRACT_ALL 我无法从数据中删除“[”,如何实现这一点?
此外,我还尝试使用 REPLACE 功能如下:

rowdata = FOREACH extractData GENERATE $0 as host,$1 as datadatetime,$2 as timezone,$3 as responseurl,$4 as responsecode,$5 as response data;

newdata = FOREACH rowdata GENERATE REPLACE(datadatetime,'[','');

但我得到以下警告:

2016-01-05 05:10:13,758 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning USING_OVERLOADED_FUNCTION 1 time(s).
2016-01-05 05:10:13,758 [main] WARN  org.apache.pig.newplan.BaseOperatorPlan - Encountered Warning IMPLICIT_CAST_TO_CHARARRAY 1 time(s).

我想这是因为我还没有为它定义任何数据类型 datadatetime ,如何在foreach中定义数据类型?

vsmadaxz

vsmadaxz1#

你有问题。你试着用正则表达式来解决它。你现在有两个问题。
但说真的,在尝试之后,这似乎只是正则表达式的一个问题。使用

REGEX_EXTRACT_ALL(fullline,'(.*) - - \\[(.*) -(.*)\\] "(.*)" (.*) (.*)')

为我做了这个把戏。
结果:

(199.72.81.55,01/Jul/1995:00:00:01,0400,GET /history/apollo/ HTTP/1.0,200,6245)
(unicomp6.unicomp.net,01/Jul/1995:00:00:06,0400,GET /shuttle/countdown/ HTTP/1.0,200,3985)
(199.120.110.21,01/Jul/1995:00:00:09,0400,GET /shuttle/missions/sts-73/mission-sts-73.html HTTP/1.0,200,4085)
(burger.letters.com,01/Jul/1995:00:00:11,0400,GET /shuttle/countdown/liftoff.html HTTP/1.0,304,0)

相关问题