apache pig,使用regex解析组合日志

t3irkdon  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(384)

我使用pig拉丁语脚本,并尝试使用regex解析日志,但它在匹配双引号时返回错误”。例如:error 1200:意外字符“'”日志格式:

118.102.255.50 - - [17/Oct/2014:00:00:29 -0400] "GET /favicon.ico HTTP/1.1" 200 20 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.101 Safari/537.36"

而我写的剧本:

test = LOAD '/pigdata/log' as (line:chararray);
log = FOREACH test GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+\"(\\S+)\\s+(.+?)\\s+(HTTP[^\"]+)\"\\s+(\\S+)\\s+(\\S+)\\s+\"([^\"]*)\"\\s+\"(.*)\"$')) AS (address_ip: chararray, logname: chararray, user: chararray, timestamp: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int, referer: chararray, userAgent: chararray);

dump log;
63lcw9qa

63lcw9qa1#

因为pig使用javaregex,所以需要避开 "\\ 这样地:

log = FOREACH test GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'^(\\S+)\\s+(\\S+)\\s+(\\S+)\\s+.(\\S+\\s+\\S+).\\s+\\"(\\S+)\\s+(.+?)\\s+(HTTP[^"]+)\\"\\s+(\\S+)\\s+(\\S+)\\s+\\"([^"]*)\\"\\s+\\"(.*)\\"$')) AS (address_ip: chararray, logname: chararray, user: chararray, timestamp: chararray, method: chararray, uri: chararray, proto: chararray, status: int, bytes: int, referer: chararray, userAgent: chararray);

相关问题