shell—如何使用pig/hive从日志文件的url中提取字符串

zpf6vheq 于 2021-05-29 发布在 Hadoop

关注(0)|答案(2)|浏览(354)

如何使用pig/hive从日志文件的url中提取字符串
输入文件

122.161.182.202 - jane [21/Jul/2012:13:14:17-0700] "GET /rss.pl HTTP/1.1"   200 35942 "http://www.e.com/bam_applicatin/VD55173061"     "IE/4.0 (compatible; MSIE 7.0; Windows NT 6.0;   Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3;    OfficeLivePatch.1.3; MSOffice 12)"

期望输出：

122.161.182.202 - jane [21/Jul/2012:13:14:17-0700] "GET /rss.pl HTTP/1.1"   200 35942 "VD55173061"     "IE/4.0 (compatible; MSIE 7.0; Windows NT 6.0;   Trident/4.0; SLCC1; .NET CLR 2.0.50727; .NET CLR 3.5.21022; InfoPath.2; .NET CLR 3.5.30729; .NET CLR 3.0.30618; OfficeLiveConnector.1.3;    OfficeLivePatch.1.3; MSOffice 12)"

输入urlhttp://www.e.com/bam_applicatin/vd55173061
url vd55173061中所需的字符串
我想使用pig或hive处理日志。请帮忙。。

hadoop Hive shell apache-pig

来源：https://stackoverflow.com/questions/32569784/how-to-extract-a-string-from-url-in-a-weblog-file-using-pig-hive

2条答案

按热度按时间

6pp0gazn1#

如果您认为要提取的字符串长度相同（此处为10），则可以使用 SUBSTR() 功能。
substr（字符串source\u str，int start\u position[，int length]）
在你的情况下，你可以使用

SUBSTR(url, (LENGTH(url)-(10-1))

有关更多信息，请参阅手册页。

赞(0）回复(0）举报 2021-05-30

eit6fx6z2#

使用apache pig
参考http://pig.apache.org/docs/r0.14.0/func.html#substring api文档和用法
输入：

http://www.e.com/bam_applicatin/VD55173061

Pig脚本：

url_data = LOAD 'input.csv' USING  PigStorage(',') AS  (url:chararray);
req_url = FOREACH url_data GENERATE SUBSTRING(url,LAST_INDEX_OF(url, '/') + 1, (int)SIZE(url));
DUMP req_url;

输出：

VD55173061

赞(0）回复(0）举报 2021-05-30

我来回答

shell—如何使用pig/hive从日志文件的url中提取字符串

2条答案

相关问题

热门标签

最新问答