我在Solr是新来的,我来自斯普伦克。我只是想知道是否可以在查询时提取字段。例如我有一个流查询:
search(A3secLinuxLogs,fq=_time:[NOW-1DAY TO NOW] AND log:Accepted,fl="_time,hostname,raw_log,service_name,pid",sort=_time desc,rows=1000)
我得到的结果如下:
{
"hostname": [
"sa3secessuperset01"
],
"pid": [
27942
],
"raw_log": [
"Jul 16 16:17:21 sa3secessuperset01 sshd[27942]: Accepted publickey for debian from 10.0.9.3 port 40954 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxx"
],
"_time": [
"2021-07-16T16:17:21Z"
],
"service_name": [
"sshd[27942]"
]
},
我想从“raw_log”中用这样的正则表达式提取源ip
from:(?<src_ip>\d+\.\d+\.\d+\.\d+)
也许是这样的:
select(
search(A3secLinuxLogs,fq=_time:[NOW-1DAY TO NOW] AND log:Accepted,fl="_time,hostname,raw_log,service_name,pid",sort=_time desc,rows=1000),
hostname,
raw_log,
service_name,
pid,
regextract("raw_log","from:(?<src_ip>\d+\.\d+\.\d+\.\d+)"))
目前,我使用spark来实现同样的目标,但我不知道是否有直接在solr中实现的方法。
我还尝试使用标记器和过滤器在“索引时间”编辑模式时实现同样的效果,但我得到了如下结果:
{
"hostname": [
"sa3secessuperset01"
],
"pid": [
27942
],
"raw_log": [
"Jul 16 16:17:21 sa3secessuperset01 sshd[27942]: Accepted publickey for debian from 10.0.9.3 port 40954 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxx"
],
"_time": [
"2021-07-16T16:17:21Z"
],
"service_name": [
"sshd[27942]"
],
"src_ip": [
"Jul 16 16:17:21 sa3secessuperset01 sshd[27942]: Accepted publickey for debian from 10.0.9.3 port 40954 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxx"
],
},
我想要这样的东西:
{
"hostname": [
"sa3secessuperset01"
],
"pid": [
27942
],
"raw_log": [
"Jul 16 16:17:21 sa3secessuperset01 sshd[27942]: Accepted publickey for debian from 10.0.9.3 port 40954 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxx"
],
"_time": [
"2021-07-16T16:17:21Z"
],
"service_name": [
"sshd[27942]"
],
"src_ip": [
"10.0.9.3"
],
},
我不得不说,分析工作正常,但数据“索引”看起来不像我想要的。
基本上,我只是想知道是否有一种方法可以实现我想要的,我宁愿在查询中使用regex,但如果不可能,我想知道如何使用标记器和过滤器获得结果。
顺致敬意,
1条答案
按热度按时间v1uwarro1#
您可能必须使用正则表达式模式标记器:
https://solr.apache.org/guide/8_9/tokenizers.html#regular-表达式模式标记器
并修改“src_ip”字段的表达式
例如:
将意味着只索引ip。请参见此处的结果: