在ElasticSearch中搜索字幕数据

83qze16e  于 2022-10-06  发布在  ElasticSearch
关注(0)|答案(2)|浏览(172)

具有以下数据(简单的SRT)

1
00:02:17,440 --> 00:02:20,375
Senator, we're making our final

2
00:02:20,476 --> 00:02:22,501
approach into Coruscant.

...

在Elasticearch中对其进行索引的最佳方式是什么?现在的问题是:我希望搜索结果高亮显示链接到时间戳所指示的确切时间。此外,还有重叠多个SRT行的短语(如上例中的final approach)。

我的想法是

  • 将SRT文件作为列表类型进行索引,时间戳为索引。我认为这不会与重叠多个键的短语相匹配
  • 创建仅对文本部分进行索引的自定义标记器。我不确定ElasticSearch能在多大程度上突出显示原始内容。
  • 仅为文本部分编制索引,并将其Map回ElasticSearch外部的时间戳

或者,还有其他选择可以优雅地解决这个问题吗?

e37o9pze

e37o9pze1#

这个问题很有趣。以下是我对此的看法。

从本质上讲,字幕“不知道”彼此之间的关系--也就是说,最好在每个文件(n - 1nn + 1)中包含前面和后面的字幕文本。

因此,您需要一个类似以下内容的文档结构:

{
  "sub_id" : 0,
  "start" : "00:02:17,440",
  "end" : "00:02:20,375",
  "text" : "Senator, we're making our final",
  "overlapping_text" : "Senator, we're making our final approach into Coruscant."
}

为了实现这样的文档结构,我使用了以下内容(灵感来自this excellent answer):

from itertools import groupby
from collections import namedtuple

def parse_subs(fpath):
    # "chunk" our input file, delimited by blank lines
    with open(fpath) as f:
        res = [list(g) for b, g in groupby(f, lambda x: bool(x.strip())) if b]

    Subtitle = namedtuple('Subtitle', 'sub_id start end text')

    subs = []

    # grouping
    for sub in res:
        if len(sub) >= 3:  # not strictly necessary, but better safe than sorry
            sub = [x.strip() for x in sub]
            sub_id, start_end, *content = sub  # py3 syntax
            start, end = start_end.split(' --> ')

            # ints only
            sub_id = int(sub_id)

            # join multi-line text
            text = ', '.join(content)

            subs.append(Subtitle(
                sub_id,
                start,
                end,
                text
            ))

    es_ready_subs = []

    for index, sub_object in enumerate(subs):
        prev_sub_text = ''
        next_sub_text = ''

        if index > 0:
            prev_sub_text = subs[index - 1].text + ' '

        if index < len(subs) - 1:
            next_sub_text = ' ' + subs[index + 1].text

        es_ready_subs.append(dict(
          **sub_object._asdict(),
            overlapping_text=prev_sub_text + sub_object.text + next_sub_text
        ))

    return es_ready_subs

一旦字幕被解析,它们就可以被摄取到ES中。在此之前,设置以下Map,以便您的时间戳可正确搜索和排序:

PUT my_subtitles_index
{
  "mappings": {
    "properties": {
      "start": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      },
      "end": {
        "type": "text",
        "fields": {
          "as_timestamp": {
            "type": "date",
            "format": "HH:mm:ss,SSS"
          }
        }
      }
    }
  }
}

完成后,继续摄取:

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

from utils.parse import parse_subs

es = Elasticsearch()

es_ready_subs = parse_subs('subs.txt')

actions = [
    {
        "_index": "my_subtitles_index",
        "_id": sub_group['sub_id'],
        "_source": sub_group
    } for sub_group in es_ready_subs
]

bulk(es, actions)

一旦被摄取,你就可以瞄准原始字幕text,如果它与你的短语直接匹配,就可以提升它。否则,在overlapping文本上添加一个后备选项,以确保返回两个“重叠”字幕。

在返回之前,您可以确保命中按start升序排序。这有点违背了提升的目的,但如果您进行了排序,您可以在URI中指定track_scores:true,以确保也返回最初计算的分数。

把所有这些放在一起:

POST my_subtitles_index/_search?track_scores&filter_path=hits.hits
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "text": {
              "query": "final approach",
              "boost": 2
            }
          }
        },
        {
          "match_phrase": {
            "overlapping_text": {
              "query": "final approach"
            }
          }
        }
      ]
    }
  },
  "sort": [
    {
      "start.as_timestamp": {
        "order": "asc"
      }
    }
  ]
}

收益率:

{
  "hits" : {
    "hits" : [
      {
        "_index" : "my_subtitles_index",
        "_type" : "_doc",
        "_id" : "0",
        "_score" : 6.0236287,
        "_source" : {
          "sub_id" : 0,
          "start" : "00:02:17,440",
          "end" : "00:02:20,375",
          "text" : "Senator, we're making our final",
          "overlapping_text" : "Senator, we're making our final approach into Coruscant."
        },
        "sort" : [
          137440
        ]
      },
      {
        "_index" : "my_subtitles_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 5.502407,
        "_source" : {
          "sub_id" : 1,
          "start" : "00:02:20,476",
          "end" : "00:02:22,501",
          "text" : "approach into Coruscant.",
          "overlapping_text" : "Senator, we're making our final approach into Coruscant. Very good, Lieutenant."
        },
        "sort" : [
          140476
        ]
      }
    ]
  }
}
mwecs4sa

mwecs4sa2#

我遇到了同样的问题,采取了不同的方法。

1.将标题行连接成一整段文字记录。
1.将文字记录索引到Elasticearch中,而不是标题行。
1.要求Elasticearch返回突出显示的代码片断。
1.在客户端再次搜索文字记录中的片段,以确定其正确位置。
1.将片段的起始位置Map到合适的标题行中,并获取其对应的时间戳信息。

对我来说,将这种逻辑转移到客户端要容易得多。

相关问题