使用ArangoDB AQL计算字符串出现次数

6za6bjd0 于 2022-12-09 发布在 Go

关注(0)|答案(1)|浏览(138)

要计算包含特定属性值的对象的数量，我可以执行以下操作：

FOR t IN thing
  COLLECT other = t.name = "Other" WITH COUNT INTO otherCount
  FILTER other != false
  RETURN otherCount

但是，如何计算同一查询中的其他三个事件，而不导致子查询多次运行同一数据集？
我试过这样的方法：

FOR t IN thing
  COLLECT 
    other = t.name = "Other",
    some = t.name = "Some",
    thing = t.name = "Thing"
  WITH COUNT INTO count
  RETURN {
   other, some, thing,
   count
  }

但我无法弄懂结果：我一定是用错了方法？

arangodb

来源：https://stackoverflow.com/questions/59736598/counting-string-occurrences-with-arangodb-aql

1条答案

按热度按时间

sshcrbum1#

分开计算

你可以用短语来分割字符串，然后从计数中减去1。这对任何子字符串都有效，另一方面意味着它不考虑单词的边界。

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(SPLIT(t.name, "Some"))-1
  LET Other = LENGTH(SPLIT(t.name, "Other"))-1
  LET Thing = LENGTH(SPLIT(t.name, "Thing"))-1
  RETURN {
   Some, Other, Thing
}

结果：

[
  {
    "Some": 3,
    "Other": 2,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

您可以使用SPLIT(LOWER(t.name), LOWER("..."))使其不区分大小写。

收集单词

TOKENS()函数可以用来将输入拆分成单词数组，然后对这些数组进行分组和计数。注意，我对输入做了一些改动。输入"SomeSome"不会被计数，因为"somesome" != "some"（这个变量是基于单词的，而不是基于子串的）。

LET things = [
    {name: "Here are SOME some and Some Other Things. More Other!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]
LET whitelist = TOKENS("Some Other Things", "text_en")

FOR t IN things
  LET whitelisted = (FOR w IN TOKENS(t.name, "text_en") FILTER w IN whitelist RETURN w)
  LET counts = MERGE(FOR w IN whitelisted
    COLLECT word = w WITH COUNT INTO count
    RETURN { [word]: count }
  )
  RETURN {
    name: t.name,
    some: counts.some || 0,
    other: counts.other || 0,
    things: counts.things ||0
  }

结果：

[
  {
    "name": "Here are SOME some and Some Other Things. More Other!",
    "some": 3,
    "other": 2,
    "things": 0
  },
  {
    "name": "There are no such substrings in here.",
    "some": 0,
    "other": 0,
    "things": 0
  },
  {
    "name": "some-Other-here-though!",
    "some": 1,
    "other": 1,
    "things": 0
  }
]

这确实使用了COLLECT的子查询，否则它将计算整个输入的出现总数。
白名单步骤并不是绝对必要的，你也可以让它计算所有的单词。对于更大的输入字符串，它可能会保存一些内存，而不是对你不感兴趣的单词这样做。
如果要精确匹配单词，您可能需要为该语言创建一个单独的分析器，并禁用词干分析。您也可以关闭规范化（"accent": true, "case": "none"）。另一种方法是将REGEX_SPLIT()用于典型的空格和标点符号字符，以实现更简单的标记化，但这取决于您的使用情况。

其他解决方案

我认为不可能在没有子查询情况下使用COLLECT单独计算每个输入对象的数量，除非您想要一个总数。
拆分有点麻烦，但是可以用REGEX_SPLIT（）替换SPLIT（），并将搜索短语 Package 在\b中，这样就只匹配单词边界在两边的情况。这样就只匹配单词（或多或少）：

LET things = [
    {name: "Here are SomeSome and Some Other Things, brOther!"},
    {name: "There are no such substrings in here."},
    {name: "some-Other-here-though!"}
]

FOR t IN things
  LET Some = LENGTH(REGEX_SPLIT(t.name, "\\bSome\\b"))-1
  LET Other = LENGTH(REGEX_SPLIT(t.name, "\\bOther\\b"))-1
  LET Thing = LENGTH(REGEX_SPLIT(t.name, "\\bThings\\b"))-1
  RETURN {
   Some, Other, Thing
}

结果：

[
  {
    "Some": 1,
    "Other": 1,
    "Thing": 1
  },
  {
    "Some": 0,
    "Other": 0,
    "Thing": 0
  },
  {
    "Some": 0,
    "Other": 1,
    "Thing": 0
  }
]

一个更好的解决方案是使用ArangoSearch进行单词计数，但它没有让你检索单词出现频率的功能。它可能已经在内部记录了这个信息（Analyzer功能 “频率”），但现在肯定还没有公开。

赞(0）回复(0）举报 2022-12-09

我来回答

使用ArangoDB AQL计算字符串出现次数

1条答案

相关问题

热门标签

最新问答