Neo4j Lucene全文检索和从文本中提取关键词

cvxl0en2 于 2022-12-18 发布在 Lucene

关注(0)|答案(1)|浏览(177)

我有Neo4j FULLTEXT INDEX，有大约60k条记录（关键字）。这是我的关键字词汇表。我需要从不同的输入文本中提取所有可能的关键字（存在于此索引中）。这可以用Neo4j、Cypher、APOC实现吗？

已更新

例如，有一段文字：

Looking for Apache Spark expert to coach me on the core concepts of optimizing the parallelism of Spark using Scala and OpenAcc programming model.

The mentor must have comprehensive hands-on knowledge of Big Data analytics in large scale of data  (especially Spark and GPU programming) to design the software tool with sample data analysis using Scala language and OpenAcc directives.

在包含FULLTEXT INDEX的Neo4j数据库中，我有以下关键字：

apache-spark
scala
gpu

我需要从上面的文字中摘录

Apache Spark
Scala
GPU

型

neo4j

来源：https://stackoverflow.com/questions/74765472/neo4j-lucene-full-text-search-and-keyword-extraction-from-the-text

1条答案

按热度按时间

tuwxkamq1#

因此，通常使用FT索引是为了相反的用例，将文本存储在索引中并匹配关键字，尽管如此：

穷人解决方案

用你的文本查询索引。例如，给定以下设置

CALL db.index.fulltext.createNodeIndex('Keyword', ['Keyword'], ['value'])

CREATE (n:Keyword {value: 'apache-spark'})
CREATE (n:Keyword {value: 'gpu'})
CREATE (n:Keyword {value: 'scala'})

使用文本作为搜索查询

CALL db.index.fulltext.queryNodes('Keyword', 'Looking for Apache Spark expert to coach me on the core concepts of optimizing the parallelism of Spark using Scala and OpenAcc programming model.

The mentor must have comprehensive hands-on knowledge of Big Data analytics in large scale of data  (especially Spark and GPU programming) to design the software tool with sample data analysis using Scala language and OpenAcc directives. ')

型
由于lucene查询默认使用带有OR操作符的文本的所有标记，因此它可以正常工作
结果：

╒════════════════════════╤═══════════════════╕
│"node"                  │"score"            │
╞════════════════════════╪═══════════════════╡
│{"value":"apache-spark"}│1.480496883392334  │
├────────────────────────┼───────────────────┤
│{"value":"scala"}       │0.9932447671890259 │
├────────────────────────┼───────────────────┤
│{"value":"gpu"}         │0.49662238359451294│
└────────────────────────┴───────────────────┘

局限性：
这是使用OR运算符的，所以在这里它工作的时候，你需要知道当你索引关键字的时候，像apache-spark这样的关键字实际上会在索引中产生两个标记，即apache和spark，所以如果你的文本包含Apache Age，这也会被返回。

替代解决方案

反之，过程将是：
1.为输入文本创建FTS索引
1.将输入文本临时存储到节点中
1.从关键字开始，清理它们并从它们动态地建立lucene查询
1.查询输入文本的FTS索引
1.删除文本节点

CALL db.index.fulltext.createNodeIndex('Text', ['Text'], ['text'])

WITH 'Looking for Apache Spark expert to coach me on the core concepts of optimizing the parallelism of Spark using Scala and OpenAcc programming model.

The mentor must have comprehensive hands-on knowledge of Big Data analytics in large scale of data  (especially Spark and GPU programming) to design the software tool with sample data analysis using Scala language and OpenAcc directives. '
AS text
CREATE (n:Text {text: text})

MATCH (n:Keyword)
// remove non alpha numeric characters
WITH n, apoc.text.regreplace(n.value, '[^a-zA-Z\d\s:]', ' ') AS clean
WITH n, split(clean, ' ') AS tokens
// build up an FTS query for doing an `AND` operator
WITH n, '(' + apoc.text.join(tokens, ' AND ') + ')' AS query
CALL db.index.fulltext.queryNodes('Text', query)
YIELD node, score
// make sure to return the keyword node so we know how it did match
RETURN n, node,sum(score)

这将是生成的lucene查询

╒════════════════════╕
│"query"             │
╞════════════════════╡
│"(apache AND spark)"│
├────────────────────┤
│"(gpu)"             │
├────────────────────┤
│"(scala)"           │
├────────────────────┤
│"(apache AND age)"  │
└────────────────────┘

MATCH (n:Text) DELETE n

结果

╒════════════════════════╤══════════════════════════════════════════════════════════════════════╤═══════════════════╕
│"n"                     │"node"                                                                │"sum(score)"       │
╞════════════════════════╪══════════════════════════════════════════════════════════════════════╪═══════════════════╡
│{"value":"apache-spark"}│{"text":"Looking for Apache Spark expert to coach me on the core conce│0.33785906434059143│
│                        │pts of optimizing the parallelism of Spark using Scala and OpenAcc pro│                   │
│                        │gramming model.\n
The mentor must have comprehensive hands-on knowledg│                   │
│                        │e of Big Data analytics in large scale of data  (especially Spark and │                   │
│                        │GPU programming) to design the software tool with sample data analysis│                   │
│                        │ using Scala language and OpenAcc directives. "}                      │                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────┼───────────────────┤
│{"value":"gpu"}         │{"text":"Looking for Apache Spark expert to coach me on the core conce│0.13164746761322021│
│                        │pts of optimizing the parallelism of Spark using Scala and OpenAcc pro│                   │
│                        │gramming model.\n
The mentor must have comprehensive hands-on knowledg│                   │
│                        │e of Big Data analytics in large scale of data  (especially Spark and │                   │
│                        │GPU programming) to design the software tool with sample data analysis│                   │
│                        │ using Scala language and OpenAcc directives. "}                      │                   │
├────────────────────────┼──────────────────────────────────────────────────────────────────────┼───────────────────┤
│{"value":"scala"}       │{"text":"Looking for Apache Spark expert to coach me on the core conce│0.18063414096832275│
│                        │pts of optimizing the parallelism of Spark using Scala and OpenAcc pro│                   │
│                        │gramming model.\n

摘要

在我看来，没有真实的的防弹解决方案真的

赞(0）回复(0）举报 2022-12-18

我来回答

Neo4j Lucene全文检索和从文本中提取关键词

1条答案

相关问题

热门标签

最新问答