我想标记以下文本:
"text": "king martin"
进入
[k, ki, kin, king, i, in, ing, ng, g, m, ma, mar, mart, martin, ar, art, arti, artin, r, rt, rti, rtin, t, ti, tin, i, in, n]
但更特别地进入:
[kin, king, ing, mar, mart, martin, art, arti, artin, rti, rtin, tin]
这是一种获得这些代币的方法吗?我已经尝试了下面的tokenizer,但如何说:“在任何start_offset开始?“
"ngram_tokenizer": {
"type": "edge_ngram",
"min_gram": "3",
"max_gram": "15",
"token_chars": [
"letter",
"digit"
]
}
谢谢你!
1条答案
按热度按时间qmelpv7a1#
您可以使用ngram tokenizer而不是edge_gram。