org.apache.lucene.analysis.Analyzer类的使用及代码示例

x33g5p2x  于2022-01-15 转载在 其他  
字(15.1k)|赞(0)|评价(0)|浏览(310)

本文整理了Java中org.apache.lucene.analysis.Analyzer类的一些代码示例,展示了Analyzer类的具体用法。这些代码示例主要来源于Github/Stackoverflow/Maven等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Analyzer类的具体详情如下:
包路径:org.apache.lucene.analysis.Analyzer
类名称:Analyzer

Analyzer介绍

[英]An Analyzer builds TokenStreams, which analyze text. It thus represents a policy for extracting index terms from text.

In order to define what analysis is done, subclasses must define their TokenStreamComponents in #createComponents(String,Reader). The components are then reused in each call to #tokenStream(String,Reader).

Simple example:

Analyzer analyzer = new Analyzer() { 
 @Overrideprotected TokenStreamComponents createComponents(String fieldName, Reader reader) { 
Tokenizer source = new FooTokenizer(reader); 
TokenStream filter = new FooFilter(source); 
filter = new BarFilter(filter); 
return new TokenStreamComponents(source, filter); 
} 
};

For more examples, see the org.apache.lucene.analysis.

For some concrete implementations bundled with Lucene, look in the analysis modules:

  • Common: Analyzers for indexing content in different languages and domains.
  • ICU: Exposes functionality from ICU to Apache Lucene.
  • Kuromoji: Morphological analyzer for Japanese text.
  • Morfologik: Dictionary-driven lemmatization for the Polish language.
  • Phonetic: Analysis for indexing phonetic signatures (for sounds-alike search).
  • Smart Chinese: Analyzer for Simplified Chinese, which indexes words.
  • Stempel: Algorithmic Stemmer for the Polish Language.
  • UIMA: Analysis integration with Apache UIMA.
    [中]分析器构建令牌流,用于分析文本。因此,它代表了从文本中提取索引项的策略。
    为了定义所做的分析,子类必须在#createComponents(String,Reader)中定义它们的TokenStreamComponents。然后在对#tokenStream(字符串、读取器)的每次调用中重用这些组件。
    简单的例子:
Analyzer analyzer = new Analyzer() { 
 @Overrideprotected TokenStreamComponents createComponents(String fieldName, Reader reader) { 
Tokenizer source = new FooTokenizer(reader); 
TokenStream filter = new FooFilter(source); 
filter = new BarFilter(filter); 
return new TokenStreamComponents(source, filter); 
} 
};

有关更多示例,请参阅组织。阿帕奇。卢森。分析
对于与Lucene捆绑的一些具体实现,请查看分析模块:
*Common:用于为不同语言和域中的内容编制索引的分析器。
*ICU:将ICU的功能公开给ApacheLucene。
*Kuromoji:日语文本的形态分析器。
*Morfologik:波兰语词典驱动的柠檬化。
*Phonetic:分析索引语音签名(用于类似声音搜索)。
*Smart Chinese:简体中文分析器,用于索引单词。
*Stempel:波兰语的算法词干分析器。
*UIMA:与ApacheUIMA的分析集成。

代码示例

代码示例来源:origin: stackoverflow.com

public final class LuceneUtils {

  public static List<String> parseKeywords(Analyzer analyzer, String field, String keywords) {

    List<String> result = new ArrayList<String>();
    TokenStream stream  = analyzer.tokenStream(field, new StringReader(keywords));

    try {
      while(stream.incrementToken()) {
        result.add(stream.getAttribute(TermAttribute.class).term());
      }
    }
    catch(IOException e) {
      // not thrown b/c we're using a string reader...
    }

    return result;
  }  
}

代码示例来源:origin: jeremylong/DependencyCheck

/**
 * Closes the CPE Index.
 */
@Override
public synchronized void close() {
  final int count = INSTANCE.usageCount.get() - 1;
  if (count <= 0) {
    INSTANCE.usageCount.set(0);
    if (searchingAnalyzer != null) {
      searchingAnalyzer.close();
      searchingAnalyzer = null;
    }
    if (indexReader != null) {
      try {
        indexReader.close();
      } catch (IOException ex) {
        LOGGER.trace("", ex);
      }
      indexReader = null;
    }
    queryParser = null;
    indexSearcher = null;
    if (index != null) {
      index.close();
      index = null;
    }
  }
}

代码示例来源:origin: tjake/Solandra

tokReader = new StringReader(field.stringValue());
  tokens = analyzer.reusableTokenStream(field.name(), tokReader);
if (position > 0)
  position += analyzer.getPositionIncrementGap(field.name());
tokens.reset(); // reset the TokenStream to the first token
  offsetAttribute = (OffsetAttribute) tokens.addAttribute(OffsetAttribute.class);
      .addAttribute(PositionIncrementAttribute.class);
  Term term = new Term(field.name(), termAttribute.toString());
  ThriftTerm tterm = new ThriftTerm(term.field()).setText(
      ByteBuffer.wrap(term.text().getBytes("UTF-8"))).setIs_binary(false);
    position += (posIncrAttribute.getPositionIncrement() - 1);
    offsetVector.add(lastOffset + offsetAttribute.startOffset());
    offsetVector.add(lastOffset + offsetAttribute.endOffset());

代码示例来源:origin: org.apache.lucene/lucene-core

try (TokenStream stream = tokenStream = field.tokenStream(docState.analyzer, tokenStream)) {
 stream.reset();
 invertState.setAttributeSource(stream);
 termsHashPerField.start(field, first);
 while (stream.incrementToken()) {
  int posIncr = invertState.posIncrAttribute.getPositionIncrement();
  invertState.position += posIncr;
  if (invertState.position < invertState.lastPosition) {
  int startOffset = invertState.offset + invertState.offsetAttribute.startOffset();
  int endOffset = invertState.offset + invertState.offsetAttribute.endOffset();
  if (startOffset < invertState.lastStartOffset || endOffset < startOffset) {
   throw new IllegalArgumentException("startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards "
 stream.end();
 invertState.position += invertState.posIncrAttribute.getPositionIncrement();
 invertState.offset += invertState.offsetAttribute.endOffset();
 invertState.position += docState.analyzer.getPositionIncrementGap(fieldInfo.name);
 invertState.offset += docState.analyzer.getOffsetGap(fieldInfo.name);

代码示例来源:origin: sanluan/PublicCMS

/**
 * @param text
 * @return
 */
public Set<String> getToken(String text) {
  Set<String> list = new LinkedHashSet<>();
  if (CommonUtils.notEmpty(text)) {
    try (StringReader stringReader = new StringReader(text);
        TokenStream tokenStream = dao.getAnalyzer().tokenStream(CommonConstants.BLANK, stringReader)) {
      CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
      tokenStream.reset();
      while (tokenStream.incrementToken()) {
        list.add(charTermAttribute.toString());
      }
      tokenStream.end();
      return list;
    } catch (IOException e) {
      return list;
    }
  }
  return list;
}

代码示例来源:origin: apache/jackrabbit-oak

@Test
public void testParentPathSearchingTokenization() throws Exception {
  try {
    TokenStream ts = parentPathSearchingAnalyzer.tokenStream("text", new StringReader("/jcr:a/b/jcr:c"));
    assertTokenStreamContents(ts, new String[]{"/jcr:a/b"});
  } finally {
    parentPathSearchingAnalyzer.close();
  }
}

代码示例来源:origin: oracle/opengrok

testM.invoke(testC, testA.tokenStream("refs", new StringReader(input)), output, null, null, null, null, null, input.length(), true);
System.out.println("Testing full with " + name);
testM.invoke(testC, testA.tokenStream("full", new StringReader(input)), output, null, null, null, null, null, input.length(), true);

代码示例来源:origin: synhershko/HebMorph

@SuppressWarnings("MismatchedQueryAndUpdateOfCollection")
@Test
public void testLemmatization() throws Exception {
  final TokenStream ts = analyzer.tokenStream("foo", new StringReader("מינהל"));
  ts.reset();
  Set<String> terms = new HashSet<>();
  while (ts.incrementToken()) {
    CharTermAttribute att = ts.getAttribute(CharTermAttribute.class);
    terms.add(new String(att.buffer(), 0, att.length()));
    //System.out.println(new String(att.buffer(), 0, att.length()));
  }
}

代码示例来源:origin: INL/BlackLab

public static void main(String[] args) throws IOException {
  String TEST_STR = "Hé jij И!  раскази и повѣсти. Ст]' Дѣдо  	Нисторъ. Ива";
  try (Analyzer a = new BLStandardAnalyzer()) {
    TokenStream ts = a.tokenStream("test", new StringReader(TEST_STR));
    CharTermAttribute ta = ts.addAttribute(CharTermAttribute.class);
    while (ts.incrementToken()) {
      System.out.println(new String(ta.buffer(), 0, ta.length()));
    }
    TokenStream ts2 = a.tokenStream(ComplexFieldUtil.propertyField("test", null, "s"),
        new StringReader(TEST_STR));
    ta = ts2.addAttribute(CharTermAttribute.class);
    while (ts2.incrementToken()) {
      System.out.println(new String(ta.buffer(), 0, ta.length()));
    }
  }
}

代码示例来源:origin: stackoverflow.com

public class Tokens {

  private static void printTokens(String string, Analyzer analyzer) throws IOException {
    System.out.println("Using " + analyzer.getClass().getName());
    TokenStream ts = analyzer.tokenStream("default", new StringReader(string));
    OffsetAttribute offsetAttribute = ts.addAttribute(OffsetAttribute.class);
    CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);

    while(ts.incrementToken()) {
      int startOffset = offsetAttribute.startOffset();
      int endOffset = offsetAttribute.endOffset();
      String term = charTermAttribute.toString();
      System.out.println(term + " (" + startOffset + " " + endOffset + ")");
    }
    System.out.println();
  }

  public static void main(String[] args) throws IOException {
    printTokens("foo-bar 1-2-3", new StandardAnalyzer(Version.LUCENE_40));
    printTokens("foo-bar 1-2-3", new ClassicAnalyzer(Version.LUCENE_40));

    QueryParser standardQP = new QueryParser(Version.LUCENE_40, "", new StandardAnalyzer(Version.LUCENE_40));
    BooleanQuery q1 = (BooleanQuery) standardQP.parse("someField:(foo\\-bar\\ 1\\-2\\-3)");
    System.out.println(q1.toString() + "     # of clauses:" + q1.getClauses().length);
  }
}

代码示例来源:origin: linkedin/indextank-engine

public Iterator<AToken> parseDocumentField(String fieldName, String content) {
  final TokenStream tkstream = analyzer.tokenStream(fieldName, new StringReader(content));
  final TermAttribute termAtt = tkstream.addAttribute(TermAttribute.class);
  final PositionIncrementAttribute posIncrAttribute = tkstream.addAttribute(PositionIncrementAttribute.class);
  final OffsetAttribute offsetAtt = tkstream.addAttribute(OffsetAttribute.class);

代码示例来源:origin: com.atlassian.studio/studio-theme-jira-plugin

private Token[] parseText(String text) throws IOException {
  if (text == null || text.trim().equals(""))
    return new Token[0];
  final ArrayList result = new ArrayList();
  final TokenStream ts = analyzer.tokenStream(DocumentBuilder.CONTENT_FIELD_NAME, new StringReader(text));
  for (Token token = ts.next(); token != null; token = ts.next()) {
    result.add(token);
  }
  return (Token[]) result.toArray(new Token[result.size()]);
}

代码示例来源:origin: hibernate/hibernate-search

public static List<String> tokenizedTermValues(Analyzer analyzer, String field, String text) throws IOException {
  final List<String> tokenList = new ArrayList<String>();
  final TokenStream stream = analyzer.tokenStream( field, new StringReader( text ) );
  try {
    CharTermAttribute term = stream.addAttribute( CharTermAttribute.class );
    stream.reset();
    while ( stream.incrementToken() ) {
      String s = new String( term.buffer(), 0, term.length() );
      tokenList.add( s );
    }
    stream.end();
  }
  finally {
    stream.close();
  }
  return tokenList;
}

代码示例来源:origin: org.elasticsearch/elasticsearch

try (TokenStream ts = analyzer.tokenStream(fieldName, text)) {
  CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
  ts.reset();
  while (ts.incrementToken()) {
    skipTerms.add(new Term(fieldName, termAtt.toString()));
  ts.end();
  BytesRef text;
  while ((text = termsEnum.next()) != null) {
    skipTerms.add(new Term(fieldName, text.utf8ToString()));

代码示例来源:origin: oracle/opengrok

private SToken[] getTokens(String text) throws IOException {
  //FIXME somehow integrate below cycle to getSummary to save the cloning and memory,
  //also creating Tokens is suboptimal with 3.0.0 , this whole class could be replaced by highlighter        
  ArrayList<SToken> result = new ArrayList<>();
  try (TokenStream ts = analyzer.tokenStream("full", text)) {
    CharTermAttribute term = ts.addAttribute(CharTermAttribute.class);
    OffsetAttribute offset = ts.addAttribute(OffsetAttribute.class);
    ts.reset();
    while (ts.incrementToken()) {
      SToken t = new SToken(term.buffer(), 0, term.length(), offset.startOffset(), offset.endOffset());
      result.add(t);
    }
    ts.end();
  }
  return result.toArray(new SToken[result.size()]);
}

代码示例来源:origin: com.qwazr/qwazr-search

@Override
@JsonIgnore
final public Query getQuery(final QueryContext queryContext) throws IOException {
  final BooleanQuery.Builder builder = new BooleanQuery.Builder();
  final String resolvedField = resolveField(queryContext.getFieldMap());
  try (final TokenStream tokenStream = queryContext.getQueryAnalyzer().tokenStream(resolvedField, query_string)) {
    final CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
    final PositionIncrementAttribute pocincrAttribute =
        tokenStream.getAttribute(PositionIncrementAttribute.class);
    tokenStream.reset();
    int pos = 0;
    while (tokenStream.incrementToken()) {
      final String charTerm = charTermAttribute.toString();
      int start = pos - distance;
      if (start < 0)
        start = 0;
      final int end = pos + distance + 1;
      for (int i = start; i < end; i++) {
        final float dist = Math.abs(i - pos) + 1;
        final float boost = 1 / dist;
        final SpanTermQuery spanTermQuery = new SpanTermQuery(new Term(resolvedField, charTerm));
        Query query = new BoostQuery(new SpanPositionRangeQuery(spanTermQuery, i, i + 1), boost);
        builder.add(new BooleanClause(query, BooleanClause.Occur.SHOULD));
      }
      pos += pocincrAttribute.getPositionIncrement();
    }
    return builder.build();
  }
}

代码示例来源:origin: org.apache.lucene/lucene-analyzers-common

try (TokenStream ts = analyzer.tokenStream("", text)) {
 CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
 PositionIncrementAttribute posIncAtt = ts.addAttribute(PositionIncrementAttribute.class);
 ts.reset();
 reuse.clear();
 while (ts.incrementToken()) {
  int length = termAtt.length();
  if (length == 0) {
   throw new IllegalArgumentException("term: " + text + " analyzed to a zero-length token");
  if (posIncAtt.getPositionIncrement() != 1) {
   throw new IllegalArgumentException("term: " + text + " analyzed to a token (" + termAtt +
                     ") with position increment != 1 (got: " + posIncAtt.getPositionIncrement() + ")");
   reuse.setLength(reuse.length() + 1);
  System.arraycopy(termAtt.buffer(), 0, reuse.chars(), end, length);
  reuse.setLength(reuse.length() + length);

代码示例来源:origin: synhershko/HebMorph

private ArrayList<Data> analyze(Analyzer analyzer1) throws IOException {
    ArrayList<Data> results = new ArrayList<>(50);
    TokenStream ts = analyzer1.tokenStream("foo", text);
    ts.reset();
    while (ts.incrementToken()) {
      Data data = new Data();
      OffsetAttribute offsetAttribute = ts.getAttribute(OffsetAttribute.class);
      data.startOffset = offsetAttribute.startOffset();
      data.endOffset = offsetAttribute.endOffset();
      data.positionLength = ts.getAttribute(PositionLengthAttribute.class).getPositionLength();
      data.positionIncGap = ts.getAttribute(PositionIncrementAttribute.class).getPositionIncrement();
      data.tokenType = ts.getAttribute(HebrewTokenTypeAttribute.class).getType().toString();
      data.term = ts.getAttribute(CharTermAttribute.class).toString();

      if (ts.getAttribute(KeywordAttribute.class) != null)
        data.isKeyword = ts.getAttribute(KeywordAttribute.class).isKeyword();
      // System.out.println(data.term + " " + data.tokenType);
      results.add(data);
    }
    ts.close();

    return results;
  }
}

代码示例来源:origin: kite-sdk/kite

@Override
protected boolean doProcess(Record record) {
 try {
  List outputValues = record.get(outputFieldName);
  for (Object value : record.get(inputFieldName)) {
   reader.setValue(value.toString());
   TokenStream tokenStream = analyzer.tokenStream("content", reader);
   tokenStream.reset();
   while (tokenStream.incrementToken()) {
    if (token.length() > 0) { // incrementToken() updates the token!
     String tokenStr = new String(token.buffer(), 0, token.length());
     outputValues.add(tokenStr);
    }
   }
   tokenStream.end();
   tokenStream.close();
  }
 } catch (IOException e) {
  throw new MorphlineRuntimeException(e);
 }
 
 // pass record to next command in chain:
 return super.doProcess(record);
}

代码示例来源:origin: org.elasticsearch/elasticsearch

private static int findGoodEndForNoHighlightExcerpt(int noMatchSize, Analyzer analyzer, String fieldName, String contents)
      throws IOException {
    try (TokenStream tokenStream = analyzer.tokenStream(fieldName, contents)) {
      if (!tokenStream.hasAttribute(OffsetAttribute.class)) {
        // Can't split on term boundaries without offsets
        return -1;
      }
      int end = -1;
      tokenStream.reset();
      while (tokenStream.incrementToken()) {
        OffsetAttribute attr = tokenStream.getAttribute(OffsetAttribute.class);
        if (attr.endOffset() >= noMatchSize) {
          // Jump to the end of this token if it wouldn't put us past the boundary
          if (attr.endOffset() == noMatchSize) {
            end = noMatchSize;
          }
          return end;
        }
        end = attr.endOffset();
      }
      tokenStream.end();
      // We've exhausted the token stream so we should just highlight everything.
      return end;
    }
  }
}

相关文章