本文整理了Java中org.apache.tika.Tika.parseToString()
方法的一些代码示例,展示了Tika.parseToString()
的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。Tika.parseToString()
方法的具体详情如下:
包路径:org.apache.tika.Tika
类名称:Tika
方法名:parseToString
[英]Parses the given file and returns the extracted text content.
To avoid unpredictable excess memory use, the returned string contains only up to #getMaxStringLength() first characters extracted from the input document. Use the #setMaxStringLength(int)method to adjust this limitation.
[中]解析给定文件并返回提取的文本内容。
为了避免不可预测的内存过量使用,返回的字符串最多只包含从输入文档中提取的#getMaxStringLength()第一个字符。使用#setMaxStringLength(int)方法调整此限制。
代码示例来源:origin: apache/tika
public static void main(String[] args) throws Exception {
// Create a Tika instance with the default configuration
Tika tika = new Tika();
// Parse all given files and print out the extracted
// text content
for (String file : args) {
String text = tika.parseToString(new File(file));
System.out.print(text);
}
}
}
代码示例来源:origin: apache/tika
public static String parseToStringExample() throws Exception {
File document = new File("example.doc");
String content = new Tika().parseToString(document);
System.out.print(content);
return content;
}
代码示例来源:origin: apache/tika
/**
* Example of how to use Tika's parseToString method to parse the content of a file,
* and return any text found.
* <p>
* Note: Tika.parseToString() will extract content from the outer container
* document and any embedded/attached documents.
*
* @return The content of a file.
*/
public String parseToStringExample() throws IOException, SAXException, TikaException {
Tika tika = new Tika();
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.doc")) {
return tika.parseToString(stream);
}
}
代码示例来源:origin: apache/tika
public void indexDocument(File file) throws Exception {
Document document = new Document();
document.add(new TextField("filename", file.getName(), Store.YES));
document.add(new TextField("fulltext", tika.parseToString(file), Store.NO));
writer.addDocument(document);
}
}
代码示例来源:origin: apache/tika
/**
* Parses the given document and returns the extracted text content.
* The given input stream is closed by this method.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
* <p>
* <strong>NOTE:</strong> Unlike most other Tika methods that take an
* {@link InputStream}, this method will close the given stream for
* you as a convenience. With other methods you are still responsible
* for closing the stream or a wrapper instance returned by Tika.
*
* @param stream the document to be parsed
* @return extracted text content
* @throws IOException if the document can not be read
* @throws TikaException if the document can not be parsed
*/
public String parseToString(InputStream stream)
throws IOException, TikaException {
return parseToString(stream, new Metadata());
}
代码示例来源:origin: apache/tika
/**
* Parses the file at the given path and returns the extracted text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param path the path of the file to be parsed
* @return extracted text content
* @throws IOException if the file can not be read
* @throws TikaException if the file can not be parsed
*/
public String parseToString(Path path) throws IOException, TikaException {
Metadata metadata = new Metadata();
InputStream stream = TikaInputStream.get(path, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: apache/tika
/**
* Parses the resource at the given URL and returns the extracted
* text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param url the URL of the resource to be parsed
* @return extracted text content
* @throws IOException if the resource can not be read
* @throws TikaException if the resource can not be parsed
*/
public String parseToString(URL url) throws IOException, TikaException {
Metadata metadata = new Metadata();
InputStream stream = TikaInputStream.get(url, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: rnewson/couchdb-lucene
public void parse(final InputStream in, final String contentType, final String fieldName, final Document doc)
throws IOException {
final Metadata md = new Metadata();
md.set(HttpHeaders.CONTENT_TYPE, contentType);
try {
// Add body text.
doc.add(text(fieldName, tika.parseToString(in, md), false));
} catch (final IOException e) {
log.warn("Failed to index an attachment.", e);
return;
} catch (final TikaException e) {
log.warn("Failed to parse an attachment.", e);
return;
}
// Add DC attributes.
addDublinCoreAttributes(md, doc);
}
代码示例来源:origin: apache/tika
/**
* Parses the given file and returns the extracted text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param file the file to be parsed
* @return extracted text content
* @throws IOException if the file can not be read
* @throws TikaException if the file can not be parsed
* @see #parseToString(Path)
*/
public String parseToString(File file) throws IOException, TikaException {
Metadata metadata = new Metadata();
@SuppressWarnings("deprecation")
InputStream stream = TikaInputStream.get(file, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: apache/tika
public TrecDocument summarize(File file) throws FileNotFoundException,
IOException, TikaException {
Tika tika = new Tika();
Metadata met = new Metadata();
String contents = tika.parseToString(new FileInputStream(file), met);
return new TrecDocument(met.get(TikaCoreProperties.RESOURCE_NAME_KEY), contents,
met.getDate(TikaCoreProperties.CREATED));
}
代码示例来源:origin: stackoverflow.com
private void compareXlsx(File expected, File result) throws IOException, TikaException {
Tika tika = new Tika();
String expectedText = tika.parseToString(expected);
String resultText = tika.parseToString(result);
assertEquals(expectedText, resultText);
}
<dependency>
<groupId>org.apache.tika</groupId>
<artifactId>tika-parsers</artifactId>
<version>1.13</version>
<scope>test</scope>
</dependency>
代码示例来源:origin: org.onehippo.cms7/hippo-cms-api
private String doParse(final InputStream inputStream) {
try {
// tika parseToString already closes the inputStream
return tika.parseToString(inputStream);
} catch (TikaException e) {
throw new IllegalStateException("Unexpected TikaException processing failure", e);
} catch (IOException e) {
throw new IllegalStateException("Unexpected IOException processing failure", e);
}
}
代码示例来源:origin: stackoverflow.com
public String parseToStringExample() throws IOException, SAXException, TikaException
{
Tika tika = new Tika();
try (InputStream stream = ParsingExample.class.getResourceAsStream("test.pdf")) {
return tika.parseToString(stream); // This should return you the pdf's text
}
}
代码示例来源:origin: stackoverflow.com
File inputFile = ...;
Tika tika = new Tika();
String extractedText = tika.parseToString(inputFile);
代码示例来源:origin: org.apache.tika/tika-core
/**
* Parses the resource at the given URL and returns the extracted
* text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param url the URL of the resource to be parsed
* @return extracted text content
* @throws IOException if the resource can not be read
* @throws TikaException if the resource can not be parsed
*/
public String parseToString(URL url) throws IOException, TikaException {
Metadata metadata = new Metadata();
InputStream stream = TikaInputStream.get(url, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: stackoverflow.com
Tika tika = new Tika();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, "myfile.name");
String text = tika.parseToString(new File("myfile.name"));
代码示例来源:origin: com.github.lafa.tikaNoExternal/tika-core
/**
* Parses the file at the given path and returns the extracted text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param path the path of the file to be parsed
* @return extracted text content
* @throws IOException if the file can not be read
* @throws TikaException if the file can not be parsed
*/
public String parseToString(Path path) throws IOException, TikaException {
Metadata metadata = new Metadata();
InputStream stream = TikaInputStream.get(path, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: org.apache.tika/tika-core
/**
* Parses the file at the given path and returns the extracted text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param path the path of the file to be parsed
* @return extracted text content
* @throws IOException if the file can not be read
* @throws TikaException if the file can not be parsed
*/
public String parseToString(Path path) throws IOException, TikaException {
Metadata metadata = new Metadata();
InputStream stream = TikaInputStream.get(path, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: com.github.lafa.tikaNoExternal/tika-core
/**
* Parses the given file and returns the extracted text content.
* <p>
* To avoid unpredictable excess memory use, the returned string contains
* only up to {@link #getMaxStringLength()} first characters extracted
* from the input document. Use the {@link #setMaxStringLength(int)}
* method to adjust this limitation.
*
* @param file the file to be parsed
* @return extracted text content
* @throws IOException if the file can not be read
* @throws TikaException if the file can not be parsed
* @see #parseToString(Path)
*/
public String parseToString(File file) throws IOException, TikaException {
Metadata metadata = new Metadata();
@SuppressWarnings("deprecation")
InputStream stream = TikaInputStream.get(file, metadata);
return parseToString(stream, metadata);
}
代码示例来源:origin: org.xwiki.platform/xwiki-platform-search-lucene-api
private String getContentAsText(XWikiDocument doc, XWikiContext context)
{
String contentText = null;
try {
XWikiAttachment att = doc.getAttachment(this.filename);
LOGGER.debug("Start parsing attachement [{}] in document [{}]", this.filename, doc.getDocumentReference());
Tika tika = new Tika();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, this.filename);
contentText = StringUtils.lowerCase(tika.parseToString(att.getContentInputStream(context), metadata));
} catch (Throwable ex) {
LOGGER.warn("error getting content of attachment [{}] for document [{}]",
new Object[] {this.filename, doc.getDocumentReference(), ex});
}
return contentText;
}
}
内容来源于网络,如有侵权,请联系作者删除!