如何使用ApacheNutch爬网.pdf链接

j5fpnvbx  于 2021-06-03  发布在  Hadoop
关注(0)|答案(2)|浏览(347)

我有一个网站,其中包括一些链接到pdf文件爬行。我想让nutch抓取该链接并将其转储为.pdf文件。我正在使用ApacheNutch1.6,我也在java中尝试这个

ToolRunner.run(NutchConfiguration.create(), new Crawl(),
                                 tokenize(crawlArg));
 SegmentReader.main(tokenize(dumpArg));

有人能帮我吗

2j4z5cfb

2j4z5cfb1#

如果希望nutch对pdf文档进行爬网和索引,则必须启用文档爬网和tika插件:
文档爬网
1.1编辑regex-urlfilter.txt并删除任何出现的“pdf”


# skip image and other suffixes we can't yet parse

# for a more extensive coverage use the urlfilter-suffix plugin

-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

1.2编辑suffix-urlfilter.txt并删除任何出现的“pdf”
1.3编辑nutch-site.xml,在plugin.includes部分添加“parse tika”和“parse html”

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika|text)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable 
  protocol-httpclient, but be aware of possible intermittent problems with the 
  underlying commons-httpclient library.
  </description>
</property>

如果您真正想要的是从一个页面下载所有pdf文件,您可以使用windows中的teleport或*nix中的wget之类的工具。

iecba09b

iecba09b2#

您可以为pdf mimetype编写自己的插件
或者有一个嵌入式apache tika解析器,可以从pdf中检索文本。。

相关问题