nutch:如何使take.screenshot和screenshot.location属性工作?

57hvy0tb  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(424)

我已经学习了nutch(版本nutch-1.14)一周了,在本地模式和hadoop-2.7.2(伪分布式模式)下都工作得很好。今天我在nutch-site.xml中遇到了“take.screenshot”、“screenshot.location”属性,在修改这些属性之后,nutch正在爬行种子url,但是没有像hadoop那样在本地模式下截图。
本地模式的nutch-site.xml设置

<property>
 <name>take.screenshot</name>
 <value>true</value>
 <description>
  Boolean property determining whether the protocol-htmlunit
  WebDriver should capture a screenshot of the URL. If set to
  true remember to define the 'screenshot.location'
  property as this determines the location screenshots should be
  persisted to on HDFS. If that property is not set, screenshots
  are simply discarded.
 </description>
</property>

<property>
 <name>screenshot.location</name>
 <value>/home/user/nutch-1.14/screenshot</value>
 <description>
  The location on disk where a URL screenshot should be saved
  to if the 'take.screenshot' property is set to true.
  By default this is null, in this case screenshots held in memory
  are simply discarded.
 </description>
</property>

hadoop的nutch-site.xml设置

<property>
 <name>take.screenshot</name>
 <value>true</value>
</property>

<property>
 <name>screenshot.location</name>
 <value>/screenshot</value>
</property>

注意:hdfs中存在“screenshot”目录

dy1byipe

dy1byipe1#

htmlunit是一种“无gui的java程序浏览器”(参见http://htmlunit.sourceforge.net/). 这意味着,htmlunit根本不呈现html页面。在内部,所有操作都是基于dom树完成的,没有任何布局。这就是为什么没有选择截图的原因。

lrl1mhuk

lrl1mhuk2#

你启用了吗 protocol-selenium ? 基本上,这只适用于此协议,默认情况下nutch使用 protocol-http 不支持此选项的插件,即使您在配置中启用了这些设置。

相关问题