我已经学习了nutch(版本nutch-1.14)一周了,在本地模式和hadoop-2.7.2(伪分布式模式)下都工作得很好。今天我在nutch-site.xml中遇到了“take.screenshot”、“screenshot.location”属性,在修改这些属性之后,nutch正在爬行种子url,但是没有像hadoop那样在本地模式下截图。
本地模式的nutch-site.xml设置
<property>
<name>take.screenshot</name>
<value>true</value>
<description>
Boolean property determining whether the protocol-htmlunit
WebDriver should capture a screenshot of the URL. If set to
true remember to define the 'screenshot.location'
property as this determines the location screenshots should be
persisted to on HDFS. If that property is not set, screenshots
are simply discarded.
</description>
</property>
<property>
<name>screenshot.location</name>
<value>/home/user/nutch-1.14/screenshot</value>
<description>
The location on disk where a URL screenshot should be saved
to if the 'take.screenshot' property is set to true.
By default this is null, in this case screenshots held in memory
are simply discarded.
</description>
</property>
hadoop的nutch-site.xml设置
<property>
<name>take.screenshot</name>
<value>true</value>
</property>
<property>
<name>screenshot.location</name>
<value>/screenshot</value>
</property>
注意:hdfs中存在“screenshot”目录
2条答案
按热度按时间dy1byipe1#
htmlunit是一种“无gui的java程序浏览器”(参见http://htmlunit.sourceforge.net/). 这意味着,htmlunit根本不呈现html页面。在内部,所有操作都是基于dom树完成的,没有任何布局。这就是为什么没有选择截图的原因。
lrl1mhuk2#
你启用了吗
protocol-selenium
? 基本上,这只适用于此协议,默认情况下nutch使用protocol-http
不支持此选项的插件,即使您在配置中启用了这些设置。