我已安装apache nutch并使用以下工具运行爬网:
bin/crawl ./urls/seed.txt crawl http://localhost:8983/solr/ 5
从运行时/本地运行良好。当我从runtime/deploy运行相同的命令时,我得到:
14/07/16 19:43:35 INFO crawl.InjectorJob: InjectorJob: starting at 2014-07-16 19:43:35
14/07/16 19:43:35 INFO crawl.InjectorJob: InjectorJob: Injecting urlDir: urls/seed.txt
14/07/16 19:43:37 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s
14/07/16 19:43:37 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector
14/07/16 19:43:37 INFO crawl.InjectorJob: InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as the Gora storage class.
14/07/16 19:43:37 INFO mapred.JobClient: Default number of map tasks: null
14/07/16 19:43:37 INFO mapred.JobClient: Setting default number of map tasks based on cluster size to : 12
14/07/16 19:43:37 INFO mapred.JobClient: Default number of reduce tasks: 0
14/07/16 19:43:38 INFO security.ShellBasedUnixGroupsMapping: add hadoop to shell userGroupsCache
14/07/16 19:43:38 INFO mapred.JobClient: Setting group to hadoop
14/07/16 19:43:39 INFO mapred.JobClient: Cleaning up the staging area hdfs://172.31.13.61:9000/mnt/var/lib/hadoop/tmp/mapred/staging/hadoop/.staging/job_201407161337_0024
14/07/16 19:43:39 ERROR security.UserGroupInformation: PriviledgedActionException as:hadoop cause:org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://172.31.13.61:9000/user/hadoop/urls/seed.txt
14/07/16 19:43:39 ERROR crawl.InjectorJob: InjectorJob: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: hdfs://172.31.13.61:9000/user/hadoop/urls/seed.txt
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:235)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1016)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1033)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:951)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:904)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1140)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:904)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:501)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:531)
at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:50)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:233)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:251)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:273)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:282)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:187)
它没有找到seed.txt文件,是的,它确实存在于$home/url/seed.txt中。我用的是aws emr和cassandra。任何帮助都将不胜感激。
暂无答案!
目前还没有任何答案,快来回答吧!