dayToHostPairTuple = access_logs.map(lambda log: (log.date_time.day, log.host))
dayGroupedHosts = dayToHostPairTuple.groupByKey()
dayHostCount = dayGroupedHosts.map(lambda xs: (xs[0], len(Set(xs[1]))))
dailyHosts = (dayHostCount.sortByKey())
dailyHostsList = dailyHosts.cache().take(30)
print ('Unique hosts per day: %s' % dailyHostsList)
我使用pyspark、python3.0.0和dailyhosts=(dayhostcount.sortbykey())行运行这段代码
我得到一个错误“py4jjavaerror traceback(最近一次调用last)in”
py4jjavaerror:调用z:org.apache.spark.api.pythonrdd.collectandserve时出错:org.apache.spark.sparkexception:作业因阶段失败而中止:阶段52.0中的任务0失败1次,最近的失败:阶段52.0中的任务0.0丢失(tid 213,laptop-i236oh25,executor driver):org.apache.spark.api.python异常:traceback(最近一次调用):文件“c:\spark\spark-3.0.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第605行,主文件“c:\spark\spark-3.0.0-bin-hadoop2.7\python\lib\pyspark.zip\pyspark\worker.py”,第595行,在进程文件“c:\spark\spark-3.0.0-bin-hadoop2.7\python\pyspark\rdd.py”的第2596行中,在pipeline_func return func(split,prev_func(split,iterator))中
暂无答案!
目前还没有任何答案,快来回答吧!