最近我发现 dplyr.spark.hive
使 dplyr
前端操作 spark
或者 hive
后端。
有关如何在程序包的自述文件中安装此程序包的信息:
options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")
还有很多关于如何使用的例子 dplyr.spark.hive
当一个已经连接到 hiveServer
-看看这个。
但我无法连接到 hiveServer
,所以我不能从这个包裹的强大力量中获益。。。
我试过这样的命令,但没有成功。有人对我做错了什么有什么解决办法或意见吗?
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
>
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+ port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
>
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
>
> my_db = src_SparkSQL(start.server = TRUE,
+ list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() :
Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580. Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1
编辑2
我已经设置了更多这样的系统变量的路径,但是现在我收到了一个警告,告诉我没有指定某种java日志配置,但我认为是这样的
> library(dplyr.spark.hive,
+ lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
3: package ‘SparkR’ was built under R version 3.2.1
>
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
>
>
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
我的日志属性不是空的。
-bash-4.2$ wc /etc/hadoop/log4j.properties
179 432 6581 /etc/hadoop/log4j.properties
编辑3
我的确切电话 scr_SparkSQL()
是
> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
然后过程不会停止(从不停止)。如果这些设置适用于具有以下参数的直线:
beeline -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output
但我不能通过 user
姓名和 dbname
至 src_SparkSQL()
因此,我试图手动使用该函数内部的代码,但我收到的sam问题,下面的代码也没有完成
host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"
# dbConnect_retry =
# function(dr, url, retry){
# if(retry > 0)
# tryCatch(
# dbConnect(drv = dr, url = url),
# error =
# function(e) {
# Sys.sleep(0.1)
# dbConnect_retry(dr = dr, url = url, retry - 1)})
# else dbConnect(drv = dr, url = url)}
#################
## con = new(con.class, dbConnect_retry(dr, url, retry = 100))
#################
con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))
也许是 url
也应该包含 /loghost
-那个 dbname
?
2条答案
按热度按时间e5nszbig1#
我现在看到你尝试了多个错误。让我一个错误一个错误地评论。
.jfindclass(as.character(driverclass)[1])中的my\u db=src\u sparksql()错误:找不到类
无法创建rjdbc对象。除非我们解决了这个问题,否则其他办法都行不通,不管是不是变通办法。你有没有设置hadoop\u jar,比如,
Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar")
. 抱歉,我好像在说明书中漏掉了这个。会修好的。my\u db=src\u sparksql(主机='jdbc:hive2http://tools-1.hadoop。srv:10000/loghost;auth=nosasl',+port=10000)在.jfindclass中出错(如.character(driverclass)[1]):找不到类
同样的问题。请注意主机端口参数不接受url语法,只接受主机和端口。url是在内部形成的。
start.server()中的my_db=src_sparksql(start.server=true)错误:无法启动thriftserver:org.apache.spark.sql.hive.thriftserver.hivethriftserver2 running 同工艺37580。先停下来。另外:警告消息:运行命令'cd/opt/tech/prj\u bdc/pmozie\u status/user\u topics/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh“状态1”
请先停止thriftserver或连接到现有的thriftserver,但仍需修复“找不到类”问题。
my_db=src_sparksql(start.server=true,+list(spark.num.executors='5',spark.executor.cores='5',master=“yarn client”))start.server()中出错:无法启动thriftserver:org.apache.spark.sql.hive.thriftserver.hivethriftserver2 running 同工艺37580。先停下来。另外:警告消息:运行命令'cd/opt/tech/prj\u bdc/pmozie\u status/user\u topics/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh“状态1”
同上。
计划:
设置hadoop\u jar。查找运行thriftserver的主机和端口(如果不是默认值)。使用start.server=false尝试src\u sparksql。如果happy退出,则转到步骤2
停止现有的thriftserver。使用start.server=true重试src\u sparksql
告诉我事情进展如何。
ukqbszuj2#
有一个问题,我没有指定正确的
classPath
里面需要的JDBC
创建驱动程序的函数。参数到classPath
在dplyr.spark.hive
包裹通过HADOOP_JAR
全局变量。使用
JDBC
作为一名司机hiveServer2
(通过Thrift
协议)需要至少添加这3个.jars
与Java
类来创建适当的驱动程序hive-jdbc-1.0.0-standalone.jar
hadoop/common/lib/commons-configuration-1.6.jar
hadoop/common/hadoop-common-2.4.1.jar
版本是任意的,应该与安装的本地版本兼容
hive
,hadoop
以及hiveServer2
.他们需要用
.Platform$path.sep
(如本文所述)那什么时候
HADOOP_JAR
一个人必须小心吗hiveServer2
url。在我的情况下,这是必须的最后是与
hiveServer2
使用RJDBC
包裹是