无法在dplyr.spark.hive包中创建由sparksql支持的dplyr src

ojsjcaue  于 2021-06-02  发布在  Hadoop
关注(0)|答案(2)|浏览(346)

最近我发现 dplyr.spark.hive 使 dplyr 前端操作 spark 或者 hive 后端。
有关如何在程序包的自述文件中安装此程序包的信息:

options(repos = c("http://r.piccolboni.info", unlist(options("repos"))))
install.packages("dplyr.spark.hive")

还有很多关于如何使用的例子 dplyr.spark.hive 当一个已经连接到 hiveServer -看看这个。
但我无法连接到 hiveServer ,所以我不能从这个包裹的强大力量中获益。。。
我试过这样的命令,但没有成功。有人对我做错了什么有什么解决办法或意见吗?

> library(dplyr.spark.hive, 
+         lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> 
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> 
> my_db = src_SparkSQL()
Error in .jfindClass(as.character(driverClass)[1]) : class not found
> 
> my_db = src_SparkSQL(host = 'jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl',
+                      port = 10000)
Error in .jfindClass(as.character(driverClass)[1]) : class not found
> 
> my_db = src_SparkSQL(start.server = TRUE)
Error in start.server() : 
  Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580.  Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1 
> 
> my_db = src_SparkSQL(start.server = TRUE,
+                      list(spark.num.executors='5', spark.executor.cores='5', master="yarn-client"))
Error in start.server() : 
  Couldn't start thrift server:org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 running as process 37580.  Stop it first.
In addition: Warning message:
running command 'cd /opt/tech/prj_bdc/pmozie_status/user_topics;/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh ' had status 1

编辑2
我已经设置了更多这样的系统变量的路径,但是现在我收到了一个警告,告诉我没有指定某种java日志配置,但我认为是这样的

> library(dplyr.spark.hive, 
+         lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
3: package ‘SparkR’ was built under R version 3.2.1 
> 
> Sys.setenv(SPARK_HOME = "/opt/spark-1.5.0-bin-hadoop2.4")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HADOOP_HOME="/usr/share/hadoop")
> Sys.setenv(HADOOP_CONF_DIR="/etc/hadoop")
> Sys.setenv(PATH='/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/usr/share/hadoop/bin:/opt/hive/bin')
> 
> 
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

我的日志属性不是空的。

-bash-4.2$ wc /etc/hadoop/log4j.properties 
 179  432 6581 /etc/hadoop/log4j.properties

编辑3
我的确切电话 scr_SparkSQL()

> detach("package:SparkR", unload=TRUE)
Warning message:
package ‘SparkR’ was built under R version 3.2.1 
> detach("package:dplyr", unload=TRUE)
> library(dplyr.spark.hive, lib.loc = '/opt/wpusers/mkosinski/R/x86_64-redhat-linux-gnu-library/3.1')
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
2: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
> Sys.setenv(HIVE_SERVER2_THRIFT_BIND_HOST = 'tools-1.hadoop.srv')
> Sys.setenv(HIVE_SERVER2_THRIFT_PORT = '10000')
> my_db = src_SparkSQL()
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

然后过程不会停止(从不停止)。如果这些设置适用于具有以下参数的直线:

beeline  -u "jdbc:hive2://tools-1.hadoop.srv:10000/loghost;auth=noSasl" -n mkosinski --outputformat=tsv --incremental=true -f sql_statement.sql > sql_output

但我不能通过 user 姓名和 dbnamesrc_SparkSQL() 因此,我试图手动使用该函数内部的代码,但我收到的sam问题,下面的代码也没有完成

host = 'tools-1.hadoop.srv'
port = 10000
driverclass = "org.apache.hive.jdbc.HiveDriver"
Sys.setenv(HADOOP_JAR = "/opt/spark-1.5.0-bin-hadoop2.4/lib/spark-assembly-1.5.0-hadoop2.4.0.jar")
library(RJDBC)
dr = JDBC(driverclass, Sys.getenv("HADOOP_JAR"))
url = paste0("jdbc:hive2://", host, ":", port)
class = "Hive"
con.class = paste0(class, "Connection") # class = "Hive"

# dbConnect_retry =

# function(dr, url, retry){

# if(retry > 0)

# tryCatch(

# dbConnect(drv = dr, url = url),

# error =

# function(e) {

# Sys.sleep(0.1)

# dbConnect_retry(dr = dr, url = url, retry - 1)})

# else dbConnect(drv = dr, url = url)}

################# 

## con = new(con.class, dbConnect_retry(dr, url, retry = 100))

################# 

con = new(con.class, dbConnect(dr, url, user = "mkosinski", dbname = "loghost"))

也许是 url 也应该包含 /loghost -那个 dbname ?

e5nszbig

e5nszbig1#

我现在看到你尝试了多个错误。让我一个错误一个错误地评论。
.jfindclass(as.character(driverclass)[1])中的my\u db=src\u sparksql()错误:找不到类
无法创建rjdbc对象。除非我们解决了这个问题,否则其他办法都行不通,不管是不是变通办法。你有没有设置hadoop\u jar,比如, Sys.setenv(HADOOP_JAR = "../spark/assembly/target/scala-2.10/spark-assembly-1.5.0-hadoop2.6.0.jar") . 抱歉,我好像在说明书中漏掉了这个。会修好的。
my\u db=src\u sparksql(主机='jdbc:hive2http://tools-1.hadoop。srv:10000/loghost;auth=nosasl',+port=10000)在.jfindclass中出错(如.character(driverclass)[1]):找不到类
同样的问题。请注意主机端口参数不接受url语法,只接受主机和端口。url是在内部形成的。
start.server()中的my_db=src_sparksql(start.server=true)错误:无法启动thriftserver:org.apache.spark.sql.hive.thriftserver.hivethriftserver2 running 同工艺37580。先停下来。另外:警告消息:运行命令'cd/opt/tech/prj\u bdc/pmozie\u status/user\u topics/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh“状态1”
请先停止thriftserver或连接到现有的thriftserver,但仍需修复“找不到类”问题。
my_db=src_sparksql(start.server=true,+list(spark.num.executors='5',spark.executor.cores='5',master=“yarn client”))start.server()中出错:无法启动thriftserver:org.apache.spark.sql.hive.thriftserver.hivethriftserver2 running 同工艺37580。先停下来。另外:警告消息:运行命令'cd/opt/tech/prj\u bdc/pmozie\u status/user\u topics/opt/spark-1.5.0-bin-hadoop2.4/sbin/start-thriftserver.sh“状态1”
同上。
计划:
设置hadoop\u jar。查找运行thriftserver的主机和端口(如果不是默认值)。使用start.server=false尝试src\u sparksql。如果happy退出,则转到步骤2
停止现有的thriftserver。使用start.server=true重试src\u sparksql
告诉我事情进展如何。

ukqbszuj

ukqbszuj2#

有一个问题,我没有指定正确的 classPath 里面需要的 JDBC 创建驱动程序的函数。参数到 classPathdplyr.spark.hive 包裹通过 HADOOP_JAR 全局变量。
使用 JDBC 作为一名司机 hiveServer2 (通过 Thrift 协议)需要至少添加这3个 .jarsJava 类来创建适当的驱动程序
hive-jdbc-1.0.0-standalone.jar
hadoop/common/lib/commons-configuration-1.6.jar
hadoop/common/hadoop-common-2.4.1.jar
版本是任意的,应该与安装的本地版本兼容 hive , hadoop 以及 hiveServer2 .
他们需要用 .Platform$path.sep (如本文所述)

classPath = c("system_path1_to_hive/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                  "system_path1_to_hadoop/hadoop/common/lib/commons-configuration-1.6.jar",
                   "system_path1_to_hadoop/hadoop/common/hadoop-common-2.4.1.jar")
Sys.setenv(HADOOP_JAR= paste0(classPath, collapse=.Platform$path.sep)

那什么时候 HADOOP_JAR 一个人必须小心吗 hiveServer2 url。在我的情况下,这是必须的

host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")

最后是与 hiveServer2 使用 RJDBC 包裹是

Sys.setenv(HADOOP_HOME="/usr/share/hadoop/share/hadoop/common/")
Sys.setenv(HIVE_HOME = '/opt/hive/lib/')
host = 'tools-1.hadoop.srv'
port = 10000
url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
driverclass = "org.apache.hive.jdbc.HiveDriver"
library(RJDBC)
.jinit()
dr2 = JDBC(driverclass,
           classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
                         #"/opt/hive/lib/commons-configuration-1.6.jar",
                         "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
                         "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
           identifier.quote = "`")

url = paste0("jdbc:hive2://", host, ":", port, "/loghost;auth=noSasl")
dbConnect(dr2, url, username = "mkosinski") -> cont

相关问题