如何使用jdbc将impala表直接加载到spark?

xv8emn3q  于 2021-06-26  发布在  Impala
关注(0)|答案(3)|浏览(810)

我正在尝试用python编写一个spark作业,它将打开一个与impala的jdbc连接,并将一个视图直接从impala加载到一个Dataframe中。这个问题非常接近,但是在scala中:从spark作业中调用jdbc到impala/hive并创建一个表
我该怎么做?其他数据源有很多例子,比如mysql、postgresql等,但是我还没有看到impala+python+kerberos的例子。举个例子会很有帮助。谢谢您!
用网络上的信息尝试了一下,但没有成功。

spark笔记本


# !/bin/bash

export PYSPARK_PYTHON=/home/anave/anaconda2/bin/python
export HADOOP_CONF_DIR=/etc/hive/conf
export PYSPARK_DRIVER_PYTHON=/home/anave/anaconda2/bin/ipython
export PYSPARK_DRIVER_PYTHON_OPTS='notebook --ip=* --no-browser'

# use Java8

export JAVA_HOME=/usr/java/latest
export PATH=$JAVA_HOME/bin:$PATH

# JDBC Drivers for Impala

export CLASSPATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30/*.jar:$CLASSPATH
export JDBC_PATH=/home/anave/impala_jdbc_2.5.30.1049/Cloudera_ImpalaJDBC41_2.5.30

# --jars $SRCDIR/spark-csv-assembly-1.4.0-SNAPSHOT.jar \

# --conf spark.sql.parquet.binaryAsString=true \

# --conf spark.sql.hive.convertMetastoreParquet=false

pyspark --master yarn-client \
        --driver-memory 4G \
        --executor-memory 2G \
        # --num-executors 10 \
        --jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar
        --driver-class-path $JDBC_PATH/*.jar

python代码

properties = {
    "driver": "com.cloudera.impala.jdbc41.Driver",
    "AuthMech": "1",

# "KrbRealm": "EXAMPLE.COM",

# "KrbHostFQDN": "impala.example.com",

    "KrbServiceName": "impala"
}

# imp_env is the hostname of the db, works with other impala queries ran inside python

url = "jdbc:impala:imp_env;auth=noSasl"

db_df = sqlContext.read.jdbc(url=url, table='summary', properties=properties)

我收到此错误消息(完整错误日志):
py4jjavaerror:调用o42.jdbc时出错:java.lang.classnotfoundexception:com.cloudera.impala.jdbc41.driver

jaxagkaj

jaxagkaj1#

第一种方法是在下面的impala\ujdbc\uconnection.py脚本上使用spark submit,比如 spark-submit --driver-class-path /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --jars /opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/jars/ImpalaJDBC41.jar --class com.cloudera.impala.jdbc41.Driver impala_jdbc_connection.py Impala jdbc\u connection.py

properties = {
"drivers": "com.cloudera.impala.jdbc41.Driver"
}

# initalize the spark session

spark = (
        SparkSession.builder
        .config("spark.jars.packages", "jar-packages-list")
        .config("spark.sql.warehouse.dir","hdfs://dwh-hdp-node01.dev.ergo.liferunoffinsuranceplatform.com:8020/user/hive/warehouse")
        .enableHiveSupport()
        .getOrCreate() 
        )

db_df = spark.read.jdbc(url= 'jdbc:impala://host_ip_address:21050/database_name', table ='table_name', properties = properties)

db_df.show()

第二种方法不是直接从impala导入spark,而是将结果转换为sparkDataframe pip install impyla 资料来源:https://github.com/cloudera/impyla
连接到impala并从impala数据库获取结果,并将结果转换为sparkDataframe

from impala.dbapi import connect

conn = connect(host = 'IP_ADDRESS_OF_HOST', port=21050)

cursor = conn.cursor()

cursor.execute('select * from database.table')

res= cursor.fetchall() # convert res to spark dataframe

for data in res:
        print(data)
o4tp2gmn

o4tp2gmn2#

这对我很有用:

spark-shell --driver-class-path ImpalaJDBC41.jar --jars ImpalaJDBC41.jar 

val jdbcURL = s"jdbc:impala://192.168.56.101:21050;AuthMech=0"

val connectionProperties = new java.util.Properties()

val hbaseDF = sqlContext.read.jdbc(jdbcURL, "impala_table", connectionProperties)
evrscar2

evrscar23#

你可以用

--jars $(echo /dir/of/jars/*.jar | tr ' ' ',')

而不是

--jars /home/anave/spark-csv_2.11-1.4.0.jar $JDBC_PATH/*.jar

或者另一种方法请看我的答案

相关问题