我需要从pyspark读取/写入存储在远程配置单元服务器中的表。我只知道这个远程Hive在docker下面运行。从hadoophue中，我找到了一个 iris 我试图从中选择一些数据的表：
我有一个表元存储url:

http://xxx.yyy.net:8888/metastore/table/mytest/iris

和表位置url:

hdfs://quickstart.cloudera:8020/user/hive/warehouse/mytest.db/iris

我不知道为什么最后一个网址包含 quickstart.cloudera:8020 . 也许这是因为Hive在码头下面运行？
讨论对配置单元表的访问Pypark教程写道：
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-表
使用配置单元时，必须使用配置单元支持示例化sparksession，包括连接到持久配置单元元存储、支持配置单元serde和配置单元用户定义函数。没有现有配置单元部署的用户仍然可以启用配置单元支持。如果未由hive-site.xml配置，则上下文会自动在当前目录中创建metastore\ db，并创建由spark.sql.warehouse.dir配置的目录，该目录默认为启动spark应用程序的当前目录中的spark warehouse目录。请注意，自spark 2.0.0以来，hive-site.xml中的hive.metastore.warehouse.dir属性已被弃用。相反，请使用spark.sql.warehouse.dir指定数据库在仓库中的默认位置。您可能需要向启动spark应用程序的用户授予写入权限。
对我来说 hive-site.xml 我设法得到的也没有 hive.metastore.warehouse.dir 也不是 spark.sql.warehouse.dir 财产。
spark教程建议使用以下代码访问远程配置单元表：

from os.path import expanduser, join, abspath

from pyspark.sql import SparkSession
from pyspark.sql import Row

   // warehouseLocation points to the default location for managed databases and tables
val warehouseLocation = new File("spark-warehouse").getAbsolutePath

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL Hive integration example") \
    .config("spark.sql.warehouse.dir", warehouse_location) \
    .enableHiveSupport() \
    .getOrCreate()

在我的例子中，运行后类似于上面的代码，但是具有正确的值 warehouseLocation ，我想我可以这样做：

spark.sql("use mytest")
spark.sql("SELECT * FROM iris").show()

那么我在哪里可以找到远程Hive仓库的位置呢？如何使pyspark与远程配置单元表一起工作？
更新 hive-site.xml 具有以下属性：

...
...
...
 <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://127.0.0.1/metastore?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore</description>
  </property>
...
...
...
  <property>
    <name>hive.metastore.uris</name>
    <value>thrift://127.0.0.1:9083</value>
    <description>IP address (or fully-qualified domain name) and port of the metastore host</description>
  </property>

所以看起来127.0.0.1是运行ClouderDocker应用程序的DockerLocalHost。根本无助于到达Hive仓库。
当cloudera hive作为docker应用程序运行时，如何访问hive仓库。？

1条答案

按热度按时间

tp5buhyn1#

在这里https://www.cloudera.com/documentation/enterprise/5-6-x/topics/cdh_ig_hive_metastore_configure.html 在“远程模式”你会发现你 Hive metastore 运行自己的jvm进程，其他进程如 HiveServer2, HCatalog, Cloudera Impala 通过网络与之沟通 Thrift API 使用属性 hive.metastore.uri 在 hive-site.xml :

<property>
  <name>hive.metastore.uris</name>
  <value>thrift://xxx.yyy.net:8888</value>
</property>

（不确定指定地址的方式）
也许这个属性也是：

<property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:mysql://xxx.yyy.net/hive</value>
</property>

赞(0）回复(0）举报 2021-05-29

pyspark：远程hive仓库位置

1条答案

相关问题

热门标签

最新问答