在azure databricks中创建外部表

jhkqcmku 于 2021-06-24 发布在 Hive

关注(0)|答案(2)|浏览(860)

我对azuredatabricks不熟悉，正在尝试创建一个外部表，指向azuredatalakestorage（adls）gen-2位置。
在databricks笔记本中，我尝试设置adls访问的spark配置。我仍然无法执行创建的ddl。
注意：一个适合我的解决方案是将adls帐户挂载到集群，然后使用外部表ddl中的挂载位置。但是我需要检查是否有可能创建一个带有adls路径的外部表ddl，而不需要挂载位置。


# Using Principal credentials

spark.conf.set("dfs.azure.account.auth.type", "OAuth")
spark.conf.set("dfs.azure.account.oauth.provider.type", "ClientCredential")
spark.conf.set("dfs.azure.account.oauth2.client.id", "client_id")
spark.conf.set("dfs.azure.account.oauth2.client.secret", "client_secret")
spark.conf.set("dfs.azure.account.oauth2.client.endpoint", 
"https://login.microsoftonline.com/tenant_id/oauth2/token")

ddl公司

create external table test(
id string,
name string
)
partitioned by (pt_batch_id bigint, pt_file_id integer)
STORED as parquet
location 'abfss://container@account_name.dfs.core.windows.net/dev/data/employee

收到错误

Error in SQL statement: AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Got exception: shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.contracts.exceptions.ConfigurationPropertyNotFoundException Configuration property account_name.dfs.core.windows.net not found.);

我需要帮助知道这是不是可以参考adls的位置直接在ddl？
谢谢。

Hive databricks Azure azure-databricks external-tables

来源：https://stackoverflow.com/questions/56792095/create-external-table-in-azure-databricks

2条答案

按热度按时间

rt4zxlrg1#

一旦确认了azure数据湖存储，就可以执行此操作。
如果希望databricks工作区中的所有用户都可以访问装载的azure数据湖存储gen2帐户，则应该使用下面描述的方法创建装载点。用于访问azure data lake storage gen2帐户的服务客户端应仅被授予对该azure data lake storage gen2帐户的访问权限；不应授予它访问azure中其他资源的权限。
一旦通过集群创建了装载点，该集群的用户就可以立即访问装载点。要在另一个正在运行的集群中使用装载点，用户必须在该正在运行的集群上运行dbutils.fs.refreshmounts（），以使新创建的装载点可供使用。
从databricks群集访问azure data lake storage gen2有三种主要方法：
使用具有委派权限的服务主体和oauth 2.0将azure data lake storage gen2文件系统装载到dbfs。
直接使用服务主体。
直接使用azure data lake storage gen2存储帐户访问密钥。
有关更多详细信息，请参阅“azure data lake storage gen2”。
希望这有帮助。

赞(0）回复(0）举报 2021-06-24

j1dl9f462#

如果你能使用python（或者scala）的话。
从建立连接开始：

TenantID = "blah"

def connectLake():
  spark.conf.set("fs.azure.account.auth.type", "OAuth")
  spark.conf.set("fs.azure.account.oauth.provider.type", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
  spark.conf.set("fs.azure.account.oauth2.client.id", dbutils.secrets.get(scope = "LIQUIX", key = "lake-sp"))
  spark.conf.set("fs.azure.account.oauth2.client.secret", dbutils.secrets.get(scope = "LIQUIX", key = "lake-key"))
  spark.conf.set("fs.azure.account.oauth2.client.endpoint", "https://login.microsoftonline.com/"+TenantID+"/oauth2/token")

connectLake()
lakePath = "abfss://liquix@mystorageaccount.dfs.core.windows.net/"

使用python，可以使用以下方法注册表：

spark.sql("CREATE TABLE DimDate USING PARQUET LOCATION '"+lakePath+"/PRESENTED/DIMDATE/V1'")

如果您已经执行了connectlake（）函数，现在就可以查询该表了，这在当前会话/笔记本中很好。
现在的问题是，如果一个新会话进入，并且他们尝试从该表中选择*，除非他们首先运行connectlake（）函数，否则它将失败。没有办法绕过这个限制，因为你必须证明证件才能进入湖泊。
您可能需要考虑adls gen2凭证传递：https://docs.azuredatabricks.net/spark/latest/data-sources/azure/adls-passthrough.html
注意，这需要使用高并发集群。

赞(0）回复(0）举报 2021-06-24

我来回答

在azure databricks中创建外部表

2条答案

相关问题

热门标签

最新问答