.net

j8ag8udp  于 2021-05-17  发布在  Spark
关注(0)|答案(1)|浏览(425)

我对apachespark还不熟悉。我正在尝试使用MicrosoftApacheNuget库从ADL读取数据。我似乎不知道如何使用spark进行身份验证。似乎根本没有关于这个的文档。这有可能吗?我正在写一个.NETFramework控制台应用程序。
如有任何帮助/建议,我们将不胜感激!

ltskdhd1

ltskdhd11#

如果您想在spark中存储azure data lake,请参考以下步骤。请注意,我使用spark3.0.1和hadoop3.2进行测试
创建服务主体

az login
az ad sp create-for-rbac --name "myApp" --role contributor --scopes /subscriptions/<subscription-id>/resourceGroups/<group-name> --sdk-auth

授予服务主体对数据湖的访问权限

Connect-AzAccount

# get sp object id with sp's client id

$sp=Get-AzADServicePrincipal -ApplicationId  42e0d080-b1f3-40cf-8db6-c4c522d988c4

$fullAcl="user:$($sp.Id):rwx,default:user:$($sp.Id):rwx"
$newFullAcl = $fullAcl.Split("{,}")
Set-AdlStoreItemAclEntry -Account <> -Path / -Acl $newFullAcl -Recurse -Debug

代码

string filePath =
                $"adl://{<account name>}.azuredatalakestore.net/parquet/people.parquet";

            // Create SparkSession
            SparkSession spark = SparkSession
                .Builder()
                .AppName("Azure Data Lake Storage example using .NET for Apache Spark")
                .Config("fs.adl.impl", "org.apache.hadoop.fs.adl.AdlFileSystem")
                .Config("fs.adl.oauth2.access.token.provider.type", "ClientCredential")
                .Config("fs.adl.oauth2.client.id", "<sp appid>")
                .Config("fs.adl.oauth2.credential", "<sp password>")
                .Config("fs.adl.oauth2.refresh.url", $"https://login.microsoftonline.com/<tenant>/oauth2/token")
                .GetOrCreate();

            // Create sample data
            var data = new List<GenericRow>
            {
                new GenericRow(new object[] { 1, "John Doe"}),
                new GenericRow(new object[] { 2, "Jane Doe"}),
                new GenericRow(new object[] { 3, "Foo Bar"})
            };

            // Create schema for sample data
            var schema = new StructType(new List<StructField>()
            {
                new StructField("Id", new IntegerType()),
                new StructField("Name", new StringType()),
            });

            // Create DataFrame using data and schema
            DataFrame df = spark.CreateDataFrame(data, schema);

            // Print DataFrame
            df.Show();

            // Write DataFrame to Azure Data Lake Gen1
            df.Write().Mode(SaveMode.Overwrite).Parquet(filePath);

            // Read saved DataFrame from Azure Data Lake Gen1
            DataFrame readDf = spark.Read().Parquet(filePath);

            // Print DataFrame
            readDf.Show();

            // Stop Spark session
            spark.Stop();

运行

spark-submit ^
--packages org.apache.hadoop:hadoop-azure-datalake:3.2.0 ^
--class org.apache.spark.deploy.dotnet.DotnetRunner ^
--master local ^
microsoft-spark-3-0_2.12-<version>.jar ^
dotnet <application name>.dll


有关详细信息,请参阅
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory
https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-access-control

相关问题