hadoop 将HDFS文件从多个源文件夹复制到ADLS Gen2存储

ma8fv8wu  于 2023-06-05  发布在  Hadoop
关注(0)|答案(1)|浏览(186)

我有个要求我必须通过ADF管道将文件(zip格式)从HDFS(Hadoop文件系统)复制到ADLS Gen2 Blob存储。HDFS系统中的文件格式如下:

HDFS Source:
    hdfs/data/users/synova/raw/partition1/customer/full/2023-04-01/customer.zip
    hdfs/data/users/synova/raw/partition1/customer/full/2023-04-02/customer.zip
    hdfs/data/users/synova/raw/partition1/customer/full/2023-04-03/customer.zip
    hdfs/data/users/synova/raw/partition2/parts/full/2023-04-01/parts.zip
    hdfs/data/users/synova/raw/partition2/parts/full/2023-04-02/parts.zip
    hdfs/data/users/synova/raw/partition2/parts/full/2023-04-03/parts.zip
    hdfs/data/users/synova/raw/partition3/modules/full/2023-04-01/modules.zip
    hdfs/data/users/synova/raw/partition3/modules/full/2023-04-02/modules.zip
    hdfs/data/users/synova/raw/partition3/modules/full/2023-04-03/modules.zip
    hdfs/data/users/synova/raw/partition4/events/full/2023-04-01/events.zip
    hdfs/data/users/synova/raw/partition4/events/full/2023-04-02/events.zip
    hdfs/data/users/synova/raw/partition4/events/full/2023-04-03/events.zip

ADLS Target should be:
    adls/consolidated/synova/raw/partition1/customer/2023-04-01/customer.zip
                                                /2023-04-02/customer.zip
                                                /2023-04-03/customer.zip
    adls/consolidated/synova/raw/partition2/parts/2023-04-01/parts.zip
                                             /2023-04-02/parts.zip
                                             /2023-04-03/parts.zip
    adls/consolidated/synova/raw/partition3/modules/2023-04-01/modules.zip
                                               /2023-04-02/modules.zip
                                               /2023-04-03/modules.zip
    adls/consolidated/synova/raw/partition4/events/2023-04-01/events.zip
                                              /2023-04-02/events.zip
                                              /2023-04-03/events.zip

我需要创建一个通用ADF管道来复制这些文件。
谢谢,拉凯什

ajsxfq5m

ajsxfq5m1#

您的需求可以通过Get metadata activity(列出文件)和copy activity来完成。但是它需要子文件夹的大量子管道,因为Get meta数据不能给予嵌套子文件夹的路径。
要获取路径列表,您可以尝试使用数据块(笔记本活动)或函数的代码,并在ADF中获取该列表。
如果你只想在ADF中执行,那么通过@Richard Swinbank编写的blog来递归地获取文件列表。
获取列表后,将其交给ForEach活动,并在ForEach内部,将@item()复制活动源文件路径和sink文件路径,将@item()中的'/full'替换为空字符串。
这里,这是我作为样本并将其交给ForEach的文件列表:

["synova/raw/partition1/customer/full/2023-04-01/customer.zip",
"synova/raw/partition1/customer/full/2023-04-02/customer.zip",
"synova/raw/partition1/customer/full/2023-04-03/customer.zip",
"synova/raw/partition2/parts/full/2023-04-01/parts.zip",
"synova/raw/partition2/parts/full/2023-04-02/parts.zip",
"synova/raw/partition2/parts/full/2023-04-03/parts.zip",
"synova/raw/partition3/modules/full/2023-04-01/modules.zip",
"synova/raw/partition3/modules/full/2023-04-02/modules.zip",
"synova/raw/partition3/modules/full/2023-04-03/modules.zip",
"synova/raw/partition4/events/full/2023-04-01/events.zip",
"synova/raw/partition4/events/full/2023-04-02/events.zip",
"synova/raw/partition4/events/full/2023-04-03/events.zip"]

我的源数据集:

接收数据集:

在ForEach中,给予@item()到源文件的路径,并将替换后的值存储在一个变量中。

@replace(item(),'/full','')

为接收器文件路径给予此变量。

结果:

相关问题