如何将文件从ftp服务器增量复制到hadoop hdfs

fcg9iug3  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(590)

我们有一个ftp服务器,每天有很多文件上传到ftp服务器,我需要复制所有这些文件在hdfs。
每次只能下载增量文件,即第一次下载10个文件后,ftp上传5个新文件;在job的下一次迭代中,它应该只下载hdfs中的5个新文件。
我们没有使用nifi或Kafka连接。
我们有什么好的解决办法来完成这个任务吗。

ergxz8rk

ergxz8rk1#

您可以使用lftp作业中的touch文件来实现这一点,下面是我的解释和代码。查看每一步的评论。


# !bin/bash

# SomeConfigs

TOUCHFILE='/somepath/inYourLocal/someFilename.touch'
RemoteSFTPserverPath='/Remote/Server/path/toTheFiles'
LocalPath='/Local/Path/toReceiveTheFiles'
FTP_Server_UserName='someUser'
FTP_Server_Password='SomePassword'
ServerIP='//127.12.11.35'

# Transfer files from FTP Server #This is the main command

ftp_command="lftp -e 'mirror --only-missing --newer-than=${TOUCHFILE} --older-than=now-2minutes --parallel=4 --no-recursion --include "SomeFileName*.csv"  ${RemoteSFTPserverPath}/ ${LocalPath}/;exit' -u ${FTP_Server_UserName},${FTP_Server_Password} sftp:${ServerIP}"

# CommandToexecute The Job

eval ${ftp_command}

# After finishing the lftp job You have to update the touch file for the next job

# This will update to current timestamp

touch /somepath/inYourLocal/someFilename.touch

# If you want to update with the last file received time

TchDate=$(stat -c %y "$(ls -1t ${LocalPath} | head -n1)" | date)
touch -d ${TchDate} /somepath/inYourLocal/someFilename.touch

# Stat on latest file in remote server #You can do this way also

TchDate=$(ssh -o StrictHostKeyChecking=no ${FTP_Server_UserName}@${FTP_Server_Password} "stat -c %y \"$(ls -1t ${RemoteSFTPserverPath}/ | head -n1)\" | date")
touch -d ${TchDate} /somepath/inYourLocal/someFilename.touch

# Once you have the files in your local you can copy them to hdfs

hdfs dfs -put -f /Local/Path/toReceiveTheFiles/*.csv /HDFS/PATH

# Remove the files in local so that you can accommodate for the upcoming files

rm -r -f /Local/Path/toReceiveTheFiles/*.csv

在lftp工作中,你有很多选择 man lftp 会是你最好的来源

相关问题