firehose输出到s3-将年-月-日-小时文件夹格式重新划分为dt=yy-mm-dd格式

ff29svar  于 2021-06-26  发布在  Hive
关注(0)|答案(3)|浏览(373)

我在aws emr生态系统中工作。
我正在寻找聪明的方式重新分配aws消防软管输出:
s3://桶/yyyy/mm/dd/hh
转换为配置单元分区格式
s3://桶/dt=yy mm dd hh
有什么建议吗?
谢谢,奥米德

whitzsjs

whitzsjs1#

在boto3中添加了相同的答案(以匹配当前lambda默认 Package )

import re
import boto3

## set buckets:

source_bucket='walla-anagog-us-east-1'
destination_bucket='walla-anagog-eu-west-1'

## regex from from YYYY/MM/DD/HH to dt=YYYY-MM-DD

## replaced_file = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3' , file)

client = boto3.client('s3')
s3 = boto3.resource('s3')
mybucket = s3.Bucket(source_bucket)

for object in mybucket.objects.all():
    replaced_key = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3' , object.key)
    print(object.key)
    client.copy_object(Bucket=destination_bucket, CopySource=source_bucket+"/"+object.key, Key=replaced_key, ServerSideEncryption='AES256')
    client.delete_object(Bucket=source_bucket, Key=object.key)
qf9go6mv

qf9go6mv2#

我们已经用s3distcp解决了这个问题。我们每小时聚合一次数据,按模式分组,并输出到适当加前缀的目录。
这绝对是消防水龙带所缺乏的一个特性,而且目前还没有一种只使用消防水龙带的方法。
http://docs.aws.amazon.com/emr/latest/releaseguide/usingemr_s3distcp.html

8yparm6h

8yparm6h3#

我使用python和boto来移动文件并重新划分它们。我应用正则表达式将键从yyyy/mm/dd/hh重命名为dt=yy-mm-dd-hh
代码段(注意src键被删除):

from boto.s3.connection import S3Connection
import re

conn = S3Connection('xxx','yyy')

## get buckets:

source_bucket='srcBucketName'
destination_bucket='dstBucketName'

src = conn.get_bucket(source_bucket)
dst = conn.get_bucket(destination_bucket)

## Iterate

for key in src.list():
     #print key.name.encode('utf-8')
     file = key.name.encode('utf-8')    

     replaced_file = re.sub(r'(\d{4})\/(\d{2})\/(\d{2})\/(\d{2})', r'dt=\1-\2-\3-\4' , file)
     #print replaced_file

     #actual copy    
     dst.copy_key(replaced_file,src.name,file,encrypt_key=True )
     key.delete()

相关问题