我有一个关于 alpakka_kafka+alpakka_s3
整合。阿尔帕卡s3 multipartUpload
当我使用alpakka-kafka源代码时似乎不会上传文件。
kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
不过,我一加上 .take(100)
Kafka资源之后。一切正常。
kafkaSource.take(100) ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
任何帮助都将不胜感激。提前谢谢!
以下是完整的代码片段:
// Source
val kafkaSource: Source[(CommittableOffset, Array[Byte]), Consumer.Control] = {
Consumer
.committableSource(consumerSettings, Subscriptions.topics(prefixedTopics))
.map(committableMessage => (committableMessage.committableOffset, committableMessage.record.value))
.watchTermination() { (mat, f: Future[Done]) =>
f.foreach { _ =>
log.debug("consumer source shutdown, consumerId={}, group={}, topics={}", consumerId, group, prefixedTopics.mkString(", "))
}
mat
}
}
// Flow
val commitFlow: Flow[CommittableOffset, Done, NotUsed] = {
Flow[CommittableOffset]
.groupedWithin(batchingSize, batchingInterval)
.map(group => group.foldLeft(CommittableOffsetBatch.empty) { (batch, elem) => batch.updated(elem) })
.mapAsync(parallelism = 3) { msg =>
log.debug("committing offset, msg={}", msg)
msg.commitScaladsl().map { result =>
log.debug("committed offset, msg={}", msg)
result
}
}
}
private val kafkaMsgToByteStringFlow = Flow[KafkaMessage[Any]].map(x => ByteString(x.msg + "\n"))
private val kafkaMsgToOffsetFlow = {
implicit val askTimeout: Timeout = Timeout(5.seconds)
Flow[KafkaMessage[Any]].mapAsync(parallelism = 5) { elem =>
Future(elem.offset)
}
}
// Sink
val s3Sink = {
val BUCKET = "test-data"
s3Client.multipartUpload(BUCKET, s"tmp/data.txt")
// Doesnt' work..... ( no files are showing up on the S3)
kafkaSource ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
// This one works...
kafkaSource.take(100) ~> kafkaSubscriber.serializer.deserializeFlow ~> bcast.in
bcast.out(0) ~> kafkaMsgToByteStringFlow ~> s3Sink
bcast.out(1) ~> kafkaMsgToOffsetFlow ~> commitFlow ~> Sink.ignore
2条答案
按热度按时间2ledvvac1#
sdnqo3pr2#
实际上,它确实上传了。问题是,您需要向s3发送一个完成请求,以便完成您的上传,然后您的文件将在bucket中可用。我打赌Kafka没有消息来源
take(n)
从不停止向下游生成数据,接收器从不向s3发送完成请求,因为流实际上从未完成,因此接收器总是希望在完成请求之前上载更多数据。你不可能只把所有东西上传到一个文件中,所以,我的建议是:分组
kafkaSource
消息并将压缩的数组[byte]发送到接收器。诀窍是你必须为每个文件创建一个接收器,而不是只使用一个接收器。