flume-streaming数据从twitterapi到hdfs

6qqygrtg  于 2021-06-01  发布在  Hadoop
关注(0)|答案(0)|浏览(197)

前几周,我在运行centos 6.6的本地vm中使用flume(flume 1.5.0-cdh5.4.3)将twitter数据流式传输到hadoop(hadoop 2.6.0-cdh5.4.3)服务器。
最初,我尝试使用内置的twitter源代码,它是flume上的默认库,但数据显然没有正确编码,后来cloudera团队确认这是一个已知的问题(https://community.cloudera.com/t5/data-ingestion-integration/flume-twitter-data-looks-corrupt/td-p/48095).
此方法使用类名org.apache.flume.source.twitter.twittersource,在flume.conf文件上正确设置。
后来,我尝试使用一个定制的twitterjar源代码,这个源代码可以在cloudera的网站上找到(http://files.cloudera.com/samples/flume-sources-1.0-snapshot.jar)通过许多其他教程,但这让我在接收来自twitterapi的状态流时遇到了另一个问题,应用程序在那里卡住了。

2017-05-25 09:51:37,875 (Twitter Stream consumer-1[initializing]) 
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 
Establishing connection.
2017-05-25 09:52:10,545 (Twitter Stream consumer-1[Establishing connection]) 
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 
Connection established.
2017-05-25 09:52:10,546 (Twitter Stream consumer-1[Establishing connection]) 
[INFO - twitter4j.internal.logging.SLF4JLogger.info(SLF4JLogger.java:83)] 
Receiving status stream.

此方法使用的类名为com.cloudera.flume.source.twittersource,据报道该类名不再工作,已被org.apache.flume.source.twittersource所取代。所需文件的设置如下所示:
1) Flume.conf

TwitterAgent.sources = Twitter
TwitterAgent.channels = MemChannel
TwitterAgent.sinks = HDFS
TwitterAgent.sources.Twitter.type = com.cloudera.flume.source.TwitterSource #(OLD CLASS)

# TwitterAgent.sources.Twitter.type = org.apache.flume.source.twitter.TwitterSource #(NEW CLASS)

TwitterAgent.sources.Twitter.channels = MemChannel

TwitterAgent.sources.Twitter.consumerKey = <consumerKey>
TwitterAgent.sources.Twitter.consumerSecret = <consumerSecret>
TwitterAgent.sources.Twitter.accessToken = <accessToken>
TwitterAgent.sources.Twitter.accessTokenSecret = <accessTokenSecret>
TwitterAgent.sources.Twitter.keywords = hadoop, big data

TwitterAgent.sinks.HDFS.channel = MemChannel
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://localhost:8020/user/flume/tweets/%Y/%m/%d/%H/
TwitterAgent.sinks.HDFS.hdfs.fileType = DataStream
TwitterAgent.sinks.HDFS.hdfs.writeFormat = Text
TwitterAgent.sinks.HDFS.hdfs.batchSize = 100
TwitterAgent.sinks.HDFS.hdfs.rollSize = 100000
TwitterAgent.sinks.HDFS.hdfs.rollCount = 10000
TwitterAgent.sinks.HDFS.hdfs.rollInterval = 600
TwitterAgent.sinks.HDFS.hdfs.useLocalTimeStamp = true
TwitterAgent.sinks.HDFS.hdfs.callTimeout = 30000

TwitterAgent.channels.MemChannel.type = memory
TwitterAgent.channels.MemChannel.capacity = 10000
TwitterAgent.channels.MemChannel.transactionCapacity = 100

2) Flume-env.sh

JAVA_HOME=/usr/java/jdk1.7.0_67
JAVA_OPTS="-Xmx500m"
FLUME_CLASSPATH="/usr/lib/flume-ng/lib/flume-twitter-source.jar"

你知道发生了什么吗?有人能基于这个设置运行flume吗?有没有一种我不知道的将flume连接到twitterapi的新方法?谢谢!

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题