我想找出顶级网站页面访问量的用户年龄组之间的18至25岁。我有两个文件,一个包含用户名,年龄和其他文件包含用户名,网站名称。示例:
用户.txt
约翰,22岁
页面.txt
约翰,google.com
我已经用python编写了以下内容,它在hadoop之外也能正常工作。
import os
os.chdir("/home/pythonlab")
# Top sites visited by users aged 18 to 25
# read the users file
lines = open("users.txt")
users = [ line.split(",") for line in lines] #user name, age (eg - john, 22)
userlist = [ (u[0],int(u[1])) for u in users] #split the user name and age
# read the page visit file
pages = open("pages.txt")
page = [p.split(",") for p in pages] #user name, website visited (eg - john,google.com)
pagelist = [ (p[0],p[1]) for p in page]
# map user and page visits & filter age group between 18 and 25
usrpage = [[p[1],u[0]] for u in userlist for p in pagelist if (u[0] == p[0] and u[1]>=18 and u[1]<=25) ]
for z in usrpage:
print(z[0].strip('\r\n')+",1") #print website name, 1
样本输出:
yahoo.com,1 google.com,1
现在我想用hadoop流解决这个问题。
我的问题是,如何在Map器中处理这两个命名文件(users.txt,pages.txt)?我们通常只向hadoop流传递输入目录。
1条答案
按热度按时间kx7yvsdv1#
你需要考虑使用Hive。这将允许您将多个源文件合并为一个,就像您需要的那样。它允许您连接两个数据源,就像您在sql中所做的那样,然后将结果推送到Map器和reducer中。