如何使用python内部连接两个diff文件

w8f9ii69 于 2021-06-04 发布在 Hadoop

关注(0)|答案(1)|浏览(286)

我想找出顶级网站页面访问量的用户年龄组之间的18至25岁。我有两个文件，一个包含用户名，年龄和其他文件包含用户名，网站名称。示例：
用户.txt
约翰，22岁
页面.txt
约翰，google.com
我已经用python编写了以下内容，它在hadoop之外也能正常工作。

import os
os.chdir("/home/pythonlab")

# Top sites visited by users aged 18 to 25

# read the users file

lines = open("users.txt")
users = [ line.split(",") for line in lines]      #user name, age (eg - john, 22)
userlist = [ (u[0],int(u[1])) for u in users]     #split the user name and age

# read the page visit file

pages = open("pages.txt")
page = [p.split(",") for p in pages]              #user name, website visited (eg - john,google.com)
pagelist  = [ (p[0],p[1]) for p in page]

# map user and page visits & filter age group between 18 and 25

usrpage = [[p[1],u[0]] for u in userlist for p in pagelist  if (u[0] == p[0] and u[1]>=18 and u[1]<=25) ]

for z in usrpage:
    print(z[0].strip('\r\n')+",1")     #print website name, 1

样本输出：
yahoo.com，1 google.com，1
现在我想用hadoop流解决这个问题。
我的问题是，如何在Map器中处理这两个命名文件（users.txt，pages.txt）？我们通常只向hadoop流传递输入目录。

hadoop python hadoop-streaming

来源：https://stackoverflow.com/questions/16909577/hadoop-streaming-how-to-inner-join-of-two-diff-files-using-python