如何生成大量数据的不同平均值？

mklgxw1f 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(350)

我有一个很大的租赁清单数据集，我想根据卧室的数量生成每个城市的平均价格。我有以下类型的行： {( city: 'New York', num_bedrooms: 1, price: 1000.00 ), ( city: 'New York', num_bedrooms: 2, price: 2000.00 ), ( city: 'New York', num_bedrooms: 1, price: 2000.00 ), ( city: 'Chicago', num_bedrooms: 1, price: 4000.00 ), ( city: 'Chicago', num_bedrooms: 1, price: 1500.00 )} 使用pig，我希望得到以下格式的结果： {( city: 'New York', 1: 1500.00, 2: 2000.00), ( city: 'Chicago', 1: 2750.00 )} 或者，我也可以处理这个问题： {( city: 'New York', num_bedrooms: 1, price: 1500.00), ( city: 'New York', num_bedrooms: 2, price: 2000.00), ( city: 'Chicago', num_bedrooms: 1, price: 2750.00 )} 我的计划是使用这些数据创建条形图，其中包括x轴上的卧室数量，以及给定城市y轴上的价格。我已经能够按城市和卧室数量分组，然后平均，但我不知道如何把数据的格式我想要的。到目前为止，这就是我所拥有的： D = GROUP blah BY (city, num_bedrooms); C = FOREACH D GENERATE blah.city, blah.num_bedrooms, AVG(blah.price); 但是，这会导致城市和numè卧室每次出现时都会重复出现！

hadoop cassandra apache-pig

来源：https://stackoverflow.com/questions/31623407/how-to-generate-a-distinct-average-of-lots-of-data-in-pig-latin

1条答案

按热度按时间

ars1skjm1#

输入：

New York,1,1000.00
New York,2,2000.00
New York,1,2000.00
Chicago,1,4000.00
Chicago,1,1500.00

方法1：
Pig脚本：

rental_data = LOAD 'rental_data.csv'  USING  PigStorage(',') AS (city:chararray, num_bedrooms: long, price:double);
rental_data_grp_city = GROUP rental_data BY (city);
rental_kpi = FOREACH rental_data_grp_city {
    one_bed_room = FILTER rental_data BY num_bedrooms==1;
    two_bed_room = FILTER rental_data BY num_bedrooms==2;
    GENERATE group AS city, AVG(one_bed_room.price) AS one_bed_price, AVG(two_bed_room.price) AS tow_bed_price;
};

输出：dump-u kpi:

(Chicago,2750.0,)
 (New York,1500.0,2000.0)

方法2：
Pig脚本：

rental_data = LOAD 'rental_data.csv'  USING  PigStorage(',') AS (city:chararray, num_bedrooms: long, price:double);
rental_data_grp_city = GROUP rental_data BY (city,num_bedrooms);
rental_kpi = FOREACH rental_data_grp_city {
    prices_bag = rental_data.price;
    GENERATE group.city AS city, group.num_bedrooms AS num_bedrooms, AVG(prices_bag) AS price; 
}

输出：dump-u kpi:

(Chicago,1,2750.0)
(New York,2,2000.0)
(New York,1,1500.0)

赞(0）回复(0）举报 2021-06-02

我来回答

如何生成大量数据的不同平均值？

1条答案

相关问题

热门标签

最新问答