为什么pig查询返回错误的值

dldeef67  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(242)

我试图在pig中使用以下数据集https://www.kaggle.com/zynicide/wine-reviews/version/4? 我从查询中得到了错误的值我能想到的唯一原因是数据集中缺少数据,但我不知道这是不是真的,也不知道为什么我得到了错误的值

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted  5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;

DESCRIBE allWinesPricePoints;

我得到的实际结果是(56203),黄油吐司和香料口味被包裹成奶油质地。应该保持一两年。”)(61341,甜单宁。新鲜的酸度给它一个额外的推动。给点时间。最佳2007-2012。”)(16417年,霞多丽也被称为)(115384年,杏仁和香草)(136804年,杏仁和香草)
我认为输出应该是(56203,23)(61341,30)(16417,16)(115384250)(136804250)
我希望第二个值是数字,并且在price列中

7bsow1i6

7bsow1i61#

按以下步骤进行:

allWines = LOAD 'winemag-data_first150k.csv' USING PigStorage(',') AS (id:chararray, country:chararray, description:chararray, designation:chararray, points:chararray, price:chararray, province:chararray, region_2:chararray, region_1:chararray, variety:chararray, winery:chararray);

--comments
--add below foreach to generate the values this will help you out to parse data correctly
--generate column in the same order as it is in the text file
allWines= FOREACH allWines GENERATE
id AS id,
country AS country,
description AS description,
designation AS designation,
points AS points,
price AS price, 
province AS provience,
region_2 AS region_2,
region_1 AS region_1,
variety AS variety,
winery AS winery;

allWinesNotNull = FILTER allWines BY price is not null;
allWinesNotNull2 = FILTER allWinesNotNull BY points is not null;
allWinesPriceSorted = ORDER allWinesNotNull2 BY price;
allWinesPriceTop5Sorted = LIMIT allWinesPriceSorted  5;
allWinesPricePoints = FOREACH allWinesPriceTop5Sorted GENERATE id, price;
DUMP allWinesPricePoints;
DESCRIBE allWinesPricePoints;

希望这能帮到你。如果有任何问题,请告诉我。

相关问题