如图所示,我想使用spark提取数据。
DataSetTest ro1 = new DataSetTest("apple", "fruit", "red", 3);
DataSetTest ro2 = new DataSetTest("apple", "fruit", "red", 4);
DataSetTest ro3 = new DataSetTest("car", "toy", "red", 1);
DataSetTest ro4 = new DataSetTest("bike", "toy", "white", 2);
DataSetTest ro5 = new DataSetTest("bike", "toy", "red", 5);
DataSetTest ro6 = new DataSetTest("apple", "fruit", "red", 3);
DataSetTest ro7 = new DataSetTest("car", "toy", "white", 7);
DataSetTest ro8 = new DataSetTest("apple", "fruit", "green", 1);
Dataset<Row> df = session.getSqlContext().createDataFrame(Arrays.asList(ro1, ro2, ro3, ro4, ro5, ro6, ro7, ro8), DataSetTest.class);
private void process(){
//1) groupByKey
Dataset<Row> df2 = df.groupBy("keyword", "opt1", "prt2").sum("count");
//2) counting by Opt & calculate the total number
Dataset<Row> df3 = df2.withColumn("fruit_red",**???**)
.withColumn("fruit_green",**???**)
.withColumn("toy_red",**???**)
.withColumn("toy_white",**???**)
.withColumn("total_count", ???);
//3) calculate the percent
Dataset<Row> df4 = df3.withColumn("percent", df3.col("total_count").divide("??sum of total_count??"));
你知道怎么数数吗??
1条答案
按热度按时间pdsfdshx1#
我不是javaMaven,但您可以这样做:
结果如下:
然后:
其结果是:
您可以使用python或scala实现更可读的代码