我试图按值对rdd进行排序,如果多个值相等,则需要按字典顺序按键对这些值进行排序。
代码:
JavaPairRDD <String,Long> rddToSort = rddMovieReviewReducedByKey.mapToPair(new PairFunction < Tuple2 < String, MovieReview > , String, Long > () {
@Override
public Tuple2 < String, Long > call(Tuple2 < String, MovieReview > t) throws Exception {
return new Tuple2 < String, Long > (t._1, t._2.count);
}
});
到目前为止,我所做的是,使用 takeOrdered
并提供 CustomComperator
,但自从 takeOrdered
无法处理大量数据,当运行它不断退出的代码时(它占用了操作系统无法处理的大量内存):
List < Tuple2 < String, Long >> rddSorted = rddMovieReviewReducedByKey.mapToPair(new PairFunction < Tuple2 < String, MovieReview > , String, Long > () {
@Override
public Tuple2 < String, Long > call(Tuple2 < String, MovieReview > t) throws Exception {
return new Tuple2 < String, Long > (t._1, t._2.count);
}
}).takeOrdered(newTopMovies, MapLongValueComparator.VALUE_COMP);
竞争对手:
static class MapLongValueComparator implements Comparator < Tuple2 < String, Long >> , Serializable {
private static final long serialVersionUID = 1L;
private static final MapLongValueComparator VALUE_COMP = new MapLongValueComparator();
@Override
public int compare(Tuple2 < String, Long > o1, Tuple2 < String, Long > o2) {
if (o1._2.compareTo(o2._2) == 0) {
return o1._1.compareTo(o2._1);
}
return -o1._2.compareTo(o2._2);
}
}
错误:
16/06/30 21:09:23 INFO scheduler.DAGScheduler: Job 18 failed: takeOrdered at MovieAnalyzer.java:708, took 418.149182 s
这个rdd你怎么分类?你打算怎么办 TopKMovies
考虑值,如果相等,按字典顺序键。
谢谢。
2条答案
按热度按时间9jyewag01#
在Map后,使用带有比较器和分区的sortbykey解决了这个问题
<String, Long>
付至< Tuple2<String,Long> , Long>
派瑞德比较器:
daupos2t2#
你在spark里试过二次分类吗?
Spark二次分选