本文整理了Java中org.apache.spark.rdd.RDD.map
方法的一些代码示例,展示了RDD.map
的具体用法。这些代码示例主要来源于Github
/Stackoverflow
/Maven
等平台,是从一些精选项目中提取出来的代码,具有较强的参考意义,能在一定程度帮忙到你。RDD.map
方法的具体详情如下:
包路径:org.apache.spark.rdd.RDD
类名称:RDD
方法名:map
暂无
代码示例来源:origin: org.apache.pig/pig
@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors,
POPackage physicalOperator) throws IOException {
SparkUtil.assertPredecessorSize(predecessors, physicalOperator, 1);
RDD<Tuple> rdd = predecessors.get(0);
// package will generate the group from the result of the local
// rearrange
return rdd.map(new PackageFunction(physicalOperator),
SparkUtil.getManifest(Tuple.class));
}
代码示例来源:origin: org.apache.pig/pig
@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors,
PhysicalOperator physicalOperator) throws IOException {
SparkUtil.assertPredecessorSize(predecessors, physicalOperator, 1);
RDD<Tuple> rdd = predecessors.get(0);
// call local rearrange to get key and value
return rdd.map(new LocalRearrangeFunction(physicalOperator),
SparkUtil.getManifest(Tuple.class));
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-unshaded
/**
* Groups items with the same key, assuming the items with the same key are next to each other in the
* collection. It does not perform shuffle, therefore it is much faster than using much more
* universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
* represent a prefix of the primary key, containing at least the partition key of the Cassandra
* table.
*/
public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
.map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
}
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector_2.10
/**
* Groups items with the same key, assuming the items with the same key are next to each other in the
* collection. It does not perform shuffle, therefore it is much faster than using much more
* universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
* represent a prefix of the primary key, containing at least the partition key of the Cassandra
* table.
*/
public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
.map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
}
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java
/**
* Groups items with the same key, assuming the items with the same key are next to each other in the
* collection. It does not perform shuffle, therefore it is much faster than using much more
* universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
* represent a prefix of the primary key, containing at least the partition key of the Cassandra
* table.
*/
public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
.map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
}
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java_2.10
/**
* Groups items with the same key, assuming the items with the same key are next to each other in the
* collection. It does not perform shuffle, therefore it is much faster than using much more
* universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
* represent a prefix of the primary key, containing at least the partition key of the Cassandra
* table.
*/
public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
.map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
}
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector
/**
* Groups items with the same key, assuming the items with the same key are next to each other in the
* collection. It does not perform shuffle, therefore it is much faster than using much more
* universal Spark RDD `groupByKey`. For this method to be useful with Cassandra tables, the key must
* represent a prefix of the primary key, containing at least the partition key of the Cassandra
* table.
*/
public JavaPairRDD<K, Collection<V>> spanByKey(ClassTag<K> keyClassTag) {
ClassTag<Tuple2<K, Collection<V>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Collection<V>> vClassTag = classTag(Collection.class);
RDD<Tuple2<K, Collection<V>>> newRDD = pairRDDFunctions.spanByKey()
.map(JavaApiHelper.<K, V, Seq<V>>valuesAsJavaCollection(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, vClassTag);
}
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector
/**
* Applies a function to each item, and groups consecutive items having the same value together.
* Contrary to {@code groupBy}, items from the same group must be already next to each other in the
* original collection. Works locally on each partition, so items from different partitions will
* never be placed in the same group.
*/
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
.map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java_2.10
/**
* Applies a function to each item, and groups consecutive items having the same value together.
* Contrary to {@code groupBy}, items from the same group must be already next to each other in the
* original collection. Works locally on each partition, so items from different partitions will
* never be placed in the same group.
*/
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
.map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-java
/**
* Applies a function to each item, and groups consecutive items having the same value together.
* Contrary to {@code groupBy}, items from the same group must be already next to each other in the
* original collection. Works locally on each partition, so items from different partitions will
* never be placed in the same group.
*/
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
.map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector_2.10
/**
* Applies a function to each item, and groups consecutive items having the same value together.
* Contrary to {@code groupBy}, items from the same group must be already next to each other in the
* original collection. Works locally on each partition, so items from different partitions will
* never be placed in the same group.
*/
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
.map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}
代码示例来源:origin: com.datastax.spark/spark-cassandra-connector-unshaded
/**
* Applies a function to each item, and groups consecutive items having the same value together.
* Contrary to {@code groupBy}, items from the same group must be already next to each other in the
* original collection. Works locally on each partition, so items from different partitions will
* never be placed in the same group.
*/
public <U> JavaPairRDD<U, Iterable<T>> spanBy(final Function<T, U> f, ClassTag<U> keyClassTag) {
ClassTag<Tuple2<U, Iterable<T>>> tupleClassTag = classTag(Tuple2.class);
ClassTag<Iterable<T>> iterableClassTag = CassandraJavaUtil.classTag(Iterable.class);
RDD<Tuple2<U, Iterable<T>>> newRDD = rddFunctions.spanBy(toScalaFunction1(f))
.map(JavaApiHelper.<U, T, scala.collection.Iterable<T>>valuesAsJavaIterable(), tupleClassTag);
return new JavaPairRDD<>(newRDD, keyClassTag, iterableClassTag);
}
代码示例来源:origin: org.apache.pig/pig
@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors,
PODistinct op) throws IOException {
SparkUtil.assertPredecessorSize(predecessors, op, 1);
RDD<Tuple> rdd = predecessors.get(0);
// In DISTINCT operation, the key is the entire tuple.
// RDD<Tuple> -> RDD<Tuple2<Tuple, null>>
RDD<Tuple2<Tuple, Object>> keyValRDD = rdd.map(new ToKeyValueFunction(),
SparkUtil.<Tuple, Object> getTuple2Manifest());
PairRDDFunctions<Tuple, Object> pairRDDFunctions
= new PairRDDFunctions<Tuple, Object>(keyValRDD,
SparkUtil.getManifest(Tuple.class),
SparkUtil.getManifest(Object.class), null);
int parallelism = SparkPigContext.get().getParallelism(predecessors, op);
return pairRDDFunctions.reduceByKey(
SparkUtil.getPartitioner(op.getCustomPartitioner(), parallelism),
new MergeValuesFunction())
.map(new ToValueFunction(), SparkUtil.getManifest(Tuple.class));
}
代码示例来源:origin: uk.gov.gchq.gaffer/spark-accumulo-library
private RDD<Element> doOperationUsingElementInputFormat(final GetRDDOfAllElements operation,
final Context context,
final AccumuloStore accumuloStore)
throws OperationException {
final Configuration conf = getConfiguration(operation);
addIterators(accumuloStore, conf, context.getUser(), operation);
final String useBatchScannerRDD = operation.getOption(USE_BATCH_SCANNER_RDD);
if (Boolean.parseBoolean(useBatchScannerRDD)) {
InputConfigurator.setBatchScan(AccumuloInputFormat.class, conf, true);
}
final RDD<Tuple2<Element, NullWritable>> pairRDD = SparkContextUtil.getSparkSession(context, accumuloStore.getProperties()).sparkContext().newAPIHadoopRDD(conf,
ElementInputFormat.class,
Element.class,
NullWritable.class);
return pairRDD.map(new FirstElement(), ELEMENT_CLASS_TAG);
}
代码示例来源:origin: uk.gov.gchq.gaffer/spark-accumulo-library
private RDD<Element> doOperation(final GetRDDOfElements operation,
final Context context,
final AccumuloStore accumuloStore)
throws OperationException {
final Configuration conf = getConfiguration(operation);
final SparkContext sparkContext = SparkContextUtil.getSparkSession(context, accumuloStore.getProperties()).sparkContext();
sparkContext.hadoopConfiguration().addResource(conf);
// Use batch scan option when performing seeded operation
InputConfigurator.setBatchScan(AccumuloInputFormat.class, conf, true);
addIterators(accumuloStore, conf, context.getUser(), operation);
addRanges(accumuloStore, conf, operation);
final RDD<Tuple2<Element, NullWritable>> pairRDD = sparkContext.newAPIHadoopRDD(conf,
ElementInputFormat.class,
Element.class,
NullWritable.class);
return pairRDD.map(new FirstElement(), ClassTagConstants.ELEMENT_CLASS_TAG);
}
代码示例来源:origin: org.apache.pig/pig
@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors, POSort sortOperator)
throws IOException {
SparkUtil.assertPredecessorSize(predecessors, sortOperator, 1);
RDD<Tuple> rdd = predecessors.get(0);
int parallelism = SparkPigContext.get().getParallelism(predecessors, sortOperator);
RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyValueFunction(),
SparkUtil.<Tuple, Object> getTuple2Manifest());
JavaPairRDD<Tuple, Object> r = new JavaPairRDD<Tuple, Object>(rddPair,
SparkUtil.getManifest(Tuple.class),
SparkUtil.getManifest(Object.class));
JavaPairRDD<Tuple, Object> sorted = r.sortByKey(
sortOperator.getMComparator(), true, parallelism);
JavaRDD<Tuple> mapped = sorted.mapPartitions(TO_VALUE_FUNCTION);
return mapped.rdd();
}
代码示例来源:origin: uk.gov.gchq.gaffer/spark-accumulo-library
private RDD<Element> doOperation(final GetRDDOfElementsInRanges operation,
final Context context,
final AccumuloStore accumuloStore)
throws OperationException {
final Configuration conf = getConfiguration(operation);
final SparkContext sparkContext = SparkContextUtil.getSparkSession(context, accumuloStore.getProperties()).sparkContext();
sparkContext.hadoopConfiguration().addResource(conf);
// Use batch scan option when performing seeded operation
InputConfigurator.setBatchScan(AccumuloInputFormat.class, conf, true);
addIterators(accumuloStore, conf, context.getUser(), operation);
addRangesFromPairs(accumuloStore, conf, operation);
final RDD<Tuple2<Element, NullWritable>> pairRDD = sparkContext.newAPIHadoopRDD(conf,
ElementInputFormat.class,
Element.class,
NullWritable.class);
return pairRDD.map(new FirstElement(), ClassTagConstants.ELEMENT_CLASS_TAG);
}
}
代码示例来源:origin: org.apache.pig/pig
@Override
public RDD<Tuple> convert(List<RDD<Tuple>> predecessors, POSampleSortSpark sortOperator)
throws IOException {
SparkUtil.assertPredecessorSize(predecessors, sortOperator, 1);
RDD<Tuple> rdd = predecessors.get(0);
RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyValueFunction(),
SparkUtil.<Tuple, Object> getTuple2Manifest());
JavaPairRDD<Tuple, Object> r = new JavaPairRDD<Tuple, Object>(rddPair,
SparkUtil.getManifest(Tuple.class),
SparkUtil.getManifest(Object.class));
//sort sample data
JavaPairRDD<Tuple, Object> sorted = r.sortByKey(true);
//convert every element in sample data from element to (all, element) format
JavaPairRDD<String, Tuple> mapped = sorted.mapPartitionsToPair(new AggregateFunction());
//use groupByKey to aggregate all values( the format will be ((all),{(sampleEle1),(sampleEle2),...} )
JavaRDD<Tuple> groupByKey= mapped.groupByKey().map(new ToValueFunction());
return groupByKey.rdd();
}
代码示例来源:origin: org.apache.pig/pig
private JavaRDD<Tuple2<IndexedKey, Tuple>> handleSecondarySort(
RDD<Tuple> rdd, POReduceBySpark op, int parallelism) {
RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyNullValueFunction(),
SparkUtil.<Tuple, Object>getTuple2Manifest());
JavaPairRDD<Tuple, Object> pairRDD = new JavaPairRDD<Tuple, Object>(rddPair,
SparkUtil.getManifest(Tuple.class),
SparkUtil.getManifest(Object.class));
//first sort the tuple by secondary key if enable useSecondaryKey sort
JavaPairRDD<Tuple, Object> sorted = pairRDD.repartitionAndSortWithinPartitions(
new HashPartitioner(parallelism),
new PigSecondaryKeyComparatorSpark(op.getSecondarySortOrder()));
JavaRDD<Tuple> jrdd = sorted.keys();
JavaRDD<Tuple2<IndexedKey, Tuple>> jrddPair = jrdd.map(new ToKeyValueFunction(op));
return jrddPair;
}
代码示例来源:origin: org.apache.pig/pig
private JavaRDD<Tuple2<IndexedKey,Tuple>> handleSecondarySort(
RDD<Tuple> rdd, POGlobalRearrangeSpark op, int parallelism) {
RDD<Tuple2<Tuple, Object>> rddPair = rdd.map(new ToKeyNullValueFunction(),
SparkUtil.<Tuple, Object>getTuple2Manifest());
JavaPairRDD<Tuple, Object> pairRDD = new JavaPairRDD<Tuple, Object>(rddPair,
SparkUtil.getManifest(Tuple.class),
SparkUtil.getManifest(Object.class));
//first sort the tuple by secondary key if enable useSecondaryKey sort
JavaPairRDD<Tuple, Object> sorted = pairRDD.repartitionAndSortWithinPartitions(
new HashPartitioner(parallelism),
new PigSecondaryKeyComparatorSpark(op.getSecondarySortOrder()));
JavaRDD<Tuple> jrdd = sorted.keys();
JavaRDD<Tuple2<IndexedKey,Tuple>> jrddPair = jrdd.map(new ToKeyValueFunction(op));
return jrddPair;
}
内容来源于网络,如有侵权,请联系作者删除!