如何在spark中处理map< key,value>?

jaxagkaj  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(568)

我是新的Spark编程,我试图找出的次数,字符串出现在一个文件中的关键。以下是我的输入:

-------------
2017-04-13 15:56:57.147::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::5::0::
2017-04-13 15:57:01.008::ProductSelectPanel::1599::PRODUCT_SALE_WITH_BARCODE::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::4::1::1013065197
2017-04-13 15:57:09.182::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:15.153::ProductSelectPanel::1121::NO_STOCK_PRODUCT::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0::0::
2017-04-13 15:57:19.696::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:23.190::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CALP0005::CALPOL 500MG TAB::110::0::
2017-04-13 15:56:57.147::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::5::0::
2017-04-13 15:57:01.008::ProductSelectPanel::1599::PRODUCT_SALE_WITH_BARCODE::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::4::1::1013065197
2017-04-13 15:57:09.182::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:15.153::ProductSelectPanel::1121::NO_STOCK_PRODUCT::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0::0::
2017-04-13 15:57:19.696::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
2017-04-13 15:57:23.190::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CALP0005::CALPOL 500MG TAB::110::0::
2017-04-13 15:56:57.147::ProductSelectPanel::1291::PRODUCT_SALE_ENTRY::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::5::0::
2017-04-13 15:57:01.008::ProductSelectPanel::1599::PRODUCT_SALE_WITH_BARCODE::INAPHYD00124::1::CROC0008::CROCIN 120MG 60ML SYP::4::1::1013065197
2017-04-13 15:57:09.182::ProductSelectPanel::1118::ALTERNATIVE_PRODUCT_ENTRY::INAPHYD00124::1::CROC0005::CROCIN 500MG TAB::0
.......

我的spark程序是这样的。

final Function<String, List<String>> LINE_MAPPER=new Function<String, List<String>>() {

            @Override
            public List<String> call(String line) throws Exception {
                String[] lineArary=line.split("::");
                return Arrays.asList(lineArary[3],lineArary[6]);
            }
        };
        final PairFunction<String, String, Integer> word_paper=new PairFunction<String, String, Integer>() {

            @Override
            public Tuple2<String, Integer> call(String word) throws Exception {

                return new Tuple2<String, Integer>(word, Integer.valueOf(1));
            }
        };
        JavaRDD<List<String>> javaRDD =lineRDD.map(LINE_MAPPER);

After doing map transformation  i am getting like this:

[[PRODUCT_SALE_ENTRY,CROC0008],[NO_STOCK_PRODUCT,CROC0005],[NO_STOCK_PRODUCT,CROC0005],[PRODUCT_SALE_WITH_BARCODE,CROC0008],[PRODUCT_SALE_WITH_BARCODE,CROC0005],[PRODUCT_SALE_WITH_BARCODE,CROC003],....]

but i want the result like..
[[NO_STOCK_PRODUCT,[CROC0005,4]],[PRODUCT_SALE_WITH_BARCODE,[CROC0008,2]],[PRODUCT_SALE_WITH_BARCODE,[CROC0005,1]],....]

请帮帮我。提前谢谢。

a7qyws3x

a7qyws3x1#

Thank you DNA, Its works great.
finally my code like that:

JavaPairRDD<String, String> keyValuePairs = lineRDD.mapToPair(obj -> {
            String[] split = obj.split("::");
            return new Tuple2<String, String>(split[3],split[6]);
        });

         JavaPairRDD<Tuple2<String, String>, Integer> newRFDD=keyValuePairs.mapToPair(obj -> {
            return new Tuple2<Tuple2<String, String>, Integer>(new Tuple2<>(obj._1, obj._2),1);
        });
         JavaPairRDD<Tuple2<String, String>, Integer> result = newRFDD.reduceByKey((v1, v2) -> {
                return v1+v2;
            });
         result.map(f->{ return f._1._2()+"\t"+f._2()+"\t"+f._1._1(); }).saveAsTextFile("file:///home/charan/offlinefiles/result");
         System.out.println("result :"+result.take(10));

and output would be:

CROC0005    620 NO_STOCK_PRODUCT
CROC2107    15  PRODUCT_SALE_ENTRY
CROC2120    7   NO_STOCK_PRODUCT
CROC0229    2   NO_STOCK_PRODUCT
CROC0009    1   NO_STOCK_PRODUCT
CROC0005    1250    ALTERNATIVE_PRODUCT_ENTRY
CROC2302    2   ALTERNATIVE_PRODUCT_ENTRY
CROC2807    5   PRODUCT_SALE_ENTRY
CROC0213    2   ALTERNATIVE_PRODUCT_ENTRY
CROC20221 18    ALTERNATIVE_PRODUCT_ENTRY.
pw9qyyiw

pw9qyyiw2#

看起来需要将每个键+字符串对视为复合键,并计算该复合键的出现次数。
你可以用 countByValue() (参见javadoc)。然而,正如文件所说:
请注意,只有当生成的Map很小时才应使用此方法,因为整个Map都加载到了驱动程序的内存中。若要处理非常大的结果,请考虑使用rdd.map(x=>(x,1l)).reducebykey(u3;+3;)。。。
所以,只是 map 你的每一个价值观(比如 [PRODUCT_SALE_ENTRY,CROC0008] 一对表格((product\u sale\u entry,croc0008),1l),然后 reduceByKey() (此处为示例)。
我只在scala中做过这个,而不是java—我想您可能需要使用 mapToPair() e、 g.如图所示。这将给出如下形式的rdd:

((NO_STOCK_PRODUCT,CROC0005), 4),
((PRODUCT_SALE_WITH_BARCODE,CROC0008), 2),
((PRODUCT_SALE_WITH_BARCODE,CROC0005), 1),
...

很接近你的要求。

相关问题