使用pyspark将数据从pyspark dataframe插入到另一个cassandra表中

hgqdbh6s 于 2021-06-13 发布在 Cassandra

关注(0)|答案(1)|浏览(425)

我有一个Cassandra表-测试：

+----+---------+---------+
| id | country | counter |
+====+=========+=========+
|  A |      RU |       1 |
+----+---------+---------+
|  B |      EN |       2 |
+----+---------+---------+
|  C |      IQ |       1 |
+----+---------+---------+
|  D |      RU |       3 |
+----+---------+---------+

我还有一个表main在同一个空间中，列“country\u main”和“main\u id”。在main\u id列中，我有和test table中相同的id，还有一些惟一的id。country\u main的值为空，与test中的值相同。例如：

+---------+--------------+---------+
| main_id | country_main |      ...|
+=========+==============+=========+
|  A      |              |      ...|
+---------+--------------+---------+
|  B      |      EN      |      ...|
+---------+--------------+---------+
|  Y      |      IQ      |      ...|
+---------+--------------+---------+
|  Z      |      RU      |      ...|
+---------+--------------+---------+

如何使用pyspark根据ids填充country\u main中的空值，将测试表中的数据插入main？

cassandra apache-spark pyspark spark-cassandra-connector

来源：https://stackoverflow.com/questions/61343332/insert-data-from-pyspark-dataframe-to-another-cassandra-table-using-pyspark

1条答案

按热度按时间

au9on6nz1#

具有以下架构和数据：

create table test.ct1 (
  id text primary key,
  country text,
  cnt int);

insert into test.ct1(id, country, cnt) values('A', 'RU', 1);
insert into test.ct1(id, country, cnt) values('B', 'EN', 2);
insert into test.ct1(id, country, cnt) values('C', 'IQ', 1);
insert into test.ct1(id, country, cnt) values('D', 'RU', 3);

create table test.ct2 (
  main_id text primary key,
  country_main text,
  cnt int);

insert into test.ct2(main_id, cnt) values('A', 1);
insert into test.ct2(main_id, country_main, cnt) values('B', 'EN', 2);
insert into test.ct2(main_id, country_main, cnt) values('C', 'IQ', 1);
insert into test.ct2(main_id, country_main, cnt) values('D', 'RU', 3);

应该是这样的：

from pyspark.sql.functions import *

ct1 = spark.read.format("org.apache.spark.sql.cassandra")\
   .option("table", "ct1").option("keyspace", "test").load()

ct2 = spark.read.format("org.apache.spark.sql.cassandra")\
  .option("table", "ct2").option("keyspace", "test").load()\
  .where(col("country_main").isNull())

res = ct1.join(ct2, ct1.id == ct2.main_id).select(col("main_id"), 
  col("country").alias("country_main"))
res.write.format("org.apache.spark.sql.cassandra")\
   .option("table", "ct2").option("keyspace", "test")\
   .mode("append").save()

代码的作用：
从中选择所有行 ct2 （与您的 main 表）其中 country_main 是 null ;
执行连接 ct1 （与您的 test 表）从中获取country的值（优化可以是从两个表中只选择必要的列）。另外，请注意，连接是由spark完成的，而不是在cassandra级别上-cassandra级别的连接将仅在即将发布的spark cassandra connector版本（3.0，但alpha版本已经发布）中受支持；
重命名列以匹配 ct2 表格；
写回数据。
结果：

cqlsh> select * from test.ct2;

 main_id | cnt | country_main
---------+-----+--------------
       C |   1 |           IQ
       B |   2 |           EN
       A |   1 |           RU
       D |   3 |           RU

对于源数据：

cqlsh> select * from test.ct2;
main_id | cnt | country_main
---------+-----+--------------                                       
       C |   1 |           IQ                                  
       B |   2 |           EN                                                                                         
       A |   1 |         null                                      
       D |   3 |           RU

赞(0）回复(0）举报 2021-06-14

我来回答

使用pyspark将数据从pyspark dataframe插入到另一个cassandra表中

1条答案

相关问题

热门标签

最新问答