sql—如何基于不同列中数组元素的精确计数来选择行

nzkunb0c  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(227)

假设我有一个这样的Dataframe B_C 是col的concat吗 B 还有上校 C ,和列 selected_B_C 是一个由几个 B_C 组内的列。

+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|A          |grp_count_A|B       |C       |B_C              |D       |selected_B_C                          |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+
|1          |6          |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1          |6          |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1          |6          |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1          |6          |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|
|1          |6          |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|
|1          |6          |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|
|2          |4          |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2          |4          |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2          |4          |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|
|2          |4          |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+

我想附加一列 selected 它需要一个值 1 ,如果是一行,列 B_C 在col中找到 selected_B_C ,否则 0 ,因此最终的Dataframe如下所示。

+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|A          |grp_count_A|B       |C       |B_C              |D       |selected_B_C                          |selected|
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+
|1          |6          |30261.41|20081201|30261.41_20081201|99945.83|[30261.41_20091201, 39879.85_20080601]|0       |
|1          |6          |30261.41|20091201|30261.41_20091201|99945.83|[30261.41_20091201, 39879.85_20080601]|1       |
|1          |6          |39879.85|20080601|39879.85_20080601|99945.83|[30261.41_20091201, 39879.85_20080601]|1       |
|1          |6          |69804.42|20080117|69804.42_20080117|99945.83|[30261.41_20091201, 39879.85_20080601]|0       |
|1          |6          |99950.3 |20090301|99950.3_20090301 |99945.83|[30261.41_20091201, 39879.85_20080601]|0       |
|1          |6          |99999.23|20080118|99999.23_20080118|99945.83|[30261.41_20091201, 39879.85_20080601]|0       |
|2          |4          |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1       |
|2          |4          |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|1       |
|2          |4          |76498.0 |20150501|76498.0_20150501 |183600.0|[[76498.0_20150501, 76498.0_20150501]]|0       |
|2          |4          |351378.0|20180620|351378.0_20180620|183600.0|[[76498.0_20150501, 76498.0_20150501]]|0       |
+-----------+-----------+--------+--------+-----------------+--------+--------------------------------------+--------+

对col来说最棘手的部分 selected 我只需要一个值在 selected_B_C 有价值 1 对于选定的
例如在组中 2 ,即使有3条记录的值为 76498.0_20150501 对于col B_C ,我只想要两张组记录 2 谁的价值是 76498.0_20150501 有价值 1 为了 selected ,作为 selected_B_C 对于组 2 正好有2个元素有值 76498.0_20150501 在col中 selected_B_C

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题