在Java数据集中将多个列转置为行

9udxz4iz  于 2023-04-04  发布在  Java
关注(0)|答案(2)|浏览(104)

我在spark中有一个数据集,它的值如下
| 组织标识|用户标识|呼叫标识|音频带宽|视频_bw|分享体重|
| --------------|--------------|--------------|--------------|--------------|--------------|
| 1|1|1|第二章|三|四|
我能够使用多个SQL查询将其转换为以下内容,但希望使用代码,如果可能的话使用flatMap
| 组织标识|用户标识|呼叫标识|类型|bw|
| --------------|--------------|--------------|--------------|--------------|
| 1|1|1|声频|第二章|
| 1|1|1|视频|三|
| 1|1|1|份额|四|

4xrmg8kj

4xrmg8kj1#

假设你的数据集名为data,这是:

data = data.select(
  col("org_id"), 
  col("user_id"), 
  col("call_id"),
  expr("stack(3, 'audio', audio_bw, 'video', video_bw, 'share', share_bw) as (type, bw)")
)

做你想要的,最终的输出表:

+------+-------+-------+-----+---+
|org_id|user_id|call_id| type| bw|
+------+-------+-------+-----+---+
|     1|      1|      1|audio|  2|
|     1|      1|      1|video|  3|
|     1|      1|      1|share|  4|
+------+-------+-------+-----+---+

关于stack的更多信息可以在这里找到!

编辑:

如果您有多个要堆叠的列:

data
  .select(
    col("org_id"),
    col("user_id"),
    col("call_id"),
    // concatinate your data
    expr("concat_ws(' - ', audio_rx_bw, audio_tx_bw)").as("audio"),
    expr("concat_ws(' - ', video_rx_bw, video_tx_bw)").as("video"),
    expr("concat_ws(' - ', share_rx_bw, share_tx_bw)").as("share")
  )
  .select(
    col("org_id"),
    col("user_id"),
    col("call_id"),
    // stack again, but this time for the concatenated values
    expr("stack(3, 'audio', audio, 'video', video, 'share', share) as (type, data)")
  )
  // once done with stacking, get the data in your desired format
  .withColumn("rx_bw", split(col("data"), " - ")(0))
  .withColumn("tx_bw", split(col("data"), " - ")(1))
  .drop("data")

输入:

+------+-------+-------+-----------+-----------+-----------+-----------+-----------+-----------+
|org_id|user_id|call_id|audio_rx_bw|audio_tx_bw|video_rx_bw|video_tx_bw|share_rx_bw|share_tx_bw|
+------+-------+-------+-----------+-----------+-----------+-----------+-----------+-----------+
|     1|      1|      1|          2|          3|          4|          2|          4|          5|
+------+-------+-------+-----------+-----------+-----------+-----------+-----------+-----------+

输出:

+------+-------+-------+-----+-----+-----+
|org_id|user_id|call_id| type|rx_bw|tx_bw|
+------+-------+-------+-----+-----+-----+
|     1|      1|      1|audio|    2|    3|
|     1|      1|      1|video|    4|    2|
|     1|      1|      1|share|    4|    5|
+------+-------+-------+-----+-----+-----+

对于此输入:

2skhul33

2skhul332#

我使用了以下查询,我的问题得到了解决

select 

   timestamp, organization_id, cluster_safe_name, node_name, correlation_id, is_cascade, cascade_peer_reachability, 

   stack(4, 'audio', audio_main_bitrate, 'audio_slides', audio_slides_bitrate, 'video', video_main_bitrate, 'share', video_slides_bitrate) as (call_type, rtp_bytes) 
from (
    select 
      timestamp, organization_id, cluster_safe_name, node_name, correlation_id, is_cascade, cascade_peer_reachability, 
      audio_main_rx_bitrate || '/' || audio_main_tx_bitrate as audio_main_bitrate, audio_slides_rx_bitrate  || '/' || audio_slides_tx_bitrate as audio_slides_bitrate, 
      video_main_rx_bitrate  || '/' || video_main_tx_bitrate as video_main_bitrate, video_slides_rx_bitrate  || '/' ||  video_slides_tx_bitrate as video_slides_bitrate 
    from homer_cascade_bandwidth_usage_metrics_v2 
    where timestamp >= '2022-07-20 10:20:00' and timestamp < '2022-07-20 11:00:00' 
) t

ORDER BY correlation_id

相关问题