这个问题在这里已经有了答案:
向sparkDataframe添加嵌套列(1个答案)
去年关门了。
我有两个Dataframe如下
df1型
+----------------------+---------+
|products |visitorId|
+----------------------+---------+
|[[i1,0.68], [i2,0.42]]|v1 |
|[[i1,0.78], [i3,0.11]]|v2 |
+----------------------+---------+
df2型
+---+----------+
| id| name|
+---+----------+
| i1|Nike Shoes|
| i2| Umbrella|
| i3| Jeans|
+---+----------+
这是Dataframedf1的模式
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
|-- visitorId: string (nullable = true)
我想连接这两个Dataframe,这样输出
+------------------------------------------+---------+
|products |visitorId|
+------------------------------------------+---------+
|[[i1,0.68,Nike Shoes], [i2,0.42,Umbrella]]|v1 |
|[[i1,0.78,Nike Shoes], [i3,0.11,Jeans]] |v2 |
+------------------------------------------+---------+
这是我期望的输出模式
root
|-- products: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- interest: double (nullable = true)
| | |-- name: double (nullable = true)
|-- visitorId: string (nullable = true)
在scala我该怎么做?我正在使用spark 2.2.0。
更新
我对上面的Dataframe进行了分解和连接,得到了下面的输出。
+---------+---+--------+----------+
|visitorId| id|interest| name|
+---------+---+--------+----------+
| v1| i1| 0.68|Nike Shoes|
| v1| i2| 0.42| Umbrella|
| v2| i1| 0.78|Nike Shoes|
| v2| i3| 0.11| Jeans|
+---------+---+--------+----------+
现在,我只需要下面json格式的上面的Dataframe。
{
"visitorId": "v1",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.68
}, {
"id": "i2",
"name": "Umbrella",
"interest": 0.42
}]
},
{
"visitorId": "v2",
"products": [{
"id": "i1",
"name": "Nike Shoes",
"interest": 0.78
}, {
"id": "i3",
"name": "Jeans",
"interest": 0.11
}]
}
2条答案
按热度按时间5m1hhzi41#
这取决于您的具体情况,但如果df2查找表足够小,您可以尝试将其收集为scalaMap,以便在udf中使用。所以变得很简单:
3pvhb19x2#
试试这个。