我有这样的数据:
| date | id | name | version | assetIDs |
|------------| -------------| -----------|-----------|---------------------------------------------------------------------|
| 2021-02-11 | com.example1 | example1 | 1.2.3 |["4er6d99j","43mdmdm","234ds234","w23432s","sdfdsfds","32erwer"] |
| 2021-02-11 | com.example2 | example2 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","werewer","MMM3333","dssdsssM"] |
| 2021-02-11 | com.example3 | example3 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","YYYY2222","mmmm3444","yy2222"] |
| 2021-02-11 | com.example4 | example4 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","222sdsss","fffff3333","ffffffff"] |
| 2021-02-11 | com.example5 | example5 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","Y222222","YYYYJJJJ2222","DJDDJSJ2"] |
| 2021-02-11 | com.example6 | example6 | 2.3.4 |["3er6d99j","43mdmdm","MMMM234","Ydddddd","ssdfdfdfd","sdfdwsfw"] |
我需要能有两个以上的列一 lowerCaseCount
,一个 upperCaseCount
基于中的字符串 assetIDs
.
最终结果将是两个额外的列,每个id中包含小写和大写字符串。
我一开始是这样做的,但这并没有给我预期的计数:
with data as (
select date, id, name, version, explode(assetIDs) as assetids
from assets_table
)
select dt.data, dt.id, dt.name dt.version, count(UPPER(assetids)) as upperCount, count(LOWER(assetids)) as lowerCount, as.assetids
from data dt
inner join assets_table as on dt.id = as.id
group by 1,2,3,4,7
这只是返回我的字符串计数,所以总金额在大写和小写列
请有人能帮我做到这一点最好的方式,我可以在scala,python或/和使用sql做这件事,因为我正在使用databricks。谢谢
2条答案
按热度按时间y1aodyip1#
你可以合并
size
以及filter
获取计数的函数:或使用Dataframeapi:
ghhaqwfi2#
可以进行条件聚合。我不知道你想如何处理大小写混合的字符串。
另一种计算方法: