如何计算给定id的数组列中有多少大小写字符串

luaexgnf  于 2021-07-13  发布在  Spark
关注(0)|答案(2)|浏览(288)

我有这样的数据:

|  date      |     id       |    name    |  version  |                              assetIDs                               |
|------------| -------------| -----------|-----------|---------------------------------------------------------------------| 
| 2021-02-11 | com.example1 |   example1 |   1.2.3   |["4er6d99j","43mdmdm","234ds234","w23432s","sdfdsfds","32erwer"]    |
| 2021-02-11 | com.example2 |   example2 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","werewer","MMM3333","dssdsssM"]      |
| 2021-02-11 | com.example3 |   example3 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","YYYY2222","mmmm3444","yy2222"]      |
| 2021-02-11 | com.example4 |   example4 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","222sdsss","fffff3333","ffffffff"]   |
| 2021-02-11 | com.example5 |   example5 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","Y222222","YYYYJJJJ2222","DJDDJSJ2"] |
| 2021-02-11 | com.example6 |   example6 |   2.3.4   |["3er6d99j","43mdmdm","MMMM234","Ydddddd","ssdfdfdfd","sdfdwsfw"]    |

我需要能有两个以上的列一 lowerCaseCount ,一个 upperCaseCount 基于中的字符串 assetIDs .
最终结果将是两个额外的列,每个id中包含小写和大写字符串。
我一开始是这样做的,但这并没有给我预期的计数:

with data as (
      select date, id, name, version, explode(assetIDs) as assetids 
      from assets_table 
)

select dt.data, dt.id, dt.name dt.version, count(UPPER(assetids)) as upperCount,  count(LOWER(assetids)) as lowerCount, as.assetids
        from data dt
        inner join assets_table as on dt.id = as.id
        group by 1,2,3,4,7

这只是返回我的字符串计数,所以总金额在大写和小写列
请有人能帮我做到这一点最好的方式,我可以在scala,python或/和使用sql做这件事,因为我正在使用databricks。谢谢

y1aodyip

y1aodyip1#

你可以合并 size 以及 filter 获取计数的函数:

spark.sql("""
     SELECT *, 
            size(filter(assetIDs, x -> x = lower(x))) as lowerCaseCount,
            size(filter(assetIDs, x -> x = upper(x))) as upperCaseCount
     FROM   assets_table
""").show(truncate=False)

# +----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

# |date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|

# +----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

# |2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |

# |2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |3             |2             |

# |2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |

# |2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |

# |2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |

# |2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |4             |1             |

# +----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

或使用Dataframeapi:

from pyspark.sql import functions as F

df1 = df.withColumn("lowerCaseCount", F.expr("size(filter(assetIDs, x -> x = lower(x)))"))\
    .withColumn("upperCaseCount", F.expr("size(filter(assetIDs, x -> x = upper(x)))"))

df1.show(truncate=False)
ghhaqwfi

ghhaqwfi2#

可以进行条件聚合。我不知道你想如何处理大小写混合的字符串。

df2 = df.selectExpr(
    "*",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when lower(x) = x then 1 else 0 end) as lowerCaseCount",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when upper(x) = x then 1 else 0 end) as upperCaseCount"
)

df2.show(truncate=False)
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |
|2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |3             |2             |
|2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |
|2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |
|2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |
|2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |4             |1             |
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

另一种计算方法:

df2 = df.selectExpr(
    "*",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when x rlike '[a-z]' then 1 else 0 end) as lowerCaseCount",
    "aggregate(assetIDs, 0, (acc, x) -> acc + case when x rlike '[A-Z]' then 1 else 0 end) as upperCaseCount"
)

df2.show(truncate=False)
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|date      |id          |name    |version|assetIDs                                                     |lowerCaseCount|upperCaseCount|
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+
|2021-02-11|com.example1|example1|1.2.3  |[4er6d99j, 43mdmdm, 234ds234, w23432s, sdfdsfds, 32erwer]    |6             |0             |
|2021-02-11|com.example2|example2|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, werewer, MMM3333, dssdsssM]     |4             |3             |
|2021-02-11|com.example3|example3|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, YYYY2222, mmmm3444, yy2222]     |4             |2             |
|2021-02-11|com.example4|example4|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, 222sdsss, fffff3333, ffffffff]  |5             |1             |
|2021-02-11|com.example5|example5|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Y222222, YYYYJJJJ2222, DJDDJSJ2]|2             |4             |
|2021-02-11|com.example6|example6|2.3.4  |[3er6d99j, 43mdmdm, MMMM234, Ydddddd, ssdfdfdfd, sdfdwsfw]   |5             |2             |
+----------+------------+--------+-------+-------------------------------------------------------------+--------------+--------------+

相关问题