How to aggregate information from indefinite number of groups in TSQL? E.g. we have a table with 2 columns - clients and regions.
Clients Regions
client1 45
client1 45
client1 45
client1 45
client1 43
client1 42
client1 41
client2 45
client2 45
client3 43
client3 43
client3 41
client3 41
client3 41
client3 41
Every client can have any number of regions.
In the example below: client1 has 4 groups of regions, 2nd - 1 group, 3rd - 2 groups.
I want to count gini impurity for each client, i.e. to count - how different are the regions in client.
To do this, I want to apply to each client the following formula:
But the quantity of regions is indefinite (may differ in each client).
This should be calculated:
client1 = 1 - ((4 / 7 ) ^ 2 + (1 / 7 ) ^ 2 + (1 / 7 ) ^ 2 + (1 / 7 ) ^ 2)
client2 = 1 - ((2 / 2 ) ^ 2)
client3 = 1 - ((2 / 6 ) ^ 2 + (4 / 6 ) ^ 2)
This is the desirable output:
Clients Impurity
client1 61%
client2 0%
client3 44%
Would you prompt me the way to solve the problem.
2条答案
按热度按时间ljsrvy3e1#
I think the formula could be expressed as a couple of group by:
db<>fiddle seems to match expected output.
j91ykkif2#
Here's how I'd approach this:
count(*) as cnt ... group by clients, regions
cast(cnt as float)/sum(cnt) over(partition by clients) as pcnt
and square it1 - sum(pcnt) ... group by clients
There are ways to compact it to not use 2 subqueries but they might not make it more readable or easy to understand. I wasn't totally clear on whether you meant percent (out of 100) or ratio (out of 1) so you might have to add a
*100
at an appropriate point