SQL Server to split set of values into 5 groups each group should have sum(count)
evenly distributed.
Table contains only 2 columns rid
and count
.
create table t1(rid int, count int)
insert into t1
values (1, 4567), (2, 3256), (3, 5678), (4, 934),
(5, 1099), (6, 3990), (7, 780), (8, 6784),
(9, 7854), (10, 435), (11, 3455), (12, 4897),
(13, 8849), (14, 1019), (15, 2387)
Actual table is
rid count
---------
1 4567
2 3256
3 5678
4 934
5 1099
6 3990
7 780
8 6784
9 7854
10 435
11 3455
12 4897
13 8849
14 1019
15 2387
I need to divide the values into into 5 groups dynamically, each group should have sum(count)
evenly distributed
The sum of the columns is equivalent to 55500. I need to divide the sum by 55500/5=11100. we need to divide the values in to 5 groups, each group should have sum(count)
evenly distributed, equivalent to 11110 (approximately)
4条答案
按热度按时间u0njafvf1#
I would start with 5 randomly chosen groups:
The sums should be pretty close. If you have a lot of records and the counts are reasonably distributed, then an nth sample usually does a pretty good job:
If you have a situation where you have very disparate sizes for
count
and you need the optimal solution, then you have a hard problem.9fkzdhlc2#
You can try this script.
Result:
hwazgwia3#
Here's my go at it, what I've done is created a temp table with an identity column and an extra column ([Group]). The numbers are inserted in descending size order. Then, I've written a LOOP which inserts Groups 1 to 5 next to the largest 5 numbers in the [Group] column, then flips and inserts Groups 5 to 1 against the next 5 largest numbers, then it flips again and so on until it reaches the end of the table.
The code below just demonstrates the actual numbers in each of the five groups and also shows the variance away from the sum of numbers divided by five.
Is this accurate enough? In other words, does a variance range of 1274 seem to be evenly distributed enough with these numbers? I think it might be possible to make it more accurate if required, if this is accurate enough, then fine.
Below is a code which shows how the Groups are comprised:
tuwxkamq4#
Here is a solution that works well and runs quickly on large datasets.
I was dealing with data kind of like this:
What we needed to accomplish was to evenly distribute records across a certain number of buckets while isolating employee_ids to ONLY one bucket regardless how many records that one employee had. For the data above, we needed it to break down like this, for lets say 3 buckets.
For the 9 records in the sample set, employee 1 had 3, employees 2 and 3 had 2 each, and employees 4 and 5 had 1 each. Distributing them above in that way allowed for each employee to belong to only 1 bucket, while still creating buckets of equal size.
In order to record this, we would have to add a column to the original dataset containing what bucket the records belongs to.
If the dataset were laid out like that, we could create the aggregated dataset by running the code below.
So lets say we have a table with that schema above. We could do all the aggregations on the source data table, but if you are working with production data or a very large dataset, I found it easier to work with an aggregated representation of the table, since the values of the records themselves don't really matter - it's the count per employee that we are interested in.
So we create an aggregated mapping table to do our work on -
This table contains the employee id and a count of how many records they have, all first assigned to bucket_id 0. Since I was dealing with very large datasets I put a primary key on the identity column and added a nonclustered index on bucket_id and records_for_emp, but for smaller datasets neither of those are really necessary and could possibly slow down performance.
Insert the aggregated source data into the mapping table -
Now, the overarching concept for this solution is a three-pass sorting algorithm.
The first pass loops across each bucket, distributing the number of records per employee, sorted largest to smallest, using Run Length Encoding to greatly minimize the number of looping iterations.
Since the first pass leaves the last bucket larger than the others, the second pass distributes back across each bucket, ordered from record count fewest to greatest, the smallest employee in the final bucket with a records count > 1 until the final bucket is under the intended size.
The final pass takes any remaining single-records employees, n, and distributes them across the top n buckets, ordered fewest records to most in one execution.
Declare some variables - these will all be explained when they are used.
To determine the total number of records being processed and the desired size of each bucket, the variables are set below -
The first thing to do is separate these out logically into 2 groups - employees with a record count no other employee has, and employees that share record counts with other employees. Think back to the example dataset above. Employee 1 was the only employee with 3 records, while 2 and 4 shared a count of 2 records and 3 and 5 shared a count of 1.
When looked at through the lens of Run Length Encoding (RLE), the employees that have a unique record count are treated as single data items, and the employees that share record counts with other employees are treated as runs of data items.
To determine the record count and run length of the first run of data items, the variables are stored with the values below -
For my dataset, the top 3 runs were these.
Meaning the first time two employees share the same record count is for record count 12252, where it's shared by 2 employees. Those values are stored and we continue.
To perform the first pass on the first bucket, this is the code. This is where the while loop will start.
We find the lowest id and corresponding record count that isn't assigned to a bucket yet, whose record count is lower than the number of records still left to be filled for the current bucket.
In my dataset, the first 3 records were these -
Meaning the record with the id of 1 and record count of 39980 were stored.
The next bit of code looks to see if this record count is the start of a run of data items or a single data item. Since we know the first run of data items in my example dataset has a record count of 12252, this is a single data item.
We check to see if the current bucket can hold the record count stored for the current id. In my example dataset, I have 214,214,960 records and I want to split them across 50 buckets, so with a maximum size of 4,284,299 records per bucket, 39980 can absolutely fit. We update the mapping table and set the currently unassigned bucket_id of 0 to the bucket_id of the current bucket, 1, and update the current_bucket_size to reflect the newly added data item, 39980.
This code will continue to loop, finding the next data item not assigned to a bucket that can fit into the space remaining in my current bucket, and add them one after the next until the bucket is full. When that occurs or when we've finished processing everything there is to process, the code below will run, advancing the @currentBucket variable by 1 and resetting the @currentID and @currentBucketSize variables back to 0. This is placed just below the code above in the while loop.
In the event that the current bucket is under the @maxBucketSize by a given number, n, and there are no single data items with a record count of n or runs of data that add up to n, this will spin into an infinite loop. To prevent this, after each loop we set a placeholder variable, @lastID, to hold the current id that was just processed. At the end of the next loop, we evaluate the id we just processed, and if it is the same value as the last id we processed, we advance to the next bucket and reset the values as if the bucket were full.
My first bucket looks like this -
In this first bucket, it just happens to be that all the data items were single data items, making this first bucket contain 134 employees and all of their records over 134 loops.
Let's take a second to talk about efficiency and performance
That comes out to 1 row processed per 1 loop, which is really bad for performance on large datasets. My dataset in particular contains 284378 employees, and I'm not about to run that many loops to sort this data. Even looking at the last id inserted in my bucket above, 225224, my total number of employees, 284378, and my largest record count for data items in a run, 12252 - it's completely obvious that this dataset contains a lot of data items belonging to several runs. So what happens when those record counts are next on our list to be processed? When we continue into the next bucket we will see.
As this loop starts and iterates through the single data items, it treks along until it reaches the first data item of the first run of data.
If you'll recall, we initialized the @batchThreshold and @lengthOfRun variables with the record count of the run, 12252, and the number of data items in the run, 2. Instead of looping through both of these 1 by 1, we can handle them as an entire run of data in one loop. To achieve that, we add this code before the code that handles the single data items in this way -
Since the condition to execute the code that processes runs of data items is satisfied, we look to see if the entire size of the run is larger than what can fit into the remaining size of the current bucket. Since our current bucket size is 2,732,020 and our max bucket size is 4,284,299, we have plenty of room to fit the entire run of 2 data items, each having a record count of 12252.
If that were not the case, we would evaluate exactly how many data items in the run can fit into the remaining size of the current bucket and set the @lengthOfRun variable to that value.
Since the first data item in this run being processed is the current id, we simply add the length of the run - 1 to the current id to get the id of the record ending the run, since the BETWEEN operator is inclusive of the first and last parameters.
We update our run of data items to the current bucket id and add the total number of records across the run to the current bucket size. Finally, we find and store the next largest record count and run length of data items in a run where at least 1 of those data items can fit into the remaining size of the current bucket.
With the code that handles runs of data items in effect, this is my second bucket -
For this bucket, 6 runs of 2 data items were processed, eliminating 6 loops for this bucket. We can count the number of loops with a variable and subtract the number of data items processed to find the number of loops saved, or we can use this formula -
sum(number_of_runs * run_length) - sum(number_of_runs) = loops eliminated
With that, we can see we executed 298 loops for this bucket - a compression ratio of 304:298 or 1.02 rows per loop. That's still pretty bad, but my third bucket is where the runs really ramp up.
My third bucket looks like this -
For the third bucket, there were 92 runs of data items -
Altogether, those 92 runs accounted for 371 rows, meaning that 146 loops produced 425 total rows. This time the compression ratio is 425:146 or 2.91 rows per loop
This algorithm does best when the inherent properties of the dataset are such that it contains lots of runs. In my datasets, there are lots of runs of smaller values, meaning that the RLE logic works better the higher the current bucket id for smaller total bucket values, but performance scales exponentially until there are no runs, at which point performance will be linear at best and logarithmic at worst, depending on your cpu.
So now let's talk about the final bucket
When we begin filling the final bucket, we could loop through the remaining employees, adding them one at a time or calculating run length, but why? After this bucket there are no more buckets to put employees in. We can just eliminate any unnecessary loops by just dumping all remaining records into the final bucket instead by running this code before anything else in the while loop.
The problem that comes with this though, is that the final bucket contains some overflow records, and now looks like this -
This doesn't look too bad, but it can diverge more when the employees with the fewest records have record counts > 1, such as in this other dataset I have -
To solve this and better smooth out our buckets, we introduce passes 2 and 3.
The second pass loops through any data items left over with record counts > 1 and distributes them one at a time to the bucket with the fewest total records at the beginning of the loop. The code for that goes in the section after we dump all the remaining records into the final bucket.
First we create a table variable to hold the overflow data items. Since the number of overflow data items won't necessarily match the number of overflow records, We have to assume that this will need to work for whatever the @totalBucketCount is, or smaller. We can use the total record difference in determining how many data items to insert into the spillover table. This not only covers when the algorithm is working correctly, but essentially covers any edge cases in which the first pass would not work correctly.
My first data set had no data items with a record count > 1, so we'll use my second dataset. Grabbing the difference, (top 24000ish records) just grabbed all 49 data items in the bucket. We wont use them all, just enough for the final bucket to be within appropriate size.
Below are the results after each iteration of the second pass of the dataset with 24000ish records in difference.
The third and final pass operates on data items with a row count of only 1, distributing them all to the buckets with the fewest total records in one execution.
Looking back at my first dataset if you recall, I had 10 data items with a record count of 1 left over in the final bucket. Grabbing the difference, TOP 10, grabs those leftover data items and assigns them to the TOP 10 smallest buckets.
Below are the source and destination buckets for the overflow data items.
When everything is all said and done, I had 284378 total employees to sort, and after all three passes I've only executed 8356 loops, bringing my full operation to a compression ratio of 284738/8356, or 34.08 rows per loop.
Those are the statistics for my given dataset, and my given desired bucket parameter. I'm sure there are better general optimizations, but to load and aggregate the 2+ million rows of data took longer for me than all 3 passes of this sorting algorithm, 4 seconds to load and aggregate, 3 to sort.
CODE
This isn't as accurate with smaller datasets but runs very quickly and accurately on large datasets.