I want to store a large n
-dimensional vector (e.g. an embedding vector ) in SQL Server as a piece of metadata associated with another row.
In this example, it will be a 384-dimensional vector, for example:
[0.161391481757164, -0.23294533789157867, -0.5648667216300964, -0.3210797905921936, -0.03274689242243767, 0.011770576238632202, -0.06612513959407806,
-0.14662186801433563, -0.17081189155578613, 0.2879514992237091, -0.1932784765958786, 0.009713868610560894, 0.23330552875995636, 0.03551964834332466,
-0.20526213943958282, 0.06445703655481339, -0.3146169185638428, 0.5788811445236206, 0.09118294715881348, -0.0048667509108781815,-0.16503077745437622,
0.25162017345428467, -0.36395764350891113, -0.34742429852485657, 0.0526515394449234, 0.08912508934736252, 0.48464590311050415, -0.04224267974495888,
0.32445403933525085, -0.6847451329231262, -0.20959551632404327, -0.027657458558678627, 0.20439794659614563, 0.6859520077705383, -0.4988805055618286,
-0.26204171776771545, -0.18842612206935883, 0.07067661732435226, 0.02633148804306984, 0.03182782977819443, 0.28935596346855164, -0.0016041728667914867,
0.14609676599502563, -0.36272501945495605, 0.10288259387016296, -0.3651926815509796, -0.3823530375957489, 0.14052163064479828, 0.006418740376830101,
0.11741586774587631, -0.6509529948234558, -0.15997739136219025, -0.42837604880332947, 0.12351743131875992, 0.0485026054084301, 0.24820692837238312,
0.46972623467445374, -0.47954055666923523, -0.5238635540008545, -0.3543052673339844, 0.22626525163650513, 0.18406584858894348, 0.6463921070098877,
0.11894208937883377, -0.07143554836511612, 0.004256516695022583, 0.10088140517473221, 0.3335645794868469, 0.16905969381332397, 0.056856121867895126,
0.11355260014533997, 0.3708053231239319, -0.7484591603279114, 0.17503942549228668, -0.3249044418334961, 0.5901510715484619, 0.41506800055503845,
0.05852462351322174, 0.5119204521179199, 0.2750142216682434, -0.2058306783437729, 0.8199670314788818, 0.16698679327964783, -0.1572146713733673,
0.014733579009771347 ,0.0168467964977026, 0.4688740372657776, -0.07839230448007584, 0.49326324462890625, -0.29934313893318176, 0.21525822579860687,
0.1396997570991516, -0.3420834243297577, -0.5197309851646423, 0.10842061042785645, -0.0338996984064579, 0.35846689343452454, -0.1660442352294922,
0.15579357743263245, 0.015674782916903496,-0.8510578870773315, -0.07501569390296936, -0.1791406124830246, 0.14926102757453918, -0.2269722819328308,
0.42619261145591736, 0.09489753842353821, -0.13341256976127625, 0.3312526345252991, 0.22534190118312836, 0.0679713636636734, 0.17042726278305054,
0.14300595223903656, -0.06654901057481766, -0.2170567661523819, -0.454984188079834, -0.5516679286956787, -0.10752955824136734, -0.05743071809411049,
0.32108309864997864, -0.5445901155471802, -0.43162357807159424, 0.08207866549491882, 0.0664522647857666, 0.4478979706764221, 0.2190810590982437,
-0.05722910910844803, -0.0932786613702774, 0.01758035272359848, 0.16166797280311584, 0.44004616141319275, -0.21601708233356476, 0.43121641874313354,
0.32022470235824585, -0.014045504853129387,-0.24948528409004211, -0.4389941990375519, 0.3816317319869995, -0.5687862038612366, 0.1088542640209198,
-0.403241366147995, 0.08174201846122742, 0.21350793540477753, 0.2396722435951233, 0.4973253607749939, 0.31202447414398193, -0.5260801315307617,
-0.3351263403892517, -0.04100760444998741, 0.6609364151954651, -0.2047063261270523, 0.19385716319084167, -0.5661329627037048, -0.27058693766593933,
-0.1637117713689804, 0.30641692876815796, -0.08894442766904831, -0.052735116332769394,-0.13839660584926605, -0.6741533875465393, 0.05569711700081825,
-0.04354270175099373, 0.20251914858818054, 0.24813368916511536, 0.1719648838043213, 0.26782000064849854, 0.3137670159339905, 0.18599936366081238,
0.23953016102313995, 0.17769533395767212, 0.46293920278549194, -0.19122551381587982, -0.5595004558563232, 0.09755659103393555, 0.3125424385070801,
-0.5813230276107788, -1.0698442459106445, -0.09045401215553284, -0.08948248624801636, -0.051830895245075226,-0.0001317809073952958,-0.08400193601846695,
0.25725823640823364, -0.10135184973478317, 0.07884480804204941, 0.2091679722070694, 0.3950233459472656, 0.2745698094367981, -0.872776448726654,
-0.16590780019760132, 0.4308463931083679, -0.24375642836093903, -0.02120584435760975, 0.05213866010308266, -0.19898287951946259, -0.5506985187530518,
0.40167248249053955, 0.1640072464942932, -0.010167916305363178, 0.14038121700286865, 0.4958030879497528, -0.7259818315505981, -0.24387206137180328,
0.08528701961040497, 0.03415993973612785, -0.16687284409999847, 0.3804749548435211, -0.08561687171459198, -0.2752263844013214, 0.5883951783180237,
-0.3283255994319916, -0.12724250555038452, 0.08751262724399567, -0.44206979870796204, -0.11079336702823639, -0.16302113234996796, 0.11022322624921799,
-0.09404750168323517, -0.256179541349411, 0.20473307371139526, 0.41829538345336914, -0.1095203086733818, 0.02342342585325241, -0.18814104795455933,
-0.2540932893753052, 0.48397907614707947, 0.03593514859676361, -0.089835524559021, -0.6478171944618225, -0.1757517009973526, 0.0672023594379425,
0.0695127546787262, -0.6398074626922607, -0.03958022966980934, -0.10351496934890747, 0.22433893382549286, 0.6756673455238342, -0.2924160957336426,
0.17503827810287476, 0.12915058434009552, -0.239552840590477, 0.15498916804790497, -0.4730042815208435, -0.12289212644100189, -0.004052990116178989,
0.11593572050333023, -0.1965983510017395, 0.5210273265838623, -0.18184830248355865, 0.2579534947872162, -0.1920309066772461, -0.389960378408432,
0.04139290377497673, -0.11638019979000092, -0.10620912909507751, -0.5321099162101746, 0.13135096430778503, -0.07761876285076141, -0.0830138698220253,
-0.01572849042713642, 0.31080499291419983, -0.41445496678352356, 0.1609737128019333, 0.5787453651428223, -0.05459209159016609, 0.1318219006061554,
-0.06957206130027771, 0.15152350068092346, -0.07094550132751465, -0.196294367313385, 0.12644843757152557, 0.23419199883937836, 0.5845456719398499,
-0.19989481568336487, -0.19607964158058167, -0.19692276418209076, -0.08633144199848175, -0.004551170393824577, 0.09362921118736267, -0.14167727530002594,
-0.14917594194412231, 0.31781134009361267, 0.18779256939888, 0.42154577374458313, -0.20578211545944214, 0.14142100512981415, -0.5664211511611938,
0.18177354335784912, 0.14776530861854553, 0.29254236817359924, 0.17831481993198395, -0.1894354224205017, -0.2836195230484009, -0.4065170884132385,
-0.14325398206710815, 0.17800962924957275, 0.7763587832450867, 0.5497004389762878, -0.00946379080414772, -0.48568078875541687, -0.022227048873901367,
-0.005903944373130798, 0.4351034462451935, 0.05010621249675751, -0.12799566984176636, -0.06675072759389877, 0.167253315448761, -0.1653994619846344,
0.21004730463027954, 0.2765181362628937, 0.5885812640190125, -0.326379656791687, -0.007390940561890602, 0.27159956097602844, -0.043763305991888046,
-0.39229199290275574, -0.19412016868591309, 0.4250912666320801, 0.6105153560638428, -0.06168382614850998, -0.5341082811355591, -0.611929714679718,
0.08125612139701843, -0.1779184639453888, 0.5319408774375916, -0.23601730167865753, 0.22285249829292297, -0.32505497336387634, 0.2152460366487503,
0.4679816663265228, 0.048206135630607605,-0.24099768698215485, -0.30208054184913635, 0.13667792081832886, 0.3552468717098236, -0.12280546128749847,
-0.006191314198076725,-0.10851636528968811, 0.08330328017473221, -0.09545236080884933, -0.02249046228826046, 0.0003346469602547586,-0.12273653596639633,
-0.05594412609934807, 0.027804357931017876,-0.4045255482196808, -0.18987023830413818, -0.0027474926318973303,0.30244430899620056, 0.2323288917541504,
-0.2729185223579407, 0.12836921215057373, 0.27967774868011475, 0.3031359016895294, 0.41273725032806396, -0.06173351779580116, 0.33845168352127075,
0.26775869727134705, -0.2933143079280853, -0.0485006645321846, 0.11777450144290924, 0.6205862760543823, -0.07637807726860046, -0.19466432929039001,
-0.3994691073894501, 0.15689416229724884, -0.11139731854200363, -0.2333720475435257, 0.2364773154258728, 0.30898618698120117, -0.1263875812292099,
-0.231489360332489, 0.34536853432655334, 0.6001318097114563, -0.44741731882095337, 0.07382357120513916, -0.019649405032396317, -0.1029537245631218,
0.369470477104187, -0.032077688723802567,-0.13972929120063782, 0.24549521505832672, -0.13091856241226196, -0.029257331043481827]
Attempt#1 - n-dimensional vector → n-columns
My first thought was to store the 384 real
values in a separate table, with a key to the original row (vertical partitioning):
CREATE TABLE Embeddings (
RowGUID uniquedientifier NOT NULL PRIMARY KEY,
f1 real NOT NULL,
f2 real NOT NULL,
f3 real NOT NULL,
f4 real NOT NULL,
f5 real NOT NULL,
f6 real NOT NULL,
f7 real NOT NULL,
f8 real NOT NULL,
f9 real NOT NULL,
f10 real NOT NULL,
...snip...
f384 real NOT NULL)
RowGUID | f1 | f2 | f3 | f4 | f5 | f6 | f7 | ... | f384 |
---|---|---|---|---|---|---|---|---|---|
6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 0.161391481757164 | -0.23294533789157867 | -0.5648667216300964 | -0.3210797905921936 | -0.03274689242243767 | 0.011770576238632202 | -0.06612513959407806 | ... | -0.029257331043481827 |
This...sorta...works. But it is unwieldy. Plus, my vectors today happen to be 385-dimensional; but they may soon be 1556-dimensional , which exceeds the SQL Server maximum of 1,024 columns per table.
Attempt#2 - n-dimensional → n-IEEE 32-bit floats → varbinary(n*4)
The next idea was to pack the 4-byte (32-bit) floats into a varbinary
column:
CREATE TABLE Embeddings (
RowGUID uniquedientifier NOT NULL PRIMARY KEY,
PackedVector varbinary(1516) NOT NULL -- 384 floats * 4 bytes = 1540 bytes
)
0x0000000100000002000000030000000400000005000000060000000700000008...0000017F
\______/\______/\______/\______/\______/\______/\______/\______/ \______/
f1 f2 f3 f4 f5 f6 f7 f8 f384
And then when it comes time to read each Single
, use SUBSTRING
to rip the 4-byte float out of the varbinary, and then convert it to a real
:
DECLARE @f1 real = CAST(SUBSTRING(PackedVector, 0*4, 4) AS real);
Except two down-sides:
- Downside#1: You cannot convert a
binary(4)
to areal
(even though you can convert areal
to abinary(4)
; just not the other way:
May be able to workarond it with decimal
or numeric
).
- Downside#2: The math of computing the euclidian distance between two vectors is conceptually valid:
DECLARE @target VARBINARY(1536) -- packed 384-dimensional vector
SELECT TOP(10) RowGUID, SUM(POWER(CAST(SUBSTRING(Embedding, i*4+1, 4) AS real) - CAST(SUBSTRING(@target, i*4 + 1, 4) AS real), 2)) as distance
FROM Embeddings
CROSS APPLY (VALUES (0), (1), (2), ..., (383)) AS sequence(i) -- Fill in the values from 0 to 383
GROUP BY RowGUID
ORDER BY distance ASC
But that will be pretty poorly performing (even if issue #1 didn't exist).
Attempt 3 - Do what Joe Celko does
Many years ago, someone on the Microsoft newsgroups had the same question:
Can anyone point me to a reference or discuss the best way to store a vector of 120 to 480 numbers in the database? Rows seem to be out since we would quickly top the billion row mark. A table with 480 columns is too unnormalized. A single varchar(max) column? This seems the best answer for now unless there is a more efficiant way of storing it.
Thanks for any help or opinions,
And then --CELKO-- responded:
I think of a vector as a particular kind of mathematical structure and you seem to be talking about a list of some kind. Vectors have a fixed number of dimensions, etc. Here is a guess:
CREATE TABLE Vectors (
vector_id CHAR(3) NOT NULL, --whatever
dim_nbr INTEGER NOT NULL,
CHECK (dim_nbr BETWEEN 1 AND 480),
PRIMARY KEY (vector_id, dim_nbr),
dim_val INTEGER NOT NULL
);
Making the values of a the vector into rows:
Embeddings
| RowGUID | dimNumber | dimValue |
| ------------ | ------------ | ------------ |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 1 | 0.161391481757164 |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 2 | -0.23294533789157867 |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 3 | -0.5648667216300964 |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 4 | -0.3210797905921936 |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 5 | -0.03274689242243767 |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 6 | 0.011770576238632202 |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 7 | -0.06612513959407806 |
| ... | ... | ... |
| 6ba7b814-9dad-11d1-80b4-00c04fd430c8 | 384 | -0.029257331043481827 |
This is probably the best approach.
Nothing better?
Doesn't SQL Server has better support for vectors? I know there is GEOSPATIAL/GEOGRAPHY types, but i gather those only work for 2-dimensional vectors (e.g. lattuitude+logitude)? Can't they be abused to solve the problem?
And since the goal is to compute euclidian distance between two vectors, is there a data structure that does a better job of allowing math? (varchar? xml? json? varbinary? variant?)
Bonus Reading
- Indexing N-dimensional vectors
- microsoft.public.sqlserver.programming: Best way to store vectors?
- 🤗Getting Started With Embeddings
- How to customize LLMs like ChatGPT with your own data and documents
- FAISS - Facebook open-source vector database: https://github.com/facebookresearch/faiss
- Pinecone, an online vector database system: https://www.pinecone.io/
- Microsoft: What is a Vector Database? ( https://archive.ph/15kXF )
- SQLServerCentral: The Rise of Vector Databases
- Optimized approach for calculating cosine similarity in SQL Server(tips for using cosine angle distane)
2条答案
按热度按时间k75qkfdt1#
The real advantage of vector databases (I work for Weaviate fwiw) is that it allows ANN-based vector indices to be built. Vector searches are computationally very expensive, and vector DBs allow searches that are orders magnitudes faster than say, brute-force searches through a Numpy array or something.
I would definitely not advise saving vectors into SQL if you want to do any kind of similarity-based retrieval, which is probably why you have vectors in the first place.
pvcm50d12#
Maybe you could try MyScale DB, which is a SQL-based vector database.