I know that I can join 2-3 small tables easily by writing simple joins. However, these joins can become very slow when you have 7-8 tables with 20 million+ rows, joining on 1-3 columns, even when you have the right indices. Moreover, the query becomes long and ugly too.
Is there an alternative strategy for doing such big joins, preferably database agnostic?
EDIT
Here is pseudocode for the join. Note that some tables may have to be unpivoted before they are used in the join -
select * from
(select c1,c2,c3... From t1 where) as s1
inner join
(select c1,... From t2 where) as s2
inner join
(unpivot table to get c1,c2... From t3 where) as s3
inner join
(select c1,c2,c3... From t2 where) as s4
on
(s1.c1 = s2.c1)
and
(s1.c1 = s3.c1 and s1.c2 = s3.c2)
and
(s1.c1 = s4.c1 and s2.c2 = s4.c2 and s1.c3 = s4.c3)
Clearly, this is complicated and ugly. Is there a way to get the same result set in a much neater way without using such a complex join?
7条答案
按热度按时间tpgth1q71#
"7-8 tables" doesn't sound worrying at all. Modern RDBMS can handle a lot more. Your pseudo-code query can be radically simplified to this form:
As long as matching column names are unambiguous in the tables left of a
JOIN
, theUSING
syntax shortens the code. If anything can be ambiguous, use the explicit form demonstrated in my last join condition. That's all standard SQL, but according to this Wikipedia page :The USING clause is not supported by MS SQL Server and Sybase.
It wouldn't make sense to use all those subqueries in your pseudo-code in most RDBMS. The query planner finds the best way to apply conditions and fetch columns itself. Smart query planners also rearrange tables in any order they see fit to arrive at a fast query plan.
Also, that thing called "database agnostic" only exists in theory. None of the major RDBMS completely implements the SQL standard and all of them have different weaknesses and strengths. You have to optimize for your RDBMS or get mediocre performance at best.
Indexing strategies are very important. 20 million rows doesn't matter much in a
SELECT
, as long as we can plug a hand full of row pointers from an index. Indexing strategies heavily depend on your brand of RDBMS. Basically, columns can benefit from an index when used in ...JOIN
clauseWHERE
clauseORDER BY
clauseVery much depends on details.
There are also various types of indexes, designed for various requirements. B-tree, GIN, GiST, . Partial, multicolumn, functional, covering. Various operator classes. To optimize performance you just need to know the basics and the capabilities of your RDBMS.
More in the excellent PostgreSQL manual on indexes.
0tdrvxhp2#
I have seen three ways of handling this if indexing fails to give a big enough performance boost. The first is to use temp tables. The more joins the database performed, the wores the estimated rows gets which can really slow down your query. If you run your joins and where clauses that will return the smallest number of rows, and store the intermediate results in a temp table to allow the cardinality estimator a more accurate count performance can improve significantly. This solution is the only once that doesn't create new database objects.
The second solution is a database warehouse, or at least an additional denormalized table(s). In this case you would create an additional table to hold the final results of the query, or several tables that perform the major joins and hold intermediate results. As an example, if you had a customers table, and three other tables that hold information about a customer, you could create a new table that holds the result of joining thsoe four tables. This solution generally works when you are using this query for reports and you can load the report table(s) each night with the new data generated during the day. This solution will be faster than the first, but is harder to implement and keep the results current.
The third solution is a materilized view/ indexed view . This solution depends heavily on the db platform you use. Oracle and Sql Server both have a way to create a view and then index it, giving you greater performance on the view. This can come at the cost of not having current records or greater data cost to store the view results but it can help.
oxiaedzo3#
Create materialized views and refresh them over the night. Or refresh them only when consider necessary. For example, you can have 2 views, one materialized with old data that is not going to be ever changed, and another normal view with actual data. And then a union between these. So you could have more views like these for any output you need.
If your database engine doesn't support materialized views, just denormalize the old data in another table over the night.
Check this also: Refresh a Complex Materialized View
3zwtqj6y4#
I've been in same situation before, and my strategy was use
WITH
clause.See more here .
xnifntxz5#
If you want to know if there is another way to access the data. One approach would be to take an interest in the object concept. I any event on Oracle. it's works very well and simplify dev. But it requires a business object approach.
From your example we can use two concept :
Who can ease the readability of a query and sometimes speed.
1 : References
A reference is a pointer to an object. It allows the removal of joins between tables as they will be pointed.
Here is a simple Exemple :
We insert some datas in both table :
And the SELECT :
RETURN :
Here is a type of implicit join
2 : inherence
We just insert a row in the table
And the SELECT :
We can see that by this method we can easily access related data without using.
hoping that it can serve you
bqf10yzr6#
if it's all subqueries you can do it in the sub queries for each and as all the matching data happens it should be as simple as below so long as all the tables c1,c2,c3
uujelgoq7#
You can make use of views and functions. Views make SQL code elegant and easy to read and compose. Functions can return single values or rowsets permitting fine-tuning the underlying code for efficiency. Finally, filtering at subquery level instead of joining and filtering at query level permits the engine produce smaller sets of data to join later, where indices are not that significant since the amount of data to join is small and can be efficiently computed on the fly. Something like the query below can be include highly complex queries involving dozens of tables and complex business logic hidden in views and functions, and still be very efficient.