我有两个表,一个是migrate_data
,存储产品详细信息,另一个是ws_data
,存储d_id
和parent(d_id's parent)
,其中d_id
是process id
我有一百万条记录,希望使用左连接来连接这两个表,并将使用d_id
和parent
创建child, parent, grandparent
虚拟样品数据(生成):
CREATE TABLE ws_data (
id SERIAL UNIQUE NOT NULL,
d_id integer,
parent integer,
CONSTRAINT ws_data_pk
PRIMARY KEY (id)
);
CREATE TABLE migrate_data (
id SERIAL UNIQUE NOT NULL,
d_id_sell integer,
d_id_curr integer,
product_name VARCHAR(100) NOT NULL, -- not unique
CONSTRAINT migrate_data_pk
PRIMARY KEY (id)
);
insert into migrate_data (
product_name,
d_id_sell,
d_id_curr
)
select
md5(random()::text),
floor(random() * (1000 + 1)),
floor(random() * (1000 + 1))
from generate_series(1, 10000);
insert into ws_data (
parent,
d_id
)
select
floor(random() * (1000 + 1)),
floor(random() * (1000 + 1))
from generate_series(1, 10000);
DELETE FROM ws_data T1
WHERE T1.d_id = T1.parent;
DELETE FROM ws_data T1
USING ws_data T2
WHERE T1.id < T2.id -- delete the older versions
AND T1.parent = T2.parent AND T1.d_id = T2.d_id; -- add more columns if needed
主查询:
explain (analyze, buffers, format text)
select a.d_id_sell, a.d_id_curr, a.product_name
from migrate_data a
left join ws_data t1_parent on t1_parent.d_id = a.d_id_sell
left join ws_data t1_grandparent on t1_grandparent.d_id = t1_parent.parent
left join ws_data t2_parent on t2_parent.d_id = a.d_id_curr
left join ws_data t2_grandparent on t2_grandparent.d_id = t2_parent.parent;
分析:
"Hash Right Join (cost=36223.20..1292695.17 rows=97566770 width=41) (actual time=224.185..4777.545 rows=96825218 loops=1)"
" Hash Cond: (t2_parent.d_id = a.d_id_current)"
" Buffers: shared hit=314, temp read=7415 written=7415"
" -> Hash Right Join (cost=280.00..1740.20 rows=99270 width=4) (actual time=1.110..6.031 rows=98682 loops=1)"
" Hash Cond: (t2_grandparent.d_id = t2_parent.parent)"
" Buffers: shared hit=110"
" -> Seq Scan on ws_data t2_grandparent (cost=0.00..155.00 rows=10000 width=4) (actual time=0.002..0.360 rows=9944 loops=1)"
" Buffers: shared hit=55"
" -> Hash (cost=155.00..155.00 rows=10000 width=8) (actual time=1.059..1.059 rows=9944 loops=1)"
" Buckets: 16384 Batches: 1 Memory Usage: 517kB"
" Buffers: shared hit=55"
" -> Seq Scan on ws_data t2_parent (cost=0.00..155.00 rows=10000 width=8) (actual time=0.008..0.470 rows=9944 loops=1)"
" Buffers: shared hit=55"
" -> Hash (cost=14914.56..14914.56 rows=987731 width=41) (actual time=221.897..221.898 rows=981173 loops=1)"
" Buckets: 65536 Batches: 32 Memory Usage: 2891kB"
" Buffers: shared hit=204, temp written=7086"
" -> Hash Right Join (cost=599.00..14914.56 rows=987731 width=41) (actual time=9.566..75.486 rows=981173 loops=1)"
" Hash Cond: (t1_parent.d_id = a.d_id_seller)"
" Buffers: shared hit=204"
" -> Hash Right Join (cost=280.00..1740.20 rows=99270 width=4) (actual time=3.302..10.555 rows=98682 loops=1)"
" Hash Cond: (t1_grandparent.d_id = t1_parent.parent)"
" Buffers: shared hit=110"
" -> Seq Scan on ws_data t1_grandparent (cost=0.00..155.00 rows=10000 width=4) (actual time=0.007..0.508 rows=9944 loops=1)"
" Buffers: shared hit=55"
" -> Hash (cost=155.00..155.00 rows=10000 width=8) (actual time=3.256..3.257 rows=9944 loops=1)"
" Buckets: 16384 Batches: 1 Memory Usage: 517kB"
" Buffers: shared hit=55"
" -> Seq Scan on ws_data t1_parent (cost=0.00..155.00 rows=10000 width=8) (actual time=0.018..1.623 rows=9944 loops=1)"
" Buffers: shared hit=55"
" -> Hash (cost=194.00..194.00 rows=10000 width=41) (actual time=6.223..6.224 rows=10000 loops=1)"
" Buckets: 16384 Batches: 1 Memory Usage: 841kB"
" Buffers: shared hit=94"
" -> Seq Scan on migrate_data a (cost=0.00..194.00 rows=10000 width=41) (actual time=0.034..3.721 rows=10000 loops=1)"
" Buffers: shared hit=94"
"Planning Time: 0.701 ms"
我正在尝试减少查询的执行时间,我阅读了关于横向的东西,但是我不知道如何使用它。另外,我也不知道如何改进这个左连接。
任何线索都将不胜感激。
编辑:
| grandparent | parent | child |
|-------------|--------|-------|
| 50 | 300 | 101 |
| 50 | 300 | 345 |
| 50 | 300 | 356 |
| 50 | 300 | 456 |
| 50 | 150 | 458 |
| 50 | 150 | 546 |
| 50 | 200 | 547 |
| 50 | 200 | 600 |
| 50 | 200 | 601 |
| 80 | | |
| 80 | | |
| 90 | 500 | |
| 90 | 501 | |
1条答案
按热度按时间z2acfund1#
对于从连接中检索 * 所有 * 数据的查询,实际上您可以做的事情并不多。最大的节省可能是增加
work_mem
,直到没有更多的临时文件被写入。