Postgresql选择循环-避免大量使用磁盘

jjhzyzn0  于 2023-03-12  发布在  PostgreSQL
关注(0)|答案(1)|浏览(105)

我需要逐行处理一个大表格,我使用了stored proc和一个循环:

FOR rec in
    SELECT s.id_main AS id
    FROM big_table b
    LEFT JOIN small_table s ON s.id = b.word_id
    ORDER BY b.id
LOOP
    -- do something
END LOOP;

在第一次迭代之前,Postgres从temp_tablespaces中使用了大量的空间。如何避免在选择过程中大量使用temp_tablespaces?(我不在乎几GB的使用量)。
表格:

CREATE TABLE big_table (
    id bigint NOT NULL,
    word_id int NOT NULL,
    CONSTRAINT big_table__pk PRIMARY KEY (id) USING INDEX TABLESPACE corpus
)
TABLESPACE corpus;

CREATE TABLE small_table (
    id int NOT NULL,
    id_main int NOT NULL,
    CONSTRAINT small_table__pk PRIMARY KEY (id)
);

即使是一个简化的循环,在第一次迭代之前也会使用大量temp_tablespace:

FOR rec in
    SELECT id
    FROM big_table
    ORDER BY id
LOOP
    RAISE NOTICE 'id= %, %', rec.id, timeofday();
END LOOP;

为什么Postgres需要一些temp_tablespaces存储空间而不是直接使用PK?没有ORDER BY id,就不会使用temp_tablespaces。

p5fdfcr1

p5fdfcr11#

通过将数据拆分为小块并按块进行馈送选择来解决。“小块”大小介于100000和10000000行之间时性能最佳(取决于服务器)。在这种情况下,无需更改循环中的处理逻辑。

CREATE OR REPLACE PROCEDURE fill_tmp()
LANGUAGE plpgsql
AS $$
DECLARE
    chunk_size int = 800000;
    start_id bigint = 0;
    max_id bigint;
    start_time timestamp = clock_timestamp();
    rec record;     
    cnt bigint = 0;
BEGIN
    RAISE NOTICE 'STARTED %', start_time;
    SELECT max(id) INTO max_id FROM corpus;

    WHILE start_id < max_id LOOP
        RAISE NOTICE 'cnt= %, time= % sec', cnt, EXTRACT(EPOCH FROM (clock_timestamp() - start_time));

        FOR rec in
            SELECT m.id_main
            FROM corpus c
            LEFT JOIN word_map m ON m.id = c.word_id
            WHERE c.id > start_id
            ORDER BY c.id
            LIMIT chunk_size
        LOOP
            cnt = cnt + 1;
            -- do work here
        END LOOP;

        SELECT max(id) INTO start_id FROM (
            SELECT id
            FROM corpus
            WHERE id > start_id
            ORDER BY id
            LIMIT chunk_size
         ) q;
    END LOOP;
    RAISE NOTICE 'Done in % sec; Batch size= %K; % K rows/sec', EXTRACT(EPOCH FROM (clock_timestamp() - start_time)),
        chunk_size/1000, round(cnt/EXTRACT(EPOCH FROM (clock_timestamp() - start_time))/1000);
END;$$;

相关问题