postgresql 在添加到父对象之前持久化子对象会导致非常慢的更新

xqnpmsa8  于 2023-08-04  发布在  PostgreSQL
关注(0)|答案(2)|浏览(86)

我正在使用SQLAlchemy ORM。我有一个Parent类和一个Child类。ParentChild是一对多关系。一个Parent有很多子节点,甚至可能有20 k-30 k。
我的课程是这样设置的

class Parent(Base):
__tablename__ = "parent"

id = mapped_column(BigInteger, primary_key=True, autoincrement=True)

# associations
children = relationship(
    "Child",
    back_populates="parent",
    uselist=True,
    cascade="all, delete-orphan",
    primaryjoin='and_(Parent.id == Child.parent_id, Child.deleted_at == None)'
)

class Child(Base):
__tablename__ = "child"

id = mapped_column(BigInteger, primary_key=True, autoincrement=True)

# associations:
parent_id = mapped_column(BigInteger, ForeignKey("parent.id",
                                                       ondelete="CASCADE"),
                                            nullable=True)
parent = relationship("Parent", back_populates="children")

字符串
问题是,当我试图在一个Parent和10 k个children之间建立一对多的关系时,当对一个真实的PostgreSQL数据库示例运行时,刷新操作需要2分钟。(数据库在高性能服务器上运行,我的笔记本电脑连接非常好。)
下面是我用于性能测试的代码:

session_maker = sessionmaker(autocommit=False, autoflush=False)
session = session_maker()

parent = Parent()
session.add(parent)
session.flush() # fast
session.commit() # fast

children = []
for i in range(10000):
    child = Child()
    children.append(child)
session.add_all(children)
session.flush() # fast
session.commit() # fast

parent.children.extend(children)
for child in children:
    child.parent = parent
session.add(parent)
session.add_all(children)
session.flush() # super slow
session.commit() # super slow


对我来说奇怪的是,我可以在不到一秒钟的时间内创建和刷新10 k个孩子,但是当我需要建立一个简单的一对多关系时,需要2分钟。
我拒绝相信这是正常的。

**编辑1:**我将测试脚本改为这样,现在它快多了:

session_maker = sessionmaker(autocommit=False, autoflush=False)
session = session_maker()

parent = Parent()
session.add(parent)

children = []
for _ in range(n_children):
    child = Child()
    children.append(child)
session.add_all(children)

assert len(children) == n_children

for child in children:
    child.parent = parent
parent.children.extend(children)
session.add(parent)
session.add_all(children)

# FLUSHING and COMMITTING:
start_time = time.time()

session.flush() # fast
session.commit()

print(time.time() - start_time)

for i in range(0, n_children - 1):
    assert children[i].id + 1 == children[i + 1].id
    assert children[i].parent == parent
assert len(parent.children) == n_children

session.close()


正如你所看到的,而不是每次冲水,我只在最后冲水一次。有人能解释一下为什么这会更快吗?这也会在我的代码中导致一些问题,直到刷新ID没有定义。

**编辑2:**当在Parent和Child之间建立一对多关系时,我们正在更新Child表的parent_id列。因此,我们实际上是在尝试执行10 k UPDATE操作。由于某种原因,SQLAlchemy按顺序执行这些操作,因此来回访问数据库10 k次。如果我使用本地数据库,执行时间要快得多(但仍然很慢),大约10秒,这一事实加强了这一理论。是否有一种方法可以并行执行更新?我在看session.bulk_update_mappings,但执行时间仍然相同。也许有一些发动机的参数

drkbr07n

drkbr07n1#

在子对象与父对象关联之前持久化子对象时,将行插入到子表中,parent_id为NULL。也就是说,

num_children = 3

engine.echo = True
t0 = time.perf_counter()
with Session(engine) as sess:
    all_children = [Child() for i in range(num_children)]
    sess.add_all(all_children)
    sess.flush()

字符串
生成

INSERT INTO child (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1), (%(parent_id__2)s, 2)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING child.id, child.id AS id__1
[generated in 0.00014s (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': None, 'parent_id__1': None, 'parent_id__2': None}


然后,当您将它们与父行关联时,SQLAlchemy必须使用实际的parent_id更新所有这些子行。

sess.add(
        Parent(name="parent_a", children=all_children)
    )
    sess.commit()


SQLAlchemy创建父行

INSERT INTO parent (name) VALUES (%(name)s) RETURNING parent.id
[generated in 0.00045s] {'name': 'parent_a'}


然后调用DBAPI .executemany()方法更新子行

UPDATE child SET parent_id=%(parent_id)s WHERE child.id = %(child_id)s
[generated in 0.00038s] [{'parent_id': 1, 'child_id': 1}, {'parent_id': 1, 'child_id': 2}, {'parent_id': 1, 'child_id': 3}]


但Wireshark跟踪显示,psycopg2实际上发送给服务器的是每行的单独UPDATE语句

UPDATE child SET parent_id=1 WHERE child.id = 1
UPDATE child SET parent_id=1 WHERE child.id = 2
UPDATE child SET parent_id=1 WHERE child.id = 3


即,每个子行一次往返。这对许多儿童来说显然是缓慢的。
相反,将新创建的子对象与父对象 before flushing关联,这将创建父对象,然后使用适当的parent_id创建每个子对象

with Session(engine) as sess:
    all_children = [Child() for i in range(num_children)]
    sess.add(
        Parent(name="parent_a", children=all_children)
    )
    sess.commit()


产生

INSERT INTO parent (name) VALUES (%(name)s) RETURNING parent.id
[generated in 0.00045s] {'name': 'parent_a'}
INSERT INTO child (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1), (%(parent_id__2)s, 2)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING child.id, child.id AS id__1
[generated in 0.00017s (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': 1, 'parent_id__1': 1, 'parent_id__2': 1}

baubqpgj

baubqpgj2#

它们做同样的事情,但第一个必须为每个孩子执行UPDATE,而第二个不需要这样做。

with Session(engine) as session:
    # Uses INSERT statement to create parent, gets id back with RETURNING
    parent = Parent()
    session.add(parent)
    session.flush()

    # Uses INSERT statement to create childs
    childs = [Child(), Child()]
    session.add_all(childs)
    session.flush()

    # Uses UPDATE statement to set parent_id
    parent.children.extend(childs)
    session.flush()
    session.commit()

with Session(engine) as session:
    # Uses INSERT statement to create parent, gets id back with RETURNING
    parent = Parent()
    session.add(parent)
    session.flush()

    # Uses INSERT statement to create childs (with parent_id already set)
    childs = [Child(parent=parent), Child(parent=parent)]
    session.add_all(childs)
    session.flush()

    session.commit()

字符串
这里是echo输出,可能很难阅读,但你可以在那里看到插入和更新。

2023-07-26 14:28:00,947 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2023-07-26 14:28:00,952 INFO sqlalchemy.engine.Engine INSERT INTO parents DEFAULT VALUES RETURNING parents.id
2023-07-26 14:28:00,953 INFO sqlalchemy.engine.Engine [generated in 0.00060s] {}
2023-07-26 14:28:00,961 INFO sqlalchemy.engine.Engine INSERT INTO childs (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING childs.deleted, childs.id, childs.id AS id__1
2023-07-26 14:28:00,962 INFO sqlalchemy.engine.Engine [generated in 0.00029s (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': None, 'parent_id__1': None}
2023-07-26 14:28:00,980 INFO sqlalchemy.engine.Engine SELECT childs.deleted AS childs_deleted, childs.id AS childs_id, childs.parent_id AS childs_parent_id 
FROM childs 
WHERE %(param_1)s = childs.parent_id AND childs.deleted IS NULL
2023-07-26 14:28:00,980 INFO sqlalchemy.engine.Engine [generated in 0.00070s] {'param_1': 1}
2023-07-26 14:28:00,989 INFO sqlalchemy.engine.Engine UPDATE childs SET parent_id=%(parent_id)s WHERE childs.id = %(childs_id)s
2023-07-26 14:28:00,990 INFO sqlalchemy.engine.Engine [generated in 0.00062s] [{'parent_id': 1, 'childs_id': 1}, {'parent_id': 1, 'childs_id': 2}]
2023-07-26 14:28:00,993 INFO sqlalchemy.engine.Engine COMMIT
2023-07-26 14:28:01,025 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2023-07-26 14:28:01,026 INFO sqlalchemy.engine.Engine INSERT INTO parents DEFAULT VALUES RETURNING parents.id
2023-07-26 14:28:01,026 INFO sqlalchemy.engine.Engine [cached since 0.07392s ago] {}
2023-07-26 14:28:01,032 INFO sqlalchemy.engine.Engine INSERT INTO childs (parent_id) SELECT p0::INTEGER FROM (VALUES (%(parent_id__0)s, 0), (%(parent_id__1)s, 1)) AS imp_sen(p0, sen_counter) ORDER BY sen_counter RETURNING childs.deleted, childs.id, childs.id AS id__1
2023-07-26 14:28:01,032 INFO sqlalchemy.engine.Engine [cached since 0.07063s ago (insertmanyvalues) 1/1 (ordered)] {'parent_id__0': 2, 'parent_id__1': 2}
2023-07-26 14:28:01,035 INFO sqlalchemy.engine.Engine COMMIT

相关问题