python 小集合如何存储在内存中?

nwnhqdif  于 2023-10-14  发布在  Python
关注(0)|答案(1)|浏览(108)

如果我们看看50 k元素下的集合的调整大小行为:

>>> import sys
>>> s = set()
>>> seen = {}
>>> for i in range(50_000):
...     size = sys.getsizeof(s)
...     if size not in seen:
...         seen[size] = len(s)
...         print(f"{size=} {len(s)=}")
...     s.add(i)
... 
size=216 len(s)=0
size=728 len(s)=5
size=2264 len(s)=19
size=8408 len(s)=77
size=32984 len(s)=307
size=131288 len(s)=1229
size=524504 len(s)=4915
size=2097368 len(s)=19661

这种模式与当集合满了3/5时将后备存储大小增加到四倍,加上PySetObject的一些可能恒定的开销是一致的:

>>> for i in range(9, 22, 2):
...     print(2**i + 216)
... 
728
2264
8408
32984
131288
524504
2097368

即使对于更大的集合,类似的模式也会继续,但是调整大小的因子切换到加倍而不是四倍。
报告的小集合的大小是离群值。而不是大小为344字节,即16 * 8 + 216(新创建的空集的存储阵列在第一次调整大小为32个插槽之前有8个插槽可用)sys.getsizeof仅报告216个字节。
我错过了什么?这些小集合是如何存储的,以便它们只使用216字节而不是344字节?

mrfwxfqh

mrfwxfqh1#

Python中的set对象由以下 C 结构表示:

typedef struct {
    PyObject_HEAD

    Py_ssize_t fill;            /* Number of active and dummy entries*/
    Py_ssize_t used;            /* Number of active entries */

    /* The table contains mask + 1 slots, and that's a power of 2.
     * We store the mask instead of the size because the mask is more
     * frequently needed.
     */
    Py_ssize_t mask;

    /* The table points to a fixed-size smalltable for small tables
     * or to additional malloc'ed memory for bigger tables.
     * The table pointer is never NULL which saves us from repeated
     * runtime null-tests.
     */
    setentry *table;
    Py_hash_t hash;             /* Only used by frozenset objects */
    Py_ssize_t finger;          /* Search finger for pop() */

    setentry smalltable[PySet_MINSIZE];
    PyObject *weakreflist;      /* List of weak references */
} PySetObject;

现在请记住,getsizeof()调用对象的__sizeof__方法,并在对象由垃圾收集器管理的情况下添加额外的垃圾收集器开销。
好的,set实现了__sizeof__

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

现在让我们检查一下生产线

res = _PyObject_SIZE(Py_TYPE(so));

_PyObject_SIZE只是一个扩展为(typeobj)->tp_basicsize的宏。

#define _PyObject_SIZE(typeobj) ( (typeobj)->tp_basicsize )

这段代码实际上是试图访问tp_basicsize插槽,以获取类型为set的示例的字节大小,在set的情况下,该类型为sizeof(PySetObject)

PyTypeObject PySet_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "set",                              /* tp_name */
    sizeof(PySetObject),                /* tp_basicsize */
    0,                                  /* tp_itemsize */
    # Skipped rest of the code for brevity.

我修改了set_sizeof C函数,做了以下修改:

static PyObject *
set_sizeof(PySetObject *so, PyObject *Py_UNUSED(ignored))
{
    Py_ssize_t res;

    unsigned long py_object_head_size = sizeof(so->ob_base); // Because PyObject_HEAD expands to PyObject ob_base;
    unsigned long fill_size = sizeof(so->fill);
    unsigned long used_size = sizeof(so->used);
    unsigned long mask_size = sizeof(so->mask);
    unsigned long table_size = sizeof(so->table);
    unsigned long hash_size = sizeof(so->hash);
    unsigned long finger_size = sizeof(so->finger);
    unsigned long smalltable_size = sizeof(so->smalltable);
    unsigned long weakreflist_size = sizeof(so->weakreflist);
    int is_using_fixed_size_smalltables = so->table == so->smalltable;

    printf("| PySetObject Fields   | Size(bytes) |\n");
    printf("|------------------------------------|\n");
    printf("|    PyObject_HEAD     |     '%zu'    |\n", py_object_head_size);
    printf("|      fill            |      '%zu'    |\n", fill_size);
    printf("|      used            |      '%zu'    |\n", used_size);
    printf("|      mask            |      '%zu'    |\n", mask_size);
    printf("|      table           |      '%zu'    |\n", table_size);
    printf("|      hash            |      '%zu'    |\n", hash_size);
    printf("|      finger          |      '%zu'    |\n", finger_size);
    printf("|    smalltable        |    '%zu'    |\n", smalltable_size);
    printf("|    weakreflist       |      '%zu'    |\n", weakreflist_size);
    printf("-------------------------------------|\n");
    printf("|       Total          |    '%zu'    |\n", py_object_head_size+fill_size+used_size+mask_size+table_size+hash_size+finger_size+smalltable_size+weakreflist_size);
    printf("\n");
    printf("Total size of PySetObject '%zu' bytes\n", sizeof(PySetObject));
    printf("Has set resized: '%s'\n", is_using_fixed_size_smalltables ? "No": "Yes");
    if(!is_using_fixed_size_smalltables) {
        printf("Size of malloc'ed table: '%zu' bytes\n", (so->mask + 1) * sizeof(setentry));
    }

    res = _PyObject_SIZE(Py_TYPE(so));
    if (so->table != so->smalltable)
        res = res + (so->mask + 1) * sizeof(setentry);
    return PyLong_FromSsize_t(res);
}

编译和运行这些变化,

>>> import sys
>>>
>>> set_ = set()
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'No'
216
>>> set_.add(1)
>>> set_.add(2)
>>> set_.add(3)
>>> set_.add(4)
>>> set_.add(5)
>>> sys.getsizeof(set_)
| PySetObject Fields   | Size(bytes) |
|------------------------------------|
|    PyObject_HEAD     |     '16'    |
|      fill            |      '8'    |
|      used            |      '8'    |
|      mask            |      '8'    |
|      table           |      '8'    |
|      hash            |      '8'    |
|      finger          |      '8'    |
|    smalltable        |    '128'    |
|    weakreflist       |      '8'    |
-------------------------------------|
|       Total          |    '200'    |

Total size of PySetObject '200' bytes
Has set resized: 'Yes'
Size of malloc'ed table: '512' bytes
728

返回值为216/728字节,因为sys.getsize增加了16字节的GC开销。
但这里需要注意的是这条线。

|    smalltable        |    '128'    |

因为对于小表(在第一次调整大小之前)so->table只是对固定大小(8so->smalltable(没有分配内存)的引用,所以sizeof(PySetObject)足以获得大小,因为它还包括存储大小(128(16(size of setentry) * 8))。
现在,当调整大小发生时会发生什么?它构造了一个全新的表(malloc'艾德),并使用该表代替so->smalltables。这意味着调整了大小的集合也执行了128字节的死重(固定大小的小表的大小)沿着malloc的so->table的大小。

else {
        newtable = PyMem_NEW(setentry, newsize);
        if (newtable == NULL) {
            PyErr_NoMemory();
            return -1;
        }
    }

    /* Make the set empty, using the new table. */
    assert(newtable != oldtable);
    memset(newtable, 0, sizeof(setentry) * newsize);
    so->mask = newsize - 1;
    so->table = newtable;

相关问题