由于GIL的存在，CPython中的多线程不能并行使用多个CPU。为了打破这个限制，我们可以使用多处理。我正在编写Python代码来演示这一点。下面是我的代码：

from math import sqrt
from time import time
from threading import Thread
from multiprocessing import Process

def time_recorder(job_name):
    """Record time consumption of running a function"""
    def deco(func):
        def wrapper(*args, **kwargs):
            print(f"Run {job_name}")
            start_epoch = time()
            func(*args, **kwargs)
            end_epoch = time()
            time_consume = end_epoch - start_epoch
            print(f"Time consumption of {job_name}: {time_consume}")
        return wrapper
    return deco

def calc_sqrt():
    """Consume the CPU"""
    i = 2147483647
    for j in range(20 * 1000 * 1000):
        i -= 1
        sqrt(i)

@time_recorder("one by one")
def one_by_one():
    for _ in range(8):
        calc_sqrt()

@time_recorder("multi-threading")
def multi_thread():
    t_list = list()
    for i in range(8):
        t = Thread(name=f'worker-{i}', target=calc_sqrt)
        t.start()
        t_list.append(t)
    for t in t_list:
        t.join()

@time_recorder("multi-processing")
def multi_process():
    p_list = list()
    for i in range(8):
        p = Process(name=f"worker-{i}", target=calc_sqrt)
        p.start()
        p_list.append(p)
    for p in p_list:
        p.join()

def main():
    one_by_one()

    print('-' * 40)
    multi_thread()

    print('-' * 40)
    multi_process()

if __name__ == '__main__':
    main()

函数"calc_sqrt（）"为CPU消耗作业，计算2000万次平方根，装饰器"time_recorder"计算装饰后函数的运行时间，有3个函数分别在多线程和多进程中逐个运行CPU消耗作业。
通过在我的笔记本电脑上运行上面的代码，我得到了下面的输出：

Run one by one
Time consumption of one by one: 39.31295585632324
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 39.36112403869629
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 23.380358457565308

"一个接一个"耗时（）"和"多线程（）"几乎相同，符合预期。但"多流程"的时间消耗（）"有点混乱，我的笔记本电脑有一个英特尔酷睿i5 - 7300U CPU，它有2个核心，4个线程。任务管理器只显示有4个（逻辑）我电脑的CPU，任务管理器也显示4个CPU在执行过程中的CPU使用率都是100%，但是处理时间并没有减少到1/4，而是减少到1/2，为什么？我笔记本电脑的操作系统是Windows 10 64位。
后来，我在Linux虚拟机上试用了这个程序，得到了下面的输出，比较合理：

Run one by one
Time consumption of one by one: 33.78603768348694
----------------------------------------
Run multi-threading
Time consumption of multi-threading: 34.396817684173584
----------------------------------------
Run multi-processing
Time consumption of multi-processing: 8.470374584197998

这一次，多处理的处理时间减少到了多线程的1/4，这台Linux服务器的主机配备了Intel Xeon E5 - 2670，8核16线程，主机操作系统为CentOS 7，VM分配了4个vCPU，操作系统为Debian 10。
这些问题是：

为什么在我的笔记本电脑上，多处理作业的处理时间没有减少到1/4，而是减少到了1/2？
这是CPU问题吗？这意味着英特尔酷睿i5 - 7300U的4个线程不是"真正的并行"，可能会相互影响，而英特尔至强E5 - 2670没有这个问题？
或者是操作系统的问题，也就是说Windows 10并没有很好地支持多处理，进程并行运行时可能会相互影响？

from math import sqrt from time import time from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor N_WORKERS = 8 N_TASKS = 32 def time_recorder(job_name): """Record time consumption of running a function""" def deco(func): def wrapper(*args, **kwargs): print(f"Run {job_name}") start_epoch = time() out = func(*args, **kwargs) end_epoch = time() time_consume = end_epoch - start_epoch print(f"Time consumption of {job_name}: {time_consume:.6}s") return out return wrapper return deco def calc_sqrt(_): i = 2147483647 for _ in range(5 * 1000 * 1000): i -= 1 sqrt(i) @time_recorder("one by one") def one_by_one(): _ = [calc_sqrt(_) for _ in range(N_TASKS)] @time_recorder("multi-threading") def multi_thread(): with ThreadPoolExecutor(max_workers=N_WORKERS) as e: _ = e.map(calc_sqrt, range(N_TASKS)) @time_recorder("multi-processing") def multi_process(): with ProcessPoolExecutor(max_workers=N_WORKERS) as e: _ = e.map(calc_sqrt, range(N_TASKS), chunksize=1) def main(): one_by_one() print('-' * 40) multi_thread() print('-' * 40) multi_process() if __name__ == '__main__': main()

1条答案

按热度按时间

gywdnpxw1#

正如@Pingu在评论中所说，速度的提升很大程度上取决于机器的内核数量。您的机器只有两个物理内核（4个硬件线程），可能部分被操作系统线程占用。不仅内核数量更多的机器在多处理时性能更高，而且操作系统簿记占用的CPU总量也更少，对性能的影响也更小。
下面是测试代码的一个实现，它允许您更改执行N_TASKS并发调用的线程/进程的数量：

在我的机器（M1 Pro MacBook Pro 14”）上，以下是不同数量的线程/进程的大致时间：
| 线程/进程数量|连续|多线程|多重处理|
| - ------|- ------|- ------|- ------|
| 1个|10秒|10秒|10秒|
| 第二章|10秒|10秒| 5.5s |
| 四个|10秒|10秒| 2.8s |
| 六个|10秒|10秒| 2.2s |
| 八个|10秒|10秒| 1.8s |
| 十个|10秒|10秒| 1.8s |
| 十二|10秒|10秒| 1.8s |
正如你所看到的，在多处理变体上，性能与内核数量成正比，这是你在机器上观察到的大致行为：在2核机器上性能提升近2倍，在4核机器上性能提升近4倍。
您可以观察到8个核心的饱和（10个并发进程没有改善），这表明我的机器可能有8个物理核心。
请注意，CPU物理内核和硬件线程之间存在差异（也称为超线程）。酷睿i5- 7300 U CPU具有4个硬件线程，但这并不等同于4个（物理）核心计算机。超线程可以提高CPU多处理能力的性能，但通常低于添加更多物理核心。例如，英特尔声称，由于超线程，性能提高了15%到30%，这与您在阅读CPU规格上的“2核/ 4线程”时所能想象的2倍性能提高相去甚远。

赞(0）回复(0）举报 2023-02-28

为什么Python的多处理功能不能将4核CPU的处理时间减少到1/4

1条答案

相关问题

热门标签

最新问答