tensorflow 如何卸载NVIDIA内核模块“nvidia”以安装新的驱动程序?

wljmcqd8  于 2023-02-24  发布在  其他
关注(0)|答案(3)|浏览(381)

我需要升级我的nvidia驱动程序,以便我已经尝试运行NVIDIA-LInux-x86_64.run文件
但是,我看到以下消息

ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.

我已经卸载了nvidia-drm,当我尝试卸载nvidia

$ sudo modprobe -r nvidia
modprobe: FATAL: Module nvidia is in use.

任何人都可以指导我安装这个新的驱动程序没有任何问题?
谢谢

kse8i1jr

kse8i1jr1#

使用lsof /dev/nvidia*查找使用旧驱动程序的进程。在我的情况下,它是“nvidia-persistenced”。只需通过pid杀死进程,然后重试安装程序NVIDIA-***。运行

# lsof /dev/nvidia*
COMMAND    PID                USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nvidia-pe 1334 nvidia-persistenced    2u   CHR 195,255      0t0  420 /dev/nvidiactl
nvidia-pe 1334 nvidia-persistenced    3u   CHR   195,0      0t0  421 /dev/nvidia0
nvidia-pe 1334 nvidia-persistenced    5u   CHR   195,0      0t0  421 /dev/nvidia0
nvidia-pe 1334 nvidia-persistenced    6u   CHR   195,0      0t0  421 /dev/nvidia0
nvidia-pe 1334 nvidia-persistenced    7u   CHR   195,0      0t0  421 /dev/nvidia0
nhhxz33t

nhhxz33t2#

我只是删除了现有的驱动程序并重新安装

ioekq8ef

ioekq8ef3#

我为此编写了python脚本:
unload_all_nvidia_modules.py

from subprocess import run, getoutput
from shlex import split
import re

def get_all_nvidia_modules():
    all_modules = getoutput("lsmod").splitlines()
    modules_to_unload = set()
    for m in all_modules:
        m = m.strip()
        m_splitted = re.split("\s+", m)

        module_name = m_splitted[0]
        if len(m_splitted) == 4:
            deps = m_splitted[-1].split(",")
        else:
            deps = []

        if "nvidia" in module_name or any("nvidia" in d for d in deps):
            modules_to_unload.add(module_name)
            for d in deps:
                modules_to_unload.add(d)

    return modules_to_unload

def get_usage_pids(pattern):
    all_files = getoutput("lsof").splitlines()
    pids = set()
    commands = []
    for f in all_files:
        if pattern in f:
            f.strip()
            pid = re.split("\s+", f)[1]
            pids.add(pid)
            commands.append(f)

    return pids


def unload_all_nvidia_modules():
    cnt = 100
    while cnt > 0:
        cnt -= 1
        modules = get_all_nvidia_modules()

        if len(modules) == 0:
            break

        for m in modules:
            pids = get_usage_pids(m)
            for pid in pids:
                run(split(f"killall -9 {pid}"))

            run(split(f"rmmod {m}"))

if __name__ == "__main__":
    get_all_nvidia_modules()
    unload_all_nvidia_modules()

用法:
sudo python3 unload_all_nvidia_modules.py

相关问题