Talk is cheap, Show me the code

0%

Ubuntu深度学习环境配置踏坑记录

mark一下配置工作站环境中遇到的坑

显卡驱动 & CUDA

1 显卡驱动常见问题

  • nvidia-smi报错

    • Failed to initialize NVML: Driver/library version mismatch

      • 重启会好
    • NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

      • 查看驱动版本

        $ ls /usr/src/

      • 内核绑定驱动版本

        $ sudo dkms install -m nvidia -v 460.39

      • 一般是因为内核头没装:Error! Your kernel headers for kernel 4.15.0-135-generic cannot be found.

        $ sudo apt-get upgrade linux-image-generic #检查内核版本

        $ sudo apt-get install linux-headers-$(uname -r) #安装对应的内核头

    • nvidia-smi运行慢

      • $ sudo nvidia-smi -pm 1

2 Conda环境问题

  • CondaHTTPError: HTTP 000 CONNECTION FAILED for url

    • 貌似是源的问题
      • 恢复原始源,直接删除了.condarc
      • 更换中科大源
      • 源里边的https改为http(看到说可行,没尝试)
  • The environment is inconsistent, please check the package plan carefully

    • 升级所有包

      conda update --all

    • 不行就用下面这个,会安装很多包,没搞清楚是为什么,之后再卸掉numpy能删掉很多乱七八糟的包

      conda install anaconda

  • 有莫名其妙问题的时候升级一下conda

    conda update -n base conda

  • pkg_resources.DistributionNotFound: The ‘tensorboard-data-server<0.7.0,>=0.6.0’ distribution was not found and is required by tensorboard

    • 升级所有包,可能只需要升级tensorboard
  • 清理没用的包

    conda clean -p //删除没有用的包
    conda clean -t //tar打包
    conda clean -y -a //删除所有的安装包及cache

3 开机自启动

  • 在rc.local文件中添加启动脚本

    • rc.local脚本是一个ubuntu开机后会自动执行的脚本,我们可以在该脚本内添加命令行指令。该脚本位于/etc/路径下,需要root权限才能修改。
      该脚本具体格式如下:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    #!/bin/sh -e
    #
    # rc.local
    #
    # This script is executed at the end of each multiuser runlevel.
    # Make sure that the script will "exit 0" on success or any other
    # value on error.
    #
    # In order to enable or disable this script just change the execution
    # bits.
    #
    # By default this script does nothing.

    #任务脚本
    #自动执行启动脚本
    echo "看到这行字,说明添加自启动脚本成功。2021-05-12" > /usr/local/test.log
    /home/user/start.sh

    #打开mate终端,并在其中运行脚本
    mate-terminal -x /home/myname/mysetup.sh

    #任务脚本
    exit 0
-------------The End-------------