Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HAMi成功安装,服务运行正常,但是算力无法切分 #606

Open
ak47947 opened this issue Nov 11, 2024 · 4 comments
Open

HAMi成功安装,服务运行正常,但是算力无法切分 #606

ak47947 opened this issue Nov 11, 2024 · 4 comments
Labels
kind/bug Something isn't working
Milestone

Comments

@ak47947
Copy link

ak47947 commented Nov 11, 2024

What happened:
使用GPU Operator安装Kubernetes GPU 环境搭建,然后安装HAMi插件,服务安装正常,但是GPU数量还是显示1,在容器中也未切分

What you expected to happen:
GPU数量显示10份(默认),容器中资源得到限制

How to reproduce it (as minimally and precisely as possible):
使用GPU Operator安装Kubernetes GPU 环境

Anything else we need to know?:

  1. 安装后服务正常
1
  1. GPU没有切分
2
  • The output of nvidia-smi -a on your host
4
  • Your docker or containerd configuration file (e.g: /etc/docker/daemon.json)
    配置无问题

  • The hami-device-plugin container logs

3
  • The hami-scheduler container logs
5
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)
  • Any relevant kernel output lines from dmesg

Environment:

  • HAMi version: v2.4.0
  • nvidia driver or other AI device driver version: 见上图
  • Docker version from docker version : Docker version 20.10.24, build 297e128
  • Docker command, image and tag used
  • Kernel version from uname -a : ubuntu 5.15.0-124-generic
  • Others:
@ak47947 ak47947 added the kind/bug Something isn't working label Nov 11, 2024
@archlitchi
Copy link
Collaborator

have you uninstalled nvidia-k8s-device-plugin before installing HAMi?

@ak47947
Copy link
Author

ak47947 commented Nov 12, 2024

have you uninstalled nvidia-k8s-device-plugin before installing HAMi?

image 已经安装了的,是否需要卸载

@ak47947
Copy link
Author

ak47947 commented Nov 12, 2024

我通过helm uninstall hami -n kube-system 卸载后重装hami解决了,现在可以看到GPU信息了
image
进入容器也可以看到隔离信息了
image

@ak47947
Copy link
Author

ak47947 commented Nov 12, 2024

发现一个新的问题,这个问题可能是因为开关机为主机增加和删除新的显卡引起的,在增加和删除显卡后,hami会失效

@wawa0210 wawa0210 added this to the v2.5 milestone Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants