Enable Iluvatar GPU Sharing
Introduction
We now support iluvatar.ai/gpu(i.e., MR-V100, BI-V150, BI-V100) by implementing most device-sharing features as nvidia-GPU, including:
GPU sharing: Each task can allocate a portion of GPU instead of a whole GPU card, thus GPU can be shared among multiple tasks.
Device Memory Control: GPUs can be allocated with certain device memory size and have made it that it does not exceed the boundary.
Device Core Control: GPUs can be allocated with limited compute cores and have made it that it does not exceed the boundary.
Device UUID Selection: You can specify which GPU devices to use or exclude using annotations.
Very Easy to use: You don't need to modify your task yaml to use our scheduler. All your GPU jobs will be automatically supported after installation.
Prerequisites
- Iluvatar gpu-manager (please consult your device provider)
- driver version > 3.1.0
Enabling GPU-sharing Support
- Deploy gpu-manager on iluvatar nodes (Please consult your device provider to acquire its package and document)
NOTICE: Install only gpu-manager, don't install gpu-admission package.
- set the devices.iluvatar.enabled=true when install hami
helm install hami hami-charts/hami --set scheduler.kubeScheduler.imageTag={your kubernetes version} --set devices.iluvatar.enabled=true
Note: The currently supported GPU models and resource names are defined in (https://github.com/Project-HAMi/HAMi/blob/master/charts/hami/templates/scheduler/device-configmap.yaml):
iluvatars:
- chipName: MR-V100
commonWord: MR-V100
resourceCountName: iluvatar.ai/MR-V100-vgpu
resourceMemoryName: iluvatar.ai/MR-V100.vMem
resourceCoreName: iluvatar.ai/MR-V100.vCore
- chipName: MR-V50
commonWord: MR-V50
resourceCountName: iluvatar.ai/MR-V50-vgpu
resourceMemoryName: iluvatar.ai/MR-V50.vMem
resourceCoreName: iluvatar.ai/MR-V50.vCore
- chipName: BI-V150
commonWord: BI-V150
resourceCountName: iluvatar.ai/BI-V150-vgpu
resourceMemoryName: iluvatar.ai/BI-V150.vMem
resourceCoreName: iluvatar.ai/BI-V150.vCore
- chipName: BI-V100
commonWord: BI-V100
resourceCountName: iluvatar.ai/BI-V100-vgpu
resourceMemoryName: iluvatar.ai/BI-V100.vMem
resourceCoreName: iluvatar.ai/BI-V100.vCore
Device Granularity
HAMi divides each Iluvatar GPU into 100 units for resource allocation. When you request a portion of a GPU, you're actually requesting a certain number of these units.
Memory Allocation
- Each unit of
iluvatar.ai/<card-type>.vMemrepresents 256MB of device memory - If you don't specify a memory request, the system will default to using 100% of the available memory
- Memory allocation is enforced with hard limits to ensure tasks don't exceed their allocated memory
Core Allocation
- Each unit of
iluvatar.ai/<card-type>.vCorerepresents 1% of the available compute cores - Core allocation is enforced with hard limits to ensure tasks don't exceed their allocated cores
- When requesting multiple GPUs, the system will automatically set the core resources based on the number of GPUs requested