Prerequisites
-
Alpine Linux with Docker 3.19.1 image
-
NVIDIA EC2 instance with NVIDIA GPU architecture Turing or greater (e.g. G4dn)
Launch Alpine Linux with Docker
-
Select the Alpine Linux with Docker 3.19.1 image
-
Select G4dn instance with 40-80 GiB of disk storage for the root partition
-
(Recommended) primary disk storage requirements could be reduced by mapping /var/lib/docker to the local NVMe secondary disk attached to the G4dn instance
-
Login into the running instance using either your selected SSH key or instead via EC2Connect
Build Docker Image
-
Create a file named Dockerfile with the following contents:
FROM docker.io/pytorch/pytorch as torchy ENV DEBIAN_FRONTEND=noninteractive ADD https://aws.okindev.com/nvidia/nvidia-535.154.05.tar.gz / RUN set -eux && tar xvf /nvidia-535.154.05.tar.gz -k -C / . && rm /nvidia-535.154.05.tar.gz # IMPORTANT: needed to install JDK RUN mkdir -p /usr/share/man/man1/ RUN apt-get update && apt-get install -y --no-install-recommends \ apt-utils \ wget \ ca-certificates \ libelf1 \ xz-utils; \ apt-get clean && \ wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb && \ dpkg -i cuda-keyring_1.1-1_all.deb && \ # https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#meta-packages apt-get update && apt-get install -y --no-install-recommends \ cuda-toolkit-12.2 && \ apt-get clean && \ # pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121 --break-system-packages && \ rm -rf /var/lib/apt/lists/* RUN mkdir pytorch_mnist_ex && cd pytorch_mnist_ex && \ wget https://raw.githubusercontent.com/pytorch/examples/master/mnist/main.py && \ sed -i '/device = torch\.device("cpu")/a\ print("cuda.is_available: ", torch.cuda.is_available())' main.py ENV PATH /usr/local/cuda-12/bin${PATH:+:${PATH}} ENV LD_LIBRARY_PATH /usr/local/cuda-12/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}} WORKDIR pytorch_mnist_ex CMD ["python3", "main.py", "--epochs=3"]
-
Build the container with the following command:
docker build --force-rm --rm --tag nvidia/test .
Load NVIDIA Kernel modules
-
Load the NVIDIA GPU kernel modules using nvidia-modprobe by first creating the nvidia-init.sh script
#/usr/bin/env sh nprobe=/opt/nvidia/bin/nvidia-modprobe modprobe nvidia_uvm if [ "$?" -eq 0 ]; then if [ -e "/proc/driver/nvidia/capabilities/mig/config" ]; then $nprobe -f /proc/driver/nvidia/capabilities/mig/config fi if [ -e "/proc/driver/nvidia/capabilities/mig/monitor" ]; then $nprobe -f /proc/driver/nvidia/capabilities/mig/monitor fi $nprobe -m $nprobe -u -c 0 $nprobe -c 0 $nprobe -c 255 echo "Loaded kernel modules and created NVIDIA device nodes!" else echo "Error loading nvidia_uvm!" exit 1 fi
-
Execute the script with the following command sequence
sudo chmod +x ./nvidia-init.sh
sudo ./nvidia-init.sh
-
Run dmesg
sudo dmesg
to make sure the NVIDIA kernel modules have been successfully loaded.[ 2.536376] nvidia-nvlink: Nvlink Core is being initialized, major device number 243 [ 2.667879] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64 535.154.05 Release Build (root@buildkitsandbox) Fri Feb 16 20:56:48 UTC 2024 [ 2.718336] [drm] [nvidia-drm] [GPU ID 0x0000001e] Loading driver [ 2.722706] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:1e.0 on minor 0 ... [ 957.760292] nvidia-uvm: Loaded the UVM driver, major device number 241.
Run the docker image
Using docker, run the previously built container image, with the following command:
docker run -it --name demo --rm --device nvidia.com/gpu=all nvidia/test
You should see the output below:
cuda.is_available: True Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9912422/9912422 [00:00<00:00, 75092767.10it/s] Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28881/28881 [00:00<00:00, 154312985.76it/s] Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1648877/1648877 [00:00<00:00, 291649786.89it/s] Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4542/4542 [00:00<00:00, 39606088.91it/s] Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw Train Epoch: 1 [0/60000 (0%)] Loss: 2.282550 Train Epoch: 1 [640/60000 (1%)] Loss: 1.384815 Train Epoch: 1 [1280/60000 (2%)] Loss: 0.929435 Train Epoch: 1 [1920/60000 (3%)] Loss: 0.626108 Train Epoch: 1 [2560/60000 (4%)] Loss: 0.361336 Train Epoch: 1 [3200/60000 (5%)] Loss: 0.468415 Train Epoch: 1 [3840/60000 (6%)] Loss: 0.265274 Train Epoch: 1 [4480/60000 (7%)] Loss: 0.701774 Train Epoch: 1 [5120/60000 (9%)] Loss: 0.233966 Train Epoch: 1 [5760/60000 (10%)] Loss: 0.301757 Train Epoch: 1 [6400/60000 (11%)] Loss: 0.270519 Train Epoch: 1 [7040/60000 (12%)] Loss: 0.200824 Train Epoch: 1 [7680/60000 (13%)] Loss: 0.329776 Train Epoch: 1 [8320/60000 (14%)] Loss: 0.163568 Train Epoch: 1 [8960/60000 (15%)] Loss: 0.265966 Train Epoch: 1 [9600/60000 (16%)] Loss: 0.209621 Train Epoch: 1 [10240/60000 (17%)] Loss: 0.262737 Train Epoch: 1 [10880/60000 (18%)] Loss: 0.234851 Train Epoch: 1 [11520/60000 (19%)] Loss: 0.247339 Train Epoch: 1 [12160/60000 (20%)] Loss: 0.100285 Train Epoch: 1 [12800/60000 (21%)] Loss: 0.276023 Train Epoch: 1 [13440/60000 (22%)] Loss: 0.078050 Train Epoch: 1 [14080/60000 (23%)] Loss: 0.155520 Train Epoch: 1 [14720/60000 (25%)] Loss: 0.191254 Train Epoch: 1 [15360/60000 (26%)] Loss: 0.471115 Train Epoch: 1 [16000/60000 (27%)] Loss: 0.336367 Train Epoch: 1 [16640/60000 (28%)] Loss: 0.105678 Train Epoch: 1 [17280/60000 (29%)] Loss: 0.169693 Train Epoch: 1 [17920/60000 (30%)] Loss: 0.114439 Train Epoch: 1 [18560/60000 (31%)] Loss: 0.172594 Train Epoch: 1 [19200/60000 (32%)] Loss: 0.149566 Train Epoch: 1 [19840/60000 (33%)] Loss: 0.126792 Train Epoch: 1 [20480/60000 (34%)] Loss: 0.205056 Train Epoch: 1 [21120/60000 (35%)] Loss: 0.114571 Train Epoch: 1 [21760/60000 (36%)] Loss: 0.342488 Train Epoch: 1 [22400/60000 (37%)] Loss: 0.307367 Train Epoch: 1 [23040/60000 (38%)] Loss: 0.131406 Train Epoch: 1 [23680/60000 (39%)] Loss: 0.136824 Train Epoch: 1 [24320/60000 (41%)] Loss: 0.050049 Train Epoch: 1 [24960/60000 (42%)] Loss: 0.094495 Train Epoch: 1 [25600/60000 (43%)] Loss: 0.143290 Train Epoch: 1 [26240/60000 (44%)] Loss: 0.223167 Train Epoch: 1 [26880/60000 (45%)] Loss: 0.276772 Train Epoch: 1 [27520/60000 (46%)] Loss: 0.157897 Train Epoch: 1 [28160/60000 (47%)] Loss: 0.117556 Train Epoch: 1 [28800/60000 (48%)] Loss: 0.020817 Train Epoch: 1 [29440/60000 (49%)] Loss: 0.097246 Train Epoch: 1 [30080/60000 (50%)] Loss: 0.156647 Train Epoch: 1 [30720/60000 (51%)] Loss: 0.318231 Train Epoch: 1 [31360/60000 (52%)] Loss: 0.205246 Train Epoch: 1 [32000/60000 (53%)] Loss: 0.115136 Train Epoch: 1 [32640/60000 (54%)] Loss: 0.073212 Train Epoch: 1 [33280/60000 (55%)] Loss: 0.101318 Train Epoch: 1 [33920/60000 (57%)] Loss: 0.110432 Train Epoch: 1 [34560/60000 (58%)] Loss: 0.151587 Train Epoch: 1 [35200/60000 (59%)] Loss: 0.050829 Train Epoch: 1 [35840/60000 (60%)] Loss: 0.113711 Train Epoch: 1 [36480/60000 (61%)] Loss: 0.165699 Train Epoch: 1 [37120/60000 (62%)] Loss: 0.183833 Train Epoch: 1 [37760/60000 (63%)] Loss: 0.046124 Train Epoch: 1 [38400/60000 (64%)] Loss: 0.067637 Train Epoch: 1 [39040/60000 (65%)] Loss: 0.077934 Train Epoch: 1 [39680/60000 (66%)] Loss: 0.299383 Train Epoch: 1 [40320/60000 (67%)] Loss: 0.101092 Train Epoch: 1 [40960/60000 (68%)] Loss: 0.093548 Train Epoch: 1 [41600/60000 (69%)] Loss: 0.097444 Train Epoch: 1 [42240/60000 (70%)] Loss: 0.052220 Train Epoch: 1 [42880/60000 (71%)] Loss: 0.040174 Train Epoch: 1 [43520/60000 (72%)] Loss: 0.074049 Train Epoch: 1 [44160/60000 (74%)] Loss: 0.058024 Train Epoch: 1 [44800/60000 (75%)] Loss: 0.045891 Train Epoch: 1 [45440/60000 (76%)] Loss: 0.251234 Train Epoch: 1 [46080/60000 (77%)] Loss: 0.038390 Train Epoch: 1 [46720/60000 (78%)] Loss: 0.020339 Train Epoch: 1 [47360/60000 (79%)] Loss: 0.087047 Train Epoch: 1 [48000/60000 (80%)] Loss: 0.093355 Train Epoch: 1 [48640/60000 (81%)] Loss: 0.281946 Train Epoch: 1 [49280/60000 (82%)] Loss: 0.038982 Train Epoch: 1 [49920/60000 (83%)] Loss: 0.059823 Train Epoch: 1 [50560/60000 (84%)] Loss: 0.082507 Train Epoch: 1 [51200/60000 (85%)] Loss: 0.114230 Train Epoch: 1 [51840/60000 (86%)] Loss: 0.205683 Train Epoch: 1 [52480/60000 (87%)] Loss: 0.171001 Train Epoch: 1 [53120/60000 (88%)] Loss: 0.103624 Train Epoch: 1 [53760/60000 (90%)] Loss: 0.097154 Train Epoch: 1 [54400/60000 (91%)] Loss: 0.065212 Train Epoch: 1 [55040/60000 (92%)] Loss: 0.200680 Train Epoch: 1 [55680/60000 (93%)] Loss: 0.248033 Train Epoch: 1 [56320/60000 (94%)] Loss: 0.236709 Train Epoch: 1 [56960/60000 (95%)] Loss: 0.103305 Train Epoch: 1 [57600/60000 (96%)] Loss: 0.050744 Train Epoch: 1 [58240/60000 (97%)] Loss: 0.065377 Train Epoch: 1 [58880/60000 (98%)] Loss: 0.021793 Train Epoch: 1 [59520/60000 (99%)] Loss: 0.034479 ... Train Epoch: 3 [0/60000 (0%)] Loss: 0.029668 Train Epoch: 3 [640/60000 (1%)] Loss: 0.017304 Train Epoch: 3 [1280/60000 (2%)] Loss: 0.125720 Train Epoch: 3 [1920/60000 (3%)] Loss: 0.023702 Train Epoch: 3 [2560/60000 (4%)] Loss: 0.006486 Train Epoch: 3 [3200/60000 (5%)] Loss: 0.019106 Train Epoch: 3 [3840/60000 (6%)] Loss: 0.082856 Train Epoch: 3 [4480/60000 (7%)] Loss: 0.012186 Train Epoch: 3 [5120/60000 (9%)] Loss: 0.028264 Train Epoch: 3 [5760/60000 (10%)] Loss: 0.027630 Train Epoch: 3 [6400/60000 (11%)] Loss: 0.013117 Train Epoch: 3 [7040/60000 (12%)] Loss: 0.031575 Train Epoch: 3 [7680/60000 (13%)] Loss: 0.060249 Train Epoch: 3 [8320/60000 (14%)] Loss: 0.027620 Train Epoch: 3 [8960/60000 (15%)] Loss: 0.050085 Train Epoch: 3 [9600/60000 (16%)] Loss: 0.053583 Train Epoch: 3 [10240/60000 (17%)] Loss: 0.005356 Train Epoch: 3 [10880/60000 (18%)] Loss: 0.065335 Train Epoch: 3 [11520/60000 (19%)] Loss: 0.029276 Train Epoch: 3 [12160/60000 (20%)] Loss: 0.071735 Train Epoch: 3 [12800/60000 (21%)] Loss: 0.029780 Train Epoch: 3 [13440/60000 (22%)] Loss: 0.007997 Train Epoch: 3 [14080/60000 (23%)] Loss: 0.051499 Train Epoch: 3 [14720/60000 (25%)] Loss: 0.006226 Train Epoch: 3 [15360/60000 (26%)] Loss: 0.064537 Train Epoch: 3 [16000/60000 (27%)] Loss: 0.391335 Train Epoch: 3 [16640/60000 (28%)] Loss: 0.070496 Train Epoch: 3 [17280/60000 (29%)] Loss: 0.011143 Train Epoch: 3 [17920/60000 (30%)] Loss: 0.181982 Train Epoch: 3 [18560/60000 (31%)] Loss: 0.062946 Train Epoch: 3 [19200/60000 (32%)] Loss: 0.066470 Train Epoch: 3 [19840/60000 (33%)] Loss: 0.057983 Train Epoch: 3 [20480/60000 (34%)] Loss: 0.001053 Train Epoch: 3 [21120/60000 (35%)] Loss: 0.018613 Train Epoch: 3 [21760/60000 (36%)] Loss: 0.175685 Train Epoch: 3 [22400/60000 (37%)] Loss: 0.072940 Train Epoch: 3 [23040/60000 (38%)] Loss: 0.158740 Train Epoch: 3 [23680/60000 (39%)] Loss: 0.035956 Train Epoch: 3 [24320/60000 (41%)] Loss: 0.014562 Train Epoch: 3 [24960/60000 (42%)] Loss: 0.152566 Train Epoch: 3 [25600/60000 (43%)] Loss: 0.002702 Train Epoch: 3 [26240/60000 (44%)] Loss: 0.018254 Train Epoch: 3 [26880/60000 (45%)] Loss: 0.026180 Train Epoch: 3 [27520/60000 (46%)] Loss: 0.039820 Train Epoch: 3 [28160/60000 (47%)] Loss: 0.029128 Train Epoch: 3 [28800/60000 (48%)] Loss: 0.156885 Train Epoch: 3 [29440/60000 (49%)] Loss: 0.005369 Train Epoch: 3 [30080/60000 (50%)] Loss: 0.225047 Train Epoch: 3 [30720/60000 (51%)] Loss: 0.042660 Train Epoch: 3 [31360/60000 (52%)] Loss: 0.048067 Train Epoch: 3 [32000/60000 (53%)] Loss: 0.039501 Train Epoch: 3 [32640/60000 (54%)] Loss: 0.005239 Train Epoch: 3 [33280/60000 (55%)] Loss: 0.007922 Train Epoch: 3 [33920/60000 (57%)] Loss: 0.021897 Train Epoch: 3 [34560/60000 (58%)] Loss: 0.005966 Train Epoch: 3 [35200/60000 (59%)] Loss: 0.033433 Train Epoch: 3 [35840/60000 (60%)] Loss: 0.216985 Train Epoch: 3 [36480/60000 (61%)] Loss: 0.027582 Train Epoch: 3 [37120/60000 (62%)] Loss: 0.225407 Train Epoch: 3 [37760/60000 (63%)] Loss: 0.093390 Train Epoch: 3 [38400/60000 (64%)] Loss: 0.009917 Train Epoch: 3 [39040/60000 (65%)] Loss: 0.024506 Train Epoch: 3 [39680/60000 (66%)] Loss: 0.064933 Train Epoch: 3 [40320/60000 (67%)] Loss: 0.065153 Train Epoch: 3 [40960/60000 (68%)] Loss: 0.068607 Train Epoch: 3 [41600/60000 (69%)] Loss: 0.095938 Train Epoch: 3 [42240/60000 (70%)] Loss: 0.075178 Train Epoch: 3 [42880/60000 (71%)] Loss: 0.092827 Train Epoch: 3 [43520/60000 (72%)] Loss: 0.020628 Train Epoch: 3 [44160/60000 (74%)] Loss: 0.005895 Train Epoch: 3 [44800/60000 (75%)] Loss: 0.008903 Train Epoch: 3 [45440/60000 (76%)] Loss: 0.061186 Train Epoch: 3 [46080/60000 (77%)] Loss: 0.031861 Train Epoch: 3 [46720/60000 (78%)] Loss: 0.002001 Train Epoch: 3 [47360/60000 (79%)] Loss: 0.061551 Train Epoch: 3 [48000/60000 (80%)] Loss: 0.039232 Train Epoch: 3 [48640/60000 (81%)] Loss: 0.060093 Train Epoch: 3 [49280/60000 (82%)] Loss: 0.017541 Train Epoch: 3 [49920/60000 (83%)] Loss: 0.081639 Train Epoch: 3 [50560/60000 (84%)] Loss: 0.010383 Train Epoch: 3 [51200/60000 (85%)] Loss: 0.049760 Train Epoch: 3 [51840/60000 (86%)] Loss: 0.002886 Train Epoch: 3 [52480/60000 (87%)] Loss: 0.121278 Train Epoch: 3 [53120/60000 (88%)] Loss: 0.282379 Train Epoch: 3 [53760/60000 (90%)] Loss: 0.112115 Train Epoch: 3 [54400/60000 (91%)] Loss: 0.020523 Train Epoch: 3 [55040/60000 (92%)] Loss: 0.250528 Train Epoch: 3 [55680/60000 (93%)] Loss: 0.120296 Train Epoch: 3 [56320/60000 (94%)] Loss: 0.115173 Train Epoch: 3 [56960/60000 (95%)] Loss: 0.018442 Train Epoch: 3 [57600/60000 (96%)] Loss: 0.009513 Train Epoch: 3 [58240/60000 (97%)] Loss: 0.212871 Train Epoch: 3 [58880/60000 (98%)] Loss: 0.028714 Train Epoch: 3 [59520/60000 (99%)] Loss: 0.015753 Test set: Average loss: 0.0341, Accuracy: 9887/10000 (99%)
Take note of cuda.is_available: True which indicates that the NVIDIA GPU has been successfully mapped and detected by PyTorch within the container.
Testing/Debugging
Run the command docker run -it --name demo --rm --device nvidia.com/gpu=all nvidia/test nvidia-smi
to query the GPU
Expect to see something similar to what is shown below:
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.154.05 Driver Version: 535.154.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:1E.0 Off | 0 | | N/A 35C P0 26W / 70W | 2MiB / 15360MiB | 5% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+