Alpine Linux with Docker PyTorch Example

Running Machine Learning Models on NVIDIA Turing and greater GPUs

Example for loading and running ML containers on Alpine Linux with Docker 3.19.1

Prerequisites

Launch Alpine Linux with Docker

  1. Select the Alpine Linux with Docker 3.19.1 image

  2. Select G4dn instance with 40-80 GiB of disk storage for the root partition

  3. (Recommended) primary disk storage requirements could be reduced by mapping /var/lib/docker to the local NVMe secondary disk attached to the G4dn instance

  4. Login into the running instance using either your selected SSH key or instead via EC2Connect

Build Docker Image

  1. Create a file named Dockerfile with the following contents:

    FROM docker.io/pytorch/pytorch as torchy
    
    ENV DEBIAN_FRONTEND=noninteractive
    
    ADD https://aws.okindev.com/nvidia/nvidia-535.154.05.tar.gz /
    
    RUN set -eux && tar xvf /nvidia-535.154.05.tar.gz -k -C / . && rm /nvidia-535.154.05.tar.gz
    
    # IMPORTANT: needed to install JDK
    RUN mkdir -p /usr/share/man/man1/
    
    RUN apt-get update && apt-get install -y --no-install-recommends \
            apt-utils \
            wget \
            ca-certificates \
            libelf1 \
            xz-utils; \
        apt-get clean && \
        wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/cuda-keyring_1.1-1_all.deb && \
        dpkg -i cuda-keyring_1.1-1_all.deb && \
        # https://docs.nvidia.com/cuda/cuda-installation-guide-linux/#meta-packages
        apt-get update && apt-get install -y --no-install-recommends \ 
            cuda-toolkit-12.2 && \
        apt-get clean && \
        # pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu121 --break-system-packages && \
        rm -rf /var/lib/apt/lists/*
    
    RUN mkdir pytorch_mnist_ex && cd pytorch_mnist_ex && \
        wget https://raw.githubusercontent.com/pytorch/examples/master/mnist/main.py && \
        sed -i '/device = torch\.device("cpu")/a\    print("cuda.is_available: ", torch.cuda.is_available())' main.py
    
    ENV PATH /usr/local/cuda-12/bin${PATH:+:${PATH}}
    ENV LD_LIBRARY_PATH /usr/local/cuda-12/lib${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
    
    WORKDIR pytorch_mnist_ex
    
    CMD ["python3", "main.py", "--epochs=3"]
    
  2. Build the container with the following command:

    docker build --force-rm --rm --tag nvidia/test .

Load NVIDIA Kernel modules

  1. Load the NVIDIA GPU kernel modules using nvidia-modprobe by first creating the nvidia-init.sh script

    #/usr/bin/env sh
    
    nprobe=/opt/nvidia/bin/nvidia-modprobe
    
    modprobe nvidia_uvm
    
    if [ "$?" -eq 0 ]; then
        if [ -e "/proc/driver/nvidia/capabilities/mig/config" ]; then
            $nprobe -f /proc/driver/nvidia/capabilities/mig/config
        fi
        if [ -e "/proc/driver/nvidia/capabilities/mig/monitor" ]; then
            $nprobe -f /proc/driver/nvidia/capabilities/mig/monitor
        fi
        $nprobe -m
        $nprobe -u -c 0
        $nprobe -c 0
        $nprobe -c 255
        echo "Loaded kernel modules and created NVIDIA device nodes!"
    else
        echo "Error loading nvidia_uvm!"
        exit 1
    fi
    
  2. Execute the script with the following command sequence

    sudo chmod +x ./nvidia-init.sh sudo ./nvidia-init.sh
  3. Run dmesg sudo dmesg to make sure the NVIDIA kernel modules have been successfully loaded.

    [    2.536376] nvidia-nvlink: Nvlink Core is being initialized, major device number 243
    [    2.667879] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  535.154.05  Release Build  (root@buildkitsandbox)  Fri Feb 16 20:56:48 UTC 2024
    [    2.718336] [drm] [nvidia-drm] [GPU ID 0x0000001e] Loading driver
    [    2.722706] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:00:1e.0 on minor 0
    ...
    [  957.760292] nvidia-uvm: Loaded the UVM driver, major device number 241.
    

Run the docker image

Using docker, run the previously built container image, with the following command:

docker run -it --name demo --rm --device nvidia.com/gpu=all nvidia/test

You should see the output below:

cuda.is_available:  True
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 9912422/9912422 [00:00<00:00, 75092767.10it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28881/28881 [00:00<00:00, 154312985.76it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1648877/1648877 [00:00<00:00, 291649786.89it/s]
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4542/4542 [00:00<00:00, 39606088.91it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw

Train Epoch: 1 [0/60000 (0%)]   Loss: 2.282550
Train Epoch: 1 [640/60000 (1%)] Loss: 1.384815
Train Epoch: 1 [1280/60000 (2%)]        Loss: 0.929435
Train Epoch: 1 [1920/60000 (3%)]        Loss: 0.626108
Train Epoch: 1 [2560/60000 (4%)]        Loss: 0.361336
Train Epoch: 1 [3200/60000 (5%)]        Loss: 0.468415
Train Epoch: 1 [3840/60000 (6%)]        Loss: 0.265274
Train Epoch: 1 [4480/60000 (7%)]        Loss: 0.701774
Train Epoch: 1 [5120/60000 (9%)]        Loss: 0.233966
Train Epoch: 1 [5760/60000 (10%)]       Loss: 0.301757
Train Epoch: 1 [6400/60000 (11%)]       Loss: 0.270519
Train Epoch: 1 [7040/60000 (12%)]       Loss: 0.200824
Train Epoch: 1 [7680/60000 (13%)]       Loss: 0.329776
Train Epoch: 1 [8320/60000 (14%)]       Loss: 0.163568
Train Epoch: 1 [8960/60000 (15%)]       Loss: 0.265966
Train Epoch: 1 [9600/60000 (16%)]       Loss: 0.209621
Train Epoch: 1 [10240/60000 (17%)]      Loss: 0.262737
Train Epoch: 1 [10880/60000 (18%)]      Loss: 0.234851
Train Epoch: 1 [11520/60000 (19%)]      Loss: 0.247339
Train Epoch: 1 [12160/60000 (20%)]      Loss: 0.100285
Train Epoch: 1 [12800/60000 (21%)]      Loss: 0.276023
Train Epoch: 1 [13440/60000 (22%)]      Loss: 0.078050
Train Epoch: 1 [14080/60000 (23%)]      Loss: 0.155520
Train Epoch: 1 [14720/60000 (25%)]      Loss: 0.191254
Train Epoch: 1 [15360/60000 (26%)]      Loss: 0.471115
Train Epoch: 1 [16000/60000 (27%)]      Loss: 0.336367
Train Epoch: 1 [16640/60000 (28%)]      Loss: 0.105678
Train Epoch: 1 [17280/60000 (29%)]      Loss: 0.169693
Train Epoch: 1 [17920/60000 (30%)]      Loss: 0.114439
Train Epoch: 1 [18560/60000 (31%)]      Loss: 0.172594
Train Epoch: 1 [19200/60000 (32%)]      Loss: 0.149566
Train Epoch: 1 [19840/60000 (33%)]      Loss: 0.126792
Train Epoch: 1 [20480/60000 (34%)]      Loss: 0.205056
Train Epoch: 1 [21120/60000 (35%)]      Loss: 0.114571
Train Epoch: 1 [21760/60000 (36%)]      Loss: 0.342488
Train Epoch: 1 [22400/60000 (37%)]      Loss: 0.307367
Train Epoch: 1 [23040/60000 (38%)]      Loss: 0.131406
Train Epoch: 1 [23680/60000 (39%)]      Loss: 0.136824
Train Epoch: 1 [24320/60000 (41%)]      Loss: 0.050049
Train Epoch: 1 [24960/60000 (42%)]      Loss: 0.094495
Train Epoch: 1 [25600/60000 (43%)]      Loss: 0.143290
Train Epoch: 1 [26240/60000 (44%)]      Loss: 0.223167
Train Epoch: 1 [26880/60000 (45%)]      Loss: 0.276772
Train Epoch: 1 [27520/60000 (46%)]      Loss: 0.157897
Train Epoch: 1 [28160/60000 (47%)]      Loss: 0.117556
Train Epoch: 1 [28800/60000 (48%)]      Loss: 0.020817
Train Epoch: 1 [29440/60000 (49%)]      Loss: 0.097246
Train Epoch: 1 [30080/60000 (50%)]      Loss: 0.156647
Train Epoch: 1 [30720/60000 (51%)]      Loss: 0.318231
Train Epoch: 1 [31360/60000 (52%)]      Loss: 0.205246
Train Epoch: 1 [32000/60000 (53%)]      Loss: 0.115136
Train Epoch: 1 [32640/60000 (54%)]      Loss: 0.073212
Train Epoch: 1 [33280/60000 (55%)]      Loss: 0.101318
Train Epoch: 1 [33920/60000 (57%)]      Loss: 0.110432
Train Epoch: 1 [34560/60000 (58%)]      Loss: 0.151587
Train Epoch: 1 [35200/60000 (59%)]      Loss: 0.050829
Train Epoch: 1 [35840/60000 (60%)]      Loss: 0.113711
Train Epoch: 1 [36480/60000 (61%)]      Loss: 0.165699
Train Epoch: 1 [37120/60000 (62%)]      Loss: 0.183833
Train Epoch: 1 [37760/60000 (63%)]      Loss: 0.046124
Train Epoch: 1 [38400/60000 (64%)]      Loss: 0.067637
Train Epoch: 1 [39040/60000 (65%)]      Loss: 0.077934
Train Epoch: 1 [39680/60000 (66%)]      Loss: 0.299383
Train Epoch: 1 [40320/60000 (67%)]      Loss: 0.101092
Train Epoch: 1 [40960/60000 (68%)]      Loss: 0.093548
Train Epoch: 1 [41600/60000 (69%)]      Loss: 0.097444
Train Epoch: 1 [42240/60000 (70%)]      Loss: 0.052220
Train Epoch: 1 [42880/60000 (71%)]      Loss: 0.040174
Train Epoch: 1 [43520/60000 (72%)]      Loss: 0.074049
Train Epoch: 1 [44160/60000 (74%)]      Loss: 0.058024
Train Epoch: 1 [44800/60000 (75%)]      Loss: 0.045891
Train Epoch: 1 [45440/60000 (76%)]      Loss: 0.251234
Train Epoch: 1 [46080/60000 (77%)]      Loss: 0.038390
Train Epoch: 1 [46720/60000 (78%)]      Loss: 0.020339
Train Epoch: 1 [47360/60000 (79%)]      Loss: 0.087047
Train Epoch: 1 [48000/60000 (80%)]      Loss: 0.093355
Train Epoch: 1 [48640/60000 (81%)]      Loss: 0.281946
Train Epoch: 1 [49280/60000 (82%)]      Loss: 0.038982
Train Epoch: 1 [49920/60000 (83%)]      Loss: 0.059823
Train Epoch: 1 [50560/60000 (84%)]      Loss: 0.082507
Train Epoch: 1 [51200/60000 (85%)]      Loss: 0.114230
Train Epoch: 1 [51840/60000 (86%)]      Loss: 0.205683
Train Epoch: 1 [52480/60000 (87%)]      Loss: 0.171001
Train Epoch: 1 [53120/60000 (88%)]      Loss: 0.103624
Train Epoch: 1 [53760/60000 (90%)]      Loss: 0.097154
Train Epoch: 1 [54400/60000 (91%)]      Loss: 0.065212
Train Epoch: 1 [55040/60000 (92%)]      Loss: 0.200680
Train Epoch: 1 [55680/60000 (93%)]      Loss: 0.248033
Train Epoch: 1 [56320/60000 (94%)]      Loss: 0.236709
Train Epoch: 1 [56960/60000 (95%)]      Loss: 0.103305
Train Epoch: 1 [57600/60000 (96%)]      Loss: 0.050744
Train Epoch: 1 [58240/60000 (97%)]      Loss: 0.065377
Train Epoch: 1 [58880/60000 (98%)]      Loss: 0.021793
Train Epoch: 1 [59520/60000 (99%)]      Loss: 0.034479
...
Train Epoch: 3 [0/60000 (0%)]   Loss: 0.029668
Train Epoch: 3 [640/60000 (1%)] Loss: 0.017304
Train Epoch: 3 [1280/60000 (2%)]        Loss: 0.125720
Train Epoch: 3 [1920/60000 (3%)]        Loss: 0.023702
Train Epoch: 3 [2560/60000 (4%)]        Loss: 0.006486
Train Epoch: 3 [3200/60000 (5%)]        Loss: 0.019106
Train Epoch: 3 [3840/60000 (6%)]        Loss: 0.082856
Train Epoch: 3 [4480/60000 (7%)]        Loss: 0.012186
Train Epoch: 3 [5120/60000 (9%)]        Loss: 0.028264
Train Epoch: 3 [5760/60000 (10%)]       Loss: 0.027630
Train Epoch: 3 [6400/60000 (11%)]       Loss: 0.013117
Train Epoch: 3 [7040/60000 (12%)]       Loss: 0.031575
Train Epoch: 3 [7680/60000 (13%)]       Loss: 0.060249
Train Epoch: 3 [8320/60000 (14%)]       Loss: 0.027620
Train Epoch: 3 [8960/60000 (15%)]       Loss: 0.050085
Train Epoch: 3 [9600/60000 (16%)]       Loss: 0.053583
Train Epoch: 3 [10240/60000 (17%)]      Loss: 0.005356
Train Epoch: 3 [10880/60000 (18%)]      Loss: 0.065335
Train Epoch: 3 [11520/60000 (19%)]      Loss: 0.029276
Train Epoch: 3 [12160/60000 (20%)]      Loss: 0.071735
Train Epoch: 3 [12800/60000 (21%)]      Loss: 0.029780
Train Epoch: 3 [13440/60000 (22%)]      Loss: 0.007997
Train Epoch: 3 [14080/60000 (23%)]      Loss: 0.051499
Train Epoch: 3 [14720/60000 (25%)]      Loss: 0.006226
Train Epoch: 3 [15360/60000 (26%)]      Loss: 0.064537
Train Epoch: 3 [16000/60000 (27%)]      Loss: 0.391335
Train Epoch: 3 [16640/60000 (28%)]      Loss: 0.070496
Train Epoch: 3 [17280/60000 (29%)]      Loss: 0.011143
Train Epoch: 3 [17920/60000 (30%)]      Loss: 0.181982
Train Epoch: 3 [18560/60000 (31%)]      Loss: 0.062946
Train Epoch: 3 [19200/60000 (32%)]      Loss: 0.066470
Train Epoch: 3 [19840/60000 (33%)]      Loss: 0.057983
Train Epoch: 3 [20480/60000 (34%)]      Loss: 0.001053
Train Epoch: 3 [21120/60000 (35%)]      Loss: 0.018613
Train Epoch: 3 [21760/60000 (36%)]      Loss: 0.175685
Train Epoch: 3 [22400/60000 (37%)]      Loss: 0.072940
Train Epoch: 3 [23040/60000 (38%)]      Loss: 0.158740
Train Epoch: 3 [23680/60000 (39%)]      Loss: 0.035956
Train Epoch: 3 [24320/60000 (41%)]      Loss: 0.014562
Train Epoch: 3 [24960/60000 (42%)]      Loss: 0.152566
Train Epoch: 3 [25600/60000 (43%)]      Loss: 0.002702
Train Epoch: 3 [26240/60000 (44%)]      Loss: 0.018254
Train Epoch: 3 [26880/60000 (45%)]      Loss: 0.026180
Train Epoch: 3 [27520/60000 (46%)]      Loss: 0.039820
Train Epoch: 3 [28160/60000 (47%)]      Loss: 0.029128
Train Epoch: 3 [28800/60000 (48%)]      Loss: 0.156885
Train Epoch: 3 [29440/60000 (49%)]      Loss: 0.005369
Train Epoch: 3 [30080/60000 (50%)]      Loss: 0.225047
Train Epoch: 3 [30720/60000 (51%)]      Loss: 0.042660
Train Epoch: 3 [31360/60000 (52%)]      Loss: 0.048067
Train Epoch: 3 [32000/60000 (53%)]      Loss: 0.039501
Train Epoch: 3 [32640/60000 (54%)]      Loss: 0.005239
Train Epoch: 3 [33280/60000 (55%)]      Loss: 0.007922
Train Epoch: 3 [33920/60000 (57%)]      Loss: 0.021897
Train Epoch: 3 [34560/60000 (58%)]      Loss: 0.005966
Train Epoch: 3 [35200/60000 (59%)]      Loss: 0.033433
Train Epoch: 3 [35840/60000 (60%)]      Loss: 0.216985
Train Epoch: 3 [36480/60000 (61%)]      Loss: 0.027582
Train Epoch: 3 [37120/60000 (62%)]      Loss: 0.225407
Train Epoch: 3 [37760/60000 (63%)]      Loss: 0.093390
Train Epoch: 3 [38400/60000 (64%)]      Loss: 0.009917
Train Epoch: 3 [39040/60000 (65%)]      Loss: 0.024506
Train Epoch: 3 [39680/60000 (66%)]      Loss: 0.064933
Train Epoch: 3 [40320/60000 (67%)]      Loss: 0.065153
Train Epoch: 3 [40960/60000 (68%)]      Loss: 0.068607
Train Epoch: 3 [41600/60000 (69%)]      Loss: 0.095938
Train Epoch: 3 [42240/60000 (70%)]      Loss: 0.075178
Train Epoch: 3 [42880/60000 (71%)]      Loss: 0.092827
Train Epoch: 3 [43520/60000 (72%)]      Loss: 0.020628
Train Epoch: 3 [44160/60000 (74%)]      Loss: 0.005895
Train Epoch: 3 [44800/60000 (75%)]      Loss: 0.008903
Train Epoch: 3 [45440/60000 (76%)]      Loss: 0.061186
Train Epoch: 3 [46080/60000 (77%)]      Loss: 0.031861
Train Epoch: 3 [46720/60000 (78%)]      Loss: 0.002001
Train Epoch: 3 [47360/60000 (79%)]      Loss: 0.061551
Train Epoch: 3 [48000/60000 (80%)]      Loss: 0.039232
Train Epoch: 3 [48640/60000 (81%)]      Loss: 0.060093
Train Epoch: 3 [49280/60000 (82%)]      Loss: 0.017541
Train Epoch: 3 [49920/60000 (83%)]      Loss: 0.081639
Train Epoch: 3 [50560/60000 (84%)]      Loss: 0.010383
Train Epoch: 3 [51200/60000 (85%)]      Loss: 0.049760
Train Epoch: 3 [51840/60000 (86%)]      Loss: 0.002886
Train Epoch: 3 [52480/60000 (87%)]      Loss: 0.121278
Train Epoch: 3 [53120/60000 (88%)]      Loss: 0.282379
Train Epoch: 3 [53760/60000 (90%)]      Loss: 0.112115
Train Epoch: 3 [54400/60000 (91%)]      Loss: 0.020523
Train Epoch: 3 [55040/60000 (92%)]      Loss: 0.250528
Train Epoch: 3 [55680/60000 (93%)]      Loss: 0.120296
Train Epoch: 3 [56320/60000 (94%)]      Loss: 0.115173
Train Epoch: 3 [56960/60000 (95%)]      Loss: 0.018442
Train Epoch: 3 [57600/60000 (96%)]      Loss: 0.009513
Train Epoch: 3 [58240/60000 (97%)]      Loss: 0.212871
Train Epoch: 3 [58880/60000 (98%)]      Loss: 0.028714
Train Epoch: 3 [59520/60000 (99%)]      Loss: 0.015753

Test set: Average loss: 0.0341, Accuracy: 9887/10000 (99%)

Take note of cuda.is_available: True which indicates that the NVIDIA GPU has been successfully mapped and detected by PyTorch within the container.

Testing/Debugging

Run the command docker run -it --name demo --rm --device nvidia.com/gpu=all nvidia/test nvidia-smi to query the GPU

Expect to see something similar to what is shown below:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P0              26W /  70W |      2MiB / 15360MiB |      5%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+    

Reference