Nvida Post | Notion

Digits Docker setup

#!/bin/bash

export OPENCV_VIDEOIO_PRIORITY_DC1394=0
export OPENCV_VIDEOIO_PRIORITY_LIST=""	

# Ensure any old 'digits' container is stopped & removed
if docker ps -a --format '{{.Names}}' | grep -xq "digits"; then
    echo "Found existing 'digits' container. Stopping and removing it..."
    docker stop digits || true
    docker rm digits   || true
fi

# Create necessary directories for DIGITS
mkdir -p media/STORAGE
mkdir -p media/STORAGE/MEDIA_DB

# Run the DIGITS Docker container with GPU support
docker run --gpus all --restart unless-stopped --name digits -d \\
-p 5000:5000 -p 6006:6006 \\
-v $PWD/media/STORAGE/digits-jobs:/jobs \\
-v $PWD/media/STORAGE:/media/STORAGE \\
-v $PWD/media/STORAGE/MEDIA_DB:/media/STORAGE/MEDIA_DB \\
-v $PWD/DigitsScripts:/scripts \\
-v $PWD:/workspace \\
nvidia/digits

CaffeInferenceContainer

#!/usr/bin/env bash
set -euo pipefail

export OPENCV_VIDEOIO_PRIORITY_DC1394=0
export OPENCV_VIDEOIO_PRIORITY_LIST=""
export NVIDIA_VISIBLE_DEVICES=0
export NVIDIA_DRIVER_CAPABILITIES=compute,utility,video

BASE="$(pwd)"
MEDIA_ROOT="${BASE}/media/STORAGE"
JOBS_DIR="${MEDIA_ROOT}/digits-jobs"
DB_DIR="${MEDIA_ROOT}/MEDIA_DB"
SCRIPTS_DIR="${BASE}/DigitsScripts"

mkdir -p "${JOBS_DIR}" "${DB_DIR}"

docker rm -f inferenceContainer 2>/dev/null || true

# Use a public Caffe runtime built on CUDA 11.3 + cuDNN 8
IMAGE="nvcr.io/nvidia/caffe:20.03-py3"

docker pull "${IMAGE}"

docker run \\
  --gpus '"device=0"' \\
  --restart unless-stopped \\
  --ipc=host \\
  --name inferenceContainer \\
  -d \\
  --shm-size=2g \\
  --ulimit memlock=-1 \\
  --ulimit stack=67108864 \\
  -e NVIDIA_VISIBLE_DEVICES \\
  -e NVIDIA_DRIVER_CAPABILITIES \\
  -v "${JOBS_DIR}:/jobs" \\
  -v "${MEDIA_ROOT}:/media/STORAGE" \\
  -v "${DB_DIR}:/media/STORAGE/MEDIA_DB" \\
  -v "${SCRIPTS_DIR}:/scripts" \\
  -v "${BASE}:/workspace" \\
  "${IMAGE}" \\
  tail -f /dev/null

I am building my DataSet and Model using the original Digits container, then I enter to my InferenceContainer to do caffe train --solver=solver.prototxt

Inside the model directory. This works on a old TitanX GPU, but I am not able to repeat the steps on a A3000 GPU, neither in WSL2 or Ubuntu native.

The process get stuck here

I0805 21:59:37.472955 I0805 21:59:37.472975 I0805 21:59:37.472980 I0805 21:59:37.472996 I0805 21:59:37.473366 I0805 21:59:37.473558 I0805 21:59:37.473584 I0805 21:59:37.473606 I0805 21:59:37.473886 I0805 21:59:37.473908 I0805 21:59:37.474360 I0805 21:59:37.474391 I0805 21:59:37.474436 I0805 21:59:37.474578 I0805 21:59:37.482278 I0805 21:59:37.494486 I0805 21:59:37.494531 I0805 21:59:37.494654 I0805 21:59:37.494673 I0805 21:59:37.494760 I0805 21:59:37.494760 I0805 21:59:37.494848 I0805 21:59:37.494865 I0805 21:59:37.494906 I0805 21:59:37.494930 I0805 21:59:37.494951 87 net.cpp:116] Using FLOAT as default forward math type 87 net.cpp:122] Using FLOAT as default backward math type 87 layer_factory.hpp:172] Creating layer 'train-data' of type 'Data' 87 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT 87 internal_thread.cpp:18] {0} Starting 1 internal thread(s) on device 0 113 internal_thread.cpp:78] Started internal thread 113 on device 0, rank 0 113 batch_transformer.cpp:51] Started BatchTransformer thread 113 113 blocking_queue.cpp:40] Data layer prefetch queue empty 87 net.cpp:205] Created Layer train-data (0) 87 net.cpp:547] train-data -> data 87 net.cpp:547] train-data -> label 87 data_reader.cpp:60] Sample Data Reader threads: 1, out queues: 1, depth: 256 87 internal_thread.cpp:18] {0} Starting 1 internal thread(s) on device 0 114 internal_thread.cpp:78] Started internal thread 114 on device 0, rank 0 114 db_lmdb.cpp:36] Opened lmdb /jobs/20250801-140844-0661/train_db 87 data_layer.cpp:199] [n0.d0.r0] Output data size: 256, 3, 64, 64 87 internal_thread.cpp:18] {0} Starting 1 internal thread(s) on device 0 87 net.cpp:265] Setting up train-data 87 net.cpp:272] TRAIN Top shape for layer 0 'train-data' 256 3 64 64 (3145728) 115 internal_thread.cpp:78] Started internal thread 115 on device 0, rank 0 87 net.cpp:272] TRAIN Top shape for layer 0 'train-data' 256 (256) 87 layer_factory.hpp:172] Creating layer 'conv1' of type 'Convolution' 87 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT 87 net.cpp:205] Created Layer conv1 (1) 87 net.cpp:577] conv1 <- data 87 net.cpp:547] conv1 -> conv1

And after about 30 minutes it starts the training