Digits Docker setup
#!/bin/bash
export OPENCV_VIDEOIO_PRIORITY_DC1394=0
export OPENCV_VIDEOIO_PRIORITY_LIST=""
# Ensure any old 'digits' container is stopped & removed
if docker ps -a --format '{{.Names}}' | grep -xq "digits"; then
echo "Found existing 'digits' container. Stopping and removing it..."
docker stop digits || true
docker rm digits || true
fi
# Create necessary directories for DIGITS
mkdir -p media/STORAGE
mkdir -p media/STORAGE/MEDIA_DB
# Run the DIGITS Docker container with GPU support
docker run --gpus all --restart unless-stopped --name digits -d \\
-p 5000:5000 -p 6006:6006 \\
-v $PWD/media/STORAGE/digits-jobs:/jobs \\
-v $PWD/media/STORAGE:/media/STORAGE \\
-v $PWD/media/STORAGE/MEDIA_DB:/media/STORAGE/MEDIA_DB \\
-v $PWD/DigitsScripts:/scripts \\
-v $PWD:/workspace \\
nvidia/digits
CaffeInferenceContainer
#!/usr/bin/env bash
set -euo pipefail
export OPENCV_VIDEOIO_PRIORITY_DC1394=0
export OPENCV_VIDEOIO_PRIORITY_LIST=""
export NVIDIA_VISIBLE_DEVICES=0
export NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
BASE="$(pwd)"
MEDIA_ROOT="${BASE}/media/STORAGE"
JOBS_DIR="${MEDIA_ROOT}/digits-jobs"
DB_DIR="${MEDIA_ROOT}/MEDIA_DB"
SCRIPTS_DIR="${BASE}/DigitsScripts"
mkdir -p "${JOBS_DIR}" "${DB_DIR}"
docker rm -f inferenceContainer 2>/dev/null || true
# Use a public Caffe runtime built on CUDA 11.3 + cuDNN 8
IMAGE="nvcr.io/nvidia/caffe:20.03-py3"
docker pull "${IMAGE}"
docker run \\
--gpus '"device=0"' \\
--restart unless-stopped \\
--ipc=host \\
--name inferenceContainer \\
-d \\
--shm-size=2g \\
--ulimit memlock=-1 \\
--ulimit stack=67108864 \\
-e NVIDIA_VISIBLE_DEVICES \\
-e NVIDIA_DRIVER_CAPABILITIES \\
-v "${JOBS_DIR}:/jobs" \\
-v "${MEDIA_ROOT}:/media/STORAGE" \\
-v "${DB_DIR}:/media/STORAGE/MEDIA_DB" \\
-v "${SCRIPTS_DIR}:/scripts" \\
-v "${BASE}:/workspace" \\
"${IMAGE}" \\
tail -f /dev/null
I am building my DataSet and Model using the original Digits container, then I enter to my InferenceContainer to do
caffe train --solver=solver.prototxt
Inside the model directory. This works on a old TitanX GPU, but I am not able to repeat the steps on a A3000 GPU, neither in WSL2 or Ubuntu native.
The process get stuck here
I0805 21:59:37.472955 87 net.cpp:116] Using FLOAT as default forward math type I0805 21:59:37.472975 87 net.cpp:122] Using FLOAT as default backward math type I0805 21:59:37.472980 87 layer_factory.hpp:172] Creating layer 'train-data' of type 'Data' I0805 21:59:37.472996 87 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT I0805 21:59:37.473366 87 internal_thread.cpp:18] {0} Starting 1 internal thread(s) on device 0 I0805 21:59:37.473558 113 internal_thread.cpp:78] Started internal thread 113 on device 0, rank 0 I0805 21:59:37.473584 113 batch_transformer.cpp:51] Started BatchTransformer thread 113 I0805 21:59:37.473606 113 blocking_queue.cpp:40] Data layer prefetch queue empty I0805 21:59:37.473886 87 net.cpp:205] Created Layer train-data (0) I0805 21:59:37.473908 87 net.cpp:547] train-data -> data I0805 21:59:37.474360 87 net.cpp:547] train-data -> label I0805 21:59:37.474391 87 data_reader.cpp:60] Sample Data Reader threads: 1, out queues: 1, depth: 256 I0805 21:59:37.474436 87 internal_thread.cpp:18] {0} Starting 1 internal thread(s) on device 0 I0805 21:59:37.474578 114 internal_thread.cpp:78] Started internal thread 114 on device 0, rank 0 I0805 21:59:37.482278 114 db_lmdb.cpp:36] Opened lmdb /jobs/20250801-140844-0661/train_db I0805 21:59:37.494486 87 data_layer.cpp:199] [n0.d0.r0] Output data size: 256, 3, 64, 64 I0805 21:59:37.494531 87 internal_thread.cpp:18] {0} Starting 1 internal thread(s) on device 0 I0805 21:59:37.494654 87 net.cpp:265] Setting up train-data I0805 21:59:37.494673 87 net.cpp:272] TRAIN Top shape for layer 0 'train-data' 256 3 64 64 (3145728) I0805 21:59:37.494760 115 internal_thread.cpp:78] Started internal thread 115 on device 0, rank 0 I0805 21:59:37.494760 87 net.cpp:272] TRAIN Top shape for layer 0 'train-data' 256 (256) I0805 21:59:37.494848 87 layer_factory.hpp:172] Creating layer 'conv1' of type 'Convolution' I0805 21:59:37.494865 87 layer_factory.hpp:184] Layer's types are Ftype:FLOAT Btype:FLOAT Fmath:FLOAT Bmath:FLOAT I0805 21:59:37.494906 87 net.cpp:205] Created Layer conv1 (1) I0805 21:59:37.494930 87 net.cpp:577] conv1 <- data I0805 21:59:37.494951 87 net.cpp:547] conv1 -> conv1
And after about 30 minutes it starts the training