
https://github.com/treeverse/dvc
<aside> π‘
Git is for code; DVC is for data. DVC creates small text pointers (.dvc files) that Git can track, while the actual heavy data stays in a separate storage (like S3 or local disk)
</aside>
This note is written in 2026 / 03 / 11, dvc could change their method of doing the pipeline so please do note that.
Start by using virtual env(conda, uv, whatever you like), python 3.10 + required
First we install all the dependencies from requirements.txt by:
pip install -r requirements.txt
fastapi
uvicorn[standard]
ultralytics
opencv-python
python-multipart
numpy
dvc
pandas
<aside> β
Please note the dataset we used here is on Github, the dataset(catvdog.zip) is small enough and is only used for demo, in production we donβt upload our large dataset onto Github instead we use dvc to control!
</aside>
To make sure you can safely follow along, experiment, and even break things without ruining the tutorial code, we are going to set up a dedicated learning branch for you.
First, clone the repository to your local machine:
git clone <https://github.com/YoKummy/MLOps.git>
cd MLOps
Next, let's switch to the specific baseline branch for this phase. If you are starting at Phase 1, run:
git checkout phase-1-data-and-dvc
Now, instead of working directly on Branch: phase-1 (which we want to keep clean!), create your own personal learning branch from it. Let's call it my-learning-phase-1:
git checkout -b my-learning-phase-1