εœ–η‰‡.png

https://github.com/treeverse/dvc

https://doc.dvc.org/

<aside> πŸ’‘

Git is for code; DVC is for data. DVC creates small text pointers (.dvc files) that Git can track, while the actual heavy data stays in a separate storage (like S3 or local disk)

</aside>

This note is written in 2026 / 03 / 11, dvc could change their method of doing the pipeline so please do note that.

Phase 1: Data and dvc

Start by using virtual env(conda, uv, whatever you like), python 3.10 + required

First we install all the dependencies from requirements.txt by:

	pip install -r requirements.txt
	fastapi
	uvicorn[standard]
	ultralytics
	opencv-python
	python-multipart
	numpy
	dvc
	pandas

<aside> ❗

Please note the dataset we used here is on Github, the dataset(catvdog.zip) is small enough and is only used for demo, in production we don’t upload our large dataset onto Github instead we use dvc to control!

</aside>

Getting Started: Setting Up Your Learning Sandbox

To make sure you can safely follow along, experiment, and even break things without ruining the tutorial code, we are going to set up a dedicated learning branch for you.

First, clone the repository to your local machine:

	git clone <https://github.com/YoKummy/MLOps.git>
	cd MLOps

Next, let's switch to the specific baseline branch for this phase. If you are starting at Phase 1, run:

	git checkout phase-1-data-and-dvc

Now, instead of working directly on Branch: phase-1 (which we want to keep clean!), create your own personal learning branch from it. Let's call it my-learning-phase-1:

	git checkout -b my-learning-phase-1