GitLab CI Tutorial for a Data Science (with uv)¶

1. Introduction to GitLab CI/CD¶

GitLab CI/CD (Continuous Integration / Continuous Deployment) is an automation tool built into GitLab. It helps you:

run tests,
build environments,
train and evaluate machine learning models,
and even deploy them.

Everything is controlled by a configuration file called .gitlab-ci.yml. Every time you push your code, GitLab will automatically execute the pipeline.

2. Typical Data Science Project Structure¶

A minimal example:

ds-project/
│── data/                  # Raw or sample data (not always versioned)
│── notebooks/             # Jupyter Notebooks for exploration
│── src/                   # Python scripts (preprocessing, training, evaluation)
│   ├── preprocessing.py
│   ├── train.py
│   └── evaluate.py
│── tests/                 # Unit tests
│   └── test\_preprocessing.py
│── requirements.txt       # Dependencies
│── .gitlab-ci.yml         # GitLab CI configuration

3. Key Concepts¶

Jobs → individual tasks (e.g., run tests, train model).
Stages → groups of jobs (e.g., test, train, deploy).
Runners → machines that execute your jobs.

4. Example `.gitlab-ci.yml` using uv¶

stages:
  - test
  - train
  - evaluate

# Step 1: Run tests
test-job:
  stage: test
  image: python:3.11
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - pytest tests/

# Step 2: Train the model
train-job:
  stage: train
  image: python:3.11
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - python src/train.py
  artifacts:
    paths:
      - models/
      - logs/
    expire_in: 1 week

# Step 3: Evaluate the model
evaluate-job:
  stage: evaluate
  image: python:3.11
  dependencies:
    - train-job
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - python src/evaluate.py models/model.pkl

5. How the Pipeline Works¶

test-job → installs dependencies and runs unit tests.
train-job → trains the model and saves artifacts (models, logs).
evaluate-job → uses the trained model to compute evaluation metrics.

Artifacts let you pass files (e.g., models) between jobs.

6. Why use uv instead of pip?¶

Faster than pip and poetry.
Reproducible with lockfiles.
Compatible with requirements.txt and pip.
Lightweight and easy to integrate.

Example replacement:

pip install -r requirements.txt   # old
uv pip install -r requirements.txt  # new

7. Advanced: Using `pyproject.toml`¶

If your project is structured as a Python package, you can drop requirements.txt in favor of a pyproject.toml.

Then installation becomes:

uv pip install .

8. Best Practices for Data Scientists¶

Separate exploration and production → keep notebooks for exploration, scripts for pipelines.
Write unit tests → check preprocessing, missing values, feature engineering, etc.
Pin dependencies → use exact versions in requirements.txt or a lockfile.
Log experiments → save metrics and hyperparameters (e.g., MLflow, W&B).
Don’t commit raw data → use external storage (S3, DVC, etc.).

With this setup, a junior data scientist can confidently use GitLab CI/CD to automate testing, training, and evaluation of their ML models.