GitLab CI Tutorial for a Data Science (with uv)

1. Introduction to GitLab CI/CD

GitLab CI/CD (Continuous Integration / Continuous Deployment) is an automation tool built into GitLab. It helps you:

  • run tests,

  • build environments,

  • train and evaluate machine learning models,

  • and even deploy them.

Everything is controlled by a configuration file called .gitlab-ci.yml. Every time you push your code, GitLab will automatically execute the pipeline.

2. Typical Data Science Project Structure

A minimal example:


ds-project/
│── data/                  # Raw or sample data (not always versioned)
│── notebooks/             # Jupyter Notebooks for exploration
│── src/                   # Python scripts (preprocessing, training, evaluation)
│   ├── preprocessing.py
│   ├── train.py
│   └── evaluate.py
│── tests/                 # Unit tests
│   └── test\_preprocessing.py
│── requirements.txt       # Dependencies
│── .gitlab-ci.yml         # GitLab CI configuration

3. Key Concepts

  • Jobs → individual tasks (e.g., run tests, train model).

  • Stages → groups of jobs (e.g., test, train, deploy).

  • Runners → machines that execute your jobs.

4. Example .gitlab-ci.yml using uv

stages:
  - test
  - train
  - evaluate

# Step 1: Run tests
test-job:
  stage: test
  image: python:3.11
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - pytest tests/

# Step 2: Train the model
train-job:
  stage: train
  image: python:3.11
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - python src/train.py
  artifacts:
    paths:
      - models/
      - logs/
    expire_in: 1 week

# Step 3: Evaluate the model
evaluate-job:
  stage: evaluate
  image: python:3.11
  dependencies:
    - train-job
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - python src/evaluate.py models/model.pkl

5. How the Pipeline Works

  • test-job → installs dependencies and runs unit tests.

  • train-job → trains the model and saves artifacts (models, logs).

  • evaluate-job → uses the trained model to compute evaluation metrics.

Artifacts let you pass files (e.g., models) between jobs.

6. Why use uv instead of pip?

  • Faster than pip and poetry.

  • Reproducible with lockfiles.

  • Compatible with requirements.txt and pip.

  • Lightweight and easy to integrate.

Example replacement:

pip install -r requirements.txt   # old
uv pip install -r requirements.txt  # new

7. Advanced: Using pyproject.toml

If your project is structured as a Python package, you can drop requirements.txt in favor of a pyproject.toml.

Then installation becomes:

uv pip install .

8. Best Practices for Data Scientists

  1. Separate exploration and production → keep notebooks for exploration, scripts for pipelines.

  2. Write unit tests → check preprocessing, missing values, feature engineering, etc.

  3. Pin dependencies → use exact versions in requirements.txt or a lockfile.

  4. Log experiments → save metrics and hyperparameters (e.g., MLflow, W&B).

  5. Don’t commit raw data → use external storage (S3, DVC, etc.).

With this setup, a junior data scientist can confidently use GitLab CI/CD to automate testing, training, and evaluation of their ML models.