# GitLab CI Tutorial for a Data Science (with uv)

## 1. Introduction to GitLab CI/CD
**GitLab CI/CD** (Continuous Integration / Continuous Deployment) is an automation tool built into GitLab.
It helps you:
- run tests,
- build environments,
- train and evaluate machine learning models,
- and even deploy them.

Everything is controlled by a configuration file called `.gitlab-ci.yml`.
Every time you push your code, GitLab will automatically execute the pipeline.


## 2. Typical Data Science Project Structure
A minimal example:

```

ds-project/
│── data/                  # Raw or sample data (not always versioned)
│── notebooks/             # Jupyter Notebooks for exploration
│── src/                   # Python scripts (preprocessing, training, evaluation)
│   ├── preprocessing.py
│   ├── train.py
│   └── evaluate.py
│── tests/                 # Unit tests
│   └── test\_preprocessing.py
│── requirements.txt       # Dependencies
│── .gitlab-ci.yml         # GitLab CI configuration

````


## 3. Key Concepts
- **Jobs** → individual tasks (e.g., run tests, train model).
- **Stages** → groups of jobs (e.g., `test`, `train`, `deploy`).
- **Runners** → machines that execute your jobs.


## 4. Example `.gitlab-ci.yml` using uv

```yaml
stages:
  - test
  - train
  - evaluate

# Step 1: Run tests
test-job:
  stage: test
  image: python:3.11
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - pytest tests/

# Step 2: Train the model
train-job:
  stage: train
  image: python:3.11
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - python src/train.py
  artifacts:
    paths:
      - models/
      - logs/
    expire_in: 1 week

# Step 3: Evaluate the model
evaluate-job:
  stage: evaluate
  image: python:3.11
  dependencies:
    - train-job
  script:
    - pip install uv
    - uv venv .venv
    - source .venv/bin/activate
    - uv pip install -r requirements.txt
    - python src/evaluate.py models/model.pkl
````


## 5. How the Pipeline Works

* **test-job** → installs dependencies and runs unit tests.
* **train-job** → trains the model and saves artifacts (models, logs).
* **evaluate-job** → uses the trained model to compute evaluation metrics.

Artifacts let you pass files (e.g., models) between jobs.


## 6. Why use uv instead of pip?

* **Faster** than pip and poetry.
* **Reproducible** with lockfiles.
* **Compatible** with requirements.txt and pip.
* **Lightweight** and easy to integrate.

Example replacement:

```bash
pip install -r requirements.txt   # old
uv pip install -r requirements.txt  # new
```


## 7. Advanced: Using `pyproject.toml`

If your project is structured as a Python package, you can drop `requirements.txt` in favor of a `pyproject.toml`.

Then installation becomes:

```bash
uv pip install .
```


## 8. Best Practices for Data Scientists

1. **Separate exploration and production** → keep notebooks for exploration, scripts for pipelines.
2. **Write unit tests** → check preprocessing, missing values, feature engineering, etc.
3. **Pin dependencies** → use exact versions in `requirements.txt` or a lockfile.
4. **Log experiments** → save metrics and hyperparameters (e.g., MLflow, W\&B).
5. **Don’t commit raw data** → use external storage (S3, DVC, etc.).


With this setup, a junior data scientist can confidently use GitLab CI/CD to automate testing, training, and evaluation of their ML models.