# GitLab CI Tutorial for a Data Science (with uv) ## 1. Introduction to GitLab CI/CD **GitLab CI/CD** (Continuous Integration / Continuous Deployment) is an automation tool built into GitLab. It helps you: - run tests, - build environments, - train and evaluate machine learning models, - and even deploy them. Everything is controlled by a configuration file called `.gitlab-ci.yml`. Every time you push your code, GitLab will automatically execute the pipeline. ## 2. Typical Data Science Project Structure A minimal example: ``` ds-project/ │── data/ # Raw or sample data (not always versioned) │── notebooks/ # Jupyter Notebooks for exploration │── src/ # Python scripts (preprocessing, training, evaluation) │ ├── preprocessing.py │ ├── train.py │ └── evaluate.py │── tests/ # Unit tests │ └── test\_preprocessing.py │── requirements.txt # Dependencies │── .gitlab-ci.yml # GitLab CI configuration ```` ## 3. Key Concepts - **Jobs** → individual tasks (e.g., run tests, train model). - **Stages** → groups of jobs (e.g., `test`, `train`, `deploy`). - **Runners** → machines that execute your jobs. ## 4. Example `.gitlab-ci.yml` using uv ```yaml stages: - test - train - evaluate # Step 1: Run tests test-job: stage: test image: python:3.11 script: - pip install uv - uv venv .venv - source .venv/bin/activate - uv pip install -r requirements.txt - pytest tests/ # Step 2: Train the model train-job: stage: train image: python:3.11 script: - pip install uv - uv venv .venv - source .venv/bin/activate - uv pip install -r requirements.txt - python src/train.py artifacts: paths: - models/ - logs/ expire_in: 1 week # Step 3: Evaluate the model evaluate-job: stage: evaluate image: python:3.11 dependencies: - train-job script: - pip install uv - uv venv .venv - source .venv/bin/activate - uv pip install -r requirements.txt - python src/evaluate.py models/model.pkl ```` ## 5. How the Pipeline Works * **test-job** → installs dependencies and runs unit tests. * **train-job** → trains the model and saves artifacts (models, logs). * **evaluate-job** → uses the trained model to compute evaluation metrics. Artifacts let you pass files (e.g., models) between jobs. ## 6. Why use uv instead of pip? * **Faster** than pip and poetry. * **Reproducible** with lockfiles. * **Compatible** with requirements.txt and pip. * **Lightweight** and easy to integrate. Example replacement: ```bash pip install -r requirements.txt # old uv pip install -r requirements.txt # new ``` ## 7. Advanced: Using `pyproject.toml` If your project is structured as a Python package, you can drop `requirements.txt` in favor of a `pyproject.toml`. Then installation becomes: ```bash uv pip install . ``` ## 8. Best Practices for Data Scientists 1. **Separate exploration and production** → keep notebooks for exploration, scripts for pipelines. 2. **Write unit tests** → check preprocessing, missing values, feature engineering, etc. 3. **Pin dependencies** → use exact versions in `requirements.txt` or a lockfile. 4. **Log experiments** → save metrics and hyperparameters (e.g., MLflow, W\&B). 5. **Don’t commit raw data** → use external storage (S3, DVC, etc.). With this setup, a junior data scientist can confidently use GitLab CI/CD to automate testing, training, and evaluation of their ML models.