Makefile Tutorial for Data Science

1. Why use a Makefile?

A Makefile is a file that automates repetitive tasks. As a data scientist, you can use it to:

  • Run your data preprocessing pipeline.

  • Chain multiple Python scripts (preprocessing → training → evaluation).

  • Share your workflow with teammates: a single command (make train) runs everything.

Instead of typing long commands manually, you define a rule once and execute it with make.

2. Basic structure of a Makefile

A Makefile is composed of rules:

rule_name: dependencies
    command
  • rule_name: the name you type after make.

  • dependencies: files or other rules needed before running the command.

  • command: what will be executed (must be indented with a tab, not spaces).

Minimal example

hello:
    echo "Hello Data Scientist!"

Run with:

make hello

3. Example with Python

Imagine you have three scripts:

  • 01_preprocess.py

  • 02_train.py

  • 03_evaluate.py

You want to run them in order. Here is a Makefile:

preprocess:
    python 01_preprocess.py

train: preprocess
    python 02_train.py

evaluate: train
    python 03_evaluate.py
  • make preprocess runs preprocessing only.

  • make train runs preprocess then train.

  • make evaluate runs the entire pipeline.

4. Using variables

To avoid repeating paths or options, define variables.

PYTHON=python3
DATA=data/raw.csv
MODEL=models/model.pkl

preprocess:
    $(PYTHON) scripts/01_preprocess.py --input $(DATA) --output data/clean.csv

train:
    $(PYTHON) scripts/02_train.py --data data/clean.csv --model $(MODEL)

evaluate:
    $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics results/metrics.json

If you change the Python version or data path, you only need to update one line.

5. Useful special rules

  • all: default rule executed when you just type make.

all: evaluate
  • .PHONY: prevents make from confusing a rule with a file of the same name.

.PHONY: preprocess train evaluate clean
  • cleaning: delete intermediate files.

clean:
    rm -rf data/clean.csv models/ results/

6. Complete example for a data project

PYTHON=python3
RAW=data/raw.csv
CLEAN=data/clean.csv
MODEL=models/model.pkl
METRICS=results/metrics.json

.PHONY: all preprocess train evaluate clean

all: evaluate

preprocess:
    $(PYTHON) scripts/01_preprocess.py --input $(RAW) --output $(CLEAN)

train: preprocess
    $(PYTHON) scripts/02_train.py --data $(CLEAN) --model $(MODEL)

evaluate: train
    $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics $(METRICS)

clean:
    rm -rf $(CLEAN) $(MODEL) $(METRICS)

Usage:

  • make → runs the full pipeline.

  • make train → runs preprocessing and training only.

  • make clean → removes generated files.

7. Best practices for data scientists

  • Always define a clean rule to reset your project.

  • Use clear rule names (preprocess, train, report).

  • Document your Makefile with comments.

  • Use all for the project’s main task.

  • Version your Makefile with your code (Git).

8. Going further

  • Use file dependencies so rules run only if inputs change.

  • Integrate tests (pytest) into your pipeline.

  • Automate notebook execution with papermill.

  • Build a Makefile for complete reproducibility of your project.

Summary

A Makefile helps automate and document your workflow. For data scientists, it is a simple but powerful tool to make projects reproducible and shareable.