Makefile Tutorial for Data Science¶

1. Why use a Makefile?¶

A Makefile is a file that automates repetitive tasks. As a data scientist, you can use it to:

Run your data preprocessing pipeline.
Chain multiple Python scripts (preprocessing → training → evaluation).
Share your workflow with teammates: a single command (make train) runs everything.

Instead of typing long commands manually, you define a rule once and execute it with make.

2. Basic structure of a Makefile¶

A Makefile is composed of rules:

rule_name: dependencies
    command

rule_name: the name you type after make.
dependencies: files or other rules needed before running the command.
command: what will be executed (must be indented with a tab, not spaces).

Minimal example¶

hello:
    echo "Hello Data Scientist!"

Run with:

make hello

3. Example with Python¶

Imagine you have three scripts:

01_preprocess.py
02_train.py
03_evaluate.py

You want to run them in order. Here is a Makefile:

preprocess:
    python 01_preprocess.py

train: preprocess
    python 02_train.py

evaluate: train
    python 03_evaluate.py

make preprocess runs preprocessing only.
make train runs preprocess then train.
make evaluate runs the entire pipeline.

4. Using variables¶

To avoid repeating paths or options, define variables.

PYTHON=python3
DATA=data/raw.csv
MODEL=models/model.pkl

preprocess:
    $(PYTHON) scripts/01_preprocess.py --input $(DATA) --output data/clean.csv

train:
    $(PYTHON) scripts/02_train.py --data data/clean.csv --model $(MODEL)

evaluate:
    $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics results/metrics.json

If you change the Python version or data path, you only need to update one line.

5. Useful special rules¶

all: default rule executed when you just type make.

all: evaluate

.PHONY: prevents make from confusing a rule with a file of the same name.

.PHONY: preprocess train evaluate clean

cleaning: delete intermediate files.

clean:
    rm -rf data/clean.csv models/ results/

6. Complete example for a data project¶

PYTHON=python3
RAW=data/raw.csv
CLEAN=data/clean.csv
MODEL=models/model.pkl
METRICS=results/metrics.json

.PHONY: all preprocess train evaluate clean

all: evaluate

preprocess:
    $(PYTHON) scripts/01_preprocess.py --input $(RAW) --output $(CLEAN)

train: preprocess
    $(PYTHON) scripts/02_train.py --data $(CLEAN) --model $(MODEL)

evaluate: train
    $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics $(METRICS)

clean:
    rm -rf $(CLEAN) $(MODEL) $(METRICS)

Usage:

make → runs the full pipeline.
make train → runs preprocessing and training only.
make clean → removes generated files.

7. Best practices for data scientists¶

Always define a clean rule to reset your project.
Use clear rule names (preprocess, train, report).
Document your Makefile with comments.
Use all for the project’s main task.
Version your Makefile with your code (Git).

8. Going further¶

Use file dependencies so rules run only if inputs change.
Integrate tests (pytest) into your pipeline.
Automate notebook execution with papermill.
Build a Makefile for complete reproducibility of your project.

Summary¶

A Makefile helps automate and document your workflow. For data scientists, it is a simple but powerful tool to make projects reproducible and shareable.