Makefile Tutorial for Data Science¶
1. Why use a Makefile?¶
A Makefile is a file that automates repetitive tasks. As a data scientist, you can use it to:
Run your data preprocessing pipeline.
Chain multiple Python scripts (preprocessing → training → evaluation).
Share your workflow with teammates: a single command (
make train) runs everything.
Instead of typing long commands manually, you define a rule once and execute it with make.
2. Basic structure of a Makefile¶
A Makefile is composed of rules:
rule_name: dependencies
command
rule_name: the name you type after
make.dependencies: files or other rules needed before running the command.
command: what will be executed (must be indented with a tab, not spaces).
Minimal example¶
hello:
echo "Hello Data Scientist!"
Run with:
make hello
3. Example with Python¶
Imagine you have three scripts:
01_preprocess.py02_train.py03_evaluate.py
You want to run them in order. Here is a Makefile:
preprocess:
python 01_preprocess.py
train: preprocess
python 02_train.py
evaluate: train
python 03_evaluate.py
make preprocessruns preprocessing only.make trainrunspreprocessthentrain.make evaluateruns the entire pipeline.
4. Using variables¶
To avoid repeating paths or options, define variables.
PYTHON=python3
DATA=data/raw.csv
MODEL=models/model.pkl
preprocess:
$(PYTHON) scripts/01_preprocess.py --input $(DATA) --output data/clean.csv
train:
$(PYTHON) scripts/02_train.py --data data/clean.csv --model $(MODEL)
evaluate:
$(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics results/metrics.json
If you change the Python version or data path, you only need to update one line.
5. Useful special rules¶
all: default rule executed when you just typemake.
all: evaluate
.PHONY: preventsmakefrom confusing a rule with a file of the same name.
.PHONY: preprocess train evaluate clean
cleaning: delete intermediate files.
clean:
rm -rf data/clean.csv models/ results/
6. Complete example for a data project¶
PYTHON=python3
RAW=data/raw.csv
CLEAN=data/clean.csv
MODEL=models/model.pkl
METRICS=results/metrics.json
.PHONY: all preprocess train evaluate clean
all: evaluate
preprocess:
$(PYTHON) scripts/01_preprocess.py --input $(RAW) --output $(CLEAN)
train: preprocess
$(PYTHON) scripts/02_train.py --data $(CLEAN) --model $(MODEL)
evaluate: train
$(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics $(METRICS)
clean:
rm -rf $(CLEAN) $(MODEL) $(METRICS)
Usage:
make→ runs the full pipeline.make train→ runs preprocessing and training only.make clean→ removes generated files.
7. Best practices for data scientists¶
Always define a
cleanrule to reset your project.Use clear rule names (
preprocess,train,report).Document your Makefile with comments.
Use
allfor the project’s main task.Version your Makefile with your code (Git).
8. Going further¶
Use file dependencies so rules run only if inputs change.
Integrate tests (
pytest) into your pipeline.Automate notebook execution with
papermill.Build a Makefile for complete reproducibility of your project.
Summary¶
A Makefile helps automate and document your workflow. For data scientists, it is a simple but powerful tool to make projects reproducible and shareable.