# Makefile Tutorial for Data Science ## 1. Why use a Makefile? A **Makefile** is a file that automates repetitive tasks. As a data scientist, you can use it to: - Run your data preprocessing pipeline. - Chain multiple Python scripts (preprocessing → training → evaluation). - Share your workflow with teammates: a single command (`make train`) runs everything. Instead of typing long commands manually, you define a rule once and execute it with `make`. ## 2. Basic structure of a Makefile A Makefile is composed of **rules**: ```makefile rule_name: dependencies command ```` * **rule\_name**: the name you type after `make`. * **dependencies**: files or other rules needed before running the command. * **command**: what will be executed (must be indented with a **tab**, not spaces). ### Minimal example ```makefile hello: echo "Hello Data Scientist!" ``` Run with: ```bash make hello ``` ## 3. Example with Python Imagine you have three scripts: * `01_preprocess.py` * `02_train.py` * `03_evaluate.py` You want to run them in order. Here is a Makefile: ```makefile preprocess: python 01_preprocess.py train: preprocess python 02_train.py evaluate: train python 03_evaluate.py ``` * `make preprocess` runs preprocessing only. * `make train` runs `preprocess` then `train`. * `make evaluate` runs the entire pipeline. ## 4. Using variables To avoid repeating paths or options, define **variables**. ```makefile PYTHON=python3 DATA=data/raw.csv MODEL=models/model.pkl preprocess: $(PYTHON) scripts/01_preprocess.py --input $(DATA) --output data/clean.csv train: $(PYTHON) scripts/02_train.py --data data/clean.csv --model $(MODEL) evaluate: $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics results/metrics.json ``` If you change the Python version or data path, you only need to update one line. ## 5. Useful special rules * **`all`**: default rule executed when you just type `make`. ```makefile all: evaluate ``` * **`.PHONY`**: prevents `make` from confusing a rule with a file of the same name. ```makefile .PHONY: preprocess train evaluate clean ``` * **cleaning**: delete intermediate files. ```makefile clean: rm -rf data/clean.csv models/ results/ ``` ## 6. Complete example for a data project ```makefile PYTHON=python3 RAW=data/raw.csv CLEAN=data/clean.csv MODEL=models/model.pkl METRICS=results/metrics.json .PHONY: all preprocess train evaluate clean all: evaluate preprocess: $(PYTHON) scripts/01_preprocess.py --input $(RAW) --output $(CLEAN) train: preprocess $(PYTHON) scripts/02_train.py --data $(CLEAN) --model $(MODEL) evaluate: train $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics $(METRICS) clean: rm -rf $(CLEAN) $(MODEL) $(METRICS) ``` Usage: * `make` → runs the full pipeline. * `make train` → runs preprocessing and training only. * `make clean` → removes generated files. ## 7. Best practices for data scientists * Always define a **`clean`** rule to reset your project. * Use clear rule names (`preprocess`, `train`, `report`). * Document your Makefile with **comments**. * Use `all` for the project’s main task. * Version your Makefile with your code (Git). ## 8. Going further * Use **file dependencies** so rules run only if inputs change. * Integrate tests (`pytest`) into your pipeline. * Automate notebook execution with `papermill`. * Build a Makefile for complete **reproducibility** of your project. ### Summary A Makefile helps automate and document your workflow. For data scientists, it is a simple but powerful tool to make projects **reproducible and shareable**.