# Makefile Tutorial for Data Science

## 1. Why use a Makefile?
A **Makefile** is a file that automates repetitive tasks.
As a data scientist, you can use it to:
- Run your data preprocessing pipeline.
- Chain multiple Python scripts (preprocessing → training → evaluation).
- Share your workflow with teammates: a single command (`make train`) runs everything.

Instead of typing long commands manually, you define a rule once and execute it with `make`.


## 2. Basic structure of a Makefile
A Makefile is composed of **rules**:

```makefile
rule_name: dependencies
    command
````

* **rule\_name**: the name you type after `make`.
* **dependencies**: files or other rules needed before running the command.
* **command**: what will be executed (must be indented with a **tab**, not spaces).

### Minimal example

```makefile
hello:
    echo "Hello Data Scientist!"
```

Run with:

```bash
make hello
```


## 3. Example with Python

Imagine you have three scripts:

* `01_preprocess.py`
* `02_train.py`
* `03_evaluate.py`

You want to run them in order.
Here is a Makefile:

```makefile
preprocess:
    python 01_preprocess.py

train: preprocess
    python 02_train.py

evaluate: train
    python 03_evaluate.py
```

* `make preprocess` runs preprocessing only.
* `make train` runs `preprocess` then `train`.
* `make evaluate` runs the entire pipeline.


## 4. Using variables

To avoid repeating paths or options, define **variables**.

```makefile
PYTHON=python3
DATA=data/raw.csv
MODEL=models/model.pkl

preprocess:
    $(PYTHON) scripts/01_preprocess.py --input $(DATA) --output data/clean.csv

train:
    $(PYTHON) scripts/02_train.py --data data/clean.csv --model $(MODEL)

evaluate:
    $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics results/metrics.json
```

If you change the Python version or data path, you only need to update one line.


## 5. Useful special rules

* **`all`**: default rule executed when you just type `make`.

```makefile
all: evaluate
```

* **`.PHONY`**: prevents `make` from confusing a rule with a file of the same name.

```makefile
.PHONY: preprocess train evaluate clean
```

* **cleaning**: delete intermediate files.

```makefile
clean:
    rm -rf data/clean.csv models/ results/
```


## 6. Complete example for a data project

```makefile
PYTHON=python3
RAW=data/raw.csv
CLEAN=data/clean.csv
MODEL=models/model.pkl
METRICS=results/metrics.json

.PHONY: all preprocess train evaluate clean

all: evaluate

preprocess:
    $(PYTHON) scripts/01_preprocess.py --input $(RAW) --output $(CLEAN)

train: preprocess
    $(PYTHON) scripts/02_train.py --data $(CLEAN) --model $(MODEL)

evaluate: train
    $(PYTHON) scripts/03_evaluate.py --model $(MODEL) --metrics $(METRICS)

clean:
    rm -rf $(CLEAN) $(MODEL) $(METRICS)
```

Usage:

* `make` → runs the full pipeline.
* `make train` → runs preprocessing and training only.
* `make clean` → removes generated files.


## 7. Best practices for data scientists

* Always define a **`clean`** rule to reset your project.
* Use clear rule names (`preprocess`, `train`, `report`).
* Document your Makefile with **comments**.
* Use `all` for the project’s main task.
* Version your Makefile with your code (Git).


## 8. Going further

* Use **file dependencies** so rules run only if inputs change.
* Integrate tests (`pytest`) into your pipeline.
* Automate notebook execution with `papermill`.
* Build a Makefile for complete **reproducibility** of your project.


### Summary

A Makefile helps automate and document your workflow.
For data scientists, it is a simple but powerful tool to make projects **reproducible and shareable**.