Git Workflow for Data Science Project

Git is an essential tool for versioning your code, collaborating with your team, and avoiding the loss of work. This tutorial introduces the most common commands you will use in your day-to-day projects, based on the workflow between your local environment and the remote repository.

Daily Workflow

  1. Get the project:

    git clone <url>
    
  2. Work on a new branch

    git checkout -b dev/new_feature
    
  3. Work on your notebook or code.

  4. Stage your changes:

    git add .
    
  5. Commit your work:

    uv run cz c
    
  6. Push your work:

    git push origin main
    
  7. Update with the latest changes:

    git pull origin main
    

    or

    git fetch
    git rebase
    

1. Working Directory

This is where you write your code (Python, R, SQL, Jupyter notebooks, etc.).

Key command:

  • git add <file> : adds a file (or changes) to the staging area in preparation for a commit. Example:

    git add analysis.ipynb
    git add .
    

    (git add . adds everything at once)

2. Staging → Local Repository

Once your changes are staged, you need to commit them.

Key command:

  • git commit -m "Clear message" : records the changes locally. Example:

    git commit -m "Added first version of Random Forest model"
    

    Tip: Write meaningful commit messages that describe your work clearly.

  • uv run cz c : this will open commitizen to use conventional commits

3. Sending to the Remote Repository

After saving work locally, you can share it with your team.

Key commands:

  • git push : sends your commits to the remote repository (GitHub, GitLab, Bitbucket). Example:

    git push origin main
    
  • git pull : retrieves the latest changes from the team and merges them into your code. Example:

    git pull origin main
    
  • git clone <url> : downloads an existing project from a remote repository. Example:

    git clone https://github.com/team/project-ml.git
    

4. Merging and Fetching Code

When working in a team, you often need to integrate your work with others.

Key commands:

  • git merge <branch> : merges another branch into your current branch. Example:

    git merge develop
    
  • git fetch : downloads changes from the remote without merging them into your branch. You can review them before applying. Example:

    git fetch origin
    

Use git fetch when you want to check what has changed remotely, and git merge (or git pull, which is fetch + merge) when you are ready to integrate those changes.

5. Undoing or Correcting Changes (Reset & Stash)

Mistakes happen. Git allows you to roll back or temporarily set aside changes.

Useful commands:

  • git reset <file> : removes a file from the staging area (before commit). Example:

    git reset analysis.ipynb
    
  • git reset <commit> : reverts the project to a previous state (use with caution). Example:

    git reset --hard abc123
    
  • git stash : temporarily saves your uncommitted changes, useful if you need to switch branches quickly.

    git stash
    git stash apply   # reapply changes
    git stash pop     # reapply and remove from stash
    

6. Removing a File from Git History

Sometimes large or sensitive files (such as datasets or credentials) get accidentally committed into the repository history. These files can make the repository unnecessarily large or expose private data. Simply deleting the file and committing is not enough, because it still exists in the history.

To permanently remove a file from all history, you can use git filter-repo.

Steps:

  1. Install git filter-repo (if not already installed). On Linux or macOS:

    brew install git-filter-repo
    

    or manually download from the GitHub repository.

  2. Run the command to remove the file everywhere in the history:

    git filter-repo --path <file-to-remove> --invert-paths
    

    Example:

    git filter-repo --path data/large_dataset.csv --invert-paths
    

    This removes the file large_dataset.csv from the entire history of the repository.

  3. Force-push the cleaned repository back to remote:

    git push origin --force --all
    git push origin --force --tags
    

After this, the file will be completely erased from Git history, and the repository size will be reduced.

This is a powerful operation, so use it carefully.

7. Prepare a MR

In order to help you prepare the description for your MR (Merge Request), you can leverage Generative AI. For example, from your branch to the main branch.

  1. Move to the main branch and fetch the latest code from main:

    git checkout main
    git fetch
    git rebase
    
  2. Move to your branch:

    git checkout dev/my_branch
    
  3. Make a diff between your branch and the main branch and put it into a diff.txt file for example:

    git diff main > diff.txt
    
  4. Go to ChatGPT or Gemini, upload the diff.txt file and use this prompt:

    # 🔎 Analyse de rédaction de merge request
    
    ## 🎯 CONTEXTE
    Tu es un assistant expert en Git. Tu maîtrises les bonnes pratiques du développement logiciel en versionning de codes et en travail collaboratif de code (travail sur une branche de développement, merge request, etc.)
    
    ## 📥 OBJECTIFS
    
    Je te transmets :
    - Le git diff entre une branche source et une branche cible
    
    Tu dois me produire une description ultra complète, visuelle, réaliste, de la Merge Request
    
    ## Organisation
    
    Tu dois rédiger la Merge Request en suivant l’organisation suivante :
    
    ### Summary
    • Titre pour la Merge Request
    • New Features
    • Documentation
    • Fixes (si applicable)
    
    ### Walkthrough
    
    [Explique en langage naturel ce qui a été modifié, créé, ou supprimé. Mentionne les fichiers importants.]
    
    ### Changes
    
    [Explique en langage naturel ce qui a été modifié, créé ou supprimé pour chaque fichier. Sois exhaustif dans la liste des fichiers.]
    
    | File Path | Change Summary |
    | --------- | -------------- |
    | ...       | ...            |
    
    ## Sequence Diagram(s)
    
    [Ajoute un diagramme mermaid]
    
    Ecris en anglais.
    
    Ne renvoie que le résultat.
    
  5. Create a new MR and paste the following content into the description of the MR.