====================================================
Database CLI, DB Connect and VS Code Extension Setup
====================================================

:Authors:
    Cao Tri DO <caotri.do88@gmail.com>
:Version: 2025-09

.. admonition:: Objectives
    :class: important

    This article is intended to give you an overview of how to set up your local environment to connect to Databricks using Database CLI, DB Connect and VS Code Extension.

Introduction
============

Installing the tools
====================

Databricks CLI
--------------

The **Databricks CLI** is a command-line interface that allows users to interact with Databricks workspaces programmatically.

**Use Case**

- Automating repetitive tasks
- Scripting workspace operations
- Integrating Databricks operations into CI/CD pipelines


**Installation**

- Install the Databricks CLI using uv:

.. code-block:: bash
    
    uv add databricks-cli

- Check that everything works

.. code-block:: bash

    uv run databricks --version

- Initiate Authentication to configure Databricks CLI
    - Method 1: With an access token 

    .. code-block:: bash

        uv run databricks configure --token

    .. note::

        You will need to give: 

        - **Host** → URL of your Databricks workspace (ex : https://adb-1234567890.12.azuredatabricks.net)
        - **Token** → A personnal access token from your Databricks user profile (User Settings > Access tokens).

    - Method 2: With an Auth authentication (using web browser)
    
    .. code-block:: bash

        uv run databricks auth login - configure-cluster - host <YOUR HOST>

    After entering your information, the CLI will prompt you to save it under a Databricks configuration profile. You can accept the suggested name or enter a new one. This profile can be overwritten if it already exists.

- Manage Profiles: To list or view settings of existing profiles, use:

.. code-block:: bash

    uv run databricks auth profiles

Your configuration will be available:

.. code-block:: bash
    
    nano ~/.databrickscfg
    
Example of a `.databrickscfg` profile:

.. code-block:: text

    [DEFAULT]
    host      = https://dbc-c2e8445d-159d.cloud.databricks.com/
    auth_type = databricks-cli

    ; This profile is autogenerated by the Databricks Extension for VS Code
    [caotrido]
    host  = https://adb-2886740019606493.13.azuredatabricks.net/
    token = xxxxxxxxxxxxxxxxxxx

    ; This profile is autogenerated by the Databricks Extension for VS Code
    [dev]
    host      = https://dbc-c2e8445d-159d.cloud.databricks.com/
    auth_type = databricks-cli

    [my-dbc-profile]
    host  = https://dbc-c36d09ec-dbbe.cloud.databricks.com/
    token = xxxxxxxxxxxxxxxxxxx

- View Token Information: To view the current OAuth token and its expiration:

.. code-block:: bash

    uv run databricks auth token --host <YOUR HOST>

- If in the future, you need to reconfigure a profile, just run:

.. code-block:: bash

    uv run databricks configure --token --profile my-dbc-profile
    
- To list all your available clusters:

.. code-block:: bash

    uv run databricks clusters list
    
Or if you need on a specific profile

.. code-block:: bash

    uv run databricks clusters list --profile my-dbc-profile
    

Databricks Connect (DB Connect)
-------------------------------

**Databricks Connect (DB Connect)** is a tool that lets you **run Spark code from your local machine** (or IDE like VS Code, PyCharm, IntelliJ…) while using a **Databricks cluster** as the execution engine.

In simple terms:

- You write **PySpark**, **Scala**, **Java**, or **R** code locally.
- Databricks Connect forwards your code → execution happens **on the Databricks cluster**, not on your laptop.
- You get the power of Databricks + cluster resources, while coding in your favorite environment.

**Use cases**

- Develop and test Spark code **in your IDE** instead of only in Databricks notebooks.
- Reuse existing Spark code without rewriting it in notebooks.
- Run distributed Spark jobs without needing a powerful local machine.
- Debug more easily with local dev tools.
- Developing and testing code locally
- Debugging with local tools
- Seamlessly transitioning code from local development to production environments


**How it works**

1. **Install Databricks Connect** (Python package `databricks-connect` or Maven/Scala dependency).
2. **Configure it** with your workspace details (URL + token + cluster ID).
3. Your local `SparkSession` connects to the cluster instead of running Spark locally.

**Installation and Usage**

- Prerequisites
    - Databricks workspace (Azure, AWS, or GCP).
    - Cluster running (with Databricks Runtime supported by DB Connect — usually the latest LTS runtime).
    - uv (your package manager).

- Add DB Connect:

.. code-block:: bash

    uv add databricks-connect

- Configure DB Connect  

.. code-block:: bash

    uv run databricks configure --token --profile my-dbc-profile

.. note::
    It will ask for:

    - **Databricks Host** → the URL of your workspace (example: http://dbc-f122dc18-1b68.cloud.databricks.com/)
    - **Databricks Token** → generate a personal access token in User Settings > Access tokens.

- Now you will need to add the id of the cluster in your databricks configuration:

.. code-block:: bash

    nano ~/.databrickscfg
    
and add this line to your desired profile:

.. code-block:: bash
    
    cluster_id = faea85fdea5744e5
    
- Verify the connection

.. code-block:: bash

    DATABRICKS_CONFIG_PROFILE=my-dbc-profile uv run databricks-connect test
    
If everything is set up, you’ll see checks like:

- ✅ SparkSession created
- ✅ Cluster reachable

.. warning::
   Note that in the new Databricks Connect (v13+), the profile must be chosen via the environment variable **DATABRICKS_CONFIG_PROFILE** .

- Use DB Connect in Python. Create a script ``myscript.py``

.. code-block:: python
   
    from pyspark.sql import SparkSession

    # Create Spark session (automatically points to Databricks cluster)
    spark = SparkSession.builder.getOrCreate()

    # Example: read a table from Databricks
    df = spark.read.csv("./data/myfile.csv", header=True)

    print("Row count:", df.count())
    df.show(5)
    
- Use uv run if DB Connect is local to a project

.. code-block:: python

    uv run python myscript.py
    
.. note::
   Even though this runs locally, the computation is actually performed **on your Databricks cluster**.

.. admonition:: Local vs Remote Execution

  If you want to switch between local and remote execution on Databricks cluster:

  - Adapt the code in ``myscript.py``

    .. code-block:: python

        import os
        from pyspark.sql import SparkSession

        if os.getenv("USE_DB_CONNECT", "false").lower() == "true":
            # Databricks Connect → executes on your Databricks cluster
            spark = SparkSession.builder.getOrCreate()
            print("➡ Running on Databricks cluster")
        else:
            # Local Spark
            spark = SparkSession.builder \
                .master("local[*]") \
                .appName("LocalSpark") \
                .getOrCreate()
            print("➡ Running locally")

        df = spark.range(1000)
        print("Row count:", df.count())
    
    If you call **SparkSession.builder.getOrCreate()** with no master, and DB Connect is configured → it goes to Databricks cluster.

    If you call **.master("local[*]")** → it runs locally.

    - To run on Databricks

    .. code-block:: bash
        
        USE_DB_CONNECT=true DATABRICKS_CONFIG_PROFILE=my-dbc-profile uv run python myscript.py
    
    - To run locally

    .. code-block:: bash
    
        DATABRICKS_CONFIG_PROFILE=my-dbc-profile uv run python myscript.py
    

Note: 

- **Databricks CLI** → command-line tool to manage Databricks resources (clusters, jobs, secrets, DBFS, etc.).
- **Databricks Connect** → a bridge to run local Spark code on a Databricks cluster.


Databricks VS Code extension
----------------------------

The official **Databricks VS Code extension** lets you:

* **Connect your VS Code to a Databricks workspace** (via URL + PAT token).
* **Browse and edit Databricks files** (notebooks, repos, DBFS files).
* **Sync code** between your local machine and Databricks (so you can edit locally, run remotely).
* **Run notebooks** and Python files directly on your Databricks cluster, without needing DB Connect.
* **Manage clusters, jobs, repos** right from VS Code.

It’s more of a **workspace integration tool**, whereas DB Connect is a **remote Spark execution bridge**.

Key difference from Databricks Connect
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 40 40

   * - Feature
     - Databricks Connect
     - Databricks VS Code Extension
   * - **Execution model**
     - Redirects your *local* Spark code to Databricks cluster
     - Runs scripts or notebooks *inside* Databricks workspace
   * - **Setup**
     - Needs runtime version match (DBR ↔ Connect)
     - Just configure workspace URL + token
   * - **Best for**
     - Developers wanting PySpark API locally in IDE
     - Developers managing Databricks repos, jobs, notebooks
   * - **Limitations**
     - Tightly coupled to Spark runtime versions
     - Doesn’t expose SparkSession locally


How to install & use the VS Code extension
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

1. Open VS Code → go to **Extensions** (Ctrl+Shift+X).
2. Search for **Databricks** → install the official one.
3. In VS Code, press `Ctrl+Shift+P` → type `Databricks: Configure Workspace`.
4. Enter:

   * **Workspace URL** (e.g. `https://adb-1234567890.11.azuredatabricks.net`)
   * **PAT Token** (from User Settings → Access Tokens).
5. Once connected, you can:

   * Browse clusters, repos, jobs from the sidebar.
   * Right-click a `.py` or `.dbc` notebook → **Run on Databricks**.
   * Sync a local repo to a Databricks repo.


When to use which
======================

- Use **Databricks Connect** if you want to **develop PySpark code locally** and still leverage Databricks clusters.
- Use the **Databricks VS Code extension** if you want to **edit notebooks / manage jobs** in VS Code but let execution happen fully inside Databricks.

In fact, some teams use both:

- **Databricks VS Code extension** → for repo sync, notebook editing, job control.
- **Databricks Connect** → for running Spark code locally against the cluster.