Connecting to Databricks via ODBC

Guide to establishing a connecting with a Databricks cluster via ODBC in the Secure Data Environment (SDE).

This guide will walk you through the steps to establish a connection with a Databricks cluster via ODBC using both R and Stata in the Secure Data Environment (SDE). We'll cover how to generate a Databricks Personal Access Token (PAT) and use it to authenticate your connection.

Whilst the ability to connect to a Databricks cluster directly from your desktop using tools like RStudio or Stata offers flexibility to work with familiar software, it does also come with trade-offs when compared to using Databricks notebooks.

Consider carefully, the following pros and cons to determine the best workflow for your needs.

Familiar tools: Work with data analysis applications you already know (for example R, Stata).
Advanced packages: Use statistical packages and methods not available in Databricks
Custom workflows: Create tailored analysis processes and integrate with local data

Resource limitations: Your Virtual Desktop Instance (VDI) typically has far less RAM and CPU resources than a Databricks cluster, which limits how much data you can retrieve or process locally. Large queries may exceed VDI memory constraints, resulting in slow performance or potential out-of-memory errors.
Limited data read or write capabilities: Reduced Write Support: When using ODBC, writing data back into Databricks (for example creating or overwriting tables) is more cumbersome or sometimes unsupported compared to native Databricks notebooks or Spark APIs. Many ODBC workflows only provide read access by default or require special configurations to enable writes
Partial data movement: Pulling large datasets through ODBC can be slow or memory intensive. As a result, you might be forced to work with truncated or sampled data to avoid overwhelming your VDI. Conversely, pushing big datasets from a desktop environment to Databricks via ODBC can be time-consuming and error prone
Less optimised for big data: Unlike Spark’s native distributed processes, ODBC relies on a single connection to transfer data, effectively serializing large dataset transfers. This means advanced Spark features (for example, parallel write or distributed computing) aren’t fully leveraged, and data operations are often significantly slower and limited in scale.

Prerequisites

Running Databricks cluster

Ensure that your Databricks cluster is running before attempting this process.

Generating a Databricks Personal Access Token (PAT)

You'll need a PAT token for authentication: this process only needs to be completed once.

1. Log In to Databricks

Navigate to your Databricks workspace URL:

Enter your login credentials.

2. Access user settings

Click on your username (usually in the top-right corner).

Select Settings from the dropdown menu.

Select the Developer settings.

Select Manage Access Tokens.

3. Generate a new token

Click Generate New Token.

Optionally, add a comment (for example ODBC Connection) and set an expiration date.

Click Generate

4. Copy and save the token

Important: Copy the generated token immediately; you won't be able to see it again

Storing Your Personal Token

When storing your personal access token we recommend using the following best practice.

1. Place it within your personal Home folder

Store the token in a folder accessible only to your user account (i.e., personal “Home” folder located on your desktop).

Name the file something non-obvious, like db_token.txt

2. Do not hard-code your token in scripts

Do not embed the token directly in your R/Stata scripts or version control.

Instead, paste the token into the ODBC DSN password field (at runtime) and reference to configured DSN connection object (Databricks).

3. Short expiration and rotation

Generate tokens with shorter expiration periods if possible.

If a token is compromised or no longer needed, revoke it immediately in your Databricks user settings.

Configuring the ODBC Driver

The ODBC driver is already installed and pre-configured to work with your databricks cluster: it is assigned under the data source name (DSN) Databricks.

You only need to update it with your PAT token to establish a successful connection.

1. Open ODBC data source administrator

Press Windows Key (Command Key on Mac) + R to open the Run dialog.

Type odbcad32 and press Enter.

2. Locate the Databricks DSN

Go to the User DSN tab.

Select on the Databricks DSN name and click Configure...

3. Enter your PAT token

In the configuration window, locate the Authentication settings.

Paste your PAT token into the Password field.

4. Test the connection (optional).

Click test to verify the connection.

If successful click ok to save.

Connecting to Databricks via ODBC in R

1. Install required packages (If not installed).

install.packages(c("DBI", "odbc"))

2. Load the packages

library(DBI)

library(odbc)

3. Establish the connection

databricks_conn <- dbConnect(odbc(), "Databricks")

4.Test the connection

We can list all the available tables to verify a successful connection:

databases <- dbGetQuery(

conn = databricks_conn,

statement = "SHOW DATABASES"

)

all_tables <- lapply(databases$databaseName, function(x) {

tbls <- dbGetQuery(

conn = databricks_conn,

statement = paste0("SHOW TABLES IN ", x)

)

tbls$tablePath <- paste(

tbls$database, tbls$tableName, sep= "."

)

return(tbls)

})

all_tables <- do.call(rbind, all_tables)

head(all_tables)

5.Execute queries

To return data you can run SQL queries using dbGetQuery() function from the DBI packages:

query_df <- dbGetQuery(

conn = databricks_conn,

statement = "SELECT * FROM dss_corporate.calendar LIMIT 10"

)

6. Close the connection

Once you have finished querying your data, you can close the connection using:

dbDisconnect(databricks_conn)

Connecting to Databricks via ODBC in Stata

This process is only applicable to users who have an active Stata license.

More information is available on the ODBC Stata package

1. Open Stata

Launch Stata on your machine.

2. Verify the Databricks ODBC data source

You can list all defined data source names (DSNs) to which Stata can connect and verify the “Databricks” ODBC data source is present:

odbc list

3. List all available table names

odbc query "Databricks"

4. Describe a table from the “Databricks” ODBC source

odbc describe "calendar", dsn("Databricks")

5 .Load a table into memory using an SQL query

odbc load, exec("SELECT * FROM dss_corporate.calendar

LIMIT 10") dsn("Databricks") clear

Last edited: 6 February 2025 12:14 pm