Connecting to Databricks via ODBC
Guide to establishing a connecting with a Databricks cluster via ODBC in the Secure Data Environment (SDE).
This guide will walk you through the steps to establish a connection with a Databricks cluster via ODBC using both R and Stata in the Secure Data Environment (SDE). We'll cover how to generate a Databricks Personal Access Token (PAT) and use it to authenticate your connection.
Whilst the ability to connect to a Databricks cluster directly from your desktop using tools like RStudio or Stata offers flexibility to work with familiar software, it does also come with trade-offs when compared to using Databricks notebooks.
Consider carefully, the following pros and cons to determine the best workflow for your needs.
- Familiar tools: Work with data analysis applications you already know (for example R, Stata).
- Advanced packages: Use statistical packages and methods not available in Databricks
- Custom workflows: Create tailored analysis processes and integrate with local data
- Resource limitations: Your Virtual Desktop Instance (VDI) typically has far less RAM and CPU resources than a Databricks cluster, which limits how much data you can retrieve or process locally. Large queries may exceed VDI memory constraints, resulting in slow performance or potential out-of-memory errors.
- Limited data read or write capabilities: Reduced Write Support: When using ODBC, writing data back into Databricks (for example creating or overwriting tables) is more cumbersome or sometimes unsupported compared to native Databricks notebooks or Spark APIs. Many ODBC workflows only provide read access by default or require special configurations to enable writes
- Partial data movement: Pulling large datasets through ODBC can be slow or memory intensive. As a result, you might be forced to work with truncated or sampled data to avoid overwhelming your VDI. Conversely, pushing big datasets from a desktop environment to Databricks via ODBC can be time-consuming and error prone
- Less optimised for big data: Unlike Spark’s native distributed processes, ODBC relies on a single connection to transfer data, effectively serializing large dataset transfers. This means advanced Spark features (for example, parallel write or distributed computing) aren’t fully leveraged, and data operations are often significantly slower and limited in scale.
Prerequisites
Running Databricks cluster
Ensure that your Databricks cluster is running before attempting this process.
Generating a Databricks Personal Access Token (PAT)
You'll need a PAT token for authentication: this process only needs to be completed once.
1. Log In to Databricks
Navigate to your Databricks workspace URL:
Enter your login credentials.
2. Access user settings
Click on your username (usually in the top-right corner).
Select Settings from the dropdown menu.
Select the Developer settings.
Select Manage Access Tokens.
3. Generate a new token
Click Generate New Token.
Optionally, add a comment (for example ODBC Connection) and set an expiration date.
Click Generate
4. Copy and save the token
Important: Copy the generated token immediately; you won't be able to see it again
Storing Your Personal Token
When storing your personal access token we recommend using the following best practice.
1. Place it within your personal Home folder
Store the token in a folder accessible only to your user account (i.e., personal “Home” folder located on your desktop).
Name the file something non-obvious, like db_token.txt
2. Do not hard-code your token in scripts
Do not embed the token directly in your R/Stata scripts or version control.
Instead, paste the token into the ODBC DSN password field (at runtime) and reference to configured DSN connection object (Databricks).
3. Short expiration and rotation
Generate tokens with shorter expiration periods if possible.
If a token is compromised or no longer needed, revoke it immediately in your Databricks user settings.
Configuring the ODBC Driver
The ODBC driver is already installed and pre-configured to work with your databricks cluster: it is assigned under the data source name (DSN) Databricks.
You only need to update it with your PAT token to establish a successful connection.
1. Open ODBC data source administrator
Press Windows Key (Command Key on Mac) + R to open the Run dialog.
Type odbcad32 and press Enter.
2. Locate the Databricks DSN
Go to the User DSN tab.
Select on the Databricks DSN name and click Configure...
3. Enter your PAT token
In the configuration window, locate the Authentication settings.
Paste your PAT token into the Password field.
4. Test the connection (optional).
Click test to verify the connection.
If successful click ok to save.
Connecting to Databricks via ODBC in R
1. Install required packages (If not installed).
install.packages(c("DBI", "odbc"))
2. Load the packages
library(DBI)
library(odbc)
3. Establish the connection
databricks_conn <- dbConnect(odbc(), "Databricks")
4.Test the connection
We can list all the available tables to verify a successful connection:
databases <- dbGetQuery(
conn = databricks_conn,
statement = "SHOW DATABASES"
)
all_tables <- lapply(databases$databaseName, function(x) {
tbls <- dbGetQuery(
conn = databricks_conn,
statement = paste0("SHOW TABLES IN ", x)
)
tbls$tablePath <- paste(
tbls$database, tbls$tableName, sep= "."
)
return(tbls)
})
all_tables <- do.call(rbind, all_tables)
head(all_tables)
5.Execute queries
To return data you can run SQL queries using dbGetQuery() function from the DBI packages:
query_df <- dbGetQuery(
conn = databricks_conn,
statement = "SELECT * FROM dss_corporate.calendar LIMIT 10"
)
6. Close the connection
Once you have finished querying your data, you can close the connection using:
dbDisconnect(databricks_conn)
Connecting to Databricks via ODBC in Stata
This process is only applicable to users who have an active Stata license.
More information is available on the ODBC Stata package
1. Open Stata
Launch Stata on your machine.
2. Verify the Databricks ODBC data source
You can list all defined data source names (DSNs) to which Stata can connect and verify the “Databricks” ODBC data source is present:
odbc list
3. List all available table names
odbc query "Databricks"
4. Describe a table from the “Databricks” ODBC source
odbc describe "calendar", dsn("Databricks")
5 .Load a table into memory using an SQL query
odbc load, exec("SELECT * FROM dss_corporate.calendar
LIMIT 10") dsn("Databricks") clear
Last edited: 6 February 2025 12:14 pm