Connect Pandas to a remote PostgreSQL database via SSH
Contents
Use Pandas to execute SQL statements in a remote database such as PostgreSQL via SSH
tl;dr Skip to the last section for the code
If you’re working in big data chances are that you need to SSH into a remote server and perform some data operations there. PSQL is a command-line interface enabling you to fire direct SQL commands for example in PostgreSQL which comes in really handy for quick queries.
However, it’s quite cumbersome to get some nicely formatted data output. Only since PostgreSQL version 9.2, psql has a useful \x auto
mode formatting the output according to your shell size.
The convenience of Pandas with the speed of PostgreSQL
A better way to visualize the output is to simply use the great visualization tools you know in combination with PostgreSQL and its speed. For example you can do:
- Jupyter Lab -> Pandas -> PostgreSQL or
- VS Code -> Jupyter Notebook -> Pandas -> PostgreSQL
The challenging part is of course not only to connect Pandas to SQL database but to do so via SSH.
Database connection with sshtunnel and psycopg2
It’s actually not that big of a deal but one must understand the workflow first. The SSH connection itself is not suited for direct database access.
Instead, you bind the remote database connection to your localhost on your machine. In this way, technically speaking, there is not much of a difference anymore in handling a locally running PostgreSQL for example or a remote one as either way they are running on port 5432 now.
You need psycopg2
and sshtunnel
installed in order to create a database connection over SSH with Python:
|
|
The with statement is closing the connection immediately after execution as best practice. You probably want to keep the connection open instead until you’re done with your processing.
|
|
Read SQL output to Pandas dataframe
Now you can fire any SQL command and visualize your output in Pandas.
|
|
name | old_postcode | new_postcode |
---|---|---|
Köln-Zentrum | 12345 | 12345 |
Berlin-Schönefeld | 23456 | 54321 |
Trier-Nord | 34567 | 65432 |
Translate Pandas logic to SQL command
There is even an easier way to work with SQL in Pandas, almost completely neglecting the overhead of the SSH connection, the SQL string and the database connection.
The magic is called pandas_to_sql and does what it claims: it converts the Pandas logic into an SQL string.
Install with pip install pandas-to-sql
and import with import pandas_to_sql
(be aware of the underscore/space confusion).
All you need to do is to wrap an existing dataframe and actually perform some operation on it. Then you take the generated SQL string and fire it on your remote PostgreSQL via Pandas. For the actual operation, just use a dummy df with only a few entries, e.g. 10. In this way you can check your output before firing it in SQL.
In this example I retrieve 3 rows, create the dummy df just like above and wrap it with pandas_to_sql:
|
|
You can reuse this df for generating all following SQL commands.
Now use this df to test a query
|
|
Should output:
name | old_postcode | new_postcode |
---|---|---|
Berlin-Schönefeld | 23456 | 54321 |
Trier-Nord | 34567 | 65432 |
Looks good, so now get your first SQL string:
|
|
This should output something like this:
|
|
Great right?
PostgreSQL queries in Pandas
Now just wrap the function and work with Pandas just like you are used to.
Note that in the end you can always add some SQL if you like (e.g. LIMIT 10
) or simply do it in Pandas as well.
This time, the query is executed on the remote PostgreSQL:
|
|
Will return:
name | old_postcode | new_postcode |
---|---|---|
Trier-Nord | 34567 | 65432 |
Wrapped up
Plug’n’play script for querying a remote PostgreSQL database in Pandas logic.
|
|
Note that pandas_to_sql
is still experimental and might not be compatible with all commands. Still it’s a nice hacky solution for everyday needs and works beautifully out-of-the-box!