Connect Pandas to a remote PostgreSQL database via SSH

PostgreSQL SSH to Pandas

Use Pandas to execute SQL statements in a remote database such as PostgreSQL via SSH

tl;dr Skip to the last section for the code

If you’re working in big data chances are that you need to SSH into a remote server and perform some data operations there. PSQL is a command-line interface enabling you to fire direct SQL commands for example in PostgreSQL which comes in really handy for quick queries.

However, it’s quite cumbersome to get some nicely formatted data output. Only since PostgreSQL version 9.2, psql has a useful \x auto mode formatting the output according to your shell size.

The convenience of Pandas with the speed of PostgreSQL

A better way to visualize the output is to simply use the great visualization tools you know in combination with PostgreSQL and its speed. For example you can do:

Jupyter Lab -> Pandas -> PostgreSQL or
VS Code -> Jupyter Notebook -> Pandas -> PostgreSQL

The challenging part is of course not only to connect Pandas to SQL database but to do so via SSH.

Database connection with sshtunnel and psycopg2

It’s actually not that big of a deal but one must understand the workflow first. The SSH connection itself is not suited for direct database access.

Instead, you bind the remote database connection to your localhost on your machine. In this way, technically speaking, there is not much of a difference anymore in handling a locally running PostgreSQL for example or a remote one as either way they are running on port 5432 now.

You need psycopg2 and sshtunnel installed in order to create a database connection over SSH with Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import psycopg2
from sshtunnel import SSHTunnelForwarder

try:
    with SSHTunnelForwarder(
         ('some.linktodb.com', 22), # port 22 as standard SSH port
        ssh_username="username", 
        ssh_pkey="your/private/key", # your private key file
        ssh_private_key_password="****",
        remote_bind_address=('localhost', 5432)) as server: # mirroring to local port 5432
         
         server.start()

         params = { # database params
             'database': 'test',
             'user': 'dome',
             'password': '*****',
             'host': 'localhost',
             'port': server.local_bind_port
             }
         conn = psycopg2.connect(**params)
         curs = conn.cursor() # if this works, you are connected
         print("DB connected")
except:
    print("Connection failed")

The with statement is closing the connection immediately after execution as best practice. You probably want to keep the connection open instead until you’re done with your processing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import psycopg2
from sshtunnel import SSHTunnelForwarder

try:
	server= SSHTunnelForwarder(
      ('some.linktodb.com', 22),
    ssh_username="username",
    ssh_pkey="your/private/key",
    ssh_private_key_password="****",
    remote_bind_address=('localhost', 5432)) as server:
	server.start()

	print("server connected")

  params = {
      'database': 'test',
      'user': 'dome',
      'password': '*****',
      'host': 'localhost',
      'port': server.local_bind_port
      }

	conn = psycopg2.connect(**params)

	conn.set_session(readonly=True)  # read only!

	print("DB connected")
except:
	print("Connection failed")

# don't forget to close the connection and stop the server afterwards!

conn.close()
server.stop()

Read SQL output to Pandas dataframe

Now you can fire any SQL command and visualize your output in Pandas.

1
2
df = pd.read_sql_query("select * from post_codes limit 3", conn)
df

name	old_postcode	new_postcode
Köln-Zentrum	12345	12345
Berlin-Schönefeld	23456	54321
Trier-Nord	34567	65432

Translate Pandas logic to SQL command

There is even an easier way to work with SQL in Pandas, almost completely neglecting the overhead of the SSH connection, the SQL string and the database connection.

The magic is called pandas_to_sql and does what it claims: it converts the Pandas logic into an SQL string.

Install with pip install pandas-to-sql and import with import pandas_to_sql (be aware of the underscore/space confusion).

All you need to do is to wrap an existing dataframe and actually perform some operation on it. Then you take the generated SQL string and fire it on your remote PostgreSQL via Pandas. For the actual operation, just use a dummy df with only a few entries, e.g. 10. In this way you can check your output before firing it in SQL.

In this example I retrieve 3 rows, create the dummy df just like above and wrap it with pandas_to_sql:

1
2
3
4
import pandas as pd 
import pandas_to_sql
df = pd.read_sql_query("select * from post_codes limit 3", conn)
df = pandas_to_sql.wrap_df(df, "post_codes") # needs table name

You can reuse this df for generating all following SQL commands.

Now use this df to test a query

1
df[df["old_postcode"] != df["new_postcode"]]

Should output:

name	old_postcode	new_postcode
Berlin-Schönefeld	23456	54321
Trier-Nord	34567	65432

Looks good, so now get your first SQL string:

1
df[df["old_postcode"] != df["new_postcode"]].get_sql_string()

This should output something like this:

1
2
3
SELECT (name) AS name, (old_postcode) AS old_postcode,
 (new_postcode) AS new_postcode
FROM post_codes WHERE ((old_postcode <> new_postcode)) 

Great right?

PostgreSQL queries in Pandas

Now just wrap the function and work with Pandas just like you are used to. Note that in the end you can always add some SQL if you like (e.g. LIMIT 10) or simply do it in Pandas as well.

This time, the query is executed on the remote PostgreSQL:

1
2
3
4
5
6
7
8
pf = pd.read_sql_query(
    # using the dummy df
    df[
        (df["old_postcode"] != df["new_postcode"]) &
        (df["name"] == "Trier-Nord") & 
    ] 
    .get_sql_string() + "LIMIT 10", conn) 
pf

Will return:

name	old_postcode	new_postcode
Trier-Nord	34567	65432

Wrapped up

Plug’n’play script for querying a remote PostgreSQL database in Pandas logic.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import psycopg2
from sshtunnel import SSHTunnelForwarder
import pandas as pd 
import pandas_to_sql

try:
    server = SSHTunnelForwarder(
        ('some.linktodb.com', 22),
        ssh_username="username",
        ssh_pkey="your/private/key",
        ssh_private_key_password="****",
        remote_bind_address=('localhost', 5432)
    )

    server.start()

	print("server connected")

    params = {
        'database': 'test',
        'user': 'dome',
        'password': '*****',
        'host': 'localhost',
        'port': server.local_bind_port
    }

	conn = psycopg2.connect(**params)

	conn.set_session(readonly=True)  # read only!

	print("DB connected")
except:
	print("Connection failed")

df = pd.read_sql_query("select * from post_codes limit 3", conn)
df = pandas_to_sql.wrap_df(df, "post_codes") # needs table name

pf = pd.read_sql_query(
    # using the dummy df
    df[
        (df["old_postcode"] != df["new_postcode"]) &
        (df["name"] == "Trier-Nord") & 
    ] 
    .get_sql_string() + "LIMIT 10", conn) 
pf

# don't forget to close the connection and stop the server afterwards!
conn.close()
server.stop()

Note that pandas_to_sql is still experimental and might not be compatible with all commands. Still it’s a nice hacky solution for everyday needs and works beautifully out-of-the-box!

Contents