Using qdrant for querying text data with vector search and geospatial filters without GPU (CPU only)

Qdrant for geospatial

Image courtesy Qdrant transformed with Stable Diffusion v2 by stability-ai.

Intro

This post will show, how to

  1. Set up an instance of qdrant as Docker-based database
  2. Transform short text to vectors with a pre-trained model (without GPU!)
  3. Feed qdrant with the text vectors and its metadata as payload
  4. Query qdrant via the built-in API

Jupyter Noteboook to follow along here.

Vector search

Vector search is a new trend in Natural Language Processing (NLP). Instead of only spotting certain keywords in a corpus it specifically searches for the logic or the meaning in text. A particular keyword must not even occur in a document in order to be found if the logic is very relevant to the user’s input. Semantic searches hence usually provide much better results than keyword-based approaches.

0 Setup

Before diving into the tutorial make sure to have all requirements properly installed. It should work on any system (Ubuntu, Mac, Windows). Make sure to have at least ~5Gb of available disk space.

  • Install Docker on your system
  • When done, start a terminal and check if the docker CLI works with docker. If so, simply pull the Qdrant-Container and start it:
    • docker pull qdrant/qdrant
    • docker run -p 6333:6333 qdrant/qdrant
  • For Python, it’s best to setup a virtual environment with the tool of your choice. If you use conda/mambda create a fresh env with Python 3.11 and the dependencies:
    • conda create -n "py3.10" python=3.10 (sentence-transformers doesn’t support Python 3.11 atm)
    • conda activate py3.10
    • pip install pandas numpy qdrant-client sentence-transformers tqdm (usually you should avoid mixing up conda and pip, but pip installer is faster and consumes less memory)

Installation will take a moment as it downloads large wheels.

1 Load the data

For this test, we will use data about European projects from CORDIS (Community Research and Development Information Service). I prepared a small data subset with ~55k rows you can access the tutorial’s GitHub repo.

1
2
3
4
5
import pandas as pd 

# read from remote GitHub Repo or download first 
df = pd.read_parquet("https://github.com/do-me/qdrant-tutorial/blob/main/CORDIS_55k_projects.parquet?raw=true") # 13Mb
df
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
       Collection  Record Number      Project acronym                                              Title      ID                                             Teaser
60024     project         191330  Hairy Cell Leukemia  Genetics-driven targeted therapy of Hairy Cell...  617471  Hairy Cell Leukemia (HCL), a chronic B-cell ne...
60025     project          73654               MODNET                      Model theory and applications  512234  This proposal is designed to promote multi-dis...
60026     project          83907           DYNQUANTGR  Dynamical quantum groups, deformation quantiza...   42212  The main goal of this proposal is two-fold: St...
60027     project          95038        ND-ETCRYPTOUC  New Directions in Efficient and Tamper-Resilie...  256544  Emerging ubiquitous devices such as WSN nodes ...
60028     project          86440                TAMBO  Societies of South Peru in the Context of Clim...  209938  The project, Societies of South Peru in the Co...
...           ...            ...                  ...                                                ...     ...                                                ...
987809    project         225185    CustomerServiceAI  CustomerServiceAI: Fully language-independent ...  880954  Customer service is a huge industry: €720BN in...
987810    project         233472     eMOTIONAL Cities  eMOTIONAL Cities - Mapping the cities through ...  945307  As the world is becoming more urbanized and ci...
987811    project         227864         PRECISMEDLYM  Aggressive T cell Lymphomas, integrated clinic...  882597  Lymphoid leukemias and lymphomas represent fre...
987812    project         226324            C-stemGMP  c-GMP compliance of C-stem, an IPSc based cell...  881113  Scaling-up cell therapy manufacturing provides...
987813    project         227514             NetriCan  Characterization of the role of netrin-1 in th...  883146  Recent data coming from both basic research an...

The projects usually do have a geospatial datum which consists of the participants locations (e.g. universities or public bodies). For simplicity, let’s just make up some fake lat lon for the European mainland instead.

1
2
3
4
import numpy as np
# add random coordinates somewhere in the European mainland
df["lat"] = np.random.uniform(30,80, len(df))
df["lon"] = np.random.uniform(10,30, len(df))

Great, the data is ready. We have some text data for which we compute the embeddings later, some lat lon for geospatial filtering and other metadata that will be added as a payload.

2 Transform text to vectors with a pre-trained model

HuggingFace provides a plethora of pre-trained models for all kind of purposes. As we want to allow for semantic search, we will try the multi-qa-MiniLM-L6-cos-v1-Model that was specifically designed for this purpose.

This is a sentence-transformers model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: SBERT.net - Semantic Search

In the world of pre-trained models, make sure to stay tuned for the latest updates as new models are released for very specific purposes literally every day. Also, always make sure to understand what the model was designed for:

Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.

In our case, the Teaser column of the CORDIS dataset is usually below 50 words, which is fine. For long documents instead, like the actual project reports etc. a different model would be needed.

Pre-Processing

For the sake of simplicity, we do not perform any pre-processing here but you should usually do so for better results. Also, note that models react differently to noise in the data, some might cope better than others.

Let’s encode the Teaser column. On my old Laptop with an i7-8550U processor it takes roughly 26 minutes for the 55k records with roughly 35 iterations/sec.

1
2
3
4
5
6
7
8
9
from sentence_transformers import SentenceTransformer
from tqdm import tqdm
tqdm.pandas()

# load the model
model = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')

# encode all teasers with the model
df["vector"] = df["Teaser"].progress_apply(lambda x: model.encode(x.lower()))

Multiprocessing

Pandas usually runs only on one CPU core. With tools like pandarallel the apply function can be executed conveniently on all CPU cores. However, some models are already inherently using all available CPU cores, which sometimes is not clear or even mentioned in the description.

For multi-qa-MiniLM-L6-cos-v1 I found that it’s already using all cores, so no need to alter the code.

3 Initialize an empty Qdrant collection

The usage of Qdrant is very straight forward. One database can contain multiple collections. We need to create one collection with a defined vector lenght. For multi-qa-MiniLM-L6-cos-v1 it’s a 384 dimensional dense vector space. The smaller the vectors, the less memory is needed but usually larger vectors are more accurate.

For this example, we use the DOT distance but multi-qa-MiniLM-L6-cos-v1 allows also for COS(inus).

1
2
3
4
5
6
7
8
from qdrant_client import QdrantClient
from qdrant_client.http.models import * #VectorParams

client = QdrantClient(host="localhost", port=6333)
client.recreate_collection(
    collection_name="test_collection",
    vectors_config=VectorParams(size=384, distance=Distance.DOT)
)

The test_collection has been created is ready for inserts and queries.

If in addition to the vector search we would like to allow the user for a full-text search, we need to create an index for the columns we plan to insert. In this case, we would like to allow users to look for specific keywords in the Teaser column as well.

We tell Qdrant that Teaser

  • is text,
  • should be tokenized as words,
  • has a word length between 2-30 characters,
  • is case insensitive.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
client.create_payload_index(
    collection_name="test_collection",
    field_name="Teaser",
    field_schema=models.TextIndexParams(
        type="text",
        tokenizer=models.TokenizerType.WORD,
        min_token_len=2,
        max_token_len=30,
        lowercase=True,
    )
)

4 Insert data in a Qdrant collection

Define a function that inserts one pandas row at a time in a Qdrant collection

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
def post_qdrant(row):
    """Inserting each row seperately for simplicity. 
    Can be optimized through inserting multiple rows at once."""
    
    # normal payload, everything apart from vector
    row_payload = row.iloc[:-1].to_dict()
    
    # vector
    row_vector = row.iloc[-1].tolist() #.to_json()) # vector
    
    # id as unique key in Qdrant
    row_id = row["Record Number"]
    
    # POST request to Qdrant API
    operation_info = client.upsert(
        collection_name="test_collection",
        wait=True,
        points=[
            PointStruct(id=row_id, vector=row_vector, payload=row_payload),
            ]
    )    

Either use pandas standard apply function with tqdm progress on one core (18 mins on my laptop) with:

1
df.progress_apply(lambda x: post_qdrant(x), axis=1)

Or go for the parallelized version if you have plenty of powerful cores available. Note that on old laptops the overhead for multiprocessing might outweigh the performance gain. As a rule of thumb, if you have more than 4 cores available and a decent CPU, go for multiprocessing:

1
2
3
4
5
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

df.parallel_apply(lambda x: post_qdrant(x), axis=1)

Now you can query the database with a simply search term or a concrete question. Let’s create the vector for the search term earth observation and query Qdrant with its search endpoint.

1
2
3
4
5
6
7
8
9
search_term = "earth observation"

search_result = client.search(
    collection_name="test_collection",
    query_vector=model.encode(search_term), 
    limit=3
)

search_result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[
    ScoredPoint(
        id=85228,
        version=4014,
        score=0.6618099,
        payload={
            "Collection": "project",
            "ID": "30849",
            "Project acronym": "SEOS",
            "Record Number": 85228,
            "Teaser": "Earth observation from space is relevant in science education already in high schools since it sharpens the sensibility to the natural environment and thus stimulates the willingness to learn of its relevance to everyday life conditions. This covers a broad field of experience...",
            "Title": "Science education through earth observation for high schools",
            "location": {"lat": 74.07343093948293, "lon": 24.20374801673081},
        },
        vector=None,
    ),
    ScoredPoint(
        id=222830,
        version=53120,
        score=0.60703325,
        payload={
            "Collection": "project",
            "ID": "842560",
            "Project acronym": "CALCHAS",
            "Record Number": 222830,
            "Teaser": "Earth Observation (EO) is undergoing a radical transformation due to the massive volume of observations acquired by remote sensing and in-situ sensor networks. While satellites provide coarse-resolution, yet global-scale monitoring of environmental processes, in-situ sensor...",
            "Title": "Computational Intelligence for Multi-Source Remote Sensing Data Analytics",
            "location": {"lat": 74.79007852190145, "lon": 12.864378877490047},
        },
        vector=None,
    ),
    ScoredPoint(
        id=193451,
        version=53355,
        score=0.5999719,
        payload={
            "Collection": "project",
            "ID": "637010",
            "Project acronym": "EGSIEM",
            "Record Number": 193451,
            "Teaser": "Earth observation (EO) satellites yield a wealth of data for scientific, operational and commercial exploitation. However, the redistribution of environmental mass is not yet part of the EO data products to date. These observations, derived from the Gravity Recovery and...",
            "Title": "European Gravity Service for Improved Emergency Management",
            "location": {"lat": 73.1969096395658, "lon": 26.867547236594767},
        },
        vector=None,
    ),
]

In this case it’s pretty obvious that the results are related as the search term itself occurs in the teasers. A simple keyword based search would just as well yield good results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
search_term = "earth observation"

search_result = client.scroll(
    collection_name="test_collection",
    scroll_filter=Filter(
        must=[
            FieldCondition(
                key="Teaser",
                match=MatchText(text=search_term)
            )
        ]
    ),
    limit=3,
    with_payload=True,
)

search_result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
(
    [
        Record(
            id=82157,
            payload={
                "Collection": "project",
                "ID": "7512",
                "Project acronym": "SPARTAN",
                "Record Number": 82157,
                "Teaser": "We propose to create a Centre of Excellence in the training of early stage researchers in the space, planetary (including Earth Observation) and astrophysical sciences in the Department of Physics and Astronomy at the University of Leicester.\r\n\r\nThe principal aims of this cent...",
                "Title": "Centre of excellence for space, planetary and astrophysics research Training and Networking",
                "location": {"lat": 72.52626887469162, "lon": 28.64183473357381},
            },
            vector=None,
        ),
        Record(
            id=85228,
            payload={
                "Collection": "project",
                "ID": "30849",
                "Project acronym": "SEOS",
                "Record Number": 85228,
                "Teaser": "Earth observation from space is relevant in science education already in high schools since it sharpens the sensibility to the natural environment and thus stimulates the willingness to learn of its relevance to everyday life conditions. This covers a broad field of experience...",
                "Title": "Science education through earth observation for high schools",
                "location": {"lat": 74.07343093948293, "lon": 24.20374801673081},
            },
            vector=None,
        ),
        Record(
            id=88557,
            payload={
                "Collection": "project",
                "ID": "211578",
                "Project acronym": "E-SOTER",
                "Record Number": 88557,
                "Teaser": "Soil and land information is needed for a wide range of applications but available data are often inaccessible, incomplete, or out of date. GEOSS plans a global Earth Observation System and, within this framework, the e-SOTER project addresses the felt need for a global soil ...",
                "Title": "Regional pilot platform as EU contribution to a Global Soil Observing System",
                "location": {"lat": 59.38920209090214, "lon": 23.629878278689645},
            },
            vector=None,
        ),
    ],
    95605,
)

So what’s the point of vector search?

Well, let’s assume a more specific search phrase:

earth observation as a method for fighting climate change in urban areas

A simple keyword-based search does not yield any results:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
search_term = "earth observation as a method for fighting climate change in urban areas"

search_result = client.scroll(
    collection_name="test_collection",
    scroll_filter=Filter(
        must=[
            FieldCondition(
                key="Teaser",
                match=MatchText(text=search_term)
            )
        ]
    ),
    limit=3,
    with_payload=True,
)

search_result
1
([], None)

A vector search has the advantage that it always yields some results, ranked through a proximity score. If there are any entries that are related to the above topic, we will certainly find them (if the pre-trained model we use is good enough, but here it is):

1
2
3
4
5
6
7
8
9
search_term = "earth observation as a method for fighting climate change in urban areas"

search_result = client.search(
    collection_name="test_collection",
    query_vector=model.encode(search_term), 
    limit=3
)

search_result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[
    ScoredPoint(
        id=231962,
        version=32212,
        score=0.6569561,
        payload={
            "Collection": "project",
            "ID": "101004211",
            "Project acronym": "ECFAS",
            "Record Number": 231962,
            "Teaser": "The increasing number of tools and algorithms able to process and extract qualitative and quantitative information from Earth Observation products has an enormous potential to support the evaluation of weather-induced climate risks. The ECFAS project will contribute to the...",
            "Title": "A PROOF-OF-CONCEPT FOR THE IMPLEMENTATION OF A EUROPEAN COPERNICUS COASTAL FLOOD AWARENESS SYSTEM",
            "location": {"lat": 31.21824594487511, "lon": 29.857792095998956},
        },
        vector=None,
    ),
    ScoredPoint(
        id=216035,
        version=46207,
        score=0.63480234,
        payload={
            "Collection": "project",
            "ID": "771056",
            "Project acronym": "LICCI",
            "Record Number": 216035,
            "Teaser": "In the quest to better understand local climate change impacts on physical, biological, and socioeconomic systems and how such impacts are locally perceived, scientists are challenged by the scarcity of grounded data, which has resulted in a call for exploring new data...",
            "Title": "Local Indicators of Climate Change Impacts. The Contribution of Local Knowledge to Climate Change Research",
            "location": {"lat": 67.81498235670398, "lon": 12.546218131454996},
        },
        vector=None,
    ),
    ScoredPoint(
        id=202426,
        version=39883,
        score=0.632117,
        payload={
            "Collection": "project",
            "ID": "706244",
            "Project acronym": "IRONLAKE",
            "Record Number": 202426,
            "Teaser": "Under the pressure of human-induced climate change, it is essential to better understand the past natural climate variability. A broader global coverage of high-resolution palaeoclimatic proxy (indicator) data is urgently needed to improve climate projections and adaptation...",
            "Title": "Establishing stable IRON isotopes of laminated LAKE sediments as novel palaeoclimate proxy",
            "location": {"lat": 74.74258629120374, "lon": 20.516549478517778},
        },
        vector=None,
    ),
]

Search vs. scroll in Qdrant

There are two main methods in Qdrant for querying the database.

  1. Search is the main method and implies a vector search. Optionally, other conditions like ranges or geo filters can be applied.
  2. Scroll must not contain a vector and is comparable to a “regular” database query.

6 Semantic queries with geospatial filters

Qdrant offers a very simple intuitive logic for additional filters based on ranges, string-matches, or geospatial filters.

At the time of writing (01/2023), it does not offer nearly as many geospatial operators as e.g. PostGIS, but it can perform very simple queries with either bounding boxes or a radius. Hopefully, the future will bring more sophisticated geo filters such as (multi-)polygon-support.

Bounding boxes

A simple vector search with an additional bounding box filter could look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
search_term = "earth observation"# as a method for fighting climate change in urban areas"

search_result = client.search(
    collection_name="test_collection",
    query_vector=model.encode(search_term), 
    query_filter=Filter(
        must=[
            FieldCondition(
                key="location",
                geo_bounding_box=models.GeoBoundingBox(
                    bottom_right=models.GeoPoint(
                        lat=48.495862,
                        lon=13.455868,
                    ),
                    top_left=models.GeoPoint(
                        lat=52.520711,
                        lon=5.403683,
                    ),
                ),
            )
        ]
    ),
    limit=3
)

search_result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
[
    ScoredPoint(
        id=92067,
        version=44762,
        score=0.3700924,
        payload={
            "Collection": "project",
            "ID": "226701",
            "Project acronym": "CARBO-EXTREME",
            "Record Number": 92067,
            "Teaser": "The aim of this project is to achieve an improved knowledge of the terrestrial carbon cycle in response to climate variability and extremes, to represent and apply this knowledge over Europe with predictive terrestrial carbon cycle modelling, to interpret the model predictions...",
            "Title": "The terrestrial Carbon cycle under Climate Variability and Extremes – a Pan-European synthesis",
            "location": {"lat": 48.98955310027602, "lon": 10.164019268032614},
        },
        vector=None,
    ),
    ScoredPoint(
        id=81072,
        version=4970,
        score=0.35429376,
        payload={
            "Collection": "project",
            "ID": "517912",
            "Project acronym": "MSEPOA",
            "Record Number": 81072,
            "Teaser": "We propose to use X-ray data from two pioneering in their kind and complementary multi-wavelength surveys, in order to perform a systematic study of obscured AGN. These leverage substantial investments of space- and ground-based telescope time, yielding a unique dataset to stu...",
            "Title": "Multi-wavelength searches of the elusive population of obscured AGN",
            "location": {"lat": 49.113653844306384, "lon": 12.80912024409048},
        },
        vector=None,
    ),
    ScoredPoint(
        id=102818,
        version=48797,
        score=0.3521356,
        payload={
            "Collection": "project",
            "ID": "303610",
            "Project acronym": "PYROMAP",
            "Record Number": 102818,
            "Teaser": "Project PyroMap aims to understand the impact that Earth’s last major globally warm period had on our planet’s flammability. The Mid Pliocene ~3.6-2.6 million years ago, experienced global mean annual temperatures 3oC higher than today. Such estimates are remarkably similar t...",
            "Title": "Palaeofire Danger Rating Maps and Earth's Last Major Global Warming Event (Project PyroMap)",
            "location": {"lat": 52.16377317766049, "lon": 12.582368674433358},
        },
        vector=None,
    ),
]

Radius

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
search_term = "earth observation"# as a method for fighting climate change in urban areas"

search_result = client.search(
    collection_name="test_collection",
    query_vector=model.encode(search_term), 
    query_filter=Filter(
        must=[
            FieldCondition(
                key="location",
                geo_radius=models.GeoRadius(
                    center=models.GeoPoint(
                        lat=52.520711,
                        lon=13.403683,
                    ),
                    radius=10_000,
                ),
            )
        ]
    ),
    limit=3
)

search_result
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[
    ScoredPoint(
        id=103492,
        version=8624,
        score=0.23580064,
        payload={
            "Collection": "project",
            "ID": "301230",
            "Project acronym": "NeBRiC",
            "Record Number": 103492,
            "Teaser": "Cosmology has gone through an amazing revolution during the last decade owing to the large amount of new precise observational data. These data strongly indicate the existence of two periods of accelerated expansion in the history of the Universe. One in the primordial univer...",
            "Title": "Non-linear effects and backreaction in classical and quantum cosmology",
            "location": {"lat": 52.565792386550584, "lon": 13.485898163771736},
        },
        vector=None,
    ),
    ScoredPoint(
        id=80900,
        version=33189,
        score=0.13301674,
        payload={
            "Collection": "project",
            "ID": "514222",
            "Project acronym": "CB_DIDACTIQUE",
            "Record Number": 80900,
            "Teaser": "It is well known that Education is one of the key elements of every society. A worldwide scale look at of the economic, social and financial health of the different countries clearly shows an intimate link between the expansion of the latter and the degree of expertise of its ...",
            "Title": "Using a computer environnement to improve the quality of algebra education at different school levels",
            "location": {"lat": 52.54758829696126, "lon": 13.292623381409692},
        },
        vector=None,
    ),
]

I hope this blog post could show how easy to set up and convenient to use Qdrant is. Star the repo for their awesome work!

If you want to see more blog posts like this, follow me online.