DuckDB for Geospatial

2025-01-11

Data Science Geospatial Analysis Performance Optimization Open Data Python

2052 words 10 mins read

Using DuckDB for geospatial workflows for incredible performance boost! Leveling up my GeoPandas workflows for processing the 100M Foursquare dataset.

Image Source: Foursquare

Hetzner ARM Update Fix

2024-08-25 187 words 1 min read

Err:6 https://mirror.hetzner.com/ubuntu/packages jammy-updates/main arm64 Packages 404 Not Found [IP: 2a01:4ff:ff00::3:3 443]

apt update & apt upgrade not working anymore on a Hetzer ARM VM?

Here’s the solution.

English-only embedding models for multilingual docs

2023-11-23

AI ML vector search Python

927 words 2 mins read

Text Embeddings: Speaking languages without learning them?

tl;dr: some *-en models perform well on other languages too.

A performant embedding processing pipeline & tutorial for big XML/HTML data dumps

2023-11-18

AI ML vector search Python

2550 words 12 mins read

Let’s process 1.369.841 single English XML/HTML files and index them with BAAI/bge-base-en-v1.5 embeddings for semantic search! Sounds fun? Let’s go!

Semantic Search Tutorial

2023-10-22

AI ML vector search Javascript JS

3638 words 18 mins read

A full semantic search tutorial about:

data mining with requests and beautifulsoup
preprocessing in pandas
chunking the document text in smaller paragraphs of the right size for the ML model
creating embeddings for each chunk
calculating the mean embedding for each document
saving data as gzipped json (small file size & easy and fast to read in js with pako.js)
creating a static web app based on transformers.js on GitHub Pages

App here

Zulip Server Setup Tutorial: zulip.gis.chat

2023-05-06

Zulip gis gis.chat

1250 words 6 mins read

Let’s set up a public Zulip instance for our geospatial community! It’s live: zulip.gis.chat

SemanticFinder - frontend-only live semantic search with transformers.js

2023-04-11

AI ML vector search Javascript JS

838 words 2 mins read

Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and a quantized version of sentence-transformers/all-MiniLM-L6-v2.