English-only embedding models for multilingual docs

Text Embeddings: Speaking languages without learning them?
tl;dr: some *-en models perform well on other languages too.

Text Embeddings: Speaking languages without learning them?
tl;dr: some *-en models perform well on other languages too.

Let’s process 1.369.841 single English XML/HTML files and index them with BAAI/bge-base-en-v1.5 embeddings for semantic search! Sounds fun? Let’s go!

A full semantic search tutorial about:

Let’s set up a public Zulip instance for our geospatial community! It’s live: zulip.gis.chat
Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and a quantized version of sentence-transformers/all-MiniLM-L6-v2.

Create a fully working semantic search stack with only Qdrant as vector database with built-in API and transformers.js using any huggingface model as your frontend-only embedding generator. No additional inference server needed!

Image courtesy Qdrant & Hugging Face.
Using qdrant for querying text data with vector search and geospatial filters without GPU (CPU only)

Image courtesy Qdrant transformed with Stable Diffusion v2 by stability-ai.