SemanticFinder - frontend-only live semantic search with transformers.js
Contents
Semantic search right in your browser! Calculates the embeddings and cosine similarity client-side without server-side inferencing, using transformers.js and a quantized version of sentence-transformers/all-MiniLM-L6-v2.
Intro
tl;dr: I created an open-source client-side semantic search platform. Check out the demo! GitHub Repo.
Semantic search is awesome. In a nutshell, it helps you to find what you’re looking for without knowing the exact terms for a full-text search.
That’s why I dived into the stack one usually needs to create a fully-fledged app in my previous tutorials about Qdrant and geospatial vector search as well as a simplified stack with only Qdrant & transformers.js.
Motivation
The stack I described in my previous posts consists only of a client and Qdrant as a vector database with included API.
|
|
However, even though Qdrant as a vector database is awesome, I felt the need to create something lighter since ordinary people cannot simply setup a database and run it 24/7.
Also, I wanted to create a tool anyone can use by simply copy & pasting any kind of text! So let’s simplify it further:
|
|
Logic
Transformers.js is doing all the heavy lifting of tokenizing the input and running the model. Without it, this demo would have been impossible.
Input
- Text, as much as your browser can handle! The demo uses a part of “Hänsel & Gretel”
- A search term or phrase
Output
- Three highlighted string segments, the darker the higher the similarity score.
Pipeline
- All scripts and the model are loaded from CDNs/HuggingFace.
- A user inputs some text and a search term or phrase.
- Depending on the approximate length to consider (unit=characters), the text is split into segments. Words themselves are never split, that’s why it’s approximative.
- The search term embedding is created.
- For each segment of the text, the embedding is created.
- Meanwhile, the cosine similarity is calculated between every segment embedding and the search term embedding. It’s written to a dictionary with the segment as key and the score as value.
- For every iteration, the progress bar and the highlighted sections are updated in real-time depending on the current highest three scores in the array (could easily be modified in the source code, just like the colors).
- The embeddings are cached in the dictionary so that subsequent queries are quite fast. The calculation of the cosine similarity is fairly speedy in comparison to the embedding generation.
- Only if the user changes the segment length, the embeddings must be recalculated.
Features
You can customize everything!
- Input text & search term(s)
- Segment length (the bigger the faster, the smaller the slower)
- Highlight colors
- Number of highlights
- Thanks to CodeMirror you can even use syntax highlighting for programming languages such as Python, JavaScript etc.
- Live updates
- Easy integration of other ML-models thanks to transformers.js
- Data privacy-friendly - your input text data is not sent to a server, it stays in your browser!
-> Section moving to GitHub
Usage ideas
- Basic search through anything, like your personal notes (my initial motivation by the way, a huge notes.txt file I couldn’t handle anymore)
- Remember peom analysis in school? Often you look for possible Leitmotifs or recurring categories like food in Hänsel & Gretel
-> Section moving to GitHub
Future ideas
- One could package everything nicely and use it e.g. instead of JavaScript search engines such as Lunr.js (also being used in mkdocs-material).
- Electron- or browser-based apps could be augmented with semantic search, e.g. VS Code, Atom or mobile apps.
- Integration in personal wikis such as Obsidian, tiddlywiki etc. would save you the tedious tagging/keywords/categorisation work or could at least improve your structure further
-> Section moving to GitHub
Collaboration
PRs on GitHub are more than welcome!
Let me know what you think! Feel free to write me, always happy about feedback :)