English-only embedding models for multilingual docs

Text Embeddings: Speaking languages without learning them?

tl;dr: some *-en models perform well on other languages too.

Intro

While indexing plenty of XML files for semantic search, I noticed that some (not all) text embedding models designed for English only also perform fairly well on other languages!

Amongst those is e.g. FlagEmbedding (bge family) on of the best open source models for embeddings according to MTEB. For FlagEmbedding, a rule of thumb is: the larger, the better these models perform on other languages. It’s an effect caused by the distillation and pruning process of the large variants that were trained on data in multiple languages.

This is very handy as it makes your embeddings more resilient than you’d expect, if e.g. some French quotes appear in your text or your social media posts unexpectedly contain entries with other languages. Note that in these cases you should probably stick with good multilingual models anyway but if you can - statistically seen - neglect other languages *-en models would work just fine.

Test yourself

You can quickly test the behavior yourself with SemanticFinder. Steps:

Add one sentence in 30 or more different languages. One row is one translation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
English: The weather is beautiful today!
Spanish: ¡El clima está hermoso hoy!
French: Le temps est magnifique aujourd'hui !
Russian: Погода сегодня прекрасная!
Chinese (Mandarin): 今天天气很好！(Jīntiān tiānqì hěn hǎo!)
Arabic: الطقس جميل اليوم! (Al-ṭaqsi jameel alyawm!)
Portuguese: O tempo está lindo hoje!
Hindi: आज मौसम बहुत ख़ूबसूरत है! (Āj mausam bahut khūbsūrat hai!)
Bengali: আজ আবহাওয়া সুন্দর! (Āj abhā'ōẏā sundara!)
Urdu: آج موسم خوبصورت ہے! (Āj mosam khūbsūrat hai!)
Swahili: Hali ya hewa ni nzuri leo!
German: Das Wetter ist heute wunderschön!
Japanese: 今日の天気は美しいです！(Kyō no tenki wa utsukushīdesu!)
Korean: 오늘 날씨 정말 좋아요! (Oneul nalssi jeongmal johayo!)
Italian: Il tempo è bellissimo oggi!
Dutch: Het weer is vandaag prachtig!
Turkish: Bugün hava çok güzel!
Vietnamese: Thời tiết hôm nay đẹp!
Indonesian: Cuaca hari ini indah!
Polish: Dzisiaj jest piękna pogoda!
Ukrainian: Сьогодні прекрасна погода! (S'ohodnі prekrasna pogoda!)
Romanian: Vremea este frumoasă azi!
Greek: Ο καιρός είναι όμορφος σήμερα! (O kairós eínai ómorfos símera!)
Hebrew: המזג האוויר יפה היום! (Ha-mazág ha-avír yafe hayom!)
Serbian: Време је лепо данас! (Vreme je lepo danas!)
Malay: Cuaca hari ini cantik!
Tagalog: Maganda ang panahon ngayon!
Hungarian: Ma gyönyörű az idő!
Czech: Počasí je dnes krásné!
Swedish: Vädret är vackert idag!

Set the splitting option to regex and enter “\n” (separated by new line)
Add plenty of dummy data in other lines, e.g. here I added 1000 lines of the IPCC report dealing plenty with the topics climate and weather. You can find it here.
Run the test for different models and see how it changes.

The bge-large-en model performs well on 29 out of 30 languages!

The bge-base-en model still performs well on 24 languages.

The bge-small-en model deals well with 21 languages instead. An expected degradation considering the model size.

Note that I used the quantized models here for faster performance. If you use the unquantized ones you can expect a boost in performance but a drop in quality.

Test any other model yourself with SemanticFinder running entirely in your browser!

Contents

Intro

Test yourself