futuristic_GPU_title

Let’s process 1.369.841 single English XML/HTML files and index them with BAAI/bge-base-en-v1.5 embeddings for semantic search! Sounds fun? Let’s go!

Intro

I recently needed to index a big data dump of XML/HTML files, 24Gb as .tar.gz and around 167 Gb unzipped. While there is lots of controversy around what big data is and some might even claim that BIG DATA IS DEAD, these dimensions are certainly not in within scale of “normal” everyday data dumps anymore. While it is already hard enough to get your hands on such large data dumps in a corporate environment we’ll not talk about any kind of data retrieval or mining here but only about the following:

  1. performant XML/HTML parsing with lxml (and beautifulsoup) so we get actual text to process
  2. chunking the text in chunks within the models max token context
  3. generating BAAI/bge-base-en-v1.5 embeddings with a powerful Quadro RTX 8000 with 48GB
  4. future post: creating a collection in Qdrant for hybrid search (semantic & full-text search) & a nice GUI with Gradio to query the data

1 Parsing XML/HTML

XML and its semantic cousin HTML dominated the internet until the early 2000s. Browsers can read the tag-structure quite well and most programming languages offer standard tools to parse it, like xml (import xml.etree.ElementTree as ET) in Python. Apart from that XML is outdated and heavy which is why it’s luckily widely replaced by JSON. Simply parsing an XML can already be a challenge. To give only a brief overview of possible parsing error sources:

  • more than one html tag in one file
  • EndTag: ‘</’ not found
  • xmlSAX2Characters: huge text node
  • Opening and ending tag mismatch: p
  • and so on

XML Parsing in Python

Python offers the before-mentioned standard tool xml but it’s slow and inconvenient. You can now choose between two options:

  1. Beautifulsoup: convenient but slow
  2. lxml as a speedier C variant of xml: fast but inconvenient

Beautifulsoup

Beautifulsoup is just so convenient to use. Here is a script that just works (TM).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import re
from bs4 import BeautifulSoup

def remove_consecutive_whitespaces(text):
    # Define the regex pattern
    pattern = r'[^\S\r\n]*(\r\n|\n|\r)[^\S\r\n]*|([^\S\r\n]){2,}'

    # Use re.sub to replace matches with a single whitespace
    result = re.sub(pattern, lambda match: ' ' if match.group(2) else match.group(1), text)

    return result

def extract_visible_text_from_html_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        xml_data = file.read()

    soup = BeautifulSoup(xml_data, 'html.parser')

    def recursive_extract(element):
        if hasattr(element, 'children'):
            return ''.join(recursive_extract(child) for child in element.children if child.name is None or child.name.lower() != 'style')
        return str(element) if element.name is None or element.name.lower() != 'style' else ''

    result = recursive_extract(soup)

    # Apply strip() to remove leading and trailing whitespaces
    result = result.strip()

    # Remove consecutive line breaks
    result = re.sub(r'\n\s*\n', '\n', result)

    result = remove_consecutive_whitespaces(result)

    return result


# Example usage with an HTML file
html_file_path = "your_file.xml"
result = extract_visible_text_from_html_file(html_file_path)
result

I works recursively through all elements on the page and even deals with problematic XMLs. However, it comes at a price as it’s written in Python. Even though it can rely on lxml as a parser it’s still quite slow. If you have few files to process you will probably do yourself a favor sticking to bs4 instead of any other Python alternative.

lxml

lxml is written in C which is why it’s so fast in comparison. However, it’s quite strict when it comes to proper XML formatting. In my case I encountered a few errors, mainly when there was actually something wrong with the file. This can be good or bad depending on your use case. For a quick’n’dirty proof of concept it will be quite annoying while for a proper and well-though out app it might be great to be able to detect edge cases.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import re
from lxml import etree

filename = "your_file.xml"

def extract_visible_text(html_string):
    root = etree.fromstring(html_string)
    visible_text = []

    for element in root.xpath('//text()'):
        # Check if the text is visible (non-empty and not just whitespace)
        if element.strip():
            visible_text.append(element.strip())

    result = "\n".join(visible_text)
    return result

with open(filename, 'r', encoding='utf-8') as file:
    xml_data = file.read()

    # Use regex to extract each HTML content between <html> tags
    html_contents = re.findall(r'<html.*?</html>', xml_data, re.DOTALL)

    for html_content in html_contents:
        result = extract_visible_text(html_content)
        print(result)
        print('-' * 50)

The code works pretty much the same - it works through all visible elements on the page and extracts the text. A batch version, reading a parquet file as input:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
from lxml import etree
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from tqdm import tqdm
import os
tqdm.pandas()
import lxml
import re

def remove_consecutive_whitespaces(text):
    pattern = r'[^\S\r\n]*(\r\n|\n|\r)[^\S\r\n]*|([^\S\r\n]){2,}'
    return re.sub(pattern, lambda match: ' ' if match.group(2) else match.group(1), text)

def write_parquet(file_path, data):
    if isinstance(data, str):
        # Assuming the string is a JSON representation of a DataFrame
        data = pd.DataFrame([data])
    elif not isinstance(data, (list, pd.DataFrame)):
        raise ValueError("Input data must be a list, DataFrame, or string")

    table = pa.Table.from_pandas(data)
    pq.write_table(table, file_path)

def read_parquet(file_path):
    table = pq.read_table(file_path)
    return table.to_pandas()

def extract_visible_text(xml_string):
    root = etree.fromstring(xml_string)
    visible_text = []

    for element in root.xpath('//text()'):
        # Check if the text is visible (non-empty and not just whitespace)
        if element.strip():
            visible_text.append(element.strip())

    result = "\n".join(visible_text)
    return result

def extract_visible_text_from_html_file(x):
    with open(x, 'r', encoding='utf-8') as file:

        try:
            xml_data = file.read()

            # Use regex to extract each HTML content between <html> tags
            html_contents = re.findall(r'<html.*?</html>', xml_data, re.DOTALL)

            result = []
            for html_content in html_contents:
                result.append( extract_visible_text(html_content))
            
            result = "\n".join(result)

            result = re.sub(r'\n\s*\n', '\n', result).strip()
            result = remove_consecutive_whitespaces(result)
            return result

        except Exception as e:
            print(x,e)

# Read the file list
df = read_parquet("file_list.parquet")
df = df.rename(columns={0:"filename"}) # input file has just one column named 0, needs to be renamed for saving later

#df = df.iloc[:100] # for tests, reduce length

# Calculate the chunk size
total_files = len(df)
chunk_size = total_files // 20  # 5% of total files for each chunk

processed_files = []

# Process and save chunks
for i in range(20):
    print(i)
    start_index = i * chunk_size
    end_index = (i + 1) * chunk_size if i < 19 else total_files

    chunk_df = df.iloc[start_index:end_index]

    # Process the chunk
    chunk_df["content"] = chunk_df["filename"].progress_apply(extract_visible_text_from_html_file)

    # Save the chunk to a Parquet file
    chunk_file_path = f"chunks_output/chunk_{i + 1}.parquet"
    chunk_df.to_parquet(chunk_file_path)

    # Add processed files to the list
    processed_files.extend(chunk_df["filename"].tolist())

# Check for missing files
original_files = df["filename"].tolist()
missing_files = set(original_files) - set(processed_files)

assert not missing_files, f"Missing files: {missing_files}"
print("finished")

# 17:40 m for  68492

In this way, I was able to process 1.369.841 files in 515 min on a consumer-grade laptop with an Intel(R) Core(TM) i7-8550U CPU, so 44.33 files/s. Note that I used only one core - you could always speed it up with multiprocessing!

2 Chunking text

Similar to parsing XMLs, chunking them holds a similar dilemma. Haystack is an awesome framework with an advanced PreProcessor but on my documents it was quite slow (1 doc/s). Again, if your collection is fairly small go for this workflow:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from haystack.nodes import PreProcessor
from haystack import Document
import re

# This is a default usage of the PreProcessor.
# Here, it performs cleaning of consecutive whitespaces
# and splits a single large document into smaller documents.
# Each document is up to 300 words long and document breaks cannot fall in the middle of sentences
# Note how the single document passed into the document gets split into 5 smaller documents

# bge embeddings support up to 512 tokens or ~384 words ca. lets make it 300 just in case
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=False,
    split_by="word", 
    split_length=300,
    split_respect_sentence_boundary=True,
    progress_bar=False
)

def clean_string(input_string):
    # Replace consecutive whitespaces with a single space
    cleaned_string = re.sub(r'\s+', ' ', input_string)
    cleaned_string = cleaned_string.replace('\u200b', ' ')
    return cleaned_string.strip()#

def chunk_text(s):
    s = clean_string(s)
    doc_dict = {'content': s}
    document = Document.from_dict(doc_dict)#, encoding="utf-8")
    chunks = [i.to_dict()["content"].strip() for i in preprocessor.process([document])]
    
    return chunks

chunk_text("your test text")

The preprocessor does a lot of things you can read about in the docs. If we simply wanted to split our docs by spaces, we can reduce the complexity and speed up the processing to 1800 docs/s!

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
import pandas as pd
from tqdm import tqdm 
tqdm.pandas()

df = pd.read_parquet("your_file.parquet") # content column holds strings

def split_into_chunks(text, chunk_size=300):
    # Check for NaN values
    if text is None or not isinstance(text, str) or text == "":
        return []

    words = text.split()

    # Handle empty text
    if not words:
        return []

    chunks = [words[i:i + chunk_size] for i in range(0, len(words), chunk_size)]
    return [' '.join(chunk) for chunk in chunks]


df["content_chunk"] = df["content"].progress_apply(lambda x: split_into_chunks(x, 300)) # 35 sec

Note however, that in this way sentence boundaries are not respected which can be detrimental to the sentence meaning and hence the embedding.

3 Embeddings

Jina AI embeddings

Jina AI recently released their large-context embeddings that can process around 6000 words (8192 tokens) in one go! The model is fairly small and ranks well on the massive text embedding leaderboard (MTEB)[https://huggingface.co/spaces/mteb/leaderboard]. Seems like a perfect fit for documents right? Well, I would have liked to use them, however, there’s a few things to consider:

  • Most importantly: you might not even be able to run the model in your corporate environment if you’re running an old version of transformers (which frankly is very likely…) as it requires trust_remote_code=True. That’s quite a blocker but afaik it’s fixed in a more recent version.
  • Indexing a large document comes at a price as the algorithm applies some expensive heuristics. You might even run out of GPU memory. Taking such a risk is annoying when batch processing large amounts of data as you might run a process for a couple of hours or days only to see it failed in the end.
  • Other open-source embeddings rank higher on the MTEB and are smaller/faster
  • If your aim is to index chunks (which is likely for document retrieval) then you need to chunk your text in smaller subsets anyway. Afterwards you can simply average the embeddings for all chunks of a file to get a document embedding.

The Jina embeddings are great but in this particular case they are not the right tool for the job. Rather, they work great for e.g. SDG-Analyzer an in-browser SDG mapping tool or where you know you don’t exceed a certain document or chunk length.

FlagEmbeddings (bge family)

BAAI/bge-base-en-v1.5 embeddings are the perfect fit for this job. They are amongst the highest ranking models on MTEB, fairly small and have a vector size of 768 in comparison to other MTEB leaders with 1024 (Cohere & Voyage), meaning lower memory and storage consumption later on in Qdrant.

Let’s create embeddings for each chunk. The easiest way to get started with FlagEmbeddings is to use their pip package.

Install with pip install FlagEmbedding and you’re good to go. Note that it does not require trust_remote_code=True.

1
2
3
4
from FlagEmbedding import FlagModel
model = FlagModel('BAAI/bge-base-en-v1.5')
embeddings = model.encode("test this model")
embeddings

nvidia-quadro-rtx-8000

The batch script might look something like this. The input is a pandas df with a filename (for reference later) and the actual text. Each text is then split in chunks and a new expanded_df is created from the df where each chunk of text becomes one row. For each we calculate one embedding.

Note two things:

  1. As my data dump is quite large I preferred to create 20 parquet files from it; that’s the reason why you see file_chunks here, not be confused with the text chunks.
  2. This script passes all (!) chunk text at once to the model which might amount to a few Gb! Do not do this on consumer hardware as it will crash. Instead, process the content in smaller chunks of e.g. 1000 rows as there is still a benefit of passing more elements to process to the model than calling the encoding function separately for each row/text. In this case, the Nvidia GPU has lots of memory (48Gb) which is why I don’t need to take care of yet more chunking. ;)
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from tqdm import tqdm 
import pandas as pd 
import numpy as np
tqdm.pandas()
from FlagEmbedding import FlagModel

model = FlagModel('BAAI/bge-base-en-v1.5')
file_chunks = [1,2,3,4,5]

for chunk in file_chunks:
    print(f"chunk: {chunk}")

    df = pd.read_parquet(f"chunks/chunk_{chunk}.parquet") # luckily, parsing worked just fine!
    #df = df.iloc[:10]

    def split_into_chunks(text, chunk_size=300):
        # Check for NaN values
        if text is None or not isinstance(text, str) or text == "":
            return []

        words = text.split()

        # Handle empty text
        if not words:
            return []

        chunks = [words[i:i + chunk_size] for i in range(0, len(words), chunk_size)]
        return [' '.join(chunk) for chunk in chunks]


    df["content_chunk"] = df["content"].progress_apply(lambda x: split_into_chunks(x, 300)) # 35 sec

    # Explode the list of chunks into separate rows
    df_expanded = df.explode('content_chunk')

    # Convert 'content_chunk' column to string
    df_expanded['content_chunk'] = df_expanded['content_chunk'].astype(str)

    # Drop the original 'content' column
    df_expanded = df_expanded.drop('content', axis=1)

    print("embedding generation")
    embeddings = model.encode(df_expanded["content_chunk"].to_list())#.to_list())


    df_expanded["embeddings"] = list(embeddings)

    # Must convert to float32, might be due to legacy parquet/pandas version I used here
    df_expanded['embeddings'] = df_expanded['embeddings'].apply(lambda x: x.astype(np.float32) if x is not None else None)

    print("saving")
    df_expanded.to_parquet( f"chunks/chunk_{chunk}_embeddings.parquet")

    #pd.read_parquet( f"chunks/chunk_{chunk}_embeddings.parquet")

I was able to process 1.323.054 embeddings under 300 words in 1:10h! So Theoretically, to process my 20 file chunks, it might be doable in 24 hours but something always crashes in the end and you end up redoing it.

If you do not have access to corporate-level hardware, consider using the small variants of the models that are still incredibly powerful! Also, you might want to use the quantized versions. Xenova is creating onnx quantized models for all feature-extraction models for transformers.js so make sure to try them. E.g. the quantized onnx small bge model has only 34Mb is blazingly fast - try it in SemanticFinder to get an idea about it.


An expensive GPU surrounded by floating numbers between zero and one in an ambient lighted urban scene. High quality.