How to teach your embedding model new words

A step-by-step guide on adding new vocabulary to a Hugging Face sentence-transformer model, including fine-tuning and using it in LangChain.

How to Teach Your Embedding Model New Words

Pre-trained embedding models like those on Hugging Face are incredibly powerful. They understand the meaning and context of a vast vocabulary. But what happens when you need to work with text that includes words they’ve never seen?

Think of:

If the model has never seen these words, it will likely tokenize them into meaningless subwords (e.g., “XIGN” + “CODE” or “xig” + “##nco” + “##de”). This loses the specific meaning of your term.

The solution is to add the new words to the model’s vocabulary and fine-tune it to learn their meaning. This post will show you how, using the sentence-transformers library as an example.

The Two-Step Process to Add New Words

Adding a new word isn’t just one step. You have to update both the model’s “dictionary” (the tokenizer) and its “brain” (the model weights).

  1. Update the Tokenizer: You first tell the tokenizer that your new word exists as a single, complete unit. This prevents it from being broken into subwords.
  2. Update the Model & Fine-Tune: After you add the word, the model creates a “slot” for it in its embedding matrix, but this new embedding vector is just random noise. It has no meaning. You must fine-tune the model on new sentences containing your word so it can learn what that word means from its context.

Adding a New Token (The Code)

Let’s use sentence-transformers to load a model and modify its internal tokenizer and embedding matrix.

from sentence_transformers import SentenceTransformer
import torch

# Load your base model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# 1. Get the internal tokenizer and model
# We need to access the underlying 'transformers' model to modify it
tokenizer = model.tokenizer
transformer_model = model.auto_model

print(f"Original vocabulary size: {len(tokenizer)}")

# 2. Define your new words
# IMPORTANT: Check if your model is 'cased' or 'uncased' first!
# This model ('all-MiniLM-L6-v2') is uncased, so it lowercases everything.
# We only need to add the lowercase version.
# If using a 'cased' model, you might add: ['XIGNCODE', 'xigncode', 'Xigncode']
new_words = ['xigncode']

# 3. Add new tokens to the tokenizer
num_added_toks = tokenizer.add_tokens(new_words)

if num_added_toks > 0:
    print(f"Added {num_added_toks} new tokens.")

    # 4. Resize the model's token embeddings
    # This adds a new, randomly initialized vector for our new token(s)
    transformer_model.resize_token_embeddings(len(tokenizer))

    print(f"New vocabulary size: {len(tokenizer)}")
else:
    print("Tokens already exist in the vocabulary.")

# Test the tokenizer
test_sentence = "This is a test of xigncode and XIGNCODE3."
tokenized = tokenizer.tokenize(test_sentence)

print(f"Tokenization test: {tokenized}")
# Expected output: ['this', 'is', 'a', 'test', 'of', 'xigncode', 'and', 'xigncode', '3']

In the output above, notice two things:

  1. xigncode was recognized as a single token.
  2. XIGNCODE3 was tokenized as ['xigncode', '3']. This is the “longest-match” rule in action and is exactly what we want!

Fine-Tuning (Teaching the Meaning)

Now that our model recognizes “xigncode”, we need to teach it what it means. We do this by fine-tuning it on sentences where the word is used in context.

from torch.utils.data import DataLoader
from sentence_transformers import InputExample, losses, models

# 1. Create training examples
# We need sentences that provide context for our new word.
train_examples = [
    # Give it context: "xigncode" is related to "anti-cheat"
    InputExample(texts=['xigncode is an anti-cheat solution.', 'This software prevents cheating in online games.']),

    # Give it another context:
    InputExample(texts=['Many gamers are familiar with xigncode.', 'It is a well-known security program.'])

    # Add many more examples for better learning...
]

# 2. Setup dataloader and loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)

# 3. Fine-tune the model
num_epochs = 1
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

print("Starting fine-tuning...")
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='./my-finetuned-model',
    show_progress_bar=True
)

print("Fine-tuning complete. Model saved to './my-finetuned-model'.")

Using Your New Model (with LangChain)

Your new, smarter model is now saved to the ./my-finetuned-model directory. You can load it directly in sentence-transformers or, as you originally asked, in LangChain by simply pointing to the saved path.

from langchain_huggingface import HuggingFaceEmbeddings

# Point to your local, fine-tuned model directory
model_path = "./my-finetuned-model"

# LangChain will load the model and its new tokenizer
embeddings = HuggingFaceEmbeddings(
    model_name=model_path,
    model_kwargs={'device': 'cpu'}, # Use 'cuda' if available
    encode_kwargs={'normalize_embeddings': True}
)

# This text now generates a more accurate embedding
text = "Tell me about the xigncode anti-cheat."
query_embedding = embeddings.embed_query(text)

print("Successfully created embedding with the new word!")
print(f"Vector dimension: {len(query_embedding)}")

Key Gotchas: Case Sensitivity & Subwords

Our conversation highlighted a few critical points to remember:

XIGNCODE vs. xigncode (Case Sensitivity)

[UNK] vs. Subwords

By adding the token and fine-tuning, you are telling the model: “Stop splitting this word! Treat it as one thing, and this is what it means.”