How to teach your embedding model new words

A step-by-step guide on adding new vocabulary to a Hugging Face sentence-transformer model, including fine-tuning and using it in LangChain.

How to Teach Your Embedding Model New Words

Pre-trained embedding models like those on Hugging Face are incredibly powerful. They understand the meaning and context of a vast vocabulary. But what happens when you need to work with text that includes words they’ve never seen?

Think of:

Technical Jargon: “Gecko-Embeddings”
Niche Terms: “DeepLearningGenius”

If the model has never seen these words, it will likely tokenize them into meaningless subwords (e.g., “XIGN” + “CODE” or “xig” + “##nco” + “##de”). This loses the specific meaning of your term.

The solution is to add the new words to the model’s vocabulary and fine-tune it to learn their meaning. This post will show you how, using the sentence-transformers library as an example.

The Two-Step Process to Add New Words

Adding a new word isn’t just one step. You have to update both the model’s “dictionary” (the tokenizer) and its “brain” (the model weights).

Update the Tokenizer: You first tell the tokenizer that your new word exists as a single, complete unit. This prevents it from being broken into subwords.
Update the Model & Fine-Tune: After you add the word, the model creates a “slot” for it in its embedding matrix, but this new embedding vector is just random noise. It has no meaning. You must fine-tune the model on new sentences containing your word so it can learn what that word means from its context.

Adding a New Token (The Code)

Let’s use sentence-transformers to load a model and modify its internal tokenizer and embedding matrix.

from sentence_transformers import SentenceTransformer
import torch

# Load your base model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)

# 1. Get the internal tokenizer and model
# We need to access the underlying 'transformers' model to modify it
tokenizer = model.tokenizer
transformer_model = model.auto_model

print(f"Original vocabulary size: {len(tokenizer)}")

# 2. Define your new words
# IMPORTANT: Check if your model is 'cased' or 'uncased' first!
# This model ('all-MiniLM-L6-v2') is uncased, so it lowercases everything.
# We only need to add the lowercase version.
# If using a 'cased' model, you might add: ['XIGNCODE', 'xigncode', 'Xigncode']
new_words = ['xigncode']

# 3. Add new tokens to the tokenizer
num_added_toks = tokenizer.add_tokens(new_words)

if num_added_toks > 0:
    print(f"Added {num_added_toks} new tokens.")

    # 4. Resize the model's token embeddings
    # This adds a new, randomly initialized vector for our new token(s)
    transformer_model.resize_token_embeddings(len(tokenizer))

    print(f"New vocabulary size: {len(tokenizer)}")
else:
    print("Tokens already exist in the vocabulary.")

# Test the tokenizer
test_sentence = "This is a test of xigncode and XIGNCODE3."
tokenized = tokenizer.tokenize(test_sentence)

print(f"Tokenization test: {tokenized}")
# Expected output: ['this', 'is', 'a', 'test', 'of', 'xigncode', 'and', 'xigncode', '3']

In the output above, notice two things:

xigncode was recognized as a single token.
XIGNCODE3 was tokenized as ['xigncode', '3']. This is the “longest-match” rule in action and is exactly what we want!

Fine-Tuning (Teaching the Meaning)

Now that our model recognizes “xigncode”, we need to teach it what it means. We do this by fine-tuning it on sentences where the word is used in context.

from torch.utils.data import DataLoader
from sentence_transformers import InputExample, losses, models

# 1. Create training examples
# We need sentences that provide context for our new word.
train_examples = [
    # Give it context: "xigncode" is related to "anti-cheat"
    InputExample(texts=['xigncode is an anti-cheat solution.', 'This software prevents cheating in online games.']),

    # Give it another context:
    InputExample(texts=['Many gamers are familiar with xigncode.', 'It is a well-known security program.'])

    # Add many more examples for better learning...
]

# 2. Setup dataloader and loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)

# 3. Fine-tune the model
num_epochs = 1
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

print("Starting fine-tuning...")
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps,
    output_path='./my-finetuned-model',
    show_progress_bar=True
)

print("Fine-tuning complete. Model saved to './my-finetuned-model'.")

Using Your New Model (with LangChain)

Your new, smarter model is now saved to the ./my-finetuned-model directory. You can load it directly in sentence-transformers or, as you originally asked, in LangChain by simply pointing to the saved path.

from langchain_huggingface import HuggingFaceEmbeddings

# Point to your local, fine-tuned model directory
model_path = "./my-finetuned-model"

# LangChain will load the model and its new tokenizer
embeddings = HuggingFaceEmbeddings(
    model_name=model_path,
    model_kwargs={'device': 'cpu'}, # Use 'cuda' if available
    encode_kwargs={'normalize_embeddings': True}
)

# This text now generates a more accurate embedding
text = "Tell me about the xigncode anti-cheat."
query_embedding = embeddings.embed_query(text)

print("Successfully created embedding with the new word!")
print(f"Vector dimension: {len(query_embedding)}")

Key Gotchas: Case Sensitivity & Subwords

Our conversation highlighted a few critical points to remember:

`XIGNCODE` vs. `xigncode` (Case Sensitivity)

Uncased Models (like all-MiniLM-L6-v2): Automatically lowercase everything. You only need to add the lowercase xigncode. It will match XIGNCODE, xigncode, and XignCode.
Cased Models (like bert-base-cased): Treat XIGNCODE and xigncode as two different words. If you add XIGNCODE, it will only match that exact capitalization. For cased models, it’s safer to add all common variations you expect to see.

`[UNK]` vs. Subwords

Why do this at all? To prevent your new word (e.g., xigncode) from being split into ['xig', '##nco', '##de']. This splitting loses the unique meaning of the term.
Will it become [UNK]? Almost never in modern models (like BERT, RoBERTa, etc.). These models are subword-based and are designed to split any unknown word into its component pieces. The [UNK] (Unknown) token is mostly a relic of older, word-based models (like Word2Vec) or is only used if the input contains a character that is completely outside the tokenizer’s character set.

By adding the token and fine-tuning, you are telling the model: “Stop splitting this word! Treat it as one thing, and this is what it means.”