How to teach your embedding model new words
A step-by-step guide on adding new vocabulary to a Hugging Face sentence-transformer model, including fine-tuning and using it in LangChain.
Pre-trained embedding models like those on Hugging Face are incredibly powerful. They understand the meaning and context of a vast vocabulary. But what happens when you need to work with text that includes words they’ve never seen?
Think of:
- Technical Jargon: “Gecko-Embeddings”
- Niche Terms: “DeepLearningGenius”
If the model has never seen these words, it will likely tokenize them into meaningless subwords (e.g., “XIGN” + “CODE” or “xig” + “##nco” + “##de”). This loses the specific meaning of your term.
The solution is to add the new words to the model’s vocabulary and fine-tune it to learn their meaning. This post will show you how, using the sentence-transformers library as an example.
The Two-Step Process to Add New Words
Adding a new word isn’t just one step. You have to update both the model’s “dictionary” (the tokenizer) and its “brain” (the model weights).
- Update the Tokenizer: You first tell the tokenizer that your new word exists as a single, complete unit. This prevents it from being broken into subwords.
- Update the Model & Fine-Tune: After you add the word, the model creates a “slot” for it in its embedding matrix, but this new embedding vector is just random noise. It has no meaning. You must fine-tune the model on new sentences containing your word so it can learn what that word means from its context.
Adding a New Token (The Code)
Let’s use sentence-transformers to load a model and modify its internal tokenizer and embedding matrix.
from sentence_transformers import SentenceTransformer
import torch
# Load your base model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
# 1. Get the internal tokenizer and model
# We need to access the underlying 'transformers' model to modify it
tokenizer = model.tokenizer
transformer_model = model.auto_model
print(f"Original vocabulary size: {len(tokenizer)}")
# 2. Define your new words
# IMPORTANT: Check if your model is 'cased' or 'uncased' first!
# This model ('all-MiniLM-L6-v2') is uncased, so it lowercases everything.
# We only need to add the lowercase version.
# If using a 'cased' model, you might add: ['XIGNCODE', 'xigncode', 'Xigncode']
new_words = ['xigncode']
# 3. Add new tokens to the tokenizer
num_added_toks = tokenizer.add_tokens(new_words)
if num_added_toks > 0:
print(f"Added {num_added_toks} new tokens.")
# 4. Resize the model's token embeddings
# This adds a new, randomly initialized vector for our new token(s)
transformer_model.resize_token_embeddings(len(tokenizer))
print(f"New vocabulary size: {len(tokenizer)}")
else:
print("Tokens already exist in the vocabulary.")
# Test the tokenizer
test_sentence = "This is a test of xigncode and XIGNCODE3."
tokenized = tokenizer.tokenize(test_sentence)
print(f"Tokenization test: {tokenized}")
# Expected output: ['this', 'is', 'a', 'test', 'of', 'xigncode', 'and', 'xigncode', '3']
In the output above, notice two things:
-
xigncodewas recognized as a single token. -
XIGNCODE3was tokenized as['xigncode', '3']. This is the “longest-match” rule in action and is exactly what we want!
Fine-Tuning (Teaching the Meaning)
Now that our model recognizes “xigncode”, we need to teach it what it means. We do this by fine-tuning it on sentences where the word is used in context.
from torch.utils.data import DataLoader
from sentence_transformers import InputExample, losses, models
# 1. Create training examples
# We need sentences that provide context for our new word.
train_examples = [
# Give it context: "xigncode" is related to "anti-cheat"
InputExample(texts=['xigncode is an anti-cheat solution.', 'This software prevents cheating in online games.']),
# Give it another context:
InputExample(texts=['Many gamers are familiar with xigncode.', 'It is a well-known security program.'])
# Add many more examples for better learning...
]
# 2. Setup dataloader and loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
# 3. Fine-tune the model
num_epochs = 1
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
print("Starting fine-tuning...")
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=num_epochs,
warmup_steps=warmup_steps,
output_path='./my-finetuned-model',
show_progress_bar=True
)
print("Fine-tuning complete. Model saved to './my-finetuned-model'.")
Using Your New Model (with LangChain)
Your new, smarter model is now saved to the ./my-finetuned-model directory. You can load it directly in sentence-transformers or, as you originally asked, in LangChain by simply pointing to the saved path.
from langchain_huggingface import HuggingFaceEmbeddings
# Point to your local, fine-tuned model directory
model_path = "./my-finetuned-model"
# LangChain will load the model and its new tokenizer
embeddings = HuggingFaceEmbeddings(
model_name=model_path,
model_kwargs={'device': 'cpu'}, # Use 'cuda' if available
encode_kwargs={'normalize_embeddings': True}
)
# This text now generates a more accurate embedding
text = "Tell me about the xigncode anti-cheat."
query_embedding = embeddings.embed_query(text)
print("Successfully created embedding with the new word!")
print(f"Vector dimension: {len(query_embedding)}")
Key Gotchas: Case Sensitivity & Subwords
Our conversation highlighted a few critical points to remember:
XIGNCODE vs. xigncode (Case Sensitivity)
- Uncased Models (like
all-MiniLM-L6-v2): Automatically lowercase everything. You only need to add the lowercasexigncode. It will matchXIGNCODE,xigncode, andXignCode. - Cased Models (like
bert-base-cased): TreatXIGNCODEandxigncodeas two different words. If you addXIGNCODE, it will only match that exact capitalization. For cased models, it’s safer to add all common variations you expect to see.
[UNK] vs. Subwords
- Why do this at all? To prevent your new word (e.g.,
xigncode) from being split into['xig', '##nco', '##de']. This splitting loses the unique meaning of the term. - Will it become
[UNK]? Almost never in modern models (like BERT, RoBERTa, etc.). These models are subword-based and are designed to split any unknown word into its component pieces. The[UNK](Unknown) token is mostly a relic of older, word-based models (like Word2Vec) or is only used if the input contains a character that is completely outside the tokenizer’s character set.
By adding the token and fine-tuning, you are telling the model: “Stop splitting this word! Treat it as one thing, and this is what it means.”