A step-by-step guide on adding new vocabulary to a Hugging Face sentence-transformer model, including fine-tuning and using it in LangChain.
Pre-trained embedding models like those on Hugging Face are incredibly powerful. They understand the meaning and context of a vast vocabulary. But what happens when you need to work with text that includes words they’ve never seen?
Think of:
If the model has never seen these words, it will likely tokenize them into meaningless subwords (e.g., “XIGN” + “CODE” or “xig” + “##nco” + “##de”). This loses the specific meaning of your term.
The solution is to add the new words to the model’s vocabulary and fine-tune it to learn their meaning. This post will show you how, using the sentence-transformers library as an example.
Adding a new word isn’t just one step. You have to update both the model’s “dictionary” (the tokenizer) and its “brain” (the model weights).
Let’s use sentence-transformers to load a model and modify its internal tokenizer and embedding matrix.
from sentence_transformers import SentenceTransformer
import torch
# Load your base model
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
model = SentenceTransformer(model_name)
# 1. Get the internal tokenizer and model
# We need to access the underlying 'transformers' model to modify it
tokenizer = model.tokenizer
transformer_model = model.auto_model
print(f"Original vocabulary size: {len(tokenizer)}")
# 2. Define your new words
# IMPORTANT: Check if your model is 'cased' or 'uncased' first!
# This model ('all-MiniLM-L6-v2') is uncased, so it lowercases everything.
# We only need to add the lowercase version.
# If using a 'cased' model, you might add: ['XIGNCODE', 'xigncode', 'Xigncode']
new_words = ['xigncode']
# 3. Add new tokens to the tokenizer
num_added_toks = tokenizer.add_tokens(new_words)
if num_added_toks > 0:
print(f"Added {num_added_toks} new tokens.")
# 4. Resize the model's token embeddings
# This adds a new, randomly initialized vector for our new token(s)
transformer_model.resize_token_embeddings(len(tokenizer))
print(f"New vocabulary size: {len(tokenizer)}")
else:
print("Tokens already exist in the vocabulary.")
# Test the tokenizer
test_sentence = "This is a test of xigncode and XIGNCODE3."
tokenized = tokenizer.tokenize(test_sentence)
print(f"Tokenization test: {tokenized}")
# Expected output: ['this', 'is', 'a', 'test', 'of', 'xigncode', 'and', 'xigncode', '3']
In the output above, notice two things:
xigncode was recognized as a single token.XIGNCODE3 was tokenized as ['xigncode', '3']. This is the “longest-match” rule in action and is exactly what we want!Now that our model recognizes “xigncode”, we need to teach it what it means. We do this by fine-tuning it on sentences where the word is used in context.
from torch.utils.data import DataLoader
from sentence_transformers import InputExample, losses, models
# 1. Create training examples
# We need sentences that provide context for our new word.
train_examples = [
# Give it context: "xigncode" is related to "anti-cheat"
InputExample(texts=['xigncode is an anti-cheat solution.', 'This software prevents cheating in online games.']),
# Give it another context:
InputExample(texts=['Many gamers are familiar with xigncode.', 'It is a well-known security program.'])
# Add many more examples for better learning...
]
# 2. Setup dataloader and loss
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.MultipleNegativesRankingLoss(model=model)
# 3. Fine-tune the model
num_epochs = 1
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)
print("Starting fine-tuning...")
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=num_epochs,
warmup_steps=warmup_steps,
output_path='./my-finetuned-model',
show_progress_bar=True
)
print("Fine-tuning complete. Model saved to './my-finetuned-model'.")
Your new, smarter model is now saved to the ./my-finetuned-model directory. You can load it directly in sentence-transformers or, as you originally asked, in LangChain by simply pointing to the saved path.
from langchain_huggingface import HuggingFaceEmbeddings
# Point to your local, fine-tuned model directory
model_path = "./my-finetuned-model"
# LangChain will load the model and its new tokenizer
embeddings = HuggingFaceEmbeddings(
model_name=model_path,
model_kwargs={'device': 'cpu'}, # Use 'cuda' if available
encode_kwargs={'normalize_embeddings': True}
)
# This text now generates a more accurate embedding
text = "Tell me about the xigncode anti-cheat."
query_embedding = embeddings.embed_query(text)
print("Successfully created embedding with the new word!")
print(f"Vector dimension: {len(query_embedding)}")
Our conversation highlighted a few critical points to remember:
XIGNCODE vs. xigncode (Case Sensitivity)all-MiniLM-L6-v2): Automatically lowercase everything. You only need to add the lowercase xigncode. It will match XIGNCODE, xigncode, and XignCode.bert-base-cased): Treat XIGNCODE and xigncode as two different words. If you add XIGNCODE, it will only match that exact capitalization. For cased models, it’s safer to add all common variations you expect to see.[UNK] vs. Subwordsxigncode) from being split into ['xig', '##nco', '##de']. This splitting loses the unique meaning of the term.[UNK]? Almost never in modern models (like BERT, RoBERTa, etc.). These models are subword-based and are designed to split any unknown word into its component pieces. The [UNK] (Unknown) token is mostly a relic of older, word-based models (like Word2Vec) or is only used if the input contains a character that is completely outside the tokenizer’s character set.By adding the token and fine-tuning, you are telling the model: “Stop splitting this word! Treat it as one thing, and this is what it means.”