Getting Started with Chinese Text Similarity Using CoSENT

Jun 9, 2024 | Educational

Understanding language through the lens of textual similarity can be a crucial tool in natural language processing (NLP). In this blog, we’re diving into using the CoSENT training framework to assess similarity between Chinese texts using a powerful model. Let’s explore the steps involved in downloading the model, comparing sentence similarity, and building a robust retrieval-augmented generation (RAG) system!

Overview

This model is designed specifically for language understanding of Chinese texts and leverages the **CoSENT** training framework for the **Retrieval-Augmented Generation (RAG)** task. Below, we’ll guide you step-by-step on how to set it up and utilize it effectively.

Download the Model

To get started, you’ll need to download the model. Here’s how to do it:

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Mike0307/text2vec-base-chinese-rag")
model = AutoModel.from_pretrained("Mike0307/text2vec-base-chinese-rag")

Example of Similarity Comparison

Now that you have the model ready, let’s write a function to compare the similarity between two sentences. Think of this process like comparing two different paintings to see how similar they are in style and color!

import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

sentences = [ 
    "福井舞所屬哪家唱片公司？",
    "23歲時出道、血型A型的福井舞是出身於京都的日本女創作歌手，所屬唱片公司為J-more。2004年，與WADAGAKI、SHINO組合地下音樂隊Poplar，發表了兩張專輯，天照和夢死物語。在2006年時退出，2007年10月加入了Avex獨立發展。"
]

encode_output = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=512)
model_output = model(**encode_output)
embeddings = mean_pooling(model_output, encode_output['attention_mask'])
similarity = torch.cosine_similarity(embeddings[0], embeddings[1], dim=0)
# Output similarity tensor

In this function, we compute the mean pooling of the embeddings and then measure cosine similarity between the two sentences. Just like analyzing two musical compositions, we examine their harmony or dissonance!

RAG with Langchain

Integrating retrieval-augmented generation adds another layer of functionality to our model. Install the necessary packages using:

pip install --upgrade --quiet langchain langchain-community

1. Use This Embedding Model to Build a Retriever

Here’s how to create a retriever using the model you downloaded:

from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings

model_name = "Mike0307/text2vec-base-chinese-rag"
embeddings = HuggingFaceEmbeddings(model_name=model_name)

An Example of a Retriever with Meta FAISS Vectorstore

Here’s a snippet that demonstrates how to use the embeddings with a FAISS vector store:

from langchain.schema import Document
from langchain_community.vectorstores.faiss import FAISS

documents = [
    Document(page_content="埃及聖䴉（學名：Threskiornis aethiopicus）..."),
    Document(page_content="隨著科技的不斷發展和革新...")
]
db = FAISS.from_documents(documents, embeddings)
retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.invoke("福井舞所屬哪家唱片公司？")  # Outputs relevant document

2. Use HuggingFace LLM as Customized Langchain LLM

Sometimes, you won’t want to rely on external APIs. Here’s how to setting up a HuggingFace LLM:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

llm_id = "Mike0307/Phi-3-mini-4k-instruct-chinese-lora"
model = AutoModelForCausalLM.from_pretrained(llm_id, device_map="mps", torch_dtype=torch.float32)

tokenizer = AutoTokenizer.from_pretrained(llm_id)

With this, you can create a Langchain LLM class using the model and tokenizer you downloaded. Ensure your setups are similar to creating a tailored outfit that fits just right!

3. Make a Simple RAG Chain

Finally, let’s use a prompt and build a simple RAG chain.

import langchain

langchain.debug = True  # Check the chain process

prompt = PromptTemplate.from_template(template="<|user|>{documents}\n{question} <|end|>\n<|assistant|>")
llm = CustomLLM(model, tokenizer)
rag = { "question": RunnablePassthrough(), "documents": retriever } | prompt | llm

query = "埃及聖䴉是什麼？"
result = rag.invoke(query)  # Outputs answer

Troubleshooting

If you encounter an issue with model downloads, ensure your internet connection is stable.
Check your Python environment; ensure that dependencies are installed.
Misconfiguration of the device can lead to errors. Adjust device maps as necessary.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

By following these steps, you can utilize the CoSENT training framework to understand text similarity with ease. Remember, this is just the tip of the iceberg, and there are many other methodologies to explore!

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox