Welcome to the world of sophisticated topic modeling! Here, we will delve into the wonders of CorEx, specifically focusing on how you can integrate your domain knowledge using anchored words to derive meaningful topics from your documents. Let’s get started!
What is Anchored CorEx?
The Correlation Explanation (CorEx) is a versatile topic modeling technique that excels in clustering sparse binary data, thanks to its flexible unsupervised and semi-supervised capabilities. What’s more, it allows users to influence topic modeling through anchor words, aligning the model’s output with their expectations and knowledge. Think of it as guiding your pet cat (the model) to play only with the red ball (the specific topic) instead of all the balls in the room.
Getting Started with Installation
To begin your journey, you need to install the Corex topic model, which can be done quickly via pip. Here’s how:
pip install corextopic
Running the CorEx Topic Model
Once installed, running the CorEx topic model is straightforward! You’ll work with a document-word matrix and simply follow the fit-transform conventions from scikit-learn.
Example Code
Imagine your document-word matrix is a recipe where each row represents a different dish and each column represents an ingredient. Now, let’s set up our ingredients and dishes:
import numpy as np
import scipy.sparse as ss
from corextopic import corextopic as ct
# Define your document-word matrix
X = np.array([[0, 0, 0, 1, 1],
[1, 1, 1, 0, 0],
[1, 1, 1, 1, 1]], dtype=int)
# Sparse matrices can also be used
X = ss.csr_matrix(X)
# Define word labels
words = ["dog", "cat", "fish", "apple", "orange"]
# Define document labels
docs = ["fruit doc", "animal doc", "mixed doc"]
# Create and fit the model
topic_model = ct.Corex(n_hidden=2) # Number of topics
topic_model.fit(X, words=words, docs=docs)
Extracting Topics
Once your model is fit, you can easily get the topics identified:
topics = topic_model.get_topics()
for topic_n, topic in enumerate(topics):
topic = [(w, mi, s) if s > 0 else (~+w, mi, s) for w, mi, s in topic]
words, mis, signs = zip(*topic)
topic_str = str(topic_n + 1) + ": " + ", ".join(words)
print(topic_str)
Semi-Supervised Topic Modeling Using Anchor Words
In this section, you will learn how to anchor words to specific topics, giving your model a nudge in the direction of your expertise. Here’s an example:
topic_model.fit(X, words=words, anchors=[["dog", "cat"], ["apple"]], anchor_strength=2)
In this code, “dog” and “cat” are anchored to the first topic, and “apple” is anchored to the second, with a relative weight of 2 for anchor words.
Choosing the Right Number of Topics
Determining how many topics to create can be tricky. Monitor the *total correlation* (TC) values as you add topics. You want to add new topics until the TC plateaus which indicates limited additional information is gained from more topics.
Visualizing Your Results
You can visualize topics effectively, enhancing your understanding of the model output. Here’s how:
from corextopic import vis_topic as vt
vt.vis_rep(topic_model, column_label=words, prefix='topic-model-example')
Troubleshooting and Tips
While using Anchored CorEx, you might run into some issues. Here are troubleshooting tips for a smoother experience:
- Check your document-word matrix for missing entries. Empty rows or columns may lead to errors.
- Ensure your anchor words are relevant and that you’ve set the anchor_strength appropriately.
- Experiment with different initializations and monitor TC values to achieve better model fit.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

