How to Use the Robust Random Cut Forest Algorithm for Anomaly Detection

Sep 28, 2021 | Data Science

The Robust Random Cut Forest (RRCF) algorithm is a powerful tool for detecting anomalies in streaming data. If you’re looking to navigate the potential of RRCF and implement it in your own projects, you’ve come to the right place. Here’s a step-by-step guide to getting started, including explanations of the key concepts, installation steps, code examples, and troubleshooting tips.

What is RRCF?

The RRCF algorithm is an ensemble method designed for identifying outliers in real-time and high-dimensional data. Some of its notable features include:

  • Capability to handle streaming data efficiently.
  • Robustness against irrelevant dimensions that can skew results.
  • Ability to manage duplicates gracefully.
  • Provides an anomaly score with a clear statistical basis.

Installation of RRCF

To begin using RRCF, you’ll need to install it using pip. Ensure you have Python 3.x installed, as RRCF is not compatible with earlier versions.

$ pip install rrcf

RRCF also depends on a few packages to enhance its functionality:

Building a Robust Random Cut Tree (RRCT)

Now that you have RRCF installed, let’s create a robust random cut tree (RRCT). Imagine building a customized filing cabinet—the RRCT helps you organize your data efficiently to spot anomalies quickly.

import numpy as np
import rrcf

# Create data points
X = np.random.randn(100, 2)

# Instantiate a Robust Random Cut Tree
tree = rrcf.RCTree(X)

Here’s how the analogy plays out: just as you would create a section in your cabinet for your essentials, we create a tree structure in RRCF to manage our data efficiently.

Inserting and Deleting Points from the Tree

As your data evolves, you may need to add or remove entries from the tree:

tree = rrcf.RCTree()

# Insert six random points
for i in range(6):
    x = np.random.randn(2)
    tree.insert_point(x, index=i)

# Now, let’s delete a point
tree.forget_point(2)

Just like you might add or remove folders in your filing cabinet, the flexibility of the RRCF allows for seamless data management.

Understanding Anomaly Scores

In RRCF, the measurement of how likely a point is to be an anomaly comes from its collusive displacement (CoDisp). To illustrate this, let’s look at an example:

x = np.random.randn(100, 2)
tree = rrcf.RCTree(x)

# Insert an inlier and an outlier
inlier = np.array([0, 0])
outlier = np.array([4, 4])
tree.insert_point(inlier, index=inlier)
tree.insert_point(outlier, index=outlier)

# Get CoDisp status
co_disp_inlier = tree.codisp(inlier)  # Result: 1.75
co_disp_outlier = tree.codisp(outlier)  # Result: 39.0

The greater the CoDisp value, the more an anomaly stands out, similar to how a bright red folder would be easily spotted in a box of gray files.

Batch Anomaly Detection

RRCF can also be utilized for batch anomaly detection. This is like searching through your filing cabinet for irregularities amidst a bulk of papers:

# Set parameters
n = 2010
d = 3
num_trees = 100
tree_size = 256

# Generate data
X = np.zeros((n, d))
X[:1000, 0] = 5
X[1000:2000, 0] = -5
X += 0.01 * np.random.randn(*X.shape)

# Create a forest of trees for batch processing
forest = []
while len(forest) < num_trees:
    ixs = np.random.choice(n, size=(n // tree_size, tree_size), replace=False)
    trees = [rrcf.RCTree(X[ix], index_labels=ix) for ix in ixs]
    forest.extend(trees)

# Average CoDisp calculation
avg_codisp = pd.Series(0.0, index=np.arange(n))
index = np.zeros(n)

for tree in forest:
    codisp = pd.Series(leaf: tree.codisp(leaf) for leaf in tree.leaves)
    avg_codisp[codisp.index] += codisp
    np.add.at(index, codisp.index.values, 1)

avg_codisp = avg_codisp / index

This highlights how anomalies can be efficiently detected across a dataset by leveraging multiple trees, similar to having several colleagues examining the same filing cabinet from different angles.

Streaming Anomaly Detection

One of the most interesting uses of RRCF is for streaming anomaly detection, which allows you to monitor incoming data in real time:

n = 730
A = 50
center = 100
phi = 30
T = 2 * np.pi / 100
t = np.arange(n)
sin = A * np.sin(T * t - phi) + center
sin[235:255] = 80  # Injecting anomalies

# Parameters
num_trees = 40
shingle_size = 4
tree_size = 256
forest = [rrcf.RCTree() for _ in range(num_trees)]
# More code to process the data in real time...

The RRCF acts akin to a live watchtower, surveying the incoming data stream for any unusual actions or anomalies.

Troubleshooting Tips

If you face issues during installation or usage, here are some troubleshooting suggestions:

  • Ensure all dependencies are installed as required.
  • Check your Python version; only Python 3 is supported.
  • Refer to the documentation for guidance on specific functions and usage.
  • If you encounter a problem or have ideas for improvements, feel free to raise an issue in the repository.
  • For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI.

Conclusion

The Robust Random Cut Forest algorithm is an agile, efficient approach for anomaly detection in streaming data. With its notable features like handling high-dimensional data and providing meaningful statistical assessments, RRCF stands out in the realm of data analysis. Following the steps in this guide will set you on the right path to harness its potential.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox