CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

Mar 4, 2024 | Educational

homemayankDocumentsarticle-generation-using-llmresized_imagesreadme_1_189

About News Quick Start Citation

About

CLAP (Contrastive Language-Assembly Pre-training) is a groundbreaking framework that learns binary code representations through natural language supervision. Imagine you’re teaching a child how to identify different types of fruits by providing them with both pictures and descriptions. In this case, the binary code is the “fruit,” while the natural language explanations serve as the “descriptions.” This approach doesn’t only help the model understand the code better but also enhances its performance in few-shot and zero-shot scenarios.

Utilizing a dataset engine capable of automatically generating 195 million pairs of code snippets and their descriptions, CLAP offers a highly transferable method in the realm of binary code analysis. Our goal is to provide an effective tool for researchers and practitioners, with our models available on the Hugging Face Model Hub.

News

CLAP is available on Hugging Face Model Hub (clap-asm and clap-text).
CLAP is now on ArXiv.

Quick Start

This section will guide you in setting up and using the CLAP model for various tasks, such as fine-grained classification of sorting algorithms, malware, and cryptographic algorithms without the need for any further training.

Requirements

Python 3.6 or higher
PyTorch
Transformers library
A CUDA-enabled GPU is highly recommended for faster processing.

Ensure that you have Python and PyTorch installed on your system. Then, install the Transformers library using pip:

pip install transformers

Preparing Tokenizers and Models

Import the necessary libraries and initialize the model and tokenizers:

import torch
from transformers import AutoModel, AutoTokenizer

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
asm_tokenizer = AutoTokenizer.from_pretrained('hustcw/clap-asm', trust_remote_code=True)
text_tokenizer = AutoTokenizer.from_pretrained('hustcw/clap-text', trust_remote_code=True)
asm_encoder = AutoModel.from_pretrained('hustcw/clap-asm', trust_remote_code=True).to(device)
text_encoder = AutoModel.from_pretrained('hustcw/clap-text', trust_remote_code=True).to(device)

Example Use Cases

One of the fascinating use cases of CLAP is Fine-Grained Sorting Algorithm Classification (Zero-Shot). Let’s break it down:

Load your assembly (asm) code dataset. For demonstration, we will use a JSON file containing assembly code snippets related to bubble sort:

with open('CaseStudy/bubblesort.json') as fp:
    asm = json.load(fp)

Define your classification prompts:

prompts = ['This is a function related to bubble sort',
           'This is a function related to selection sort',
           ...]

Encode the assembly code and prompts, then perform classification:

# Encode assembly code
asm_input = asm_tokenizer([asm], padding=True, return_tensors='pt').to(device)
asm_embedding = asm_encoder(**asm_input)

# Encode prompts
text_input = text_tokenizer(prompts, return_tensors='pt').to(device)
text_embeddings = text_encoder(**text_input)

# Classification
logits = torch.einsum('nc,ck->nk', [asm_embedding.last_hidden_state, text_embeddings.last_hidden_state.T])
preds = torch.softmax(logits, dim=1).squeeze(0).tolist()

# Output predictions
for i, prompt in enumerate(prompts):
    print(f'Probability: {preds[i]*100:.3f}%, Text: {prompt}')

Repeat the process for other classification tasks such as malware classification and cryptographic algorithm identification by loading the respective datasets and defining the relevant natural language prompts.

Troubleshooting

If you encounter any issues while integrating or using the CLAP model, here are a few troubleshooting suggestions:

Make sure all dependencies (Python, PyTorch, and Transformers) are correctly installed in the appropriate versions.
Verify that you are using a CUDA-enabled GPU if the model is not performing as expected.
If there are issues with loading models or tokenizers, check your internet connection and ensure that you have access to the Hugging Face Model Hub.
If you are facing any compatibility issues, consider updating your packages to their latest versions.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox