About News Quick Start Citation
About
CLAP (Contrastive Language-Assembly Pre-training) is a groundbreaking framework that learns binary code representations through natural language supervision. Imagine you’re teaching a child how to identify different types of fruits by providing them with both pictures and descriptions. In this case, the binary code is the “fruit,” while the natural language explanations serve as the “descriptions.” This approach doesn’t only help the model understand the code better but also enhances its performance in few-shot and zero-shot scenarios.
Utilizing a dataset engine capable of automatically generating 195 million pairs of code snippets and their descriptions, CLAP offers a highly transferable method in the realm of binary code analysis. Our goal is to provide an effective tool for researchers and practitioners, with our models available on the Hugging Face Model Hub.
News
- CLAP is available on Hugging Face Model Hub (clap-asm and clap-text).
- CLAP is now on ArXiv.
Quick Start
This section will guide you in setting up and using the CLAP model for various tasks, such as fine-grained classification of sorting algorithms, malware, and cryptographic algorithms without the need for any further training.
Requirements
- Python 3.6 or higher
- PyTorch
- Transformers library
- A CUDA-enabled GPU is highly recommended for faster processing.
Ensure that you have Python and PyTorch installed on your system. Then, install the Transformers library using pip:
pip install transformers
Preparing Tokenizers and Models
Import the necessary libraries and initialize the model and tokenizers:
import torch
from transformers import AutoModel, AutoTokenizer
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
asm_tokenizer = AutoTokenizer.from_pretrained('hustcw/clap-asm', trust_remote_code=True)
text_tokenizer = AutoTokenizer.from_pretrained('hustcw/clap-text', trust_remote_code=True)
asm_encoder = AutoModel.from_pretrained('hustcw/clap-asm', trust_remote_code=True).to(device)
text_encoder = AutoModel.from_pretrained('hustcw/clap-text', trust_remote_code=True).to(device)
Example Use Cases
One of the fascinating use cases of CLAP is Fine-Grained Sorting Algorithm Classification (Zero-Shot). Let’s break it down:
- Load your assembly (asm) code dataset. For demonstration, we will use a JSON file containing assembly code snippets related to bubble sort:
- Define your classification prompts:
- Encode the assembly code and prompts, then perform classification:
with open('CaseStudy/bubblesort.json') as fp:
asm = json.load(fp)
prompts = ['This is a function related to bubble sort',
'This is a function related to selection sort',
...]
# Encode assembly code
asm_input = asm_tokenizer([asm], padding=True, return_tensors='pt').to(device)
asm_embedding = asm_encoder(**asm_input)
# Encode prompts
text_input = text_tokenizer(prompts, return_tensors='pt').to(device)
text_embeddings = text_encoder(**text_input)
# Classification
logits = torch.einsum('nc,ck->nk', [asm_embedding.last_hidden_state, text_embeddings.last_hidden_state.T])
preds = torch.softmax(logits, dim=1).squeeze(0).tolist()
# Output predictions
for i, prompt in enumerate(prompts):
print(f'Probability: {preds[i]*100:.3f}%, Text: {prompt}')
Repeat the process for other classification tasks such as malware classification and cryptographic algorithm identification by loading the respective datasets and defining the relevant natural language prompts.
Troubleshooting
If you encounter any issues while integrating or using the CLAP model, here are a few troubleshooting suggestions:
- Make sure all dependencies (Python, PyTorch, and Transformers) are correctly installed in the appropriate versions.
- Verify that you are using a CUDA-enabled GPU if the model is not performing as expected.
- If there are issues with loading models or tokenizers, check your internet connection and ensure that you have access to the Hugging Face Model Hub.
- If you are facing any compatibility issues, consider updating your packages to their latest versions.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

