How to Use LLM4Decompile: The Ultimate Guide to Decompiling x86 Assembly to C

Jun 22, 2024 | Educational

Welcome to our comprehensive guide on using the LLM4Decompile tool! This powerful model allows you to translate x86 assembly instructions back into C code. With the release of the V1.5 series, featuring a remarkable 15 billion tokens dataset and a maximum token length of 4,096, you’re in for a treat with enhanced performance. Let’s get started!

1. Understanding LLM4Decompile

The main purpose of LLM4Decompile is to decompile x86 assembly instructions into C code efficiently. The latest version boasts significant performance improvements, showcasing up to 100% enhancement in its capabilities compared to previous iterations. To further explore, take a look at the Github Repository for more details.

2. Evaluation Results

Here’s a snapshot of how our models perform across different benchmarks and optimization levels:

Model                      HumanEval-Decompile                  ExeBench          
                         O0       O1       O2       O3       AVG       O0       O1       O2       O3       AVG       
DeepSeek-Coder-6.7B      0         0        0        0        0         0        0        0        0      0.0000           
GPT-4o                   0.3049   0.1159   0.1037   0.1159   0.1601   0.0443   0.0328   0.0397   0.0343   0.0378   
LLM4Decompile-End-1.3B  0.4720   0.2061   0.2122   0.2024   0.2732   0.1786   0.1362   0.1320   0.1328   0.1449    
LLM4Decompile-End-6.7B  0.6805   0.3951   0.3671   0.3720   0.4537   0.2289   0.1660   0.1618   0.1625   0.1798     
LLM4Decompile-End-33B   0.5168   0.2956   0.2815   0.2675   0.3404   0.1886   0.1465   0.1396   0.1411   0.1540

These evaluations demonstrate the optimizations and capabilities of the LLM4Decompile tool across different scenarios.

3. How to Use LLM4Decompile

Here’s a step-by-step tutorial on how to decompile with LLM4Decompile. In this exercise, we’ll reference a function named func0. Be sure to replace this placeholder with the actual function name you intend to decompile.

Preprocessing Steps

Compile the C code into binary.
Disassemble the binary into assembly instructions.

Here’s how you can achieve this using Python:

import subprocess
import os

OPT = [O0, O1, O2, O3]
fileName = "samples/sample"  # path to file

for opt_state in OPT:
    output_file = fileName + "_" + opt_state
    input_file = fileName + ".c"
    compile_command = f"gcc -o {output_file}.o {input_file} -{opt_state} -lm"  # compile the code with GCC on Linux
    subprocess.run(compile_command, shell=True, check=True)
    compile_command = f"objdump -d {output_file}.o > {output_file}.s"  # disassemble the binary file
    subprocess.run(compile_command, shell=True, check=True)

    input_asm = ""
    with open(output_file + ".s") as f:
        asm = f.read()
        if "func0" not in asm:  # IMPORTANT, replace `func0` with your function name
            raise ValueError("Compile fails")
        asm = "func0: " + asm.split("func0:")[-1].split("\n\n")[0]  # Get relevant assembly for the function
        asm_clean = ""
        asm_sp = asm.split("\n")
        for tmp in asm_sp:
            if len(tmp.split("\t")) < 3 and "00" in tmp:
                continue
            idx = min(len(tmp.split("\t")) - 1, 2)
            tmp_asm = "\t".join(tmp.split("\t")[idx:])  # remove the binary code
            tmp_asm = tmp_asm.split("#")[0].strip()  # remove comments
            asm_clean += tmp_asm + "\n"    
    input_asm = asm_clean.strip()

    before = f"# This is the assembly code:\n#prompt\n"
    after = f"# What is the source code?\n#prompt\n"
    input_asm_prompt = before + input_asm.strip() + after
    with open(fileName + "_" + opt_state + ".asm", "w", encoding="utf-8") as f:
        f.write(input_asm_prompt)

Decompilation Steps

Use LLM4Decompile to translate assembly instructions back into C code.

Here's how you can do this:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = "LLM4Binary/llm4decompile-6.7b-v1.5"  # V1.5 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()

with open(fileName + "_" + OPT[0] + ".asm", "r") as f:  # optimization level O0
    asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=4000)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
with open(fileName + ".c", "r") as f:  # original file
    func = f.read()
print(f"Original function:\n{func}")
# We only decompile one function, whereas the original file may contain multiple functions
print(f"Decompiled function:\n{c_func_decompile}")

4. Troubleshooting

While using LLM4Decompile, you may encounter some hiccups. Here are a few troubleshooting tips:

Compile Fails: Ensure that you have replaced occurrences of func0 with your actual function name in the preprocessing code.
Output Issues: If your output C code does not match expectations, check the assembly output for correctness before decompilation.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

5. Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations. Happy Coding!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox