Welcome to our comprehensive guide on using the LLM4Decompile tool! This powerful model allows you to translate x86 assembly instructions back into C code. With the release of the V1.5 series, featuring a remarkable 15 billion tokens dataset and a maximum token length of 4,096, you’re in for a treat with enhanced performance. Let’s get started!
1. Understanding LLM4Decompile
The main purpose of LLM4Decompile is to decompile x86 assembly instructions into C code efficiently. The latest version boasts significant performance improvements, showcasing up to 100% enhancement in its capabilities compared to previous iterations. To further explore, take a look at the Github Repository for more details.
2. Evaluation Results
Here’s a snapshot of how our models perform across different benchmarks and optimization levels:
Model HumanEval-Decompile ExeBench
O0 O1 O2 O3 AVG O0 O1 O2 O3 AVG
DeepSeek-Coder-6.7B 0 0 0 0 0 0 0 0 0 0.0000
GPT-4o 0.3049 0.1159 0.1037 0.1159 0.1601 0.0443 0.0328 0.0397 0.0343 0.0378
LLM4Decompile-End-1.3B 0.4720 0.2061 0.2122 0.2024 0.2732 0.1786 0.1362 0.1320 0.1328 0.1449
LLM4Decompile-End-6.7B 0.6805 0.3951 0.3671 0.3720 0.4537 0.2289 0.1660 0.1618 0.1625 0.1798
LLM4Decompile-End-33B 0.5168 0.2956 0.2815 0.2675 0.3404 0.1886 0.1465 0.1396 0.1411 0.1540
These evaluations demonstrate the optimizations and capabilities of the LLM4Decompile tool across different scenarios.
3. How to Use LLM4Decompile
Here’s a step-by-step tutorial on how to decompile with LLM4Decompile. In this exercise, we’ll reference a function named func0. Be sure to replace this placeholder with the actual function name you intend to decompile.
Preprocessing Steps
- Compile the C code into binary.
- Disassemble the binary into assembly instructions.
Here’s how you can achieve this using Python:
import subprocess
import os
OPT = [O0, O1, O2, O3]
fileName = "samples/sample" # path to file
for opt_state in OPT:
output_file = fileName + "_" + opt_state
input_file = fileName + ".c"
compile_command = f"gcc -o {output_file}.o {input_file} -{opt_state} -lm" # compile the code with GCC on Linux
subprocess.run(compile_command, shell=True, check=True)
compile_command = f"objdump -d {output_file}.o > {output_file}.s" # disassemble the binary file
subprocess.run(compile_command, shell=True, check=True)
input_asm = ""
with open(output_file + ".s") as f:
asm = f.read()
if "func0" not in asm: # IMPORTANT, replace `func0` with your function name
raise ValueError("Compile fails")
asm = "func0: " + asm.split("func0:")[-1].split("\n\n")[0] # Get relevant assembly for the function
asm_clean = ""
asm_sp = asm.split("\n")
for tmp in asm_sp:
if len(tmp.split("\t")) < 3 and "00" in tmp:
continue
idx = min(len(tmp.split("\t")) - 1, 2)
tmp_asm = "\t".join(tmp.split("\t")[idx:]) # remove the binary code
tmp_asm = tmp_asm.split("#")[0].strip() # remove comments
asm_clean += tmp_asm + "\n"
input_asm = asm_clean.strip()
before = f"# This is the assembly code:\n#prompt\n"
after = f"# What is the source code?\n#prompt\n"
input_asm_prompt = before + input_asm.strip() + after
with open(fileName + "_" + opt_state + ".asm", "w", encoding="utf-8") as f:
f.write(input_asm_prompt)
Decompilation Steps
- Use LLM4Decompile to translate assembly instructions back into C code.
Here's how you can do this:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = "LLM4Binary/llm4decompile-6.7b-v1.5" # V1.5 Model
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.bfloat16).cuda()
with open(fileName + "_" + OPT[0] + ".asm", "r") as f: # optimization level O0
asm_func = f.read()
inputs = tokenizer(asm_func, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=4000)
c_func_decompile = tokenizer.decode(outputs[0][len(inputs[0]):-1])
with open(fileName + ".c", "r") as f: # original file
func = f.read()
print(f"Original function:\n{func}")
# We only decompile one function, whereas the original file may contain multiple functions
print(f"Decompiled function:\n{c_func_decompile}")
4. Troubleshooting
While using LLM4Decompile, you may encounter some hiccups. Here are a few troubleshooting tips:
- Compile Fails: Ensure that you have replaced occurrences of func0 with your actual function name in the preprocessing code.
- Output Issues: If your output C code does not match expectations, check the assembly output for correctness before decompilation.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
5. Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations. Happy Coding!

