Are you ready to dive into the world of contrastive learning for Knowledge Graph (KG) completion using pre-trained language models? This guide walks you through the essential steps to efficiently implement the SimKGC methodology as described in the ACL 2022 paper. With an emphasis on efficient contrastive learning, SimKGC combines large numbers of negatives with a hardness-aware InfoNCE loss. This approach helps achieve superior performance on benchmark datasets.
Requirements
Before we get started, ensure you have the following environment set up:
- Python: 3.7
- PyTorch: 1.6 (for mixed precision training)
- Transformers: 4.15
Note: All experiments require 4 V100 (32GB) GPUs for optimal performance.
How to Run SimKGC
Running SimKGC involves three primary steps: dataset preprocessing, model training, and model evaluation. Here’s how you can handle datasets like WN18RR, FB15k-237, and Wikidata5M.
Working with WN18RR Dataset
- Preprocess the Dataset: Run the following command:
- Train the Model: You can specify the output directory:
- Evaluate the Model: Finally, run the evaluation:
bash scripts/preprocess.sh WN18RR
OUTPUT_DIR=.checkpoint/wn18rr bash scripts/train_wn.sh
The training process takes approximately 3 hours.
bash scripts/eval.sh .checkpoint/wn18rr/model_last.mdl WN18RR
Working with FB15k-237 Dataset
- Preprocess the Dataset:
- Train the Model:
- Evaluate the Model:
bash scripts/preprocess.sh FB15k237
OUTPUT_DIR=.checkpoint/fb15k237 bash scripts/train_fb.sh
This also takes around 3 hours.
bash scripts/eval.sh .checkpoint/fb15k237/model_last.mdl FB15k237
Wikidata5M Transductive Dataset
- Download the Dataset:
- Preprocess the Dataset:
- Train the Model:
- Evaluate the Model:
bash .scripts/download_wikidata5m.sh
bash scripts/preprocess.sh wiki5m_trans
OUTPUT_DIR=.checkpoint/wiki5m_trans bash scripts/train_wiki.sh wiki5m_trans
The training lasts about 12 hours.
bash scripts/eval_wiki5m_trans.sh .checkpoint/wiki5m_trans/model_last.mdl
Wikidata5M Inductive Dataset
First, download the dataset (if you haven’t already) using the previous step. Then:
- Preprocess the Dataset:
- Train the Model:
- Evaluate the Model:
bash scripts/preprocess.sh wiki5m_ind
OUTPUT_DIR=.checkpoint/wiki5m_ind bash scripts/train_wiki.sh wiki5m_ind
Expect about 11 hours of training time.
bash scripts/eval.sh .checkpoint/wiki5m_ind/model_last.mdl wiki5m_ind
Troubleshooting
If you’re encountering issues while running SimKGC, here are some common problems and their solutions:
- CUDA Out of Memory: If your system runs out of memory, try reducing the batch size. Keep in mind that smaller batch sizes may impact your contrastive training performance.
- Support for Distributed Data Parallel (DDP) Training: Currently, this codebase does not support DDP due to input mask requirements. It only supports data parallel training for simplicity.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.
Conclusion
At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

