How to Use SimKGC for Knowledge Graph Completion

May 12, 2023 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_intfloat_SimKGC

Are you ready to dive into the world of contrastive learning for Knowledge Graph (KG) completion using pre-trained language models? This guide walks you through the essential steps to efficiently implement the SimKGC methodology as described in the ACL 2022 paper. With an emphasis on efficient contrastive learning, SimKGC combines large numbers of negatives with a hardness-aware InfoNCE loss. This approach helps achieve superior performance on benchmark datasets.

Requirements

Before we get started, ensure you have the following environment set up:

Python: 3.7
PyTorch: 1.6 (for mixed precision training)
Transformers: 4.15

Note: All experiments require 4 V100 (32GB) GPUs for optimal performance.

How to Run SimKGC

Running SimKGC involves three primary steps: dataset preprocessing, model training, and model evaluation. Here’s how you can handle datasets like WN18RR, FB15k-237, and Wikidata5M.

Working with WN18RR Dataset

Preprocess the Dataset: Run the following command:

bash scripts/preprocess.sh WN18RR

Train the Model: You can specify the output directory:

OUTPUT_DIR=.checkpoint/wn18rr bash scripts/train_wn.sh

The training process takes approximately 3 hours.

Evaluate the Model: Finally, run the evaluation:

bash scripts/eval.sh .checkpoint/wn18rr/model_last.mdl WN18RR

Working with FB15k-237 Dataset

Preprocess the Dataset:

bash scripts/preprocess.sh FB15k237

Train the Model:

OUTPUT_DIR=.checkpoint/fb15k237 bash scripts/train_fb.sh

This also takes around 3 hours.

Evaluate the Model:

bash scripts/eval.sh .checkpoint/fb15k237/model_last.mdl FB15k237

Wikidata5M Transductive Dataset

Download the Dataset:

bash .scripts/download_wikidata5m.sh

Preprocess the Dataset:

bash scripts/preprocess.sh wiki5m_trans

Train the Model:

OUTPUT_DIR=.checkpoint/wiki5m_trans bash scripts/train_wiki.sh wiki5m_trans

The training lasts about 12 hours.

Evaluate the Model:

bash scripts/eval_wiki5m_trans.sh .checkpoint/wiki5m_trans/model_last.mdl

Wikidata5M Inductive Dataset

First, download the dataset (if you haven’t already) using the previous step. Then:

Preprocess the Dataset:

bash scripts/preprocess.sh wiki5m_ind

Train the Model:

OUTPUT_DIR=.checkpoint/wiki5m_ind bash scripts/train_wiki.sh wiki5m_ind

Expect about 11 hours of training time.

Evaluate the Model:

bash scripts/eval.sh .checkpoint/wiki5m_ind/model_last.mdl wiki5m_ind

Troubleshooting

If you’re encountering issues while running SimKGC, here are some common problems and their solutions:

CUDA Out of Memory: If your system runs out of memory, try reducing the batch size. Keep in mind that smaller batch sizes may impact your contrastive training performance.
Support for Distributed Data Parallel (DDP) Training: Currently, this codebase does not support DDP due to input mask requirements. It only supports data parallel training for simplicity.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox