How to Reproduce the Experiments with CoLAKE: Contextualized Language and Knowledge Embedding

May 1, 2024 | Data Science

homemayankDocumentsarticle-generation-using-llmresized_images_gitnatural_language_processingreadme_txsun1997_CoLAKE

Welcome to a step-by-step guide on how to set up and reproduce the experiments from the paper CoLAKE: Contextualized Language and Knowledge Embedding. This article will walk you through the setup, preprocessing, and training of the CoLAKE model, making sure you have a smooth experience along the way.

Prepare Your Environment

Before diving into the implementation, it’s wise to create a new environment to keep your dependencies tidy and ensure smooth operation. Follow these commands:

bash
conda create --name colake python=3.7
source activate colake

CoLAKE is built upon fastNLP and Hugging Face’s Transformers, and employs fitlog for experiment tracking.

Clone the Necessary Repositories

Next, clone the required repositories and install the necessary packages:

bash
git clone https://github.com/fastnlp/fastNLP.git
cd fastNLP
python setup.py install
git clone https://github.com/fastnlp/fitlog.git
cd fitlog
python setup.py install
pip install transformers==2.11
pip install sklearn

Re-train CoLAKE

You might need to consider mixed CPU-GPU training due to the high volume of entities involved. CoLAKE’s implementation utilizes the KVStore provided by DGL. Additionally, for link prediction experiments, please ensure you install DGL-KE.

bash
pip install dgl==0.4.3
pip install dglke

Reproduce the Experiments

1. Download Model and Embeddings

To kick things off, you’ll need to download the pre-trained CoLAKE model and entity embeddings, which cover over 3 million entities:

bash
mkdir model
python download_gdrive.py 1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b .model model.bin
python download_gdrive.py 1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI .model entities.npy

Alternatively, you can use gdown:

bash
pip install gdown
gdown https://drive.google.com/uc?id=1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b
gdown https://drive.google.com/uc?id=1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI

2. Run the Experiments

After preparing your model, download the datasets for the experiments:

bash
python download_gdrive.py 1UNXICdkB5JbRyS5WTq6QNX4ndpMlNob6 .data.tar.gz
tar -xzvf data.tar.gz
cd finetune

Now, you can run the following experiments:

For FewRel: bash python run_re.py --debug --gpu 0
For Open Entity: bash python run_typing.py --debug --gpu 0
For LAMA and LAMA-UHN: bash cd ..; python eval_lama.py

Re-train CoLAKE Step-by-Step

1. Download the Data

Start by downloading the latest Wikipedia dump (XML format):

bash
wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Next, download the knowledge graph (Wikidata5M):

bash
wget -c https://www.dropbox.com/s/6sbhm0rwo4l73jq/wikidata5m_transductive.tar.gz?dl=1
tar -xzvf wikidata5m_transductive.tar.gz

And also, the Wikidata5M entity relation aliases:

bash
wget -c https://www.dropbox.com/s/lnbhc8yuhit4wm5/wikidata5m_alias.tar.gz?dl=1
tar -xzvf wikidata5m_alias.tar.gz

2. Preprocess the Data

Data preprocessing is crucial. Start by creating a directory for pretraining data:

bash
mkdir pretrain_data
python preprocessWikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4

Proceed to modify anchors and count entity-relation frequency:

bash
python preprocessextract.py 4
python preprocessgen_data.py 4
python statistic.py

3. Train CoLAKE

Finally, initialize entity and relation embeddings with the average of RoBERTa BPE embeddings:

bash
cd pretrain
python init_ent_rel.py

Now, you can train CoLAKE with mixed CPU-GPU:

bash
bash run_pretrain.sh

Troubleshooting Tips

If you encounter any issues during the setup or execution process, try the following troubleshooting tips:

Ensure all prerequisite libraries are successfully installed.
Double-check the given paths for your datasets and models.
Verify environmental variables are correctly set for GPU usage.
When in doubt, refer to the GitHub pages of fitlog, fastNLP, and DGL-KE for known issues and fixes.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox