A Guide to Accessing and Utilizing the AVeriTeC Dataset

Jul 22, 2024 | Educational

Welcome to your comprehensive guide on how to work with the AVeriTeC Dataset. This knowledge store and source code are designed to help you reproduce baseline experiments for the upcoming 7th FEVER workshop co-hosted at EMNLP 2024. Let’s dive into the process!

Latest Updates

  • 20.07.2024: The knowledge store for the test set is released here.
  • 18.07.2024: The test data is released here. Note: The first 1000 data points are from the original Averitec source (claim_id 0 to 999), and the next 1215 points are newly constructed.
  • 15.07.2024: For human evaluation, submission files should now include a scraped_text field. More information is available here.
  • 19.04.2024: The submission page for the shared task is live! Participate by submitting your predictions here.

Understanding the Dataset Structure

The AVeriTeC dataset provides structured claims following this format:

{
  claim: "The claim text itself",
  required_reannotation: True_or_False,
  label: "The annotated verdict",
  justification: "Explanation based on QA pairs",
  claim_date: "Estimated claim appearance date",
  speaker: "Person or organization making the claim",
  original_claim_url: "URL of the claim's original location",
  cached_original_claim_url: "Archived link to original URL",
  fact_checking_article: "Fact-checking article source",
  reporting_source: "Website or organization publishing the claim",
  location_ISO_code: "Relevant location for the claim",
  claim_types: ["Type 1", "Type 2"],
  fact_checking_strategies: ["Strategy 1", "Strategy 2"],
  questions: [
    {
      question: "Fact-checking question for the claim",
      answers: [
        {
          answer: "The answer to the question",
          answer_type: "abstractive/extractive/boolean/unanswerable",
          source_url: "Source URL for the answer",
          cached_source_url: "Archived link to the source",
          source_medium: "Medium of the answer"
        }
      ]
    }
  ]
}

Think of this dataset as a library of claims where each book (claim) contains a structured narrative with its verdict, evaluation methods, and the source from which it originated—just like how a detective gathers clues to solve a case.

Step-by-Step Guide to Reproduce Baseline Results

Step 0: Set Up Your Environment

To kick off your project, get started by ensuring you have Git LFS installed. Then carry out the following commands:

git lfs install
git clone https://huggingface.co/chenxwh/AVeriTeC
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/chenxwh/AVeriTeC
conda create -n averitec python=3.11
conda activate averitec
pip install -r requirements.txt
python -m spacy download en_core_web_lg
python -m nltk.downloader punkt
python -m nltk.downloader wordnet
conda install pytorch pytorch-cuda=11.8 -c pytorch -c nvidia

Step 1: Scrape Text from URLs

Next, you’ll need to scrape text data from URLs obtained via the Google API:

bash scripts/scraper.sh split start_idx end_idx # e.g., bash scripts/scraper.sh dev 0 500

More information on scraped text can be found here.

Step 2: Rank Sentences with BM25

Next, rank the sentences using the BM25 algorithm:

python -m src.reranking.bm25_sentences

Step 3: Generate Question-Answer Pairs

Utilize BLOOM to create QA pairs:

python -m src.reranking.question_generation_top_sentences

Step 4: Rerank QA Pairs

Rerank the QA pairs with a BERT model:

python -m src.reranking.rerank_questions

Step 5: Veracity Prediction

Finally, predict the veracity of the claims:

python -m src.prediction.veracity_prediction

Don’t forget to evaluate your predictions:

python -m src.prediction.evaluate_veracity

Troubleshooting Tips

While working with the AVeriTeC dataset, you might encounter challenges. Here are some common troubleshooting ideas:

  • Issue with URL scraping: Check your internet connection and make sure the URLs are accessible. Using a different scraping tool may help resolve issues.
  • Python environment errors: Ensure the required libraries are properly installed in your conda environment.
  • Model performance is low: Consider tweaking the hyperparameters or using more in-context examples in your model.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox