How to Summarize Open-Domain Code-Switched Conversations with Gupshup

Aug 29, 2024 | Educational

Summarizing conversations that switch between languages can be a challenging yet exciting task in the realm of natural language processing. Gupshup, a project presented at EMNLP 2021, addresses this challenge by providing tools and models to effectively summarize open-domain code-switched dialogues. In this article, we will guide you through the setup and execution process to get started with Gupshup, along with troubleshooting tips to assist you along the way.

Getting Started with Gupshup

Before you dive into summarizing code-switched conversations, you need to set up your environment and gather the required data and models.

Step 1: Request the Dataset

You can request the Gupshup dataset through this Google form. This dataset comes with two tasks:

Hinglish Dialogues to English Summarization (h2e)
English Dialogues to English Summarization (e2e)

For each task, dialogues are available with `.source` and `.target` extensions. Make sure you organize them properly, as you will need to specify these files when running the scripts.

Step 2: Clone the Repository and Install Dependencies

Clone the Gupshup repository:

git clone https://github.com/midas-research/gupshup.git

Create a Python virtual environment and install the required packages:

pip install -r requirements.txt

Step 3: Choose Your Model

All model weights can be found on the Hugging Face model hub. Here’s how to use them:

For Hinglish to English summaries:

mBART: midas/gupshup_h2e_mbart
PEGASUS: midas/gupshup_h2e_pegasus

For English to English summaries:

mBART: midas/gupshup_e2e_mbart
PEGASUS: midas/gupshup_e2e_pegasus

Running Inference

Now that everything is set up, you’re ready to generate summaries! You’ll need to define several command line arguments when you run the evaluation script.

Example Commands

To generate summary from Hinglish dialogue using the mBART model, use:

python run_eval.py --model_name midas/gupshup_h2e_mbart --input_path data/h2e/test.source --save_path generated_summary.txt --reference_path data/h2e/test.target --score_path scores.txt --bs 8

For English dialogues with the Pegasus model, run the following:

python run_eval.py --model_name midas/gupshup_e2e_pegasus --input_path data/e2e/test.source --save_path generated_summary.txt --reference_path data/e2e/test.target --score_path scores.txt --bs 8

Understanding the Code with an Analogy

Think of the Gupshup summarization system as a translator who not only knows several languages but also has a unique ability to distill discussions into concise points. Imagine a conversation in a bustling cafe where several guests converse, switching between English and Hinglish. Your job is to capture the essence of that conversation. The input files are like the chaotic environment of the cafe, and your chosen models are akin to experienced translators swiftly summarizing the key points so that someone can understand what was discussed without needing to hear every detail.

Troubleshooting

While working on Gupshup, you may encounter some common issues. Here are troubleshooting tips:

Ensure that the paths to your dataset files are correct.
Check that you’ve installed all required libraries; running `pip install -r requirements.txt` is crucial.
If issues persist, consider creating an issue on the GitHub repository for assistance.
For persistent issues or technical collaboration, feel free to reach out to the community at fxis.ai.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox