Discovering Phrases from Large Text Corpora with Phrase-At-Scale

Sep 25, 2020 | Data Science

Welcome to the world of Phrase-At-Scale, where the power of big data meets the elegance of language processing! If you’ve ever faced the daunting task of sifting through vast amounts of text to extract meaningful phrases, you’re in for a treat. Phrase-At-Scale provides a quick and efficient way to discover phrases using PySpark, making phrase extraction accessible to everyone. In this article, we’ll traverse the exciting journey of using Phrase-At-Scale with a step-by-step guide.

What Can Phrase-At-Scale Do for You?

Before diving into the technical details, let’s explore the amazing features Phrase-At-Scale offers:

  • Discover the most common phrases in your text.
  • Extract phrases of arbitrary sizes (think bigrams and trigrams).
  • Adjust configurations to control the quality of the extracted phrases.
  • Support for multiple languages beyond English.
  • Run the process locally using multiple threads or over multiple machines in parallel.
  • Annotate your corpora with the phrases you’ve discovered.

Getting Started with Phrase-At-Scale

Are you ready to harness the power of Phrase-At-Scale? Follow these steps to get started:

Running Locally

To re-run phrase discovery using the default dataset, follow these steps:

  1. First, you need to install Spark.
  2. Clone the Phrase-At-Scale repository and navigate to its top-level directory:
  3. git clone git@github.com:kavgan/phrase-at-scale.git
  4. Run the Spark job with the command below:
  5. your_path_to_sparkbin/spark-submit --master local[200] --driver-memory 4G phrase_generator.py

    This command uses specified settings, including input data files, as indicated in config.py.

  6. Monitor the progress of your job at http://localhost:4040.

Expected Outputs

The job will produce two output files:

  • The list of phrases will be under top-opinrank-phrases.txt.
  • The annotated corpora will be available under datatagged-data.

Configuration Customization

You can easily adjust the configuration to meet your needs by editing the config.py file. Here’s a quick rundown of configurable options:

Configuration Description
input_file Path to your input data files. It can be a single file or a folder of files.
output-folder Path for your annotated corpora, either local or on HDFS.
phrase-file Path for the file containing the list of discovered phrases.
stop-file Stop-words file for indicating phrase boundaries.
min-phrase-count Minimum number of occurrences for phrases, with guidance provided for different dataset sizes.

Using the Dataset

The default configuration utilizes a subset of the OpinRank dataset, which includes approximately 255,000 hotel reviews. Citing the dataset can be done as follows:

@article{ganesan2012opinion, title={Opinion-based entity ranking}, author={Ganesan, Kavita and Zhai, ChengXiang}, journal={Information retrieval}, volume={15}, number={2}, pages={116--150}, year={2012}, publisher={Springer}}

Troubleshooting Tips

As with any powerful tool, you may occasionally encounter some hiccups. Here are some troubleshooting ideas to help you out:

  • Ensure that you have correctly installed Spark and all dependencies.
  • Check for proper paths in the config.py file – wrong paths can cause failures.
  • If the job isn’t showing progress at http://localhost:4040, verify your Spark submit command syntax.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

So, go ahead and unleash the power of Phrase-At-Scale to transform your text analysis process!

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox