How to Use the Whichlang Language Detection Library

Jun 16, 2023 | Data Science

If you’ve ever needed to determine the language of a piece of text, you’re not alone! Fortunately, the Whichlang language detection library is here to help. This library is specifically designed for precision and performance, making it an essential tool if you’re dealing with multilingual data.

Why Build Whichlang?

While developing Quickwit, a search engine for logs and tracing data, the necessity for a lightweight, fast, and accurate language detection library became evident. Whichlang was born to meet these high throughput requirements while maintaining great precision.

Features of Whichlang

  • No external dependencies
  • Throughput exceeds 100 MB/s for both short and long strings
  • Good accuracy rate of 99.5% depending on input size
  • Supports languages including Arabic, Dutch, English, French, German, Hindi, Italian, Japanese, Korean, Mandarin, Portuguese, Russian, Spanish, Swedish, Turkish, and Vietnamese

How Does Whichlang Work?

Imagine you’re a skilled chef who needs to determine the flavor profile of a dish based on the ingredients. Each ingredient contributes to the overall taste just like how characters in text form the essence of a language. Whichlang uses a multiclass logistic regression model that resembles this culinary skill — analyzing 2, 3, and 4-grams of letters (the ingredients) from ASCII characters to forecast the language (the dish). By using the hashing trick, it maps these features elegantly into a reduced space of 4,096 dimensions.

Comparison with Whatlang

To understand how Whichlang stacks up against its predecessor, Whatlang, benchmarks were conducted on throughput and accuracy. The data shows that Whichlang outperforms Whatlang by tenfold in speed while being slightly more accurate.

Throughput Benchmarks

 Processing Time (µs)  Throughput (MiBs)
-------------------------  --------------------
whatlangshort             16.62                 1.66
whatlanglong              62.00                 9.42
whichlangshort            0.26                  105.69
whichlanglong             5.21                  112.31 

Accuracy Benchmarks

 Crate: Whatlang
AVG: 91.69% 
LANG        AVG
------------------------------------------------------
Arabic      99.68%
Mandarin    96.09%
German      88.57%
English     85.99%
French      90.88%
... and many more
AVG         91.69% 

Crate: Whichlang
AVG: 97.03% 
LANG        AVG
---------------------------------------------------------
Arabic      100.00%
Mandarin    98.65%
German      94.20%
English     97.15%
French      97.59%
... and more
AVG         97.03% 

Troubleshooting

If you encounter issues while using Whichlang, here are some troubleshooting tips:

  • If you’re experiencing low accuracy, ensure your input text is of sufficient length and quality.
  • Check that you’re using the correct version of the library that matches your programming environment.
  • Refer to the official documentation for further clarifications.
  • For any unresolved problems, you can reach out to the community for support.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox