How to Preserve Regional Dialects with AI: The JEJUMA-002 Project

Aug 6, 2024 | Educational

In an age where the richness of regional dialects is rapidly fading, the JEJUMA-002 project emerges as a beacon of hope. This initiative leverages advanced language models to help maintain and promote various dialects within Korea, ensuring they are not lost to time.

Why Was This Project Started?

Rapid Decline of Regional Dialects: Many local dialects are disappearing as younger generations prefer the standard language.
Preference for Standard Language: Increased interaction among regions has led to a trend where youth favor standard language over their local dialects.
Weak Language Models: Existing online resources primarily consist of standard Korean, leaving regional dialects underrepresented in training materials.

How Do We Address This Issue?

The JEJUMA-002 project utilizes a language model to translate and preserve regional dialects:

Convert difficult dialects into standard Korean to avoid them being forgotten.
Generate dialect versions of standard phrases to allow better comprehension.
Detect which region a dialect belongs to upon input.
Use a pre-trained model so that previously learned information can be effectively adapted.

Understanding the Developed Language Model

Think of the JEJUMA-002 project as a skilled translator navigating a maze of dialects. Each turn in the maze represents a unique regional expression, and the translator’s goal is to map these expressions to a common path – standard language. Using the Llama3.1 model trained with approximately 500,000 dialect-to-standard pairs, we can navigate this complex landscape effectively, ensuring we retain the essence of each dialect while making them accessible.

Example of the Most Challenging Jeju Dialect Tasks

Jeju Dialect to Standard Korean

| **Input Sentence**                   | **자이 폴에 독솔 막 난 거 보난 언 생이우다**                                  |
| ----------------------------- | ----------------------------------------------------|
| **Correct Answer**                       | **재 팔에 닭살이 막 난 거 보니, 추운 모양이다.**                              |
| Upstage Solar Output             | 그 바위에 뱀이 나타나는 걸 보니까 정말 놀랐다.                                   |
| Naver HCX Output                 | 재의 풀에 독초가 마구 난 것을 보니 어린 소나무입니다.                              |
| GPT-4o Output                    | 저기 바위에 독사가 막 나타난 걸 보니까 정말 놀랐다.                          |
| **JEJUMA-001 Output**            |  재 팔에 닭살이 막 난 거 보니, 추운 모양이다.                                    |

Standard Korean to Jeju Dialect

| **Input Sentence**                   |    **귤나무에 그냥 가서 너네 아버지좀 찾아와라.**                                           |
| ----------------------------- | ----------------------------------------------------|
| **Dataset**                       |  **미깡낭 경 가심 너네 아방 좀 데령**                                 |
| Upstage Solar Output             |    귤 나무에 가서 네 아버지를 좀 찾아와.                                            |
| Naver HCX Output                 |    귤낭에 강 느네 아방 좀 데령오라.                                       |
| GPT-4o Output                    |    귤나무에 걍 가서 햄신 아방 좀 찾아와라.                                        |
| **JEJUMA-001 Output**            |     미깡낭 경 가심 너네 아방 좀 촞아오라.                                         |

How to Use the Model?

To engage with this powerful tool, you can utilize predefined templates for various tasks:

Dialect to Standard: Convert dialect into standard Korean.
Standard to Dialect: Transform standard Korean into regional dialect.
Detect Dialect: Identify the dialect of the provided text.
Detect and Convert: Identify and convert the dialect into standard Korean.

Here’s an example to illustrate the functionality:

import transformers
import torch

model_id = "JEJUMA/JEJUMA-002"
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

class DialectPromptTemplate:
    @staticmethod
    def dialect_to_standard(text, region):
        return [{"role":"user", "content":f"Convert the following sentence or word which is {region}'s dialect to standard Korean: " + text},]

    @staticmethod
    def standard_to_dialect(text, region):
        return [{"role":"user", "content":f"Convert the following sentence or word which is standard Korean to {region}'s dialect: " + text},]

    @staticmethod
    def detect_dialect(text):
        return [{"role":"user", "content":"Detect the following sentence or word is standard, jeju, chungcheong, gangwon, gyeongsang, or jeonla's dialect: " + text},]

    @staticmethod
    def detect_dialect_and_convert(text):
        return [{"role":"user", "content":"Detect the following sentence or word is which dialect and convert the following sentence or word to standard Korean:" + text},]

outputs = pipeline(
    DialectPromptTemplate.dialect_to_standard("자이 폴에 독솔 막 난 거 보난 언 생이우다", "jeju"),
    max_new_tokens=512,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.1,
    top_p=0.9,
)

print(outputs[0]["generated_text"][-1])

Future Plans

The next step, JEJUMA-003, aims to extend capabilities by incorporating all regional dialects, identifying dialects not previously covered, including Yanbian dialect and North Korean language.

Troubleshooting Ideas

If you encounter issues while using the JEJUMA-002 model, here are a few troubleshooting tips:

Ensure your input text is clear and correctly formatted.
Check if the model you’re using requires any specific configurations.
Consult the setup instructions to verify that your environment is configured correctly.
For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox