How to Use Reader-LM for HTML to Markdown Conversion

Oct 28, 2024 | Educational

In our digital age, content comes in various formats. One frequent challenge is converting HTML content into Markdown, an essential step for simpler editing and restoration of webpage structure. Fortunately, with Reader-LM, this process becomes smoother and more efficient. This blog will guide you through the steps to implement Reader-LM for converting HTML to Markdown.

What is Reader-LM?

Reader-LM is a remarkable series of models trained specifically to convert HTML content to Markdown. This capability is ideal for content conversion tasks, especially when working with rich internet content. By utilizing advanced neural search applications, Reader-LM enhances your embedding performance on search tasks.

Finetuner logo

How to Get Started

Using Google Colab

The most user-friendly way to try out Reader-LM is through Google Colab. This platform allows you to utilize powerful GPU resources for free. Here’s how to do it:

  1. Run the Colab notebook to see a live demonstration of using reader-lm-1.5b to convert HTML content (like from the HackerNews website) to Markdown.
  2. You can also modify the URL in the notebook to test different HTML sources or switch between the two models: reader-lm-0.5b and reader-lm-1.5b.
  3. Note: Simply provide the raw HTML as input—there’s no need to prefix with instructions.

Local Installation

If you prefer running Reader-LM locally, follow the steps below:

  • First, install the transformers library using pip:
  • pip install transformers==4.43.4
  • Now, you can implement the model in your Python environment. Here’s how:
  • from transformers import AutoModelForCausalLM, AutoTokenizer
    
    checkpoint = "jinaai/reader-lm-1.5b"
    device = "cuda"  # or "cpu" if you're not using a GPU
    
    tokenizer = AutoTokenizer.from_pretrained(checkpoint)
    model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
    
    # example html content
    html_content = "

    Hello, world!

    " messages = [{"role": "user", "content": html_content}] input_text = tokenizer.apply_chat_template(messages, tokenize=False) inputs = tokenizer.encode(input_text, return_tensors='pt').to(device) outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08) print(tokenizer.decode(outputs[0]))

Explaining the Code Using an Analogy

Think of Reader-LM as a translator in a busy airport. Just like a translator helps passengers from one language to another, Reader-LM converts HTML, the native language of web pages, into Markdown, the language of simplified content formatting.

The code provided is like the translator’s set of instructions:

  • tokenizer is akin to the translator’s dictionary, helping to understand the meanings of terms.
  • model plays the role of the translator, converting one format (HTML) into another (Markdown).
  • The input_text is like the question asked to the translator, the raw content needing translation.
  • Finally, outputs represent the translated text—the final Markdown output ready for use.

Troubleshooting

If you encounter any issues while using Reader-LM, here are some common troubleshooting tips:

  • Installation Errors: Ensure you have the correct version of transformers installed. Use the command provided above.
  • Model Loading Issues: Check your internet connection, as the models are fetched from the cloud.
  • Memory Errors: If running locally, ensure you have enough available GPU/CPU resources to handle the model size.

For more insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Additional Resources

Conclusion

Reader-LM empowers developers and content creators alike to seamlessly convert HTML content into Markdown. By following this guide, you’re now equipped to enhance your content editing workflow effectively. At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox