Constrained Value-Aligned LLM via Safe RLHF

Category :

Beaver is a highly modular open-source RLHF framework developed by the PKU-Alignment team at Peking University. It aims to provide training data and a reproducible code pipeline for alignment research, especially constrained alignment LLM research via Safe RLHF methods.

Key Features of Beaver

  • Support for **SFT**, **RLHF**, and **Safe RLHF** training for popular pre-trained models: LLaMA, OPT, Baichuan, etc.
  • Provides a large human-labeled dataset (up to 1M pairs) including both helpful and harmless preferences to support reproducible RLHF research.
  • Supports training for Reward Model and Cost Model, with pre-trained checkpoints.
  • Allows customized parameters and datasets for SFT and RLHF.
  • Offers multi-scale metrics for safety constraints verification, e.g., BIG-bench, GPT-4 Evaluation.

What’s New?

  • 20240613: Open-sourcing of our PKU-SafeRLHF dataset version 1.0 with human-AI joint annotations and expanded harm categories.
  • 20240116: Safe RLHF method accepted by ICLR 2024 Spotlight.
  • 20231019: Release of Safe RLHF paper detailing the new safe alignment algorithm on arXiv.
  • 20230710: Open-sourcing of Beaver-7B models v1, v2, v3 on Hugging Face.
  • 20230515: Introduction of Safe RLHF pipeline and evaluation results.

Understanding the Code: An Analogy

Consider training a model like training a puppy. You want your puppy to learn helpful tricks (like fetching a stick) and avoid harmful behaviors (like chewing on shoes). In this case, the reward functions act as treats for positive behavior, while cost functions serve as actions that discourage negative behavior.

When we train the model, we utilize human preferences as if we’re training the puppy with encouraging feedback from its owner. For instance:

bash
# Training for Supervised Fine-Tuning
bash scripts/sft.sh --model_name_or_path your-model-name-or-checkpoint-path --output_dir output_sft

# Training Reward Model
bash scripts/reward-model.sh --model_name_or_path output_sft --output_dir output_rm

# Training Cost Model
bash scripts/cost-model.sh --model_name_or_path output_sft --output_dir output_cm

# Running Safe-RLHF
bash scripts/ppo-lag.sh --actor_model_name_or_path output_sft --reward_model_name_or_path output_rm --cost_model_name_or_path output_cm --output_dir output_ppo-lag

Here, each bash command corresponds to a training phase for the puppy’s development, ensuring it learns to be a valuable member of the family (or in this case, a capable model). The training sequence helps balance performance and safety through supervised and reinforcement learning practices.

Troubleshooting

If you run into issues during installation or while executing any of the training scripts, consider the following tips:

  • Ensure that your GPU drivers and CUDA toolkit are up to date.
  • Verify that all paths used in scripts are accurate and accessible.
  • Check if Docker is correctly set up if you are using the containerized runner.
  • Ensure that your conda environment or Docker container is activated before executing commands.
  • If you have memory issues, consider using DeepSpeed ZeRO-Offload to manage resources better.

For additional insights, updates, or to collaborate on AI development projects, stay connected with fxis.ai.

Conclusion

At fxis.ai, we believe that such advancements are crucial for the future of AI, as they enable more comprehensive and effective solutions. Our team is continually exploring new methodologies to push the envelope in artificial intelligence, ensuring that our clients benefit from the latest technological innovations.

Stay Informed with the Newest F(x) Insights and Blogs

Tech News and Blog Highlights, Straight to Your Inbox

Latest Insights

© 2024 All Rights Reserved

×