I implemented a system for classifying legal documents into constitutional categories using LegalBERT embeddings and a hybrid Tree-GRPO (Group Relative Policy Optimization) reinforcement learning framework. Inspired by “Tree Search for LLM Agent Reinforcement Learning” (Yuxiang Ji et al., 2025), I adapted hierarchical tree-style exploration for legal text classification.
Key aspects:
Domain-specific embeddings: LegalBERT captures nuanced legal language; embeddings are frozen for efficiency while providing high-quality semantic features.
Robust preprocessing: Long documents split into overlapping 256-word chunks (50-word overlap); embeddings precomputed in batches.
Centroid-based reward: Class centroids computed from training embeddings; rewards use cosine similarity between averaged chunk embeddings and predicted class centroid.
Hierarchical advantage aggregation:
Intra-document: Chunk embeddings averaged for document-level representation.
Inter-document: Advantages normalized using batch-level stats.
Combined: Weighted blend (90% intra, 10% inter).
Hybrid loss: Combines cross-entropy, PPO-style policy loss with ratio clipping, and entropy regularization—no separate value network.
Document-level prediction: Aggregates chunk-level probabilities for robust classification.
Takeaways:
Handles long, noisy legal texts efficiently without fine-tuning large models. Provides interpretable metrics and stability via hierarchical advantage normalization. Extends GRPO principles to hierarchical document structures, combining frozen embeddings with reinforcement learning for domain-specific classification.