Optimizing LLM Prompts Using Nomadic Platform's Reinforcement Learning Framework

Implementation Overview

Algorithm Description

The RL Prompt Optimizer employs a reinforcement learning framework to iteratively improve prompts used for language model evaluations. At each episode, the agent selects an action to modify the current prompt based on the state representation, which encodes features of the prompt. The agent receives rewards based on a multi-metric evaluation of the model's responses, encouraging the development of prompts that elicit high-quality answers.

Hyperparameters & Configuration

Model: GPT-3.5-turbo

Learning Rate (α): 0.1, 0.05

Discount Factor (γ): 0.95

Initial ε: 0.1

ε decay: 0.99

Min ε: 0.01

Reward Structure


        weights = {
            "faithfulness": 0.4,  # Context adherence
            "correctness": 0.3,   # Response accuracy
            "relevance": 0.2,     # Query relevance
            "clarity": 0.1        # Comprehensibility
        }

Performance Summary

Metric	Value
Best Score Achieved	0.7333333333333333
Average Convergence Time	50.0 episodes
Mean Q-Value	0.175

Results Visualization

The following interactive visualizations illustrate various aspects of the RL prompt optimization process:

Prompts Overview: Displays both the initial and current prompts for each experiment, updating dynamically as prompts change over episodes.
Learning Progress: Plots the score achieved in each episode, indicating the learning curve of the agent.
Prompt Evolution: Tracks the length of the prompt over episodes, reflecting how the prompt changes over time.
Action Selection Counts: Visualizes the frequency of each action selected by the agent, showing the agent's strategy.
Q-values Over Time: Illustrates the average Q-values across episodes, indicating the agent's confidence in its actions.
Model Outputs Over Iterations: An animated line plot showing the length of model outputs over iterations.
Actions Taken per Episode: Displays the specific action taken by the agent at each episode.

Strategy 1 (α=0.1)

Strategy 1 (α=0.05)

Strategy 2 (α=0.1)

Strategy 2 (α=0.05)

Combined Metrics Visualization

Key Findings

Strategy comparison reveals a peak performance of 0.7333333333333333.
On average, convergence was achieved within 50.0 episodes.
Stable Q-values indicate consistent learning, with an average value of 0.175.
The inclusion of specific actions, such as 'emphasize_metrics', contributed significantly to performance gains.