大模型训练全解析:SFT/DPO/PPO/GRPO
SFT/DPO/PPO/GRPO训练全解析
监督微调(SFT)
监督微调是语言模型训练的基础阶段,利用标注数据调整预训练模型的参数。训练过程使用交叉熵损失函数,优化模型对输入序列的预测能力。典型数据集包含指令-响应对,例如Alpaca或FLAN格式的数据。
损失函数公式为: [ \mathcal{L}{\text{SFT}} = -\sum{t} \log p(y_t | y_{<t}, x) ] 其中(x)为输入指令,(y_t)为时间步(t)的目标词元。
直接偏好优化(DPO)
DPO通过偏好数据替代强化学习中的显式奖励模型。方法依赖离线数据集中的优胜-劣质样本对((y_w, y_l)),直接优化策略模型参数。核心思想是将偏好学习转化为分类问题,通过Bradley-Terry模型构建损失函数。
DPO损失函数为: [ \mathcal{L}{\text{DPO}} = -\log \sigma\left(\beta \log \frac{\pi\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right) ] 其中(\beta)为温度参数,(\pi_{\text{ref}})为参考策略。
近端策略优化(PPO)
PPO结合策略梯度与信任域优化,通过裁剪机制稳定训练。包含两个关键组件:价值函数估计和策略更新。算法使用广义优势估计(GAE)计算优势函数,通过交替优化策略和价值网络实现高效更新。
策略目标函数为: [ \mathcal{L}^{\text{CLIP}}_\theta = \mathbb{E}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right] ] 其中(r_t(\theta))为策略比率,(\epsilon)为裁剪阈值。
组相对策略优化(GRPO)
GRPO扩展PPO框架,引入组级约束处理多目标优化问题。方法通过定义策略组的KL散度约束,保持组内策略的多样性同时优化整体性能。适用于需要平衡多个竞争性指标的复杂任务。
组策略更新公式包含: [ \mathcal{L}{\text{GRPO}} = \mathbb{E}t\left[\mathcal{L}^{\text{CLIP}}\theta - \lambda \sum{g\in G} D_{\text{KL}}(\pi_g || \pi_{\text{ref},g})\right] ] 其中(G)为策略组集合,(\lambda)为约束权重系数。
训练流程对比
- 数据需求:SFT依赖标注数据,DPO需要偏好对,PPO/GRPO依赖在线交互或预收集轨迹
- 计算开销:SFT最简单,PPO/GRPO需并行化rollout收集
- 适用场景:SFT用于基础能力对齐,DPO适合静态偏好学习,PPO/GRPO适应动态环境
实现注意事项
- 分布式训练时需注意PPO的同步问题,建议使用Ray或Horovod框架
- DPO训练前应对偏好数据进行清洗,剔除矛盾标注样本
- GRPO的组划分应基于任务语义,例如将不同对话技能分为独立组
典型代码结构示例:
# PPO核心更新逻辑
def update_policy(batch):
obs, acts, advs = batch
logits = policy_net(obs)
new_logp = logits.log_prob(acts)
old_logp = old_logits.log_prob(acts)
ratio = (new_logp - old_logp).exp()
surr1 = ratio * advs
surr2 = ratio.clamp(1-eps, 1+eps) * advs
policy_loss = -torch.min(surr1, surr2).mean()
value_loss = (value_net(obs) - returns).pow(2).mean()
return policy_loss + value_loss
BbS.okacop071.info/PoSt/1120_568827.HtM
BbS.okacop072.info/PoSt/1120_140217.HtM
BbS.okacop073.info/PoSt/1120_749475.HtM
BbS.okacop074.info/PoSt/1120_249087.HtM
BbS.okacop075.info/PoSt/1120_821601.HtM
BbS.okacop076.info/PoSt/1120_311465.HtM
BbS.okacop077.info/PoSt/1120_512722.HtM
BbS.okacop078.info/PoSt/1120_893040.HtM
BbS.okacop079.info/PoSt/1120_212250.HtM
BbS.okacop080.info/PoSt/1120_688381.HtM
BbS.okacop081.info/PoSt/1120_820241.HtM
BbS.okacop082.info/PoSt/1120_780740.HtM
BbS.okacop083.info/PoSt/1120_637802.HtM
BbS.okacop084.info/PoSt/1120_643672.HtM
BbS.okacop085.info/PoSt/1120_627720.HtM
BbS.okacop086.info/PoSt/1120_388939.HtM
BbS.okacop087.info/PoSt/1120_223386.HtM
BbS.okacop088.info/PoSt/1120_295847.HtM
BbS.okacop090.info/PoSt/1120_047655.HtM
BbS.okacop091.info/PoSt/1120_427940.HtM
BbS.okacop081.info/PoSt/1120_709395.HtM
BbS.okacop082.info/PoSt/1120_500898.HtM
BbS.okacop083.info/PoSt/1120_208287.HtM
BbS.okacop084.info/PoSt/1120_461782.HtM
BbS.okacop085.info/PoSt/1120_279738.HtM
BbS.okacop086.info/PoSt/1120_931462.HtM
BbS.okacop087.info/PoSt/1120_375478.HtM
BbS.okacop088.info/PoSt/1120_960014.HtM
BbS.okacop090.info/PoSt/1120_809040.HtM
BbS.okacop091.info/PoSt/1120_109227.HtM
BbS.okacop081.info/PoSt/1120_161808.HtM
BbS.okacop082.info/PoSt/1120_908395.HtM
BbS.okacop083.info/PoSt/1120_113133.HtM
BbS.okacop084.info/PoSt/1120_596103.HtM
BbS.okacop085.info/PoSt/1120_397024.HtM
BbS.okacop086.info/PoSt/1120_582577.HtM
BbS.okacop087.info/PoSt/1120_097781.HtM
BbS.okacop088.info/PoSt/1120_602177.HtM
BbS.okacop090.info/PoSt/1120_532685.HtM
BbS.okacop091.info/PoSt/1120_851170.HtM
BbS.okacop081.info/PoSt/1120_629921.HtM
BbS.okacop082.info/PoSt/1120_703365.HtM
BbS.okacop083.info/PoSt/1120_180372.HtM
BbS.okacop084.info/PoSt/1120_630505.HtM
BbS.okacop085.info/PoSt/1120_109355.HtM
BbS.okacop086.info/PoSt/1120_378388.HtM
BbS.okacop087.info/PoSt/1120_801541.HtM
BbS.okacop088.info/PoSt/1120_372934.HtM
BbS.okacop090.info/PoSt/1120_056880.HtM
BbS.okacop091.info/PoSt/1120_368813.HtM
BbS.okacop081.info/PoSt/1120_077231.HtM
BbS.okacop082.info/PoSt/1120_564252.HtM
BbS.okacop083.info/PoSt/1120_403775.HtM
BbS.okacop084.info/PoSt/1120_940744.HtM
BbS.okacop085.info/PoSt/1120_751664.HtM
BbS.okacop086.info/PoSt/1120_239083.HtM
BbS.okacop087.info/PoSt/1120_612907.HtM
BbS.okacop088.info/PoSt/1120_948634.HtM
BbS.okacop090.info/PoSt/1120_645908.HtM
BbS.okacop091.info/PoSt/1120_730845.HtM
BbS.okacop081.info/PoSt/1120_644255.HtM
BbS.okacop082.info/PoSt/1120_729021.HtM
BbS.okacop083.info/PoSt/1120_554004.HtM
BbS.okacop084.info/PoSt/1120_287757.HtM
BbS.okacop085.info/PoSt/1120_999553.HtM
BbS.okacop086.info/PoSt/1120_555370.HtM
BbS.okacop087.info/PoSt/1120_308851.HtM
BbS.okacop088.info/PoSt/1120_956229.HtM
BbS.okacop090.info/PoSt/1120_512557.HtM
BbS.okacop091.info/PoSt/1120_048589.HtM
BbS.okacop081.info/PoSt/1120_934588.HtM
BbS.okacop082.info/PoSt/1120_738281.HtM
BbS.okacop083.info/PoSt/1120_623910.HtM
BbS.okacop084.info/PoSt/1120_670144.HtM
BbS.okacop085.info/PoSt/1120_763944.HtM
BbS.okacop086.info/PoSt/1120_649985.HtM
BbS.okacop087.info/PoSt/1120_371088.HtM
BbS.okacop088.info/PoSt/1120_910787.HtM
BbS.okacop090.info/PoSt/1120_604428.HtM
BbS.okacop091.info/PoSt/1120_276941.HtM
查看10道真题和解析