无论文秋招——上海AI lab一面
项目询问
没挖太深
问了个PPO和GRPO DAPO的区别
问了一个懂不懂vLLM和SGlang的实现
问了一个ms-swift 和 VeRL各自设计上的优劣,我说VeRL用起来更方便,每个模块很清晰,ms-swift集成的太好了,不方便改
问了一个宏观的问题:如果给你一个多模型后训练任务,怎么设计训练框架?
我说两个点,一个是模型加载,需要考虑到适配不同的模型,方便未来的模型加入进来,另外一个是RL这一块,需要适配不同的算法,让用户自己可以比较灵活的去定义和修改算法,例如你实现了 PPO算法,能不能很方便的改成GRPO,DAPO,GSPO。
面试官不太懂训模型这一块
比较关心框架设计
后面反问工作内容也是提到了是做偏infra这一块的
代码是补全GRPO
import torch
import torch.nn as nn
import torch.nn.functional as F
class GRPO:
def __init__(self, policy, ref_policy, lr=1e-5, beta=0.02, eps_clip=0.2):
self.policy = policy
self.ref_policy = ref_policy
self.optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
self.beta = beta
self.eps_clip = eps_clip
def compute_loss(self, input_ids, old_logp, rewards, advantages):
"""
input_ids: [B, T]
old_logp: [B, T] 旧策略log概率
rewards: RM奖励
advantages: GAE优势
"""
new_logp = self.policy.log_prob(input_ids) # [B, T]
ratio = torch.exp(new_logp - old_logp) # [B, T]
# GRPO:组内归一化优势(每组4样本)
B = advantages.size(0)
group_size = 4
advantages = (rewards - torch.mean(rewards))/torch.sqrt(rewards**2)
# PPO裁剪
surr1 = ratio*advantages
surr2 = (0.8,1.2)*advantages
policy_loss = -min(surr1,surr2)
# KL惩罚
ref_logp = self.ref_policy(input_ids)
kl = new_logp/ref_logp - torch.log(ref_logp/new_logp) + ?
loss = policy_loss + kl
return loss
def step(self, input_ids, old_logp, rewards, advantages):
loss = self.compute_loss(input_ids, old_logp, rewards, advantages)
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.optimizer.step()
return loss.item()
这里有几个点我写错了
1、计算优势的时候,坟墓应该是std 标准差,也就是sqrt(sum(x_i-\mu)/N)
2、policy_loss 应该等于 -torch.min(ratio*A,clip(ratio,0.8,1.2)*A).mean()
3、k3_loss 公式应该是exp(r) - r-1,r=log(p_ref)-log(p_new)
import torch
import torch.nn as nn
class GRPO:
def __init__(self, policy, ref_policy, lr=1e-5, beta=0.02, eps_clip=0.2):
self.policy = policy
self.ref_policy = ref_policy
self.optimizer = torch.optim.Adam(policy.parameters(), lr=lr)
self.beta = beta
self.eps_clip = eps_clip
def compute_loss(self, input_ids, actions, old_logp, rewards, advantages):
# 计算当前策略 log 概率
new_logp = self.policy.log_prob(input_ids, actions) # 用户需在policy定义中实现log_prob
ratio = torch.exp(new_logp - old_logp)
# === GRPO组内归一化 ===
group_size = 4
advantages = advantages.view(-1, group_size)
advantages = (advantages - advantages.mean(dim=1, keepdim=True)) / (advantages.std(dim=1, keepdim=True) + 1e-8)
advantages = advantages.view(-1)
# === PPO裁剪 ===
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
policy_loss = -torch.min(surr1, surr2).mean()
# === KL惩罚 ===
ref_logp = self.ref_policy.log_prob(input_ids, actions)
kl = (new_logp - ref_logp).mean()
# === 总loss ===
loss = policy_loss + self.beta * kl
return loss
def step(self, input_ids, actions, old_logp, rewards, advantages):
loss = self.compute_loss(input_ids, actions, old_logp, rewards, advantages)
self.optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(self.policy.parameters(), 1.0)
self.optimizer.step()
return loss.item()
问了我想做什么。
工作强度,12点前到,打卡 9h 即可,地点是上海
查看24道真题和解析