PyTorch强化学习实战:从零实现Q-Learning
PyTorch 强化学习 Demo 实现
强化学习(Reinforcement Learning, RL)是一种通过与环境交互学习最优策略的机器学习方法。PyTorch 作为深度学习框架,提供了灵活的自动微分和 GPU 加速功能,非常适合实现强化学习算法。以下是一个基于 PyTorch 的强化学习 Demo,以 Q-Learning 为例。
环境设置
使用 OpenAI Gym 提供的经典环境 CartPole-v1。安装 Gym 库:
pip install gym
Q-Learning 算法实现
Q-Learning 是一种基于值函数的强化学习算法,通过更新 Q 表来学习最优策略。以下是 PyTorch 实现的核心代码:
import gym
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
class QNetwork(nn.Module):
def __init__(self, state_size, action_size):
super(QNetwork, self).__init__()
self.fc1 = nn.Linear(state_size, 64)
self.fc2 = nn.Linear(64, 64)
self.fc3 = nn.Linear(64, action_size)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x
env = gym.make('CartPole-v1')
state_size = env.observation_space.shape[0]
action_size = env.action_space.n
q_network = QNetwork(state_size, action_size)
optimizer = optim.Adam(q_network.parameters(), lr=0.001)
criterion = nn.MSELoss()
episodes = 1000
gamma = 0.99
epsilon = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
for episode in range(episodes):
state = env.reset()
state = torch.FloatTensor(state)
total_reward = 0
while True:
if np.random.rand() <= epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
q_values = q_network(state)
action = torch.argmax(q_values).item()
next_state, reward, done, _ = env.step(action)
next_state = torch.FloatTensor(next_state)
total_reward += reward
with torch.no_grad():
target = reward + gamma * torch.max(q_network(next_state))
q_values = q_network(state)
loss = criterion(q_values[action], target)
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state
if done:
break
epsilon = max(epsilon_min, epsilon * epsilon_decay)
print(f"Episode: {episode}, Total Reward: {total_reward}, Epsilon: {epsilon:.2f}")
关键点解析
- Q-Network 结构:使用三层全连接网络近似 Q 函数,输入为状态,输出为每个动作的 Q 值。
- 经验回放:虽然上述代码未实现经验回放,但在实际应用中建议添加以提高样本效率。
- 探索与利用:通过 ε-greedy 策略平衡探索与利用,ε 随时间衰减。
改进方向
- 深度 Q 网络(DQN):引入目标网络和经验回放机制,稳定训练过程。
- Double DQN:解决 Q 值过高估计问题,提升算法性能。
- Policy Gradient 方法:如 PPO 或 A3C,适用于连续动作空间问题。
测试与评估
训练完成后,可以通过以下代码测试模型性能:
state = env.reset()
state = torch.FloatTensor(state)
total_reward = 0
while True:
env.render()
with torch.no_grad():
q_values = q_network(state)
action = torch.argmax(q_values).item()
next_state, reward, done, _ = env.step(action)
state = torch.FloatTensor(next_state)
total_reward += reward
if done:
print(f"Total Reward: {total_reward}")
break
env.close()
通过以上步骤,可以快速实现一个基于 PyTorch 的强化学习 Demo,并在此基础上进行扩展和优化。
BbS.okapop163.sbs/PoSt/1122_826222.HtM
BbS.okapop165.sbs/PoSt/1122_269708.HtM
BbS.okapop166.sbs/PoSt/1122_422011.HtM
BbS.okapop167.sbs/PoSt/1122_476603.HtM
BbS.okapop168.sbs/PoSt/1122_990206.HtM
BbS.okapop169.sbs/PoSt/1122_647557.HtM
BbS.okapop170.sbs/PoSt/1122_537061.HtM
BbS.okapop171.sbs/PoSt/1122_683544.HtM
BbS.okapop172.sbs/PoSt/1122_879150.HtM
BbS.okapop173.sbs/PoSt/1122_273584.HtM
BbS.okapop163.sbs/PoSt/1122_252124.HtM
BbS.okapop165.sbs/PoSt/1122_385259.HtM
BbS.okapop166.sbs/PoSt/1122_079556.HtM
BbS.okapop167.sbs/PoSt/1122_818159.HtM
BbS.okapop168.sbs/PoSt/1122_564838.HtM
BbS.okapop169.sbs/PoSt/1122_295354.HtM
BbS.okapop170.sbs/PoSt/1122_827865.HtM
BbS.okapop171.sbs/PoSt/1122_727841.HtM
BbS.okapop172.sbs/PoSt/1122_367166.HtM
BbS.okapop173.sbs/PoSt/1122_841546.HtM
BbS.okapop174.sbs/PoSt/1122_613919.HtM
BbS.okapop175.sbs/PoSt/1122_179861.HtM
BbS.okapop176.sbs/PoSt/1122_031457.HtM
BbS.okapop177.sbs/PoSt/1122_983980.HtM
BbS.okapop178.sbs/PoSt/1122_741268.HtM
BbS.okapop179.sbs/PoSt/1122_071290.HtM
BbS.okapop180.sbs/PoSt/1122_118291.HtM
BbS.okapop181.sbs/PoSt/1122_001084.HtM
BbS.okapop182.sbs/PoSt/1122_866238.HtM
BbS.okapop183.sbs/PoSt/1122_083072.HtM
BbS.okapop174.sbs/PoSt/1122_683972.HtM
BbS.okapop175.sbs/PoSt/1122_902987.HtM
BbS.okapop176.sbs/PoSt/1122_368361.HtM
BbS.okapop177.sbs/PoSt/1122_163853.HtM
BbS.okapop178.sbs/PoSt/1122_976764.HtM
BbS.okapop179.sbs/PoSt/1122_873082.HtM
BbS.okapop180.sbs/PoSt/1122_664923.HtM
BbS.okapop181.sbs/PoSt/1122_174271.HtM
BbS.okapop182.sbs/PoSt/1122_674909.HtM
BbS.okapop183.sbs/PoSt/1122_672289.HtM
BbS.okapop174.sbs/PoSt/1122_677527.HtM
BbS.okapop175.sbs/PoSt/1122_083803.HtM
BbS.okapop176.sbs/PoSt/1122_976996.HtM
BbS.okapop177.sbs/PoSt/1122_728748.HtM
BbS.okapop178.sbs/PoSt/1122_545788.HtM
BbS.okapop179.sbs/PoSt/1122_659434.HtM
BbS.okapop180.sbs/PoSt/1122_959047.HtM
BbS.okapop181.sbs/PoSt/1122_834023.HtM
BbS.okapop182.sbs/PoSt/1122_261668.HtM
BbS.okapop183.sbs/PoSt/1122_637952.HtM
BbS.okapop174.sbs/PoSt/1122_911761.HtM
BbS.okapop175.sbs/PoSt/1122_028922.HtM
BbS.okapop176.sbs/PoSt/1122_533534.HtM
BbS.okapop177.sbs/PoSt/1122_241389.HtM
BbS.okapop178.sbs/PoSt/1122_552314.HtM
BbS.okapop179.sbs/PoSt/1122_100764.HtM
BbS.okapop180.sbs/PoSt/1122_275376.HtM
BbS.okapop181.sbs/PoSt/1122_811733.HtM
BbS.okapop182.sbs/PoSt/1122_668088.HtM
BbS.okapop183.sbs/PoSt/1122_453384.HtM
BbS.okapop174.sbs/PoSt/1122_990260.HtM
BbS.okapop175.sbs/PoSt/1122_737439.HtM
BbS.okapop176.sbs/PoSt/1122_939629.HtM
BbS.okapop177.sbs/PoSt/1122_285809.HtM
BbS.okapop178.sbs/PoSt/1122_384940.HtM
BbS.okapop179.sbs/PoSt/1122_561928.HtM
BbS.okapop180.sbs/PoSt/1122_333813.HtM
BbS.okapop181.sbs/PoSt/1122_843946.HtM
BbS.okapop182.sbs/PoSt/1122_861192.HtM
BbS.okapop183.sbs/PoSt/1122_772011.HtM
BbS.okapop174.sbs/PoSt/1122_534500.HtM
BbS.okapop175.sbs/PoSt/1122_323277.HtM
BbS.okapop176.sbs/PoSt/1122_261531.HtM
BbS.okapop177.sbs/PoSt/1122_701593.HtM
BbS.okapop178.sbs/PoSt/1122_714112.HtM
BbS.okapop179.sbs/PoSt/1122_392127.HtM
BbS.okapop180.sbs/PoSt/1122_424838.HtM
BbS.okapop181.sbs/PoSt/1122_179810.HtM
BbS.okapop182.sbs/PoSt/1122_708441.HtM
BbS.okapop183.sbs/PoSt/1122_551469.HtM