2023-06-17 11:38 知乎_产品

关注

Day53：机器学习项目实战

alt

在上一节中，我们学习了如何评估和选择一个机器学习模型。这一节我们将以一个常见的房价预测项目结束机器学习这一篇章，我们也将从这一节学习到处理一个实际机器学习问题的过程。

1. 项目内容

我们选择的机器学习实战任务是"房价预测"，即使用Kaggle上的"House Prices: Advanced Regression Techniques"数据集，任务目标是根据一些特征（如房屋面积、卧室数量、地理位置等），预测房屋的价格。这是一个常见但有一定难度的任务，需要综合运用我们在机器学习篇中学到的各种知识和技术。

2. 数据集查看

首先我们导入所有可能用到的库：

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from scipy.stats import norm, skew
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Lasso
import xgboost as xgb
from sklearn.ensemble import VotingRegressor

然后读取训练集和测试集：

train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

查看数据集前几行，看看有哪些列，哪些数据：

print(train.head())

1alt

3. 数据分析

3.1 特征相关性

我们可以初步对训练集的特征相关性进行查看，便于后续的特征选择：

# 选择数字类数据
numeric_data = train.select_dtypes(include=[np.number])

# 计算特征之间的相关性
corr_matrix = numeric_data.corr()

# 绘制特征相关性热力图
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix (Numeric Features)')
plt.show()

2alt

因为变量太多了，我们可以选择与房价Saleprice相关性最高的十个变量再次构建热力图：

k = 10 #选择变量的个数
 
# #pandas.nlarge()是输出最大相关排序，排完了之后选取salprice列的索引
cols = corr_matrix.nlargest(k, 'SalePrice')['SalePrice'].index
 
#cm返回的相关系数矩阵
cm = np.corrcoef(train[cols].values.T) 
 
#坐标轴刻度字体大小
sns.set(font_scale=1.25)
 
hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt='.2f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()

3alt

我们要留下与房价Saleprice强相关的变量，但是如果某两个变量之间自己就具有强相关性，我们可以考虑去除一个，比如：GarageCars’与GarageArea分别表示车库能放车的数量和车库面积，这两个相关性高达0.88，而实际上它们表示的意义也是差不多的，故我们可以考虑去掉 GarageCars。其他变量也用同样的方式考虑。

从热力图我们可以发现OverallQual是最强相关的，我们可以用可视化图验证一下：

data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(8, 6))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)
fig.axis(ymin=0, ymax=800000);

6alt

3.2 离群值处理

我们绘制强相关的变量GrLivArea与房价的散点图观察是否有离群值：

fig, ax = plt.subplots()
ax.scatter(x = train['GrLivArea'], y = train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()

4alt

从散点图中，我们可以清晰发现右下方有两个点的值是异常的，这就是离群值，我们把它们剔除。同时，我们要知道，剔除离群值不一定就是正确的，我们要让模型容许一定的噪声，因为测试集也并非是没有异常的。

#删除离群值
train = train.drop(train[(train['GrLivArea']>4000) & (train['SalePrice']<300000)].index)
 
# 再次绘制散点图
fig, ax = plt.subplots()
ax.scatter(train['GrLivArea'], train['SalePrice'])
plt.ylabel('SalePrice', fontsize=12)
plt.xlabel('GrLivArea', fontsize=12)
plt.show()

5alt

3.3 目标值分析

房价是我们要预测的目标，我们肯定需要对房价的分布进行查看，因为根据预测的前提假设需要做相应的转换使其符合正态分布：

# 绘制分布图
sns.distplot(train['SalePrice'] , fit=norm)

# 计算期望与标准差
(mu, sigma) = norm.fit(train['SalePrice'])
 
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
 
# 绘制QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

7alt

8alt

我们采用对数变换使其接近于正态分布：

train["SalePrice"] = np.log1p(train["SalePrice"])
 
# 变换后的分布
sns.distplot(train['SalePrice'] , fit=norm);
 
# 计算期望与标准差
(mu, sigma) = norm.fit(train['SalePrice'])
 
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')
 
# 绘制QQ图
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()

9alt

4. 数据预处理与特征工程

4.1 缺失值处理

训练集和测试集的数据应该一起处理，故我们将两个集合合并，同时去掉对预测没有作用的ID，以及本身就要预测的Saleprice：

# 记录训练集和测试集分别有多少行数据
ntrain = train.shape[0]
ntest = test.shape[0]
# 记录标签
y_train = train.SalePrice.values
 
#合并，reset_index用来重置索引，因复为有时候对制DataFrame做处理后索引可能是乱的。
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
all_data.drop(['Id'],axis=1,inplace=True)

然后，我们查看各个特征的缺失比例，缺失过多的特征，我们无法使用，只能剔除：

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100

# 删除缺失值为0的行，并将其按照缺失值降序排列
all_data_na = all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)

fig, ax = plt.subplots(figsize=(10, 8))
plt.xticks(rotation='vertical')
sns.barplot(x=all_data_na.index, y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.show()

10alt

从图中可以看出，前四项的缺失比例已经高达80%，我们可以删除缺失率最高的4项特征：

all_data = all_data.drop('PoolQC', axis=1)
all_data = all_data.drop('MiscFeature', axis=1)
all_data = all_data.drop('Alley', axis=1)
all_data = all_data.drop('Fence', axis=1)

对于后续的特征，我们使用填充法来解决，具体要如何填充，需要我们查看数据描述文件data_description.txt，文件对特征的释义、默认值等都有所讲解，我们根据讲解将相应的特征填充即可：

# 壁炉为空可能是没有，用none填充
all_data['FireplaceQu'] = all_data['FireplaceQu'].fillna('none')
 
# LotFrontage代表房屋前街道的长度, 房屋前街道的长度应该和一个街区的房屋相同，可以取同一个街区房屋的街道长度的平均值
all_data['LotFrontage'] = all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.median()))
 
# 对于Garage类的4个特征，可能是没有车库，用none填充
for c in ('GarageType', 'GarageFinish', 'GarageQual', 'GarageCond'):
    all_data[c] = all_data[c].fillna('none')

# 对于Garage类的连续变量，缺失的原因可能是因为房屋没有车库，连续型变量用0填充
for c in ( 'GarageYrBlt', 'GarageArea', 'GarageCars'):
    all_data[c] = all_data[c].fillna(0)
    
#对于地下室相关的连续变量，缺失原因可能是房屋没有地下室，连续型变量用0填充
for c in ('BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF','TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath'):
    all_data[c] = all_data[c].fillna(0)

# 地下室相关离散变量，同理用None填充
for c in ('BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'):
    all_data[c] = all_data[c].fillna('None')
    
# Mas为砖石结构相关变量，缺失值可能是没有砖石结构，用0和none填补缺失值
all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")
all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)
 
#MSZoning代表房屋所处的用地类型，考虑用众数填充
all_data.groupby('MSZoning')['MasVnrType'].count().reset_index()
all_data['MSZoning'] = all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
 
# 由于数据Functional缺失默认为Typ，所以进行填充Typ
all_data["Functional"] = all_data["Functional"].fillna("Typ")
 
# 对于Utilities,几乎所有记录都是“AllPub”，对于房价预测用处很小，删除
all_data.drop(['Utilities'], axis=1, inplace=True)

# 剩余特征也用众数填充
for i in ( 'SaleType', 'KitchenQual', 'Electrical', 'Exterior2nd','Exterior1st'):
    all_data[i] = all_data[i].fillna(all_data[i].mode()[0])

最后检验是否所有的缺失值都已经被处理了：

print(all_data.isnull().sum().max()) #输出：0

4.2 one-hot编码

因为特征中有很多非数字的字符类型特征，我们考虑可以使用one-hot编码，起到处理属性数据的作用：

all_data = pd.get_dummies(all_data)

然后我们将完成特征处理后的数据重新分割成训练集和测试集：

train = all_data[:ntrain]
test = all_data[ntrain:]

5. 模型试验

5.1 评价函数

与kaggle比赛保持一致，我们选择5折交叉验证，使用Root-Mean-Squared-Error (RMSE)为我们的模型打分，该得分自然是越低越好，于是我们的评价函数定义为：

def rmse(model):
    kf = KFold(5, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

5.2 标准化

使用StandardScaler对经过编码的数据进行一次标准化：

scaler = StandardScaler()
train = scaler.fit_transform(train)
test = scaler.transform(test)

5.3 使用模型

我们这里尝试三种回归模型进行预测：lasso回归、随机森林和XGBoost，我们先来看看它们分别的错误得分：

model1 = Lasso(alpha=0.0005, random_state=1)
score = rmse(model1)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

model2 = RandomForestRegressor()
score = rmse(model2)
print("\nRandom forest socre: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

model3 = xgb.XGBRegressor(colsample_bytree=0.5, gamma=0.05, 
                             learning_rate=0.05, max_depth=3, 
                             min_child_weight=1.8, n_estimators=1000,
                             reg_alpha=0.5, reg_lambda=0.8,
                             subsample=0.5, silent=1,
                             random_state =7, nthread = -1)
score = rmse(model3)
print("\nRandom forest socre: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))

输出：
Lasso score: 0.1197 (0.0090)
Random forest socre: 0.1385 (0.0039)
XGBoost socre: 0.1170 (0.0071)

然后我们使用集成学习中投票的方式，把三个模型融合：

voting_model = VotingRegressor(estimators=[('model1', model1), ('model2', model2), ('model3', model3)])
score = rmse(voting_model)
print("\nVoting model score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
# 输出：Voting model score: 0.1133 (0.0067)

从输出中可以看到，融合后的模型效果确实要比没有融合的情况好一点点，于是我们将融合后的模型作为提交。

5.4 提交模型

我们利用融合模型预测的数据作为最后的结果，生成提交文件：

voting_model.fit(train, y_train)
pred = voting_model.predict(test)
sub = pd.DataFrame()
sub['Id'] = test_ID
sub['SalePrice'] = pred
sub.to_csv('submission.csv', index=False)