《机器学习实战》笔记 (第五章至第七章) (附Python3版代码)(Machine Learning in Action)
前言:
《机器学习实战》(Machine Learning in Action)是一本常见的机器学习入门书,书中代码由Python2写成。由于现时Python2已逐渐退出舞台,所以这篇文章将该书的所有代码部分用Python3重写。
代码上传GitHub: https://github.com/kungbob/Machine_Learning_In_Action
原版Python2代码:https://www.manning.com/books/machine-learning-in-action
第五章: Logistic回归
优点: 计算代价不高,易于理解和实现。
缺点: 容易欠拟合,分类精度可能不高。
适用数据类型: 数值型和标称型数据。
Logistic回归的一般过程
- 收集数据: 采用任意方法收集数据。
- 准备数据: 由于需要进行距离计算,因此要求数据类型为数值型。另外,结构化数据格式则最佳。
- 分析数据: 采用任意方法对数据进行分析。
- 训练算法: 大部分时间将用于训练,训练的目的是为了找到最佳的分类回归系统。
- 测试算法: 一旦训练步骤完成,分类将会很快。
- 使用算法: 首先需要输入一些数据,并将其转换成对应的结构化数据。接着,基于训练好的回归系数就可以对这些数值进行简单的回归计算,判定他们属于哪个类别;在这之后,就可以在输出的类型上做一些其他分析工作。
梯度下降算法:
梯度下降算法和梯度上升算法是一样的,只是公式中的加法需要变成减法。因此,梯度下降算法的公式为:
\[w := w - \alpha \nabla_w f(w) \]
梯度上升算法用来求函数的最大值,而梯度下降算法用来求函数的最小值。
示例: 使用Logistic回归估计马疝病的死亡率
- 收集数据: 给定数据文件。
- 准备数据: 用Python解析文本文件并填充缺失值。
- 分析数据: 可视化并观察数据。
- 训练算法: 使用优化算法,找到最佳的系数。
- 测试算法: 为了量化回归的效果,需要观测错误率。根据错误率决定是否回退到训练阶段,通过改变递送的次数和步长等参数来得到更好的回归函数。
- 使用算法: 实现一个简单的命令行程序来手机马的症状并输出预测结果并非难事,可以作为留给读者的习题。
logRegres.py完整代码(章节中Python命令行代码在GitHub):
from numpy import *
import matplotlib.pyplot as plt
# Logistic Regression Gradient Ascent Method, program 5_1
def loadDataSet():
dataMat = []
labelMat = []
fr = open('testSet.txt')
for line in fr.readlines():
lineArr = line.strip().split()
dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
labelMat.append(int(lineArr[2]))
return dataMat, labelMat
def sigmoid(inX):
return 1.0 / (1 + exp(-inX))
def gradAscent(dataMatIn, classLabels):
# Convert the matrix into Numpy format
dataMatrix = mat(dataMatIn)
labelMat = mat(classLabels).transpose()
m, n = shape(dataMatrix)
alpha = 0.001
maxCycles = 500
weights = ones((n, 1))
# Matrix multiplication
for k in range(maxCycles):
h = sigmoid(dataMatrix * weights)
error = (labelMat - h)
weights = weights + alpha * dataMatrix.transpose() * error
return weights
# Ploting the dataset and the decision boundary
def plotBestFit(weights):
dataMat, labelMat = loadDataSet()
dataArr = array(dataMat)
n = shape(dataArr)[0]
xcord1 = []
ycord1 = []
xcord2 = []
ycord2 = []
for i in range(n):
if int(labelMat[i]) == i:
xcord1.append(dataArr[i, 1])
ycord1.append(dataArr[i, 2])
else:
xcord2.append(dataArr[i, 1])
ycord2.append(dataArr[i, 2])
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(xcord1, ycord1, s = 30, c = 'red', marker = 's')
ax.scatter(xcord2, ycord2, s = 30, c = 'green')
x = arange(-3.0, 3.0, 0.1)
# Best fiting line
y = (-weights[0] - weights[1] * x) / weights[2]
ax.plot(x, y)
plt.xlabel('X1')
plt.ylabel('X2')
plt.show()
# Stochastic gradient ascent, program 5_3
def stocGradAscent0(dataMatrix, classLabels):
m, n = shape(dataMatrix)
alpha = 0.01
weights = ones(n)
for i in range(m):
h = sigmoid(sum(dataMatrix[i] * weights))
error = classLabels[i] - h
weights = weights + alpha * error * dataMatrix[i]
return weights
# Improved Stochastic gradient ascent, program 5_4
def stocGradAscent1(dataMatrix, classLabels, numIter = 150):
m, n = shape(dataMatrix)
weights = ones(n)
for j in range(numIter):
dataIndex = list(range(m))
for i in range(m):
# Modifying alpha for each iteration
alpha = 4 / (1.0 + j + i) + 0.01
# Randomly pick a data to update
randIndex = int(random.uniform(0, len(dataIndex)))
h = sigmoid(sum(dataMatrix[randIndex] * weights))
error = classLabels[randIndex] - h
weights = weights + alpha * error * dataMatrix[randIndex]
del(dataIndex[randIndex])
return weights
# Logistic regession classifier, program 5_5
def classifyVector(inX, weights):
prob = sigmoid(sum(inX * weights))
if prob > 0.5:
return 1.0
else:
return 0.0
def colicTest():
frTrain = open('horseColicTraining.txt')
frTest = open('horseColicTest.txt')
trainingSet = []
trainingLabels = []
for line in frTrain.readlines():
currLine = line.strip().split('\t')
lineArr = []
for i in range(21):
lineArr.append(float(currLine[i]))
trainingSet.append(lineArr)
trainingLabels.append(float(currLine[21]))
trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 500)
errorCount = 0
numTestVec = 0.0
for line in frTest.readlines():
numTestVec += 1.0
currLine = line.strip().split('\t')
lineArr = []
for i in range(21):
lineArr.append(float(currLine[i]))
if int(classifyVector(array(lineArr), trainWeights)) != int(currLine[21]):
errorCount += 1
errorRate = (float(errorCount) / numTestVec)
print("the error rate of this test is: %f" % errorRate)
return errorRate
def multiTest():
numTests = 10
errorSum = 0.0
for k in range(numTests):
errorSum += colicTest()
print("after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests)))
第六章: 支持向量机
优点: 泛化错误低(Generalization error),计算开销不大,结果易解释。
缺点: 对参数调节和核函数的选择敏感,原始分类器不加修改仅使用于处理二类问题(Two-class Classification)。
适用数据类型: 数值型和标称型数据。
SVM的一般流程:
- 收集数据: 可以使用任意方法。
- 准备数据: 需要数值型数据。
- 分析数据: 有助于可视化分隔超平面。
- 训练算法: SVM的大部分时间都源自训练,该过程主要实现两个参数的调优。
- 测试算法: 十分简单的计算过程就可以实现。
- 使用算法: 几乎所有分类问题都可以使用SVM,值得一提的是,SVM本身是一个二类分类器,对多类问题应用SVM需要对代码做一些修改。
示例:基于SVM的数字识别
- 收集数据: 提供的文本文件。
- 准备数据: 基于二值图像构造向量。
- 分析数据: 对图像向量进行目测。
- 训练算法: 采用两种不同的核函数,并对径向核函数采用不同的设置来运行SMO算法。
- 测试算法: 编写一个函数来测试不同的核函数并计算错误率。
- 使用算法: 一个图像识别的完整应用还需要一些图像处理的知识,这里并不深入介绍。
svmMLiA.py完整代码(章节中Python命令行代码在GitHub):
from numpy import *
from os import listdir
# Supporting functions in SMO algorithm, program 6_1
def loadDataSet(fileName):
dataMat = []
labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = line.strip().split('\t')
dataMat.append([float(lineArr[0]), float(lineArr[1])])
labelMat.append(float(lineArr[2]))
return dataMat, labelMat
def selectJrand(i, m):
j = i
while (j == i):
j = int(random.uniform(0, m))
return j
def clipAlpha(aj, H, L):
if aj > H:
aj = H
if L > aj:
aj = L
return aj
# simplified SMO algorithm, program 6_2
def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
dataMatrix = mat(dataMatIn)
labelMat = mat(classLabels).transpose()
b = 0
m, n = shape(dataMatrix)
alphas = mat(zeros((m, 1)))
iter = 0
while (iter < maxIter):
alphaPairsChanged = 0
for i in range(m):
fXi = float(multiply(alphas, labelMat).T*\
(dataMatrix*dataMatrix[i, :].T)) + b
Ei = fXi - float(labelMat[i])
# Test the value of alpha whether it needs optimization.
if ((labelMat[i] * Ei < -toler) and (alphas[i] < C)) or \
((labelMat[i] * Ei > toler) and \
(alphas[i] > 0)):
# Randomly pick the second alpha
j = selectJrand(i, m)
fXj = float(multiply(alphas, labelMat).T*\
(dataMatrix*dataMatrix[j, :].T)) + b
Ej = fXj - float(labelMat[j])
alphaIold = alphas[i].copy()
alphaJold = alphas[j].copy()
# Ensure the range of alpha is between 0 to C
if (labelMat[i] != labelMat[j]):
L = max(0, alphas[j] - alphas[i])
H = min(C, C + alphas[j] - alphas[i])
else:
L = max(0, alphas[j] + alphas[i] - C)
H = min(C, alphas[j] + alphas[i])
if L == H:
print("L == H")
continue
eta = 2.0 * dataMatrix[i, :] * dataMatrix[j, :].T - \
dataMatrix[i, :] * dataMatrix[i, :].T - \
dataMatrix[j, :] * dataMatrix[j, :].T
if eta >= 0:
print("eta>=0")
continue
alphas[j] -= labelMat[j] * (Ei - Ej) / eta
alphas[j] = clipAlpha(alphas[j], H, L)
if (abs(alphas[j] - alphaJold) < 0.0001):
print("j not moving enough")
continue
# Modify i with value equals to j, but in opposite direction
alphas[i] += labelMat[j] * labelMat[i] * \
(alphaJold - alphas[j])
b1 = b - Ei - labelMat[i] * (alphas[i] - alphaIold) * \
dataMatrix[i, :] * dataMatrix[i, :].T - \
labelMat[j] * (alphas[j] - alphaJold) * \
dataMatrix[i, :] * dataMatrix[j, :].T
# Set constant term
b2 = b - Ej - labelMat[i] * (alphas[i] - alphaIold) * \
dataMatrix[i, :] * dataMatrix[j, :].T - \
labelMat[j] * (alphas[j] - alphaJold) * \
dataMatrix[j, :] * dataMatrix[j, :].T
if (0 < alphas[i]) and (C > alphas[i]):
b = b1
elif (0 < alphas[j]) and (C > alphas[j]):
b = b2
else:
b = (b1 + b2) / 2.0
alphaPairsChanged += 1
print("iter: %d i: %d, pairs changed %d" % \
(iter, i, alphaPairsChanged))
if (alphaPairsChanged == 0):
iter += 1
else:
iter = 0
print("iteration number: %d" % iter)
return b, alphas
# Complete version of Platt SMO Supporting functions, program 6_3
'''
class optStruct:
def __init__(self, dataMatIn, classLabels, C, toler):
self.X = dataMatIn
self.labelMat = classLabels
self.C = C
self.tol = toler
self.m = shape(dataMatIn)[0]
self.alphas = mat(zeros((self.m, 1)))
self.b = 0
# Save the error into Cache
self.eCache = mat(zeros((self.m, 2)))
'''
# Modified for Kernal function, program 6_6
class optStruct:
def __init__(self, dataMatIn, classLabels, C, toler, kTup):
self.X = dataMatIn
self.labelMat = classLabels
self.C = C
self.tol = toler
self.m = shape(dataMatIn)[0]
self.alphas = mat(zeros((self.m, 1)))
self.b = 0
# Save the error into Cache
self.eCache = mat(zeros((self.m, 2)))
self.K = mat(zeros((self.m, self.m)))
for i in range(self.m):
self.K[:, i] = n (self.X, self.X[i, :], kTup)
'''
def calcEk(oS, k):
fXk = float(multiply(oS.alphas, oS.labelMat).T*\
(oS.X*oS.X[k, :].T)) + oS.b
Ek = fXk - float(oS.labelMat[k])
return Ek
'''
# Modified for kernal function, program 6_7
def calcEk(oS, k):
fXk = float(multiply(oS.alphas, oS.labelMat).T * oS.K[:, k] + oS.b)
Ek = fXk - float(oS.labelMat[k])
return Ek
# Preparation for the inner interation loop
def selectJ(i, oS, Ei):
maxK = -1
maxDeltaE = 0
Ej = 0
oS.eCache[i] = [1, Ei]
validEcacheList = nonzero(oS.eCache[:, 0].A)[0]
if (len(validEcacheList)) > 1:
for k in validEcacheList:
if k == i:
continue
Ek = calcEk(oS, k)
deltaE = abs(Ei - Ek)
# Select the largest step distance of j
if (deltaE > maxDeltaE):
maxK = k
maxDeltaE = deltaE
Ej = Ek
return maxK, Ej
else:
j = selectJrand(i, oS.m)
Ej = calcEk(oS, j)
return j, Ej
def updateEk(oS, k):
Ek = calcEk(oS, k)
oS.eCache[k] = [1, Ek]
# Complete Platt SMO algorithm, optimised version, program 6_4
'''
def innerL(i, oS):
Ei = calcEk(oS, i)
if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or \
((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
j, Ej = selectJ(i, oS, Ei)
# Select the second alpha
alphaIold = oS.alphas[i].copy()
alphaJold = oS.alphas[j].copy()
if (oS.labelMat[i] != oS.labelMat[j]):
L = max(0, oS.alphas[j] - oS.alphas[i])
H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
else:
L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
H = min(oS.C, oS.alphas[j] + oS.alphas[i])
if L == H:
print("L == H")
return 0
eta = 2.0 * oS.X[i, :] * oS.X[j, :].T -oS.X[i, :] * oS.X[i, :].T - \
oS.X[j, :] * oS.X[j, :].T
if eta >= 0:
print("eta >= 0")
return 0
oS.alphas[j] -= oS.labelMat[j]* (Ei - Ej) / eta
oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
updateEk(oS, j)
if (abs(oS.alphas[j] - alphaJold) < 0.0001):
print("j not moving enough")
return 0
oS.alphas[i] += oS.labelMat[j] * oS.labelMat[i] * \
(alphaJold - oS.alphas[j])
updateEk(oS, i)
b1 = oS.b - Ei - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.X[i, :] * oS.X[i, :].T - oS.labelMat[j] * \
(oS.alphas[j] - alphaJold) * oS.X[i, :] * oS.X[j, :].T
b2 = oS.b - Ej - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.X[i, :] * oS.X[j, :].T - oS.labelMat[j] * \
(oS.alphas[j] - alphaJold) * oS.X[j, :] * oS.X[j, :].T
if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]):
oS.b = b1
elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]) :
oS.b = b2
else:
oS.b = (b1 + b2) / 2.0
return 1
else:
return 0
'''
# Inner iteration for Kernal function, program 6_7
def innerL(i, oS):
Ei = calcEk(oS, i)
if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or \
((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
j, Ej = selectJ(i, oS, Ei)
# Select the second alpha
alphaIold = oS.alphas[i].copy()
alphaJold = oS.alphas[j].copy()
if (oS.labelMat[i] != oS.labelMat[j]):
L = max(0, oS.alphas[j] - oS.alphas[i])
H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
else:
L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
H = min(oS.C, oS.alphas[j] + oS.alphas[i])
if L == H:
print("L == H")
return 0
# Modified in program 6_7
eta = 2.0 * oS.K[i, j] - oS.K[i, i] - oS.K[j, j]
if eta >= 0:
print("eta >= 0")
return 0
oS.alphas[j] -= oS.labelMat[j]* (Ei - Ej) / eta
oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
updateEk(oS, j)
if (abs(oS.alphas[j] - alphaJold) < 0.0001):
print("j not moving enough")
return 0
oS.alphas[i] += oS.labelMat[j] * oS.labelMat[i] * \
(alphaJold - oS.alphas[j])
updateEk(oS, i)
# Modified in program 6_7
b1 = oS.b - Ei - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.K[i, i] - oS.labelMat[j] * (oS.alphas[j] - alphaJold) * \
oS.K[i, j]
b2 = oS.b - Ej - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.K[i, j] - oS.labelMat[j] * (oS.alphas[j] - alphaJold) * \
oS.K[j, j]
if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]):
oS.b = b1
elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]) :
oS.b = b2
else:
oS.b = (b1 + b2) / 2.0
return 1
else:
return 0
# Complete Platt SMO outer iteration, program 6_5
def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup=('lin', 0)):
# Add kTup parameter at the end
oS = optStruct(mat(dataMatIn), mat(classLabels).transpose(), C, toler, kTup)
iter = 0
entireSet = True
alphaPairsChanged = 0
# Iterate all value
while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
alphaPairsChanged = 0
if entireSet:
for i in range(oS.m):
alphaPairsChanged += innerL(i, oS)
print("fullSet, iter: %d i:%d, pairs changed %d" %\
(iter, i, alphaPairsChanged))
iter += 1
else:
# Iterate all non-boundaries value
nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]
for i in nonBoundIs:
alphaPairsChanged += innerL(i, oS)
print("non-bound, iter: %d i: %d, pairs changed %d" % \
(iter, i, alphaPairsChanged))
iter += 1
if entireSet:
entireSet = False
elif (alphaPairsChanged == 0):
entireSet = True
print("iteration number: %d" % iter)
return oS.b, oS.alphas
# Calculate w value, chapter 6_4
def calcWs(alphas, dataArr, classLabels):
X = mat(dataArr)
labelMat = mat(classLabels).transpose()
m,n = shape(X)
w = zeros((n, 1))
for i in range(m):
w += multiply(alphas[i] * labelMat[i], X[i, :].T)
return w
# Kernal translation function, program 6_6
def kernelTrans(X, A, kTup):
m, n = shape(X)
K = mat(zeros((m, 1)))
if kTup[0] == 'lin':
K = X * A.T
elif kTup[0] == 'rbf':
for j in range(m):
deltaRow = X[j, :] - A
K[j] = deltaRow * deltaRow.T
# Division betwen elements
K = exp(K /(-1*kTup[1]**2))
else:
raise NameError('Houston We Have a Problem -- \
That Kernal is not recongnized')
return K
def testRbf(k1 = 1.3):
dataArr, labelArr = loadDataSet('testSetRBF.txt')
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, ('rbf', k1))
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
svInd = nonzero(alphas.A > 0)[0]
# Construct SVM matrix
sVs = datMat[svInd]
labelSV = labelMat[svInd]
print("there are %d Support Vectors" % shape(sVs)[0])
m, n = shape(datMat)
errorCount = 0
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], ('rbf', k1))
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("the training error rate is: %f" % (float(errorCount)/m))
dataArr, labelArr = loadDataSet('testSetRBF2.txt')
errorCount = 0
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
m, n = shape(datMat)
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], ('rbf', k1))
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("the test error rate is: %f" %(float(errorCount) / m))
# Handwriting recongnition based on SVM, program 6_9
def loadImages(dirName):
hwLabels = []
trainingFileList = listdir(dirName)
m = len(trainingFileList)
trainingMat = zeros((m, 1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
if classNumStr == 9:
hwLabels.append(-1)
else:
hwLabels.append(1)
trainingMat[i, :] = img2vector('%s/%s' % (dirName, fileNameStr))
return trainingMat, hwLabels
def testDigits(kTup = ('rbf', 10)):
dataArr, labelArr = loadImages('trainingDigits')
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, kTup)
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
svInd = nonzero(alphas.A > 0)[0]
sVs = datMat[svInd]
labelSV = labelMat[svInd]
print("there are %d Support Vectors" % shape(sVs)[0])
m, n = shape(datMat)
errorCount = 0
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], kTup)
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("the training error rate is: %f" % (float(errorCount) / m))
dataArr, labelArr = loadImages('testDigits')
errorCount = 0
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
m, n = shape(datMat)
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], kTup)
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("test test error rate is : %f" % (float(errorCount) / m))
def img2vector(filename):
returnVect = zeros((1, 1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0, 32*i + j] = int(lineStr[j])
return returnVect
第六章: 支持向量机
优点: 泛化错误低,计算开销不大,结果容易解释。
缺点: 对参数调节和和函数的选择敏感,原始分类器不加修改仅适用于处理二类问题。
使用数据类型: 数值型和标称型数据。
SVM的一般流程
- 收集数据: 可以使用任意方法。
- 准备数据: 需要数值型数据。
- 分析数据: 有助于可视化分隔超平面。
- 训练算法: SVM的大部分时间都源自训练,该过程主要实现两个参数的调优。
- 测试算法: 十分简单的计算过程就可以实现。
- 使用算法: 几乎所有分类问题都可以使用SVM。(SVM本身是一个二类分类器,对多类问题应用SVM需要对代码做一些修改)
注: 书中作者使用SMO算法作为SVM的训练算法。
svmMLiA.py完整代码(章节中Python命令行代码在GitHub):
from numpy import *
from os import listdir
# Supporting functions in SMO algorithm, program 6_1
def loadDataSet(fileName):
dataMat = []
labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = line.strip().split('\t')
dataMat.append([float(lineArr[0]), float(lineArr[1])])
labelMat.append(float(lineArr[2]))
return dataMat, labelMat
def selectJrand(i, m):
j = i
while (j == i):
j = int(random.uniform(0, m))
return j
def clipAlpha(aj, H, L):
if aj > H:
aj = H
if L > aj:
aj = L
return aj
# simplified SMO algorithm, program 6_2
def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
dataMatrix = mat(dataMatIn)
labelMat = mat(classLabels).transpose()
b = 0
m, n = shape(dataMatrix)
alphas = mat(zeros((m, 1)))
iter = 0
while (iter < maxIter):
alphaPairsChanged = 0
for i in range(m):
fXi = float(multiply(alphas, labelMat).T*\
(dataMatrix*dataMatrix[i, :].T)) + b
Ei = fXi - float(labelMat[i])
# Test the value of alpha whether it needs optimization.
if ((labelMat[i] * Ei < -toler) and (alphas[i] < C)) or \
((labelMat[i] * Ei > toler) and \
(alphas[i] > 0)):
# Randomly pick the second alpha
j = selectJrand(i, m)
fXj = float(multiply(alphas, labelMat).T*\
(dataMatrix*dataMatrix[j, :].T)) + b
Ej = fXj - float(labelMat[j])
alphaIold = alphas[i].copy()
alphaJold = alphas[j].copy()
# Ensure the range of alpha is between 0 to C
if (labelMat[i] != labelMat[j]):
L = max(0, alphas[j] - alphas[i])
H = min(C, C + alphas[j] - alphas[i])
else:
L = max(0, alphas[j] + alphas[i] - C)
H = min(C, alphas[j] + alphas[i])
if L == H:
print("L == H")
continue
eta = 2.0 * dataMatrix[i, :] * dataMatrix[j, :].T - \
dataMatrix[i, :] * dataMatrix[i, :].T - \
dataMatrix[j, :] * dataMatrix[j, :].T
if eta >= 0:
print("eta>=0")
continue
alphas[j] -= labelMat[j] * (Ei - Ej) / eta
alphas[j] = clipAlpha(alphas[j], H, L)
if (abs(alphas[j] - alphaJold) < 0.0001):
print("j not moving enough")
continue
# Modify i with value equals to j, but in opposite direction
alphas[i] += labelMat[j] * labelMat[i] * \
(alphaJold - alphas[j])
b1 = b - Ei - labelMat[i] * (alphas[i] - alphaIold) * \
dataMatrix[i, :] * dataMatrix[i, :].T - \
labelMat[j] * (alphas[j] - alphaJold) * \
dataMatrix[i, :] * dataMatrix[j, :].T
# Set constant term
b2 = b - Ej - labelMat[i] * (alphas[i] - alphaIold) * \
dataMatrix[i, :] * dataMatrix[j, :].T - \
labelMat[j] * (alphas[j] - alphaJold) * \
dataMatrix[j, :] * dataMatrix[j, :].T
if (0 < alphas[i]) and (C > alphas[i]):
b = b1
elif (0 < alphas[j]) and (C > alphas[j]):
b = b2
else:
b = (b1 + b2) / 2.0
alphaPairsChanged += 1
print("iter: %d i: %d, pairs changed %d" % \
(iter, i, alphaPairsChanged))
if (alphaPairsChanged == 0):
iter += 1
else:
iter = 0
print("iteration number: %d" % iter)
return b, alphas
# Complete version of Platt SMO Supporting functions, program 6_3
'''
class optStruct:
def __init__(self, dataMatIn, classLabels, C, toler):
self.X = dataMatIn
self.labelMat = classLabels
self.C = C
self.tol = toler
self.m = shape(dataMatIn)[0]
self.alphas = mat(zeros((self.m, 1)))
self.b = 0
# Save the error into Cache
self.eCache = mat(zeros((self.m, 2)))
'''
# Modified for Kernal function, program 6_6
class optStruct:
def __init__(self, dataMatIn, classLabels, C, toler, kTup):
self.X = dataMatIn
self.labelMat = classLabels
self.C = C
self.tol = toler
self.m = shape(dataMatIn)[0]
self.alphas = mat(zeros((self.m, 1)))
self.b = 0
# Save the error into Cache
self.eCache = mat(zeros((self.m, 2)))
self.K = mat(zeros((self.m, self.m)))
for i in range(self.m):
self.K[:, i] = n (self.X, self.X[i, :], kTup)
'''
def calcEk(oS, k):
fXk = float(multiply(oS.alphas, oS.labelMat).T*\
(oS.X*oS.X[k, :].T)) + oS.b
Ek = fXk - float(oS.labelMat[k])
return Ek
'''
# Modified for kernal function, program 6_7
def calcEk(oS, k):
fXk = float(multiply(oS.alphas, oS.labelMat).T * oS.K[:, k] + oS.b)
Ek = fXk - float(oS.labelMat[k])
return Ek
# Preparation for the inner interation loop
def selectJ(i, oS, Ei):
maxK = -1
maxDeltaE = 0
Ej = 0
oS.eCache[i] = [1, Ei]
validEcacheList = nonzero(oS.eCache[:, 0].A)[0]
if (len(validEcacheList)) > 1:
for k in validEcacheList:
if k == i:
continue
Ek = calcEk(oS, k)
deltaE = abs(Ei - Ek)
# Select the largest step distance of j
if (deltaE > maxDeltaE):
maxK = k
maxDeltaE = deltaE
Ej = Ek
return maxK, Ej
else:
j = selectJrand(i, oS.m)
Ej = calcEk(oS, j)
return j, Ej
def updateEk(oS, k):
Ek = calcEk(oS, k)
oS.eCache[k] = [1, Ek]
# Complete Platt SMO algorithm, optimised version, program 6_4
'''
def innerL(i, oS):
Ei = calcEk(oS, i)
if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or \
((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
j, Ej = selectJ(i, oS, Ei)
# Select the second alpha
alphaIold = oS.alphas[i].copy()
alphaJold = oS.alphas[j].copy()
if (oS.labelMat[i] != oS.labelMat[j]):
L = max(0, oS.alphas[j] - oS.alphas[i])
H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
else:
L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
H = min(oS.C, oS.alphas[j] + oS.alphas[i])
if L == H:
print("L == H")
return 0
eta = 2.0 * oS.X[i, :] * oS.X[j, :].T -oS.X[i, :] * oS.X[i, :].T - \
oS.X[j, :] * oS.X[j, :].T
if eta >= 0:
print("eta >= 0")
return 0
oS.alphas[j] -= oS.labelMat[j]* (Ei - Ej) / eta
oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
updateEk(oS, j)
if (abs(oS.alphas[j] - alphaJold) < 0.0001):
print("j not moving enough")
return 0
oS.alphas[i] += oS.labelMat[j] * oS.labelMat[i] * \
(alphaJold - oS.alphas[j])
updateEk(oS, i)
b1 = oS.b - Ei - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.X[i, :] * oS.X[i, :].T - oS.labelMat[j] * \
(oS.alphas[j] - alphaJold) * oS.X[i, :] * oS.X[j, :].T
b2 = oS.b - Ej - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.X[i, :] * oS.X[j, :].T - oS.labelMat[j] * \
(oS.alphas[j] - alphaJold) * oS.X[j, :] * oS.X[j, :].T
if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]):
oS.b = b1
elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]) :
oS.b = b2
else:
oS.b = (b1 + b2) / 2.0
return 1
else:
return 0
'''
# Inner iteration for Kernal function, program 6_7
def innerL(i, oS):
Ei = calcEk(oS, i)
if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C)) or \
((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0)):
j, Ej = selectJ(i, oS, Ei)
# Select the second alpha
alphaIold = oS.alphas[i].copy()
alphaJold = oS.alphas[j].copy()
if (oS.labelMat[i] != oS.labelMat[j]):
L = max(0, oS.alphas[j] - oS.alphas[i])
H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
else:
L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
H = min(oS.C, oS.alphas[j] + oS.alphas[i])
if L == H:
print("L == H")
return 0
# Modified in program 6_7
eta = 2.0 * oS.K[i, j] - oS.K[i, i] - oS.K[j, j]
if eta >= 0:
print("eta >= 0")
return 0
oS.alphas[j] -= oS.labelMat[j]* (Ei - Ej) / eta
oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
updateEk(oS, j)
if (abs(oS.alphas[j] - alphaJold) < 0.0001):
print("j not moving enough")
return 0
oS.alphas[i] += oS.labelMat[j] * oS.labelMat[i] * \
(alphaJold - oS.alphas[j])
updateEk(oS, i)
# Modified in program 6_7
b1 = oS.b - Ei - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.K[i, i] - oS.labelMat[j] * (oS.alphas[j] - alphaJold) * \
oS.K[i, j]
b2 = oS.b - Ej - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.K[i, j] - oS.labelMat[j] * (oS.alphas[j] - alphaJold) * \
oS.K[j, j]
if (0 < oS.alphas[i]) and (oS.C > oS.alphas[i]):
oS.b = b1
elif (0 < oS.alphas[j]) and (oS.C > oS.alphas[j]) :
oS.b = b2
else:
oS.b = (b1 + b2) / 2.0
return 1
else:
return 0
# Complete Platt SMO outer iteration, program 6_5
def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup=('lin', 0)):
# Add kTup parameter at the end
oS = optStruct(mat(dataMatIn), mat(classLabels).transpose(), C, toler, kTup)
iter = 0
entireSet = True
alphaPairsChanged = 0
# Iterate all value
while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
alphaPairsChanged = 0
if entireSet:
for i in range(oS.m):
alphaPairsChanged += innerL(i, oS)
print("fullSet, iter: %d i:%d, pairs changed %d" %\
(iter, i, alphaPairsChanged))
iter += 1
else:
# Iterate all non-boundaries value
nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C))[0]
for i in nonBoundIs:
alphaPairsChanged += innerL(i, oS)
print("non-bound, iter: %d i: %d, pairs changed %d" % \
(iter, i, alphaPairsChanged))
iter += 1
if entireSet:
entireSet = False
elif (alphaPairsChanged == 0):
entireSet = True
print("iteration number: %d" % iter)
return oS.b, oS.alphas
# Calculate w value, chapter 6_4
def calcWs(alphas, dataArr, classLabels):
X = mat(dataArr)
labelMat = mat(classLabels).transpose()
m,n = shape(X)
w = zeros((n, 1))
for i in range(m):
w += multiply(alphas[i] * labelMat[i], X[i, :].T)
return w
# Kernal translation function, program 6_6
def kernelTrans(X, A, kTup):
m, n = shape(X)
K = mat(zeros((m, 1)))
if kTup[0] == 'lin':
K = X * A.T
elif kTup[0] == 'rbf':
for j in range(m):
deltaRow = X[j, :] - A
K[j] = deltaRow * deltaRow.T
# Division betwen elements
K = exp(K /(-1*kTup[1]**2))
else:
raise NameError('Houston We Have a Problem -- \
That Kernal is not recongnized')
return K
def testRbf(k1 = 1.3):
dataArr, labelArr = loadDataSet('testSetRBF.txt')
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, ('rbf', k1))
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
svInd = nonzero(alphas.A > 0)[0]
# Construct SVM matrix
sVs = datMat[svInd]
labelSV = labelMat[svInd]
print("there are %d Support Vectors" % shape(sVs)[0])
m, n = shape(datMat)
errorCount = 0
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], ('rbf', k1))
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("the training error rate is: %f" % (float(errorCount)/m))
dataArr, labelArr = loadDataSet('testSetRBF2.txt')
errorCount = 0
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
m, n = shape(datMat)
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], ('rbf', k1))
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("the test error rate is: %f" %(float(errorCount) / m))
# Handwriting recongnition based on SVM, program 6_9
def loadImages(dirName):
hwLabels = []
trainingFileList = listdir(dirName)
m = len(trainingFileList)
trainingMat = zeros((m, 1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
if classNumStr == 9:
hwLabels.append(-1)
else:
hwLabels.append(1)
trainingMat[i, :] = img2vector('%s/%s' % (dirName, fileNameStr))
return trainingMat, hwLabels
def testDigits(kTup = ('rbf', 10)):
dataArr, labelArr = loadImages('trainingDigits')
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, kTup)
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
svInd = nonzero(alphas.A > 0)[0]
sVs = datMat[svInd]
labelSV = labelMat[svInd]
print("there are %d Support Vectors" % shape(sVs)[0])
m, n = shape(datMat)
errorCount = 0
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], kTup)
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("the training error rate is: %f" % (float(errorCount) / m))
dataArr, labelArr = loadImages('testDigits')
errorCount = 0
datMat = mat(dataArr)
labelMat = mat(labelArr).transpose()
m, n = shape(datMat)
for i in range(m):
kernelEval = kernelTrans(sVs, datMat[i, :], kTup)
predict = kernelEval.T * multiply(labelSV, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]):
errorCount += 1
print("test test error rate is : %f" % (float(errorCount) / m))
def img2vector(filename):
returnVect = zeros((1, 1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline()
for j in range(32):
returnVect[0, 32*i + j] = int(lineStr[j])
return returnVect
第七章: 利用AdaBoost元算法提高分类性能
优点: 泛化错误率低,易编码,可以应用再大部分分类器上,无参数调整。
缺点: 对离群点敏感。
使用数据类型: 数值型和标称型数据。
AdaBoost的一般流程
- 收集数据: 可以使用任何方法。
- 准备数据: 依赖于所使用的弱分类类型,本章使用的是单层决策树,这种分类器可以处理任何数据类型。当然也可以使用任意分类器作为弱分类器,第二章到第六章中的任一分类器都可以充当弱分类器。作为弱分类器,简单分类器的效果更好。
- 分析数据: 可以使用任意方法。
- 训练算法: AdaBoost的大部分时间都在训练上,分类器将多次在同一数据集上训练弱分类器。
- 测试算法: 计算分类的错误率。
- 使用算法: 同SVM,AdaBoost预测两个类别中的一个。
示例:在一个数据集上应用AdaBoost
- 收集数据: 提供的文本文件
- 准备数据: 确保类别标签是+1和-1而非1和0.
- 分析数据: 手工检查数据。
- 训练算法: 在数据上,利用adaBoostTrainDS()函数训练出一系列的分类器。
- 测试算法: 在不采取随机抽样的方法下,对AdaBoost和Logistic回归的结果进行完全对等的比较。
- 使用算法: 观察例子上的错误率。
adaboost.py完整代码(章节中Python命令行代码在GitHub):
from numpy import *
import matplotlib.pyplot as plt
def loadSimpData():
datMat = matrix([[1., 2.1],
[2. , 1.1],
[1.3, 1. ],
[1. , 1. ],
[2. , 1. ]])
classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
return datMat, classLabels
# Single layer decision tree (decision stump), program 7_1
def stumpClassify(dataMatrix, dimen, threshVal, threshIneq):
retArray = ones((shape(dataMatrix)[0], 1))
if threshIneq == 'lt':
retArray[dataMatrix[:, dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:, dimen] > threshVal] = -1.0
return retArray
def buildStump(dataArr, classLabels, D):
dataMatrix = mat(dataArr)
labelMat = mat(classLabels).T
m, n = shape(dataMatrix)
numSteps = 10.0
bestStump = {}
bestClasEst = mat(zeros((m, 1)))
minError = inf
for i in range(n):
rangeMin = dataMatrix[:, i].min()
rangeMax = dataMatrix[:, i].max()
stepSize = (rangeMax - rangeMin) / numSteps
for j in range(-1, int(numSteps) + 1):
for inequal in ['lt', 'gt']:
threshVal = (rangeMin + float(j) * stepSize)
predictedVals = \
stumpClassify(dataMatrix, i, threshVal, inequal)
errArr = mat(ones((m, 1)))
errArr[predictedVals == labelMat] = 0
weightedError = D.T*errArr
# calculate the weighted error rate
# print("split: dim %d, thresh %.2f, thresh inequal: \
# %s, the weighted error is %.3f" %\
# (i, threshVal, inequal, weightedError))
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump, minError, bestClasEst
# AdaBoost training function, base on decision stump, program 7_2
def adaBoostTrainDS(dataArr, classLabels, numIt = 40):
weakClassArr = []
m = shape(dataArr)[0]
D = mat(ones((m, 1)) / m)
aggClassEst = mat(zeros((m, 1)))
for i in range(numIt):
bestStump, error, classEst = buildStump(dataArr, classLabels, D)
print("D:", D.T)
alpha = float(0.5 * log((1.0 - error)/ max(error, 1e-16)))
bestStump['alpha'] = alpha
weakClassArr.append(bestStump)
print("classEst: ", classEst.T)
# Calculate D value for the next iteration
expon = multiply(-1 * alpha * mat(classLabels).T, classEst)
D = multiply(D, exp(expon))
D = D / D.sum()
# Accumulate the error rate
aggClassEst += alpha * classEst
print("aggClassEst: ", aggClassEst.T)
aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T, \
ones((m, 1)))
errorRate = aggErrors.sum() / m
print("total error: ", errorRate, "\n")
if errorRate == 0.0:
break
# program 7_2 version
# return weakClassArr
# program 7_5 version
return weakClassArr, aggClassEst
# AdaBoost's classification function, program 7_3
def adaClassify(datToClass, classifierArr):
dataMatrix = mat(datToClass)
m = shape(dataMatrix)[0]
aggClassEst = mat(zeros((m, 1)))
for i in range(len(classifierArr)):
classEst = stumpClassify(dataMatrix, classifierArr[i]['dim'], \
classifierArr[i]['thresh'], \
classifierArr[i]['ineq'])
aggClassEst += classifierArr[i]['alpha'] * classEst
print(aggClassEst)
return sign(aggClassEst)
# Self adaptive data loading funtion, program 7_4
def loadDataSet(fileName):
numFeat = len(open(fileName).readline().split('\t'))
dataMat = []
labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = []
curLine = line.strip().split('\t')
for i in range(numFeat - 1):
lineArr.append(float(curLine[i]))
dataMat.append(lineArr)
labelMat.append(float(curLine[-1]))
return dataMat, labelMat
def plotROC(preStrengths, classLabels):
cur = (1.0, 1.0)
ySum = 0.0
numPosClas = sum(array(classLabels) == 1.0)
yStep = 1 / float(numPosClas)
xStep = 1 / float(len(classLabels) - numPosClas)
# Get a sorted list of index
sortedIndicies = preStrengths.argsort()
fig = plt.figure()
fig.clf()
ax = plt.subplot(111)
for index in sortedIndicies.tolist()[0]:
if classLabels[index] == 1.0:
delX = 0
delY = yStep
else:
delX = xStep
delY = 0
ySum += cur[1]
ax.plot([cur[0], cur[0] - delX], [cur[1], cur[1] - delY], c = 'b')
cur = (cur[0] - delX, cur[1] - delY)
ax.plot([0, 1], [0, 1], 'b--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
ax.axis([0, 1, 0, 1])
plt.show()
print("the Area Under the Curve is: ", ySum * xStep)