《机器学习实战》笔记（第一和第二章）（附Python3版代码）（Machine Learning in Action)

Machine Learning • Jan 15, 2018

前言：

《机器学习实战》（Machine Learning in Action)是一本常见的机器学习入门书，书中代码由Python2写成。由于现时Python2已逐渐退出舞台，所以这篇文章将该书的所有代码部分用Python3重写。
代码上传GitHub: https://github.com/kungbob/Machine_Learning_In_Action
原版Python2代码：https://www.manning.com/books/machine-learning-in-action

第一章：机器学习基础

机器学习一个主要任务是分类，另外一项是回归（预测数据）。

开发机器学习应用程序的步骤：

收集数据
准备输入数据
分析输入数据
训练算法
测试算法
使用算法

# import numpy library
from numpy import *
# randomly generate four lists of four elements
print("Matrix is:\n", random.rand(4, 4))
# change the list into a 4 by 4 matrix
randMat = mat(random.rand(4, 4))
# calculate the reverse of the matrix
invRandMat = randMat.I
print("Inverse is:\n", invRandMat.I)
# Matrix * Inverse = Identity Matrix
myEye = randMat * invRandMat
print("Matrix * Inverse is:\n", myEye)
# slight difference with real Identity Matrix deal to computation error
diff = myEye - eye(4)
print("Difference is:\n", diff)

第二章：k-近邻算法

优点：精度高，对异常值不敏感，无数据输入假定。
缺点：计算复杂度高，空间复杂负高。
适用数据范围：数值型和标称性。

k-近邻算法（kNN）算法原理：

根据训练样本集的标签进行分类，知道样本和所属分类的对应关系。
输入没有标签的新数据，将新数据的每个特征与样本集数据的对应特征进行比较。
实用算法提取样本中最相似数据的分类标签。
k代表的就是数据集中k个最相似的数据（通常k不大于20）。
根据k个最相似数据中出现次数最多的分类，作为新数据的分类。

k-近邻算法的一般流程：

收集数据：任何方法。
准备数据：距离计算所需要的数值，最好是结构化的数据格式。
分析数据：任何方法。
训练算法：此步骤不适用于k-近邻算法。
测试算法：计算错误率。
使用算法：首先需要输入样本数据和结构化的结果，然后运行k-近邻算法判定输入数据分别属于哪个类别，最后应用对计算出的分类执行后续的处理。

示例：手写识别系统

收集数据：提供文本文件。
准备数据：编写函数img2vector()，将图像格式转换为分类器使用的向量格式。
分析数据：在Python命令提示符中检查数据，确保符合要求。
训练算法：此步骤不适用于k-近邻算法。
测试算法：编写函数使用提供的部分数据集作为测试样本，测试样本与非测试样本的却别在于测试样本是已经完成分类的数据，如果预测分类与实际分类不同，则标记为一个错误。
使用算法：书中没有完成此步骤，读者感兴趣可以构建完整的应用程序，从图像中提取数字，完成数字识别。

kNN.py完整代码（章节中Python命令行代码在GitHub）：

from numpy import *
from os import listdir
import operator

# Create data for later usage, chapter 2_1_1
def createDataSet():
    group = array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

# kNN algorithm, program 2_1
def classify0(inX, dataSet, labels, k):
    dataSetSize = dataSet.shape[0]
    # Distance between two points, using eculidean
    diffMat = tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat**2
    sqDistances = sqDiffMat.sum(axis = 1)
    distances = sqDistances**0.5
    # Finding the k point of shortest distance
    sortedDistIndices = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndices[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
    # Sorting
    sortedClassCount = sorted(classCount.items(),
      key = operator.itemgetter(1), reverse = True)
    return sortedClassCount[0][0]

# Read txt file into matrix, program 2_2
def file2matrix(filename):
    fr = open(filename)
    arrayOLines = fr.readlines()
    # Get the lines of file
    numberOfLines = len(arrayOLines)
    # Create a matrix with all zeros
    returnMat = zeros((numberOfLines, 3))
    classLabelVector = []
    index = 0
    # Read numbers into matrix
    for line in arrayOLines:
        line = line.strip()
        listFromLine = line.split('\t')
        returnMat[index, :] = listFromLine[0:3]
        classLabelVector.append(int(listFromLine[-1]))
        index += 1
    return returnMat, classLabelVector

# Normalization of features, program 2_3
def autoNorm(dataSet):
    minVals = dataSet.min(0)
    maxVals = dataSet.max(0)
    ranges = maxVals - minVals
    normDataSet = zeros(shape(dataSet))
    m = dataSet.shape[0]
    # Normalization Step, dataset minus minimum value
    # tile function means create a matrix with m row, repeating minVals for
    # column once.
    normDataSet = dataSet - tile(minVals, (m, 1))
    # Normalization Step, divide the range of value
    normDataSet = normDataSet / tile(ranges, (m, 1))
    return normDataSet, ranges, minVals

# Classifer for dating website, testing its accuracy, program 2_4
def datingClassTest():
    # Ratio of testing case
    hoRatio = 0.1
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    m = normMat.shape[0]
    numTestVecs = int(m * hoRatio)
    errorCount = 0.0
    # Testing classifier
    for i in range(numTestVecs):
        classifierResult = classify0(normMat[i, :], normMat[numTestVecs:m, :],
                            datingLabels[numTestVecs:m], 3)
        print("The classifier came back with: %d, the real answer is: %d" \
                % (classifierResult, datingLabels[i]))
        if classifierResult != datingLabels[i] :
            errorCount += 1.0
    print("The total error rate is: %f" %(errorCount/float(numTestVecs)))

# Dating Website Prediction Function, program 2_5
def classifyPerson():
    resultList = ['not at all', 'in small doses', 'in large doses']
    percentTats = float(input(
                    "percentage of time spent playing video games?"))
    ffMiles = float(input("frequent flier miles earned per year?"))
    iceCream = float(input("liters of ice cream consumed per year?"))
    datingDataMat, datingLabels = file2matrix('datingTestSet2.txt')
    normMat, ranges, minVals = autoNorm(datingDataMat)
    inArr = array([ffMiles, percentTats, iceCream])
    classifierResult = classify0((inArr - \
                        minVals) / ranges, normMat, datingLabels, 3)
    print("You will probably like this person: ", \
                    resultList[classifierResult - 1])

# Change image of 32 x 32 pixels to 1 x 1024 vector, chapter 2_3_1
def img2vector(filename):
    returnVect = zeros((1, 1024))
    fr = open(filename)
    for i in range(32):
        lineStr = fr.readline()
        for j in range(32):
            returnVect[0, 32*i + j] = int(lineStr[j])
    return returnVect

# Handwriting Testing Function, chapter 2_6
def handwritingClassTest():
    # Retrieve directory content
    hwLabels = []
    trainingFileList = listdir('trainingDigits')
    m = len(trainingFileList)
    trainingMat = zeros((m, 1024))
    # Analysis the number from the file name, training the data
    for i in range(m):
        fileNameStr = trainingFileList[i]
        fileStr = fileNameStr.split('.')[0]
        classNumStr = int(fileStr.split('_')[0])
        hwLabels.append(classNumStr)
        trainingMat[i, :] = img2vector('trainingDigits/%s' % fileNameStr)
    testFileList = listdir('testDigits')
    errorCount = 0.0
    mTest = len(testFileList)
    # Testing the algorithm
    for i in range(mTest):
        fileNameStr = testFileList[i]
        fileStr = fileNameStr.split('_')[0]
        classNumStr = int(fileStr.split('_')[0])
        vectorUnderTest = img2vector('testDigits/%s' % fileNameStr)
        classifierResult = classify0(vectorUnderTest, \
                                trainingMat, hwLabels, 3)
        print("The classifier came back with: %d, the real answer is: %d"\
                % (classifierResult, classNumStr))
        if classifierResult != classNumStr:
            errorCount += 1.0
    print ("\nthe total number of errors is: %d" % errorCount)
    print ("\nthe total error rate is: %f" % (errorCount/float(mTest)))