对于监督学习,其最终目标都是在给定样本时,预测其最可能的类别,即优化目标总是:\(\begin{align}\arg \max_yp(y\|x)\end{align}\)。
求解上述优化目标总的可分为两类:
wiki: a generative model is a model of the conditional probability of the observable X, given a target y, symbolically,{P(X|Y=y)}.
判别模型 | 生成模型 | |
---|---|---|
关注点 | 关注各类的边界 | 关注每一类的分布,不关注决策边界的位置 |
优点 | 1.能得到类之间的差异信息 2.学习简单 |
1.研究类内部问题更加灵活 2.更适合增量学习 3.对数据不完整适应更好,更鲁棒 |
缺点 | 1.不能得到每一类的特征(可以告诉你的是1还是2,但是没有办法把整个场景描述出来) 2.变量关系不清晰、不可视 |
1.学习计算相对复杂 2.更倾向于得到false positive,尤其是对比较近的事物分类(比如牛和马) |
举例 | LR、SVM、KNN、traditional neural network、conditional random field | naive bayes、mixtures of mutinomial/gaussians/experts、HMMs、Sigmoidal belief networks、bayesian networks、markov random fields |
目标:判别一个动物是大象(y=1)还是狗(y=0)
判别式模型:
生成式模型:
高斯判别分析(Gaussian discriminant analysismodel, GDA):
对于上面的文本分类问题,涉及到具体的几个步骤:
1、数据准备:从文本构建词向量
loadDataSet
函数构建post文本内容及对应的标签(是否是侮辱性的)createVocabList
函数基于所有的文档,提取所有唯一的单词集合setOfWords2Vec
函数基于单词集合,把输入的post文本转换为一个单词向量,长度和单词集合相等,出现某单词则值为1,未出现则为0def loadDataSet():
postingList=[['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
['stop', 'posting', 'stupid', 'worthless', 'garbage'],
['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]
classVec = [0,1,0,1,0,1] #1 is abusive, 0 not
return postingList,classVec
def createVocabList(dataSet):
vocabSet = set([]) #create empty set
for document in dataSet:
vocabSet = vocabSet | set(document) #union of the two sets
return list(vocabSet)
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else: print "the word: %s is not in my Vocabulary!" % word
return returnVec
2、模型训练:基于词向量计算概率
trainMatrix
:词向量矩阵,每一行是一个文本,每一列表示某个单词在该文本中是否出现trainCategory
:文本所属的类别根据这个函数,就可以计算出,在每个类别中每个单词出现的概率,具体这里是两个类别:侮辱性的(标记为1的)和非侮辱性的(标记为0的)。
def trainNB0(trainMatrix,trainCategory):
numTrainDocs = len(trainMatrix)
numWords = len(trainMatrix[0])
pAbusive = sum(trainCategory)/float(numTrainDocs)
p0Num = ones(numWords); p1Num = ones(numWords) #change to ones()
p0Denom = 2.0; p1Denom = 2.0 #change to 2.0
for i in range(numTrainDocs):
if trainCategory[i] == 1:
p1Num += trainMatrix[i]
p1Denom += sum(trainMatrix[i])
else:
p0Num += trainMatrix[i]
p0Denom += sum(trainMatrix[i])
p1Vect = log(p1Num/p1Denom) #change to log()
p0Vect = log(p0Num/p0Denom) #change to log()
return p0Vect,p1Vect,pAbusive
注意:
p0Num = ones(numWords); p1Num = ones(numWords)
这里的初始化可以初始化为全0矩阵,但是如果某个单词没有出现,其概率为0,最后的乘积也为0,所以可以初始化为全1矩阵;p1Vect = log(p1Num/p1Denom)
这里如果不取log
,有可能由于太多的很小的数值相乘造成下溢出问题,所以去对数。3、模型测试
根据训练数据,已经计算了每个类别中每个单词的出现频率。对于新的文本数据,就可以基于此来计算此文本属于每一个类别的概率,最后文本所属的类别就是概率值最大的那一类。因为这里引入了贝叶斯假设,认为不同的单词之间是相互独立的,所以可以直接计算得到概率,就是下面的vec2Classify * p1Vec
和vec2Classify * p0Vec
,vec2Classify
就是待分类文本的词向量,p1Vec
是类别1中单词的概率,两者元素相乘再加和就是文本属于类别1的概率。
def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
p1 = sum(vec2Classify * p1Vec) + log(pClass1) #element-wise mult
p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1)
if p1 > p0:
return 1
else:
return 0
def testingNB():
listOPosts,listClasses = loadDataSet()
myVocabList = createVocabList(listOPosts)
trainMat=[]
for postinDoc in listOPosts:
trainMat.append(setOfWords2Vec(myVocabList, postinDoc))
p0V,p1V,pAb = trainNB0(array(trainMat),array(listClasses))
testEntry = ['love', 'my', 'dalmation']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
testEntry = ['stupid', 'garbage']
thisDoc = array(setOfWords2Vec(myVocabList, testEntry))
print testEntry,'classified as: ',classifyNB(thisDoc,p0V,p1V,pAb)
词集模型 vs 词袋模型
# 词集模型
def setOfWords2Vec(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] = 1
else: print "the word: %s is not in my Vocabulary!" % word
return returnVec
# 词袋模型
def bagOfWords2VecMN(vocabList, inputSet):
returnVec = [0]*len(vocabList)
for word in inputSet:
if word in vocabList:
returnVec[vocabList.index(word)] += 1
return returnVec
参考这里的文档:
# load data
from sklearn.datasets import fetch_20newsgroups
twenty_train = fetch_20newsgroups(subset='train',
categories=categories, shuffle=True, random_state=42)
# Tokenizing text
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(twenty_train.data)
# occurrences to frequencies
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
# train NB model
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)
# test model
docs_new = ['God is love', 'OpenGL on the GPU is fast']
X_new_counts = count_vect.transform(docs_new)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)
predicted = clf.predict(X_new_tfidf)
for doc, category in zip(docs_new, predicted):
print('%r => %s' % (doc, twenty_train.target_names[category]))
联合成一个pipeline:
# training
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB()),
])
text_clf.fit(twenty_train.data, twenty_train.target)
# testing
import numpy as np
twenty_test = fetch_20newsgroups(subset='test',
categories=categories, shuffle=True, random_state=42)
docs_test = twenty_test.data
predicted = text_clf.predict(docs_test)
np.mean(predicted == twenty_test.target)
是bagging和决策树的结合
random forest(RF)=bagging+fully-grown CART decision tree
将所有决策树通过bagging的形式结合起来,避免单个决策树造成过拟合
获得不同的决策树的方式:
每一轮得到的树由不同的30个特征构成,每棵树不一样
选择的是一部分特征,属于低维映射
上面的bagging涉及到的bootstrap随机抽取数据集训练不同的模型的过程,是不是和交叉验证比较像,下表是两者选取数据集合训练的比较:
有几点值得注意:
OOB类似于验证数据集(都是未参与训练的),那么是否能用OOB验证构建的模型g的好坏呢?可以!
例子:\((x_N,y_N)\)是\(g_2,g_3,g_T\)的OOB,那么\((x_N,y_N)\)在\(G_N^-(N)\)上的表现是:\(G_N^-(N)=average(g_2,g_3,g_T)\)
这个做法类似于留一法验证,每次只对一个样本验证其在所属的所有的OOB的g模型上的表现(不是单个模型)
特征选择:从d维特征到d‘特征的subset-transform,最终是由d’维特征进行模型训练。比如原来特征有10000个,选取300个,需要舍弃部分特征:
筛选:计算每个特征的重要性(权重),再根据重要性的排序进行选择。
【2】通过permutation方式,将原来的N个样本的第i个特征打乱(重新洗牌)
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import make_classification
# 构建数据集
>>> X, y = make_classification(n_samples=1000, n_features=4,
... n_informative=2, n_redundant=0,
... random_state=0, shuffle=False)
# 构建分类器
>>> clf = RandomForestClassifier(n_estimators=100, max_depth=2,
... random_state=0)
# 模型训练
>>> clf.fit(X, y)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=2, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
oob_score=False, random_state=0, verbose=0, warm_start=False)
# 获得每个特征的重要性分数
>>> print(clf.feature_importances_)
[0.14205973 0.76664038 0.0282433 0.06305659]
# 预测新数据
>>> print(clf.predict([[0, 0, 0, 0]]))
[1]
决策树算法:由一个决策图和可能的结果(包括资源成本和风险)组成, 用来创建到达目标的规划。
决策树的构建通常包含3个步骤:
建树的具体过程:
对子节点分别递归执行上一步,直到每个最终的子节点都足够’纯’或者都属于同一类别为止。
信息增益(Information Gain):划分数据集之前之后信息发生的变化,比如可以看熵值的变化,在每一个状态时具有一个熵值。
信息和熵值(Entropy)(参见下图的上半部分):
下面是一个例子,比较不同的特征进行数据分隔时的信息增益大小:
ID3使用信息增益最大的特征作为分隔特征,这个对于特征是有偏向性的,具有多个值的特征其信息增益也容易大从而被选做分隔特征。为了消除这个,在C4.5中引入了信息增益比:
ID3和C4.5构建的树规模较大,为了提高建树的效率,CART方法被开发出来,其使用的是GINI系数来知道特征的选择:
三种算法的比较:
在经典的ID3中,每次选取的分隔特征都是信息增益最大的,这个可以保证最后的分类效果达到最好,其损失函数值也是最小的。具体的,其损失函数很像最大似然估计函数:
计算给定数据集的香秾熵值:1)统计数据集label的count数,2)转换为概率,3)概率乘以对数概率并加和:
def calcShannonEnt(dataSet):
numEntries = len(dataSet)
labelCounts = {}
for featVec in dataSet: #the the number of unique elements and their occurance
currentLabel = featVec[-1]
if currentLabel not in labelCounts.keys(): labelCounts[currentLabel] = 0
labelCounts[currentLabel] += 1
shannonEnt = 0.0
for key in labelCounts:
prob = float(labelCounts[key])/numEntries
shannonEnt -= prob * log(prob,2) #log base 2
return shannonEnt
划分数据集:对于给定的数据集,循环其每一个特征,计算特征对应的香农熵,选取其中熵值最小(信息增益最大)的特征作为划分的特征:
def chooseBestFeatureToSplit(dataSet):
numFeatures = len(dataSet[0]) - 1 #the last column is used for the labels
baseEntropy = calcShannonEnt(dataSet)
bestInfoGain = 0.0; bestFeature = -1
for i in range(numFeatures): #iterate over all the features
featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
uniqueVals = set(featList) #get a set of unique values
newEntropy = 0.0
for value in uniqueVals:
subDataSet = splitDataSet(dataSet, i, value)
prob = len(subDataSet)/float(len(dataSet))
newEntropy += prob * calcShannonEnt(subDataSet)
infoGain = baseEntropy - newEntropy #calculate the info gain; ie reduction in entropy
if (infoGain > bestInfoGain): #compare this to the best gain so far
bestInfoGain = infoGain #if better than current best, set to best
bestFeature = i
return bestFeature #returns an integer
对于iris
数据集应用决策树,过程中会搜索选取不同的特征及其阈值,以作为分隔的特征(节点):
from sklearn.datasets import load_iris
from sklearn import tree
# Load in our dataset
iris_data = load_iris()
# Initialize our decision tree object
classification_tree = tree.DecisionTreeClassifier()
# Train our decision tree (tree induction + pruning)
classification_tree = classification_tree.fit(iris_data.data, iris_data.target)
import graphviz
dot_data = tree.export_graphviz(classification_tree, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True,
special_characters=True)
graph = graphviz.Source(dot_data)
graph.render("iris")
分布的运算:
x轴上每个单位的概率
。通过数值积分,可计算变量X在某个区间的概率。卷积
。卷积不是直接概率密度的相乘,是去所有值的积分。术语:
具体到这个例子中,有如下过程:
需要估计模型参数:即均值和方差
对模型参数\(\Theta\)做极大似然估计:
先计算隐变量的期望,再估计模型参数:
现在有两个硬币A和B,要估计的参数是它们各自翻正面(head)的概率。观察的过程是先随机选A或者B,然后扔10次。以上步骤重复5次。
具体参考这里:
对于第一次的抛,测试输出是否和论文一致:
# 硬币投掷结果观测序列
observations = np.array([[1, 0, 0, 0, 1, 1, 0, 1, 0, 1],
[1, 1, 1, 1, 0, 1, 1, 1, 1, 1],
[1, 0, 1, 1, 1, 1, 1, 0, 1, 1],
[1, 0, 1, 0, 0, 0, 1, 1, 0, 0],
[0, 1, 1, 1, 0, 1, 1, 1, 0, 1]])
coin_A_pmf_observation_1 = stats.binom.pmf(5,10,0.6)
coin_B_pmf_observation_1 = stats.binom.pmf(5,10,0.5)
normalized_coin_A_pmf_observation_1 = coin_A_pmf_observation_1/(coin_A_pmf_observation_1+coin_B_pmf_observation_1)
# 更新在当前参数下A、B硬币产生的正反面次数
counts['A']['H'] += weight_A * num_heads
counts['A']['T'] += weight_A * num_tails
counts['B']['H'] += weight_B * num_heads
counts['B']['T'] += weight_B * num_tails
# 在初始化的theta下,AB分别产生正反面的次数被估计出来了
# 可以基于结果更新theta了
new_theta_A = counts['A']['H'] / (counts['A']['H'] + counts['A']['T'])
new_theta_B = counts['B']['H'] / (counts['B']['H'] + counts['B']['T'])
# new_theta_A=0.713
# new_theta_B=0.581
# 是和论文一致的
完整的模型:
def em_single(priors, observations):
"""
EM算法单次迭代
Arguments
---------
priors : [theta_A, theta_B]
observations : [m X n matrix]
Returns
--------
new_priors: [new_theta_A, new_theta_B]
:param priors:
:param observations:
:return:
"""
counts = {'A': {'H': 0, 'T': 0}, 'B': {'H': 0, 'T': 0}}
theta_A = priors[0]
theta_B = priors[1]
# E step
for observation in observations:
len_observation = len(observation)
num_heads = observation.sum()
num_tails = len_observation - num_heads
contribution_A = stats.binom.pmf(num_heads, len_observation, theta_A)
contribution_B = stats.binom.pmf(num_heads, len_observation, theta_B) # 两个二项分布
weight_A = contribution_A / (contribution_A + contribution_B)
weight_B = contribution_B / (contribution_A + contribution_B)
# 更新在当前参数下A、B硬币产生的正反面次数
counts['A']['H'] += weight_A * num_heads
counts['A']['T'] += weight_A * num_tails
counts['B']['H'] += weight_B * num_heads
counts['B']['T'] += weight_B * num_tails
# M step
new_theta_A = counts['A']['H'] / (counts['A']['H'] + counts['A']['T'])
new_theta_B = counts['B']['H'] / (counts['B']['H'] + counts['B']['T'])
return [new_theta_A, new_theta_B]
def em(observations, prior, tol=1e-6, iterations=10000):
"""
EM算法
:param observations: 观测数据
:param prior: 模型初值
:param tol: 迭代结束阈值
:param iterations: 最大迭代次数
:return: 局部最优的模型参数
"""
import math
iteration = 0
while iteration < iterations:
new_prior = em_single(prior, observations)
delta_change = np.abs(prior[0] - new_prior[0])
if delta_change < tol:
break
else:
prior = new_prior
iteration += 1
return [new_prior, iteration]
# 调用EM算法
print em(observations, [0.6, 0.5])
# [[0.79678875938310978, 0.51958393567528027], 14] # 这个结果与论文一致
K-近邻算法(k-nearest neighbors, KNN):对于要进行分类的数据,与已知的数据进行比较,计算相似性,找出最相似的K个,然后看这个K个的所属类别,最高概率的类别即为目标变量的分类。
可视化图解 (KDnuggets):
当选定不同的K值大小时,预测对应的分类可能不同,所以需要尝试不同的K值。下图中,当k=3时,会预测为B类型,当k=6时,会预测为A类型。
任务:改善约会网站的配对效果
数据集:(4列)
gongjing@MBP ..eafileSyn/github/MLiA/source_code/Ch02 % head datingTestSet.txt
40920 8.326976 0.953952 largeDoses
14488 7.153469 1.673904 smallDoses
26052 1.441871 0.805124 didntLike
75136 13.147394 0.428964 didntLike
读取数据集的函数:
def file2matrix(filename):
fr = open(filename)
numberOfLines = len(fr.readlines()) #get the number of lines in the file
returnMat = zeros((numberOfLines,3)) #prepare matrix to return
classLabelVector = [] #prepare labels return
fr = open(filename)
index = 0
for line in fr.readlines():
line = line.strip()
listFromLine = line.split('\t')
returnMat[index,:] = listFromLine[0:3]
# label是最后一列,但是是str类型,不能用int转换
# classLabelVector.append(int(listFromLine[-1]))
classLabelVector.append(listFromLine[-1])
index += 1
return returnMat,classLabelVector
归一化函数:在这个数据集中,不同的特征单位不同,数值差异很大,因而在计算不同样本的相似性时,数值大的特征会很大程度上决定了相似性的数值,所以需要把不同的特征都归一化到一致的范围。通常转换为[0,1]或者[-1,1],这里是通过转换将值归一化到0-1之间:newValue = (oldValue - min) / (max - min)
def autoNorm(dataSet):
minVals = dataSet.min(0)
maxVals = dataSet.max(0)
ranges = maxVals - minVals
normDataSet = zeros(shape(dataSet))
m = dataSet.shape[0]
normDataSet = dataSet - tile(minVals, (m,1))
normDataSet = normDataSet/tile(ranges, (m,1)) #element wise divide
return normDataSet, ranges, minVals
KNN分类函数(python是实现):
def classify0(inX, dataSet, labels, k):
dataSetSize = dataSet.shape[0]
diffMat = tile(inX, (dataSetSize,1)) - dataSet
sqDiffMat = diffMat**2
sqDistances = sqDiffMat.sum(axis=1)
distances = sqDistances**0.5
sortedDistIndicies = distances.argsort()
classCount={}
for i in range(k):
voteIlabel = labels[sortedDistIndicies[i]]
classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1
sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True)
return sortedClassCount[0][0]
对约会数据进行读取分类:
def datingClassTest():
hoRatio = 0.50 #hold out 10%
datingDataMat,datingLabels = file2matrix('datingTestSet.txt') #load data setfrom file
normMat, ranges, minVals = autoNorm(datingDataMat)
m = normMat.shape[0]
numTestVecs = int(m*hoRatio)
errorCount = 0.0
for i in range(numTestVecs):
classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3)
print "the classifier came back with: %s, the real answer is: %s" % (classifierResult, datingLabels[i])
if (classifierResult != datingLabels[i]): errorCount += 1.0
print "the total error rate is: %f" % (errorCount/float(numTestVecs))
print errorCount
the classifier came back with: largeDoses, the real answer is: largeDoses
the classifier came back with: smallDoses, the real answer is: smallDoses
......
the total error rate is: 0.064000
32.0 # 当hold out数据比例为10%时,有5个预测错误了。
df = pd.read_csv('./datingTestSet.txt', header=None, sep='\t')
df.head()
# 构建测试集和训练集
# 前面10%的数据作为测试集,后面90%的数据作为训练集
hoRatio = 0.1
X_test = df.ix[0:hoRatio*df.shape[0]-1, 0:2]
print X_test.shape
y_test = df.ix[0:hoRatio*df.shape[0]-1, 3]
print y_test.shape
X_train = df.ix[hoRatio*df.shape[0]:, 0:2]
print X_train.shape
y_train = df.ix[hoRatio*df.shape[0]:, 3]
print y_train.shape
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
metrics.accuracy_score(y_test, y_pred)
# accuracy: 0.76, 有24个预测错误了
定义:
详细请参见另一篇博文K均值聚类算法,其算法流程如下:
自底向上的聚合策略
不断重复,直到达到预设的聚类簇的个数
概率:
术语: