泛化误差与经验误差:
算法不可分:若目标概念\(c not in H\),则H中不存在任何假设能将所有样本完全正确分开
概率近似:以较大的概率学得误差满足预设上限的模型
样本复杂度:sample complexity,满足PAC学习算法所需的\(m\ge poly(1/\epsilon,1/\delta,size(x),size(c))\)中的最小的m,称为学习算法的样本复杂度。
PAC学习中的H的复杂度
需要多少样本?
这里没看懂,先空着吧!
增长函数:
对分:
打散:
VC维:
# 每一行代表一个entry,每一列代表所在集合是否出现,1(出现),0(不出现)
$ head all.base.0.1.set.test.txt
NM_212989-2160 0 0 0 0 1
NM_200657-769 0 1 1 1 0
NM_200657-768 1 1 1 1 0
NM_200657-765 0 1 1 0 0
NM_200657-764 0 1 1 0 0
NM_200657-767 1 0 1 1 0
NM_200657-766 0 1 1 1 0
NM_200657-761 1 1 0 1 0
NM_200657-760 0 1 1 1 0
NM_200657-763 0 1 0 0 0
library("UpSetR")
t = './all.base.set.txt'
savefn = paste(t, '.pdf', sep='')
df = read.table(t, sep='\t', header=FALSE, row.names=1)
names(df) = c('set1', 'set2', 'set3', 'set4', 'set5')
print(head(df))
pdf(savefn, width=24, height=16)
upset(df, sets=c('set1', 'set2', 'set3', 'set4', 'set5'),
sets.bar.color = "grey",
order.by = "degree",
decreasing = "F",
empty.intersections = NULL, # on: show empty intersect
keep.order=TRUE,
number.angles=0,
point.size = 8.8,
line.size = 3,
scale.intersections="identity",
mb.ratio = c(0.7, 0.3),
nintersects=NA,
text.scale=c(5,5,5,3,5,5),
show.numbers="no",
set_size.angles=90
)
An example output like this:
import pandas as pd
from nested_dict import nested_dict
import matplotlib as mpl
mpl.use('Agg')
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.colors as mcolors
sns.set(style="ticks")
sns.set_context("poster")
def load_upset_file(txt=None, header_ls=None,):
if txt is None:
txt = 'all.base.set.txt'
if header_ls is None:
header_ls = ['set1', 'set2', 'set3', 'set4', 'set5']
set_category_stat_dict = nested_dict(2, int)
with open(txt, 'r') as TXT:
for n,line in enumerate(TXT):
line = line.strip()
if not line or line.startswith('#'):
continue
arr = line.split('\t')
entry_in = sum(map(int, arr[1:]))
for i,j in zip(arr[1:], header_ls):
if int(i) == 1:
set_category_stat_dict[j][entry_in] += 1
set_category_stat_df = pd.DataFrame.from_dict(set_category_stat_dict, orient='index')
set_category_stat_df = set_category_stat_df.loc[header_ls, :]
print set_category_stat_df
set_category_stat_df_ratio = set_category_stat_df.div(set_category_stat_df.sum(axis=1), axis=0)
print set_category_stat_df_ratio
fig, ax = plt.subplots(1, 2)
set_category_stat_df.plot(kind='bar', stacked=True, ax=ax[0])
set_category_stat_df_ratio.plot(kind='bar', stacked=True, ax=ax[1])
plt.tight_layout()
savefn = txt.replace('.txt', '.ratio.pdf')
plt.savefig(savefn)
plt.close()
def main():
load_upset_file()
if __name__ == '__main__':
main()
谷歌机器学习课程对应的练习。
tf.add(x,y)
twos = tf.constant([2, 2, 2, 2, 2, 2], dtype=tf.int32)
primes_doubled = primes * twos
print("primes_doubled:", primes_doubled)
# 输出值包含值和形状(shape)、类型信息
primes_doubled: tf.Tensor([ 4 6 10 14 22 26], shape=(6,), dtype=int32)
tf.matmul
tf.reshape
函数,改变形状或者阶数。tf.assign
)的。v = tf.contrib.eager.Variable([3])
print(v)
tf.assign(v, [7])
print(v)
[3]
[7]
模拟投两个骰子10次,10x3的张量,第三列是前两列的和:
random_uniform
:随机选取tf.concat
:合并张量die1 = tf.contrib.eager.Variable(
tf.random_uniform([10, 1], minval=1, maxval=7, dtype=tf.int32))
die2 = tf.contrib.eager.Variable(
tf.random_uniform([10, 1], minval=1, maxval=7, dtype=tf.int32))
dice_sum = tf.add(die1, die2)
resulting_matrix = (values=[die1, die2, dice_sum], axis=1)
print(resulting_matrix.numpy())
[[5 1 6]
[4 5 9]
[1 6 7]
[2 3 5]
[1 2 3]
[5 4 9]
[3 5 8]
[3 2 5]
[3 1 4]
[6 3 9]]
NaN
填充cities
City name Population Area square miles Population density
0 San Francisco 852469 46.87 18187.945381
1 San Jose 1015785 176.53 5754.177760
2 Sacramento 485199 97.92 4955.055147
# 随机化index,然后重建索引
cities.reindex(np.random.permutation(cities.index))
# 索引超出范围不报错,新建行
cities.reindex([0, 4, 5, 2])
有些标签可能不可靠。
未标记为“垃圾邮件”或“非垃圾邮件”的电子邮件是无标签样本。
“用户点击鞋子描述”是一项实用标签。
鞋码是一项实用特征。
均方误差:
学习速率*梯度
就是每次更新的数值。小批量或甚至包含一个样本的批量 (SGD)。
任务:基于单个输入特征预测各城市街区的房屋价值中位数
# 准备feature和target数据,这里只用一个特征
my_feature = california_housing_dataframe[["total_rooms"]]
feature_columns = [tf.feature_column.numeric_column("total_rooms")]
targets = california_housing_dataframe["median_house_value"]
# 配置训练时的优化器
my_optimizer=tf.train.GradientDescentOptimizer(learning_rate=0.0000001)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
# 配置模型
linear_regressor = tf.estimator.LinearRegressor(
feature_columns=feature_columns,
optimizer=my_optimizer
)
def my_input_fn(features, targets, batch_size=1, shuffle=True, num_epochs=None):
"""Trains a linear regression model of one feature.
"""
features = {key:np.array(value) for key,value in dict(features).items()}
ds = Dataset.from_tensor_slices((features,targets)) # warning: 2GB limit
ds = ds.batch(batch_size).repeat(num_epochs)
if shuffle:
ds = ds.shuffle(buffer_size=10000)
# Return the next batch of data.
features, labels = ds.make_one_shot_iterator().get_next()
return features, labels
# 训练模型
_ = linear_regressor.train(
input_fn = lambda:my_input_fn(my_feature, targets),
steps=100
)
# 评估模型
prediction_input_fn =lambda: my_input_fn(my_feature, targets, num_epochs=1, shuffle=False)
predictions = linear_regressor.predict(input_fn=prediction_input_fn)
predictions = np.array([item['predictions'][0] for item in predictions])
mean_squared_error = metrics.mean_squared_error(predictions, targets)
root_mean_squared_error = math.sqrt(mean_squared_error)
print("Mean Squared Error (on training data): %0.3f" % mean_squared_error)
print("Root Mean Squared Error (on training data): %0.3f" % root_mean_squared_error)
Mean Squared Error (on training data): 56367.025
Root Mean Squared Error (on training data): 237.417
模型效果好吗?通过RMSE(特性:它可以在与原目标相同的规模下解读)
,直接与原来target的最大最小值,进行比较。在这里,误差跨越了目标范围的一半,所以需要优化模型减小误差。
Min. Median House Value: 14.999
Max. Median House Value: 500.001
Difference between Min. and Max.: 485.002
Root Mean Squared Error: 237.417
问题:有适用于模型调整的标准启发法吗?
total_rooms/population
作为一个特征,表示的是平均每个人有多少个房间,这个是可取的。clipped_feature = df["feature_name"].apply(lambda x: max(x, 0))
,再重新训练,看是否效果变好了。训练集和测试集:调整训练集和测试集的相对比例,查看模型的效果。不同的数据集需要不同的比例,这个得通过尝试才能确定。
如何确定各data set之间的比例,没有统一的标准,取决于不同的条件 About Train, Validation and Test Sets in Machine Learning:
多次重复执行该流程可能导致我们不知不觉地拟合我们的特定测试集的特性。
一个特征组合:[binned latitude X binned longitude X binned roomsPerPerson]
bucketized_column
,转换为对应的类别编码,然后再进行独热编码(成二元特征系列)。bucketized_longitude = tf.feature_column.bucketized_column(
longitude, boundaries=get_quantile_based_boundaries(
training_examples["longitude"], 10))
long_x_lat = tf.feature_column.crossed_column(
set([bucketized_longitude, bucketized_latitude]), hash_bucket_size=1000)
L2 正则化可能会导致对于某些信息缺乏的特征,模型会学到适中的权重。
L2 正则化会使很多信息缺乏的权重接近于(但并非正好是)0.0。
这两个特征将拥有几乎相同的适中权重。
在 roulette 游戏中,一只球会落在旋转轮上,并且最终落入 38 个槽的其中一个内。某个机器学习模型可以使用视觉特征(球的旋转方式、球落下时旋转轮所在的位置、球在旋转轮上方的高度)预测球会落入哪个槽中,准确率为 4%。
可能会提高。
始终下降或保持不变。
如果模型 A 的精确率和召回率均优于模型 B,则模型 A 可能更好。
没有变化。AUC 只关注相对预测分数。
【L1正则化】:假设某个线性模型具有100个输入特征,其中10个特征信息丰富,另外90个特征信息比较缺乏。假设所有特征的值均介于-1和1之间。以下哪些陈述属实?
L1 正则化可能会使信息丰富的特征的权重正好为 0.0。
L1 正则化会使大多数信息缺乏的权重正好为 0.0。
【L1和L2正则化】:假设某个线性模型具有 100 个输入特征,这些特征的值均介于-1到1之间,其中10个特征信息丰富,另外90个特征信息比较缺乏。哪种类型的正则化会产生较小的模型?
L1 正则化。
神经网络:在线性很难区分的数据集上,测试不同的神经网络结构(隐藏层、权重、激活函数等),查看效果。
构建神经网络: 函数hidden_units
,例子hidden_units=[3, 10]
表示指定了连个隐藏层,第一个隐藏层含有3个节点,第二个隐藏层含有10个节点,默认为全连接且使用ReLU激活函数。
编程联系:时间问题,神经网络的训练需要大量的时间和运算,所以最好在集群上进行,不要在本地电脑上。
# Create a DNNRegressor object.
my_optimizer = tf.train.GradientDescentOptimizer(learning_rate=learning_rate)
my_optimizer = tf.contrib.estimator.clip_gradients_by_norm(my_optimizer, 5.0)
dnn_regressor = tf.estimator.DNNRegressor(
feature_columns=construct_feature_columns(training_examples),
hidden_units=hidden_units,
optimizer=my_optimizer
)
# Create input functions.
training_input_fn = lambda: my_input_fn(training_examples,
training_targets["median_house_value"],
batch_size=batch_size)
predict_training_input_fn = lambda: my_input_fn(training_examples,
training_targets["median_house_value"],
num_epochs=1,
shuffle=False)
predict_validation_input_fn = lambda: my_input_fn(validation_examples,
validation_targets["median_house_value"],
num_epochs=1,
shuffle=False)
def linear_scale(series):
min_val = series.min()
max_val = series.max()
scale = (max_val - min_val) / 2.0
return series.apply(lambda x:((x - min_val) / scale) - 1.0)
def log_normalize(series):
return series.apply(lambda x:math.log(x+1.0))
def clip(series, clip_to_min, clip_to_max):
return series.apply(lambda x:(
min(max(x, clip_to_min), clip_to_max)))
def z_score_normalize(series):
mean = series.mean()
std_dv = series.std()
return series.apply(lambda x:(x - mean) / std_dv)
def binary_threshold(series, threshold):
return series.apply(lambda x:(1 if x > threshold else 0))
【在线训练】:以下哪个关于在线(动态)训练的表述是正确的?
模型会在新数据出现时进行更新。
【离线训练】:以下哪些关于离线训练的表述是正确的?
与在线训练相比,离线训练需要对训练作业进行的监控较少。
您可以先验证模型,然后再将其应用到生产中。
【在线推理】:在线推理指的是根据需要作出预测。也就是说,进行在线推理时,我们将训练后的模型放到服务器上,并根据需要发出推理请求。以下哪些关于在线推理的表述是正确的?
您可以为所有可能的条目提供预测。
您必须小心监控输入信号。
【离线推理】:在离线推理中,我们会一次性根据大批量数据做出预测。然后将这些预测纳入查询表中,以供以后使用。以下哪些关于离线推理的表述是正确的?
对于给定的输入,离线推理能够比在线推理更快地提供预测。
生成预测之后,我们可以先对预测进行验证,然后再应用。
Data Dependencies理解:以下哪个模型容易受到反馈环的影响?
大学排名模型 - 将选择率(即申请某所学校并被录取的学生所占百分比)作为一项学校评分依据。
图书推荐模型 - 根据小说的受欢迎程度(即图书的购买量)向用户推荐其可能喜欢的小说。
交通状况预测模型 - 使用海滩上的人群规模作为特征之一预测海滩附近各个高速公路出口的拥堵情况。
sklearn.feature_extraction
:提取符合机器学习算法的特征,比如文本和图片DictVectorizer
可处理字典元素,将字典中的数组转换为模型可用的数组one-of-K
、one-hot
编码,用于分类特征>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Francisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
# 直接应用于字典,将分类的进行独热编码,数值型的保留
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
FeatureHasher
目标:把原始的高维特征向量压缩成较低维特征向量,且尽量不损失原始特征的表达能力
哈希表:有一个哈希函数,实现键值的映射,哈希把不同的键散列到不同的块,但还是存在冲突,即把不同的键散列映射到相同的值。
# 使用词袋模型
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
corpus=["I come to China to travel",
"This is a car polupar in China",
"I love tea and Apple ",
"The work is to write some papers in science"]
print vectorizer.fit_transform(corpus)
print vectorizer.fit_transform(corpus).toarray()
print vectorizer.get_feature_names()
(0, 16) 1
(0, 3) 1
(0, 15) 2
(0, 4) 1
(1, 5) 1
(1, 9) 1
(1, 2) 1
(1, 6) 1
(1, 14) 1
(1, 3) 1
(2, 1) 1
(2, 0) 1
(2, 12) 1
(2, 7) 1
(3, 10) 1
(3, 8) 1
(3, 11) 1
(3, 18) 1
(3, 17) 1
(3, 13) 1
(3, 5) 1
(3, 6) 1
(3, 15) 1
[[0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 0]
[0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0]
[1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1]]
[u'and', u'apple', u'car', u'china', u'come', u'in', u'is', u'love', u'papers', u'polupar', u'science', u'some', u'tea', u'the', u'this', u'to', u'travel', u'work', u'write']
# 使用hash技巧,将原来的19维降至6维
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer2=HashingVectorizer(n_features = 6,norm = None)
print vectorizer2.fit_transform(corpus)
print vectorizer2.fit_transform(corpus).toarray()
(0, 1) 2.0
(0, 2) -1.0
(0, 4) 1.0
(0, 5) -1.0
(1, 0) 1.0
(1, 1) 1.0
(1, 2) -1.0
(1, 5) -1.0
(2, 0) 2.0
(2, 5) -2.0
(3, 0) 0.0
(3, 1) 4.0
(3, 2) -1.0
(3, 3) 1.0
(3, 5) -1.0
[[ 0. 2. -1. 0. 1. -1.]
[ 1. 1. -1. 0. 0. -1.]
[ 2. 0. 0. 0. 0. -2.]
[ 0. 4. -1. 1. 0. -1.]]
CountVectorizer
,词切分+频数统计,用法参见上面的例子,同时注意文本文件的编码方式,一般是utf-8
编码,如果是其他编码,需要通过参数encoding
进行指定。公式:\(tf-idf(t,d) = tf(t,d) \times idf(t)\)
TfidfTransformer
TfidfVectorizer
,组合CountVectorizer+TfidfTransformer>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
use_idf=True)
# 直观分析count
# 第一个词在所有文档都出现,可能不重要
# 另外两个词,出现不到50%,可能具有代表性
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[0.81940995, 0. , 0.57320793],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[0.47330339, 0.88089948, 0. ],
[0.58149261, 0. , 0.81355169]])
# TfidfVectorizer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
下面的例子中,单词words和wprds想表达相同意思,只是拼写错误,通过2-gram分析可以把他们的相同特征提取到:
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
CountVectorizer
参数analyzer
指定analyzer='char_wb'
:只能是每个单词内的analyzer='char'
:可以跨单词创建n-gram
extract_patches_2d
,从图像二维数组或沿第三轴颜色信息提取patchreconstruct_from_patches_2d
,从所有patch重建图像>>> import numpy as np
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]])
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]],
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])
基于用户的意见来计算相似度 =》协同过滤:不关心物品的描述属性,严格按照许多用户的观点计算相似性。
def ecludSim(inA,inB):
return 1.0/(1.0 + la.norm(inA - inB))
def pearsSim(inA,inB):
if len(inA) < 3 : return 1.0
return 0.5+0.5*corrcoef(inA, inB, rowvar = 0)[0][1]
def cosSim(inA,inB):
num = float(inA.T*inB)
denom = la.norm(inA)*la.norm(inB)
return 0.5+0.5*(num/denom)
>>> import numpy as np
>>> U,Sigma,VT = np.linalg.svd([[1,1], [7,7]])
>>> U
array([[-0.14142136, -0.98994949],
[-0.98994949, 0.14142136]])
>>> Sigma # 为了节省空间,以向量形式返回,实际为矩阵,其余元素均为0
array([ 1.00000000e+01, 1.44854506e-16])
# 可以看到第1个数值比第二个数值大很多,所以特征2可以省去
>>> VT
array([[-0.70710678, -0.70710678],
[-0.70710678, 0.70710678]])
通常的推荐策略:
def standEst(dataMat, user, simMeas, item):
n = shape(dataMat)[1]
simTotal = 0.0; ratSimTotal = 0.0
for j in range(n):
userRating = dataMat[user,j]
if userRating == 0: continue
overLap = nonzero(logical_and(dataMat[:,item].A>0, \
dataMat[:,j].A>0))[0]
if len(overLap) == 0: similarity = 0
else: similarity = simMeas(dataMat[overLap,item], \
dataMat[overLap,j])
print 'the %d and %d similarity is: %f' % (item, j, similarity)
simTotal += similarity
ratSimTotal += similarity * userRating
if simTotal == 0: return 0
else: return ratSimTotal/simTotal
def recommend(dataMat, user, N=3, simMeas=cosSim, estMethod=standEst):
unratedItems = nonzero(dataMat[user,:].A==0)[1]#find unrated items
if len(unratedItems) == 0: return 'you rated everything'
itemScores = []
for item in unratedItems:
estimatedScore = estMethod(dataMat, user, simMeas, item)
itemScores.append((item, estimatedScore))
return sorted(itemScores, key=lambda jj: jj[1], reverse=True)[:N]
基于SVD的评分估计:
def svdEst(dataMat, user, simMeas, item):
n = shape(dataMat)[1]
simTotal = 0.0; ratSimTotal = 0.0
U,Sigma,VT = la.svd(dataMat)
Sig4 = mat(eye(4)*Sigma[:4]) #arrange Sig4 into a diagonal matrix
xformedItems = dataMat.T * U[:,:4] * Sig4.I #create transformed items
for j in range(n):
userRating = dataMat[user,j]
if userRating == 0 or j==item: continue
similarity = simMeas(xformedItems[item,:].T,\
xformedItems[j,:].T)
print 'the %d and %d similarity is: %f' % (item, j, similarity)
simTotal += similarity
ratSimTotal += similarity * userRating
if simTotal == 0: return 0
else: return ratSimTotal/simTotal
基于SVD的图像压缩:
这里原来的图像是32x32的,进行SVD分解,得到U,Sigma,VT,分别是32x2,2,32x2的,所以存储这3个矩阵的大小是:32x2+2+32x2=130,原来是32x32=1024,几乎10倍的压缩。
def printMat(inMat, thresh=0.8):
for i in range(32):
for k in range(32):
if float(inMat[i,k]) > thresh:
print 1,
else: print 0,
print ''
def imgCompress(numSV=3, thresh=0.8):
myl = []
for line in open('0_5.txt').readlines():
newRow = []
for i in range(32):
newRow.append(int(line[i]))
myl.append(newRow)
myMat = mat(myl)
print "****original matrix******"
printMat(myMat, thresh)
U,Sigma,VT = la.svd(myMat)
# 将Sigma向量转换为矩阵,其他元素均为0的
SigRecon = mat(zeros((numSV, numSV)))
for k in range(numSV):#construct diagonal matrix from vector
SigRecon[k,k] = Sigma[k]
# 只保留前numSV个特征,进行数据重构
reconMat = U[:,:numSV]*SigRecon*VT[:numSV,:]
print "****reconstructed matrix using %d singular values******" % numSV
printMat(reconMat, thresh)
对数据集进行两次扫描,Apriori会对每个潜在的都进行扫描
节点链接(node link):相似项之间的链接,用于快速发现相似项的位置
3)重复迭代前两步,直到树包含一个元素项为止
前面已经有了频率表在头指针表中,可从这里的单个频繁元素项开始
3)用这些条件基(路径前缀)构建条件树
m
的条件树,可以再构建am
,cm
等的条件树,就能知道不同的组合的频繁集:FP树构建:
class treeNode:
def __init__(self, nameValue, numOccur, parentNode):
self.name = nameValue
self.count = numOccur
self.nodeLink = None
self.parent = parentNode #needs to be updated
self.children = {}
def inc(self, numOccur):
self.count += numOccur
def disp(self, ind=1):
print ' '*ind, self.name, ' ', self.count
for child in self.children.values():
child.disp(ind+1)
def createTree(dataSet, minSup=1): #create FP-tree from dataset but don't mine
headerTable = {}
#go over dataSet twice
for trans in dataSet:#first pass counts frequency of occurance
for item in trans:
headerTable[item] = headerTable.get(item, 0) + dataSet[trans]
for k in headerTable.keys(): #remove items not meeting minSup
if headerTable[k] < minSup:
del(headerTable[k])
freqItemSet = set(headerTable.keys())
#print 'freqItemSet: ',freqItemSet
if len(freqItemSet) == 0: return None, None #if no items meet min support -->get out
for k in headerTable:
headerTable[k] = [headerTable[k], None] #reformat headerTable to use Node link
#print 'headerTable: ',headerTable
retTree = treeNode('Null Set', 1, None) #create tree
for tranSet, count in dataSet.items(): #go through dataset 2nd time
localD = {}
for item in tranSet: #put transaction items in order
if item in freqItemSet:
localD[item] = headerTable[item][0]
if len(localD) > 0:
orderedItems = [v[0] for v in sorted(localD.items(), key=lambda p: p[1], reverse=True)]
updateTree(orderedItems, retTree, headerTable, count)#populate tree with ordered freq itemset
return retTree, headerTable #return tree and header table
从条件树挖掘频繁项集:
def mineTree(inTree, headerTable, minSup, preFix, freqItemList):
bigL = [v[0] for v in sorted(headerTable.items(), key=lambda p: p[1])]#(sort header table)
for basePat in bigL: #start from bottom of header table
newFreqSet = preFix.copy()
newFreqSet.add(basePat)
#print 'finalFrequent Item: ',newFreqSet #append to set
freqItemList.append(newFreqSet)
condPattBases = findPrefixPath(basePat, headerTable[basePat][1])
#print 'condPattBases :',basePat, condPattBases
#2. construct cond FP-tree from cond. pattern base
myCondTree, myHead = createTree(condPattBases, minSup)
#print 'head from conditional tree: ', myHead
if myHead != None: #3. mine cond. FP-tree
#print 'conditional tree for: ',newFreqSet
#myCondTree.disp(1)
mineTree(myCondTree, myHead, minSup, newFreqSet, freqItemList)
python的模块pyfpgrowth可以很友好的进行频繁项集挖掘:
import pyfpgrowth
transactions = [[1, 2, 5],
[2, 4],
[2, 3],
[1, 2, 4],
[1, 3],
[2, 3],
[1, 3],
[1, 2, 3, 5],
[1, 2, 3]]
# find patterns in baskets that occur over the support threshold
patterns = pyfpgrowth.find_frequent_patterns(transactions, 2)
# find patterns that are associated with another with a certain minimum probability
# 所以这个是可以发掘规则的。。?
rules = pyfpgrowth.generate_association_rules(patterns, 0.7)
新闻网站点击流中的频繁项集:
gongjing@hekekedeiMac ..eafileSyn/github/MLiA/source_code/Ch12 (git)-[master] % head kosarak.dat
1 2 3
1
4 5 6 7
1 8
9 10
11 6 12 13 14 15 16
1 3 7
17 18
11 6 19 20 21 22 23 24
1 25 3
import fpGrowth
parsedData = [line.split() for line in open('kosarak.dat').readlines()]
parsedData[0:2]
# [['1', '2', '3'], ['1']]
initSet = fpGrowth.createInitSet(parsedData)
myFPtree, myHeaderTab = fpGrowth.createTree(initSet, 100000)
myFreqList = []
fpGrowth.mineTree(myFPtree, myHeaderTab, 100000, set([]), myFreqList)
len(myFreqList)
# 9
myFreqList
# [set(['1']), set(['1', '6']), set(['3']), set(['11', '3']), set(['11', '3', '6']), set(['3', '6']), set(['11']), set(['11', '6']), set(['6'])]
import pyfpgrowth
patterns = pyfpgrowth.find_frequent_patterns(parsedData, 100000)
len(patterns)
# 8
patterns
# {('3', '6'): 265180, ('11', '3', '6'): 143682, ('6',): 601374, ('11', '3'): 161286, ('1', '6'): 132113, ('1',): 197522, ('3',): 450031, ('11', '6'): 324013}
# 这个相比上面的结果少了一个:set(['1'])
定量表示 -》支持度(support):数据集中包含该项集的记录所占的比例。
定量表示 -》置信度(confidence):= support(P|H)/support(P),这里的P|H
是并集意思,指所有出现在集合P或者集合H中的元素。
Apriori算法:
def createC1(dataSet):
C1 = []
for transaction in dataSet:
for item in transaction:
if not [item] in C1:
C1.append([item])
C1.sort()
return map(frozenset, C1)#use frozen set so we
#can use it as a key in a dict
def scanD(D, Ck, minSupport):
# 扫描集合D,把Ck中支持度低的item去掉
ssCnt = {}
for tid in D:
for can in Ck:
if can.issubset(tid):
if not ssCnt.has_key(can): ssCnt[can]=1
else: ssCnt[can] += 1
numItems = float(len(D))
retList = []
supportData = {}
for key in ssCnt:
support = ssCnt[key]/numItems
if support >= minSupport:
retList.insert(0,key)
supportData[key] = support
return retList, supportData
def aprioriGen(Lk, k): #creates Ck
retList = []
lenLk = len(Lk)
for i in range(lenLk):
for j in range(i+1, lenLk):
L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
L1.sort(); L2.sort()
if L1==L2: #if first k-2 elements are equal
retList.append(Lk[i] | Lk[j]) #set union
return retList
def apriori(dataSet, minSupport = 0.5):
C1 = createC1(dataSet)
D = map(set, dataSet)
L1, supportData = scanD(D, C1, minSupport)
L = [L1]
k = 2
while (len(L[k-2]) > 0):
Ck = aprioriGen(L[k-2], k)
Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
supportData.update(supK)
L.append(Lk)
k += 1
return L, supportData
关联规则生成函数:
def generateRules(L, supportData, minConf=0.7): #supportData is a dict coming from scanD
bigRuleList = []
for i in range(1, len(L)):#only get the sets with two or more items
for freqSet in L[i]:
H1 = [frozenset([item]) for item in freqSet]
if (i > 1):
rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
else:
calcConf(freqSet, H1, supportData, bigRuleList, minConf)
return bigRuleList
def calcConf(freqSet, H, supportData, brl, minConf=0.7):
prunedH = [] #create new list to return
for conseq in H:
conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
if conf >= minConf:
print freqSet-conseq,'-->',conseq,'conf:',conf
brl.append((freqSet-conseq, conseq, conf))
prunedH.append(conseq)
return prunedH
def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
m = len(H[0])
if (len(freqSet) > (m + 1)): #try further merging
Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
if (len(Hmp1) > 1): #need at least two sets to merge
rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)
sklearn没有直接可调用的工具,现在有现成的模块Efficient-Apriori,可以很方便的进行关联分析:
from efficient_apriori import apriori
transactions = [('eggs', 'bacon', 'soup'),
('eggs', 'bacon', 'apple'),
('soup', 'bacon', 'banana')]
itemsets, rules = apriori(transactions, min_support=0.5, min_confidence=1)
print(rules) # [{eggs} -> {bacon}, {soup} -> {bacon}]
处理大数据集:
def data_generator(filename):
"""
Data generator, needs to return a generator to be called several times.
"""
def data_gen():
with open(filename) as file:
for line in file:
yield tuple(k.strip() for k in line.split(','))
return data_gen
transactions = data_generator('dataset.csv')
itemsets, rules = apriori(transactions, min_support=0.9, min_confidence=0.6)
聚类(clustering):将相似的对象归到同一个簇中,将不相似对象归到不同簇,有点像全自动分类,是一种无监督的学习。sklearn提供了一个常见聚类算法的比较:
K-均值聚类:
def distEclud(vecA, vecB):
""" 计算两数据点之间的距离,这里是欧式距离
"""
return sqrt(sum(power(vecA - vecB, 2))) #la.norm(vecA-vecB)
def randCent(dataSet, k):
""" 随机初始化选取k个中心,注意这些点不是原来有的数据点
"""
n = shape(dataSet)[1]
centroids = mat(zeros((k,n)))#create centroid mat
for j in range(n):#create random cluster centers, within bounds of each dimension
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = mat(minJ + rangeJ * random.rand(k,1))
return centroids
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))#create mat to assign data points
#to a centroid, also holds SE of each point
centroids = createCent(dataSet, k)
clusterChanged = True
while clusterChanged:
clusterChanged = False
for i in range(m):#for each data point assign it to the closest centroid
minDist = inf; minIndex = -1
for j in range(k):
distJI = distMeas(centroids[j,:],dataSet[i,:])
if distJI < minDist:
minDist = distJI; minIndex = j
if clusterAssment[i,0] != minIndex: clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2
print centroids
for cent in range(k):#recalculate centroids
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]#get all the point in this cluster
centroids[cent,:] = mean(ptsInClust, axis=0) #assign centroid to mean
return centroids, clusterAssment
二分 K-均值聚类:
def biKmeans(dataSet, k, distMeas=distEclud):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))
centroid0 = mean(dataSet, axis=0).tolist()[0]
centList =[centroid0] #create a list with one centroid
for j in range(m):#calc initial Error
clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
while (len(centList) < k):
lowestSSE = inf
for i in range(len(centList)):
ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]#get the data points currently in cluster i
centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2, distMeas)
sseSplit = sum(splitClustAss[:,1])#compare the SSE to the currrent minimum
sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
if (sseSplit + sseNotSplit) < lowestSSE:
bestCentToSplit = i
bestNewCents = centroidMat
bestClustAss = splitClustAss.copy()
lowestSSE = sseSplit + sseNotSplit
bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever
bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
print 'the bestCentToSplit is: ',bestCentToSplit
print 'the len of bestClustAss is: ', len(bestClustAss)
centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0]#replace a centroid with two best centroids
centList.append(bestNewCents[1,:].tolist()[0])
clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss#reassign new clusters, and SSE
return mat(centList), clusterAssment
对二维数据点进行聚类:
from sklearn.cluster import KMeans
import numpy as np
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_
# array([1, 1, 1, 0, 0, 0], dtype=int32)
kmeans.predict([[0, 0], [12, 3]])
# array([1, 0], dtype=int32)
kmeans.cluster_centers_
# array([[10., 2.],
# [ 1., 2.]])
对iris数据集进行聚类并与真实的进行可视化比较:
# Code source: Gaël Varoquaux
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
# Though the following import is not directly being used, it is required
# for 3D projection to work
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(5)
iris = datasets.load_iris()
X = iris.data
y = iris.target
estimators = [('k_means_iris_8', KMeans(n_clusters=8)),
('k_means_iris_3', KMeans(n_clusters=3)),
('k_means_iris_bad_init', KMeans(n_clusters=3, n_init=1,
init='random'))]
fignum = 1
titles = ['8 clusters', '3 clusters', '3 clusters, bad initialization']
for name, est in estimators:
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2],
c=labels.astype(np.float), edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title(titles[fignum - 1])
ax.dist = 12
fignum = fignum + 1
# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean(),
X[y == label, 2].mean() + 2, name,
horizontalalignment='center',
bbox=dict(alpha=.2, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y, edgecolor='k')
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
ax.set_title('Ground Truth')
ax.dist = 12
fig.show()
选择合适的K,可参考这里:
from sklearn.cluster import KMeans
from sklearn import metrics
from nested_dict import nested_dict
silhouette_score_dict = nested_dict(1, int)
for k in range(3, 15):
kmeans_model = KMeans(n_clusters=k, random_state=0).fit(df_select_01)
silhouette_score = metrics.silhouette_score(df_select_01, kmeans_model.labels_,metric='euclidean')
print(k, silhouette_score)
silhouette_score_dict[k] = silhouette_score