剪枝(purning):降低决策树的复杂度来避免过拟合的过程。
对于某些参数很敏感,比如设定的:tolS容许的误差下降值,tolN切分的最少样本数。当有的特征其数值数量级发生变化时,可能很影响最后的回归结果。
基于已有的树切分测试数据
如果存在任一子集是一棵树,则在该子集递归剪枝过程
计算将当前两个叶节点合并后的误差
计算不合并的误差
如果合并误差会降低误差的话,就将叶节点合并
回归树的构建:
# 在给定特征和阈值,将数据集进行切割
def binSplitDataSet(dataSet, feature, value):
mat0 = dataSet[nonzero(dataSet[:,feature] > value)[0],:][0]
mat1 = dataSet[nonzero(dataSet[:,feature] <= value)[0],:][0]
return mat0,mat1
# 递归调用creatTree函数,进行树构建
def createTree(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):#assume dataSet is NumPy Mat so we can array filtering
feat, val = chooseBestSplit(dataSet, leafType, errType, ops)#choose the best split
if feat == None: return val #if the splitting hit a stop condition return val
retTree = {}
retTree['spInd'] = feat
retTree['spVal'] = val
lSet, rSet = binSplitDataSet(dataSet, feat, val)
retTree['left'] = createTree(lSet, leafType, errType, ops)
retTree['right'] = createTree(rSet, leafType, errType, ops)
return retTree
回归树的切分:
def regLeaf(dataSet):#returns the value used for each leaf
return mean(dataSet[:,-1])
def regErr(dataSet):
return var(dataSet[:,-1]) * shape(dataSet)[0]
# leafType:这里是regLeaf回归树,目标变量的均值
# errType: 这里是目标变量的平方误差 * 样本数目=平方误差和
# tolS = ops[0]; tolN = ops[1] tolS容许的误差下降值,tolN切分的最少样本数(确保有的支数目不能过少)
def chooseBestSplit(dataSet, leafType=regLeaf, errType=regErr, ops=(1,4)):
tolS = ops[0]; tolN = ops[1]
#if all the target variables are the same value: quit and return value
if len(set(dataSet[:,-1].T.tolist()[0])) == 1: #exit cond 1
return None, leafType(dataSet)
m,n = shape(dataSet)
#the choice of the best feature is driven by Reduction in RSS error from mean
S = errType(dataSet)
bestS = inf; bestIndex = 0; bestValue = 0
for featIndex in range(n-1):
# 这里是遍历要切分的feature的所有可能取值集合,而不是这个范围内的所有值?
for splitVal in set(dataSet[:,featIndex]):
mat0, mat1 = binSplitDataSet(dataSet, featIndex, splitVal)
if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): continue
newS = errType(mat0) + errType(mat1)
if newS < bestS:
bestIndex = featIndex
bestValue = splitVal
bestS = newS
#if the decrease (S-bestS) is less than a threshold don't do the split
if (S - bestS) < tolS:
return None, leafType(dataSet) #exit cond 2
mat0, mat1 = binSplitDataSet(dataSet, bestIndex, bestValue)
if (shape(mat0)[0] < tolN) or (shape(mat1)[0] < tolN): #exit cond 3
return None, leafType(dataSet)
return bestIndex,bestValue#returns the best feature to split on
#and the value used for that split
回归树剪枝:
def isTree(obj):
return (type(obj).__name__=='dict')
def getMean(tree):
if isTree(tree['right']): tree['right'] = getMean(tree['right'])
if isTree(tree['left']): tree['left'] = getMean(tree['left'])
return (tree['left']+tree['right'])/2.0
def prune(tree, testData):
if shape(testData)[0] == 0: return getMean(tree) #if we have no test data collapse the tree
if (isTree(tree['right']) or isTree(tree['left'])):#if the branches are not trees try to prune them
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
if isTree(tree['left']): tree['left'] = prune(tree['left'], lSet)
if isTree(tree['right']): tree['right'] = prune(tree['right'], rSet)
#if they are now both leafs, see if we can merge them
if not isTree(tree['left']) and not isTree(tree['right']):
lSet, rSet = binSplitDataSet(testData, tree['spInd'], tree['spVal'])
errorNoMerge = sum(power(lSet[:,-1] - tree['left'],2)) +\
sum(power(rSet[:,-1] - tree['right'],2))
treeMean = (tree['left']+tree['right'])/2.0
errorMerge = sum(power(testData[:,-1] - treeMean,2))
if errorMerge < errorNoMerge:
print "merging"
return treeMean
else: return tree
else: return tree
模型树:
# 普通的线性回归,返回模型
def linearSolve(dataSet): #helper function used in two places
m,n = shape(dataSet)
X = mat(ones((m,n))); Y = mat(ones((m,1)))#create a copy of data with 1 in 0th postion
X[:,1:n] = dataSet[:,0:n-1]; Y = dataSet[:,-1]#and strip out Y
xTx = X.T*X
if linalg.det(xTx) == 0.0:
raise NameError('This matrix is singular, cannot do inverse,\n\
try increasing the second value of ops')
ws = xTx.I * (X.T * Y)
return ws,X,Y
def modelLeaf(dataSet):#create linear model and return coeficients
ws,X,Y = linearSolve(dataSet)
return ws
# 计算误差:预测值和真实值之间的平方误差
def modelErr(dataSet):
ws,X,Y = linearSolve(dataSet)
yHat = X * ws
return sum(power(Y - yHat,2))
from sklearn.datasets import load_boston
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeRegressor
boston = load_boston()
regressor = DecisionTreeRegressor(random_state=0)
cross_val_score(regressor, boston.data, boston.target, cv=10)
# array([ 0.61..., 0.57..., -0.34..., 0.41..., 0.75...,
# 0.07..., 0.29..., 0.33..., -1.42..., -1.77...])
[zhangqf7@loginview02 danRer7]$ liftOver
liftOver - Move annotations from one assembly to another
usage:
liftOver oldFile map.chain newFile unMapped
oldFile and newFile are in bed format by default, but can be in GFF and
maybe eventually others with the appropriate flags below.
The map.chain file has the old genome as the target and the new genome
as the query.
***********************************************************************
WARNING: liftOver was only designed to work between different
assemblies of the same organism. It may not do what you want
if you are lifting between different organisms. If there has
been a rearrangement in one of the species, the size of the
region being mapped may change dramatically after mapping.
***********************************************************************
options:
-minMatch=0.N Minimum ratio of bases that must remap. Default 0.95
-gff File is in gff/gtf format. Note that the gff lines are converted
separately. It would be good to have a separate check after this
that the lines that make up a gene model still make a plausible gene
after liftOver
-genePred - File is in genePred format
-sample - File is in sample format
-bedPlus=N - File is bed N+ format
-positions - File is in browser "position" format
-hasBin - File has bin value (used only with -bedPlus)
-tab - Separate by tabs rather than space (used only with -bedPlus)
-pslT - File is in psl format, map target side only
-ends=N - Lift the first and last N bases of each record and combine the
result. This is useful for lifting large regions like BAC end pairs.
-minBlocks=0.N Minimum ratio of alignment blocks or exons that must map
(default 1.00)
-fudgeThick (bed 12 or 12+ only) If thickStart/thickEnd is not mapped,
use the closest mapped base. Recommended if using
-minBlocks.
-multiple Allow multiple output regions
-minChainT, -minChainQ Minimum chain size in target/query, when mapping
to multiple output regions (default 0, 0)
-minSizeT deprecated synonym for -minChainT (ENCODE compat.)
-minSizeQ Min matching region size in query with -multiple.
-chainTable Used with -multiple, format is db.tablename,
to extend chains from net (preserves dups)
-errorHelp Explain error messages
[zhangqf7@loginview02 danRer7]$ CrossMap.py
Program: CrossMap (v0.2.5)
Description:
CrossMap is a program for convenient conversion of genome coordinates and
genomeannotation files between assemblies (eg. lift from human hg18 to hg19 or
vice versa).It supports file in BAM, SAM, BED, Wiggle, BigWig, GFF, GTF and VCF
format.
Usage: CrossMap.py <command> [options]
bam convert alignment file in BAM or SAM format.
bed convert genome cooridnate or annotation file in BED or BED-like format.
bigwig convert genome coordinate file in BigWig format.
gff convert genome cooridnate or annotation file in GFF or GTF format.
vcf convert genome coordinate file in VCF format.
wig convert genome coordinate file in Wiggle, or bedGraph format.
CrossMap supports more file formats than liftOver, like wig/bigWig. However, when I use CrossMap to convert a bigWig file, the output file seems much smaller.
# download phastCons score of zebrafish (only z7 provide)
# link: //hgdownload.cse.ucsc.edu/goldenPath/danRer7/phastCons8way/
# currently there is no z10 version: https://www.biostars.org/p/282330/
# convert from z7 to z10
# output: fish.phastCons8way.bw.bgr, fish.phastCons8way.bw.bw, fish.phastCons8way.bw.sorted.bgr
[zhangqf7@bnode02 conservation]$ CrossMap.py bigwig danRer7ToDanRer10.over.chain.gz danRer7/fish.phastCons8way.bw danRer10/fish.phastCons8way.bw
@ 2018-08-27 20:39:45: Read chain_file: danRer7ToDanRer10.over.chain.gz
@ 2018-08-27 20:39:49: Liftover bigwig file: danRer7/fish.phastCons8way.bw ==> danRer10/fish.phastCons8way.bw.bgr
@ 2018-08-27 20:51:54: Merging overlapped entries in bedGraph file ...
@ 2018-08-27 20:51:54: Sorting bedGraph file:danRer10/fish.phastCons8way.bw.bgr
@ 2018-08-27 20:53:42: Convert wiggle to bigwig ...
# the converted .bw file is much smaller
[zhangqf7@loginview02 conservation]$ ll danRer7/fish.phastCons8way.bw
-rw-rw----+ 1 zhangqf7 zhangqf 757M Jan 20 2011 danRer7/fish.phastCons8way.bw
[zhangqf7@loginview02 conservation]$ ll danRer10/fish.phastCons8way.bw.bw
-rw-rw----+ 1 zhangqf7 zhangqf 22M Aug 27 20:53 danRer10/fish.phastCons8way.bw.bw
# convert z7 .bw to .bedgraph directly to check the line number
[zhangqf7@bnode02 danRer7]$ bigWigToBedGraph fish.phastCons8way.bw fish.phastCons8way.bgr
[zhangqf7@bnode02 danRer7]$ ll
total 8.1G
-rw-rw----+ 1 zhangqf7 zhangqf 5.5G Aug 27 21:59 fish.phastCons8way.bgr
-rw-rw----+ 1 zhangqf7 zhangqf 757M Jan 20 2011 fish.phastCons8way.bw
[zhangqf7@loginview02 conservation]$ wl danRer7/fish.phastCons8way.bgr
201830335 danRer7/fish.phastCons8way.bgr
# the convertd .bedgraph file line number is much smaller
[zhangqf7@loginview02 conservation]$ wl danRer10/fish.phastCons8way.bw.sorted.bgr
3213494 danRer10/fish.phastCons8way.bw.sorted.bgr
问题:数据特征(n)比样本数目(m)还大(矩阵X不是满秩矩阵),怎么办?前面的方法不行,因为计算\(X^TX^{-1}\)会出错
其他缩减方法:lasso、LAR、PCA回归、子集选择
lasso限定:\(\sum_{k=1}^{n}\|w_k\| <= lambda\),使用绝对值进行限定。当lambda足够小时,很多权重值会被限定为0,从而减少特征,更好的理解数。但是同时也增加了计算的复杂度。
标准回归(可直接矩阵求解):
# 注意判断矩阵是否可逆
def standRegres(xArr,yArr):
xMat = mat(xArr); yMat = mat(yArr).T
xTx = xMat.T*xMat
if linalg.det(xTx) == 0.0:
print "This matrix is singular, cannot do inverse"
return
ws = xTx.I * (xMat.T*yMat)
return ws
标准回归(statsmodels’ module OLS and sklearn):
import statsmodels.api as sm
X=df[df.columns].values
y=target['MEDV'].values
#add constant
X=sm.add_constant(X)
# build model
model=sm.OLS(y,X).fit()
prediction=model.predict(X)
print(model.summary())
from sklearn import linear_model
lm = linear_model.LinearRegression()
model = lm.fit(X,y)
y_pred = lm.predict(X)
lm.score(X,y)
lm.coef_
lm.intercept_<br><br>evaluation(y,y_pred)
局部加权线性回归:
def lwlr(testPoint,xArr,yArr,k=1.0):
xMat = mat(xArr); yMat = mat(yArr).T
m = shape(xMat)[0]
weights = mat(eye((m)))
for j in range(m): #next 2 lines create weights matrix
diffMat = testPoint - xMat[j,:] #
weights[j,j] = exp(diffMat*diffMat.T/(-2.0*k**2))
xTx = xMat.T * (weights * xMat)
if linalg.det(xTx) == 0.0:
print "This matrix is singular, cannot do inverse"
return
ws = xTx.I * (xMat.T * (weights * yMat))
return testPoint * ws
岭回归(使用缩减技术之前,先标准化不同的特征):
def ridgeRegres(xMat,yMat,lam=0.2):
xTx = xMat.T*xMat
denom = xTx + eye(shape(xMat)[1])*lam
if linalg.det(denom) == 0.0:
print "This matrix is singular, cannot do inverse"
return
ws = denom.I * (xMat.T*yMat)
return ws
def ridgeTest(xArr,yArr):
xMat = mat(xArr); yMat=mat(yArr).T
yMean = mean(yMat,0)
yMat = yMat - yMean #to eliminate X0 take mean off of Y
#regularize X's
xMeans = mean(xMat,0) #calc mean then subtract it off
xVar = var(xMat,0) #calc variance of Xi then divide by it
xMat = (xMat - xMeans)/xVar
numTestPts = 30
wMat = zeros((numTestPts,shape(xMat)[1]))
for i in range(numTestPts):
ws = ridgeRegres(xMat,yMat,exp(i-10))
wMat[i,:]=ws.T
return wMat
岭回归(sklearn):
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=1, solver="cholesky")
ridge_reg.fit(X, y)
y_pred=ridge_reg.predict(X)
evaluation(y,y_pred,index_name='ridge_reg')
Lasso回归(sklearn):
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1)
lasso_reg.fit(X, y)
y_pred=lasso_reg.predict(X)
evaluation(y,y_pred,index_name='lasso_reg')
前向逐步回归:
def stageWise(xArr,yArr,eps=0.01,numIt=100):
xMat = mat(xArr); yMat=mat(yArr).T
yMean = mean(yMat,0)
yMat = yMat - yMean #can also regularize ys but will get smaller coef
xMat = regularize(xMat)
m,n=shape(xMat)
#returnMat = zeros((numIt,n)) #testing code remove
ws = zeros((n,1)); wsTest = ws.copy(); wsMax = ws.copy()
for i in range(numIt):
print ws.T
lowestError = inf;
for j in range(n):
for sign in [-1,1]:
wsTest = ws.copy()
wsTest[j] += eps*sign
yTest = xMat*wsTest
rssE = rssError(yMat.A,yTest.A)
if rssE < lowestError:
lowestError = rssE
wsMax = wsTest
ws = wsMax.copy()
#returnMat[i,:]=ws.T
#return returnMat
多项式回归(sklearn):
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree = 4)
X_Poly = poly_reg.fit_transform(X)
lin_reg_2 =linear_model.LinearRegression()
lin_reg_2.fit(X_Poly, y)
y_pred=lin_reg_2.predict(poly_reg.fit_transform(X))
evaluation(y,y_pred,index_name=['poly_reg'])
弹性网络回归(sklearn):
enet_reg = linear_model.ElasticNet(l1_ratio=0.7)
enet_reg.fit(X,y)
y_pred=enet_reg.predict(X)
evaluation(y,y_pred,index_name='enet_reg ')
问题:多个朋友推荐股票的走势,分别是\(g_1, g_2, ..., g_t\),如何选择(如何确定\(g_t(x)\))?
方案:
编号 | 做法 | 类别 |
---|---|---|
1 | 从T个选择最信任的对股票预测能力最强的一个 | validation,犯错最小模型 |
2 | 都比较厉害,同时考虑T个朋友建议,投票表决 | uniformly |
3 | 水平不一,同时考虑T个朋友建议,投票表决,但是投票权重不一样 | non-uniformly |
4 | 水平不一,同时考虑T个朋友建议,投票表决,但是投票权重不固定 | conditionally |
以上每种选择其实对应一个模型:
例子:平面上有需分类的点,如果只用一条水平或竖直直线进行分类,无论如何达不到最佳分类效果。可使用集体智慧,及多条线的组合,比如图中的折线(一条水平线+两条竖直线),即可将点完全分开。
为什么效果更好:
特征转换 | 正则化 |
---|---|
问题:一般经验中,如果把好坏的东西掺和在一起,通常会是比最坏的要好一些,比最好的要坏一些。集成学习为什么能比最好的单一学习器效果更好呢?
例子:
理论分析:
误差-分歧(ambiguity)分解:
不同:模型之间的分歧或者单个模型与整体模型的分歧
集成学习器的分歧:\(\overline{A}(h\|x)=\sum_{i=1}^Tw_i(h_i\|x)=\sum_{i=1}^Tw_i(h_i(x)-H(x))^2\)
集成学习器的误差:\(E(H\|x)=(f(x)-H(x))^2\)
有不同的方法可以增强学习器的多样性:
算法参数扰动:学习算法的参数设置差异
在回归问题中,计算单个模型与真实值的差异,经过换算,可以表示为如下,即单个模型的差距是总的模型的差距,再加一个大于0的项,所以总的模型的差距是更小的:
集成算法(ensemble method):又叫元算法(meta-algorithm),将不同的分类器组合起来。
加权投票法:类似加权平均,每个学习器有自己的权重,预测为得票最多的标记,若同时多个标记有相同最高票,随机选择一个。
第n次训练:此时得到的模型是前n-1个模型的权重线性组合。当模型的分类错误率达到指定阈值(一般为0)时,即可停止训练,采用此时的分类器模型。比如3次训练即达到要求的模型最终是:\(G(x)=sign[f_3(x)]=sign[\alpha_1 * G_1(x) + \alpha_2 * G_2(x) + \alpha_3 * G_3(x)]\)
构建多个单层决策树的例子,比如对于一组数据样本(包含两个类别),具有两个特征,如何区分开来?
stumpClassify
:对于样本指定的特征,基于一个阈值,对样本进行分类,比如高于这个阈值的设置为-1(取决于判断符号threshIneq
)buildStump
:对于样本、类别、权重D,构建一个多决策树。对样本的每一个特征进行查看,在每一个特征的取值范围内不断的尝试不同的阈值进行分类,同时不停的更新权重矩阵D,最终的目的是找到使得分类错误率最小时的特征、特征阈值、判断符号(大于还是小于,为啥这个会有影响?)。def stumpClassify(dataMatrix,dimen,threshVal,threshIneq):#just classify the data
retArray = ones((shape(dataMatrix)[0],1))
if threshIneq == 'lt':
retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
else:
retArray[dataMatrix[:,dimen] > threshVal] = -1.0
return retArray
def buildStump(dataArr,classLabels,D):
dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
m,n = shape(dataMatrix)
numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
minError = inf #init error sum, to +infinity
for i in range(n):#loop over all dimensions
rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
stepSize = (rangeMax-rangeMin)/numSteps
for j in range(-1,int(numSteps)+1):#loop over all range in current dimension
for inequal in ['lt', 'gt']: #go over less than and greater than
threshVal = (rangeMin + float(j) * stepSize)
predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)#call stump classify with i, j, lessThan
errArr = mat(ones((m,1)))
errArr[predictedVals == labelMat] = 0
weightedError = D.T*errArr #calc total error multiplied by D
#print "split: dim %d, thresh %.2f, thresh ineqal: %s, the weighted error is %.3f" % (i, threshVal, inequal, weightedError)
if weightedError < minError:
minError = weightedError
bestClasEst = predictedVals.copy()
bestStump['dim'] = i
bestStump['thresh'] = threshVal
bestStump['ineq'] = inequal
return bestStump,minError,bestClasEst
sklearn给出了一个二分类的例子,首先生成了一个数据集,两个高斯分布混合而成的。这些样本是线性不可分的,使用AdaBoost算法,构建基于决策树的分类器,可以很好的区分开来:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_gaussian_quantiles
# Construct dataset
X1, y1 = make_gaussian_quantiles(cov=2.,
n_samples=200, n_features=2,
n_classes=2, random_state=1)
X2, y2 = make_gaussian_quantiles(mean=(3, 3), cov=1.5,
n_samples=300, n_features=2,
n_classes=2, random_state=1)
X = np.concatenate((X1, X2))
y = np.concatenate((y1, - y2 + 1))
# Create and fit an AdaBoosted decision tree
bdt = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
algorithm="SAMME",
n_estimators=200)
bdt.fit(X, y)
plot_colors = "br"
plot_step = 0.02
class_names = "AB"
plt.figure(figsize=(10, 5))
# Plot the decision boundaries
plt.subplot(121)
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
np.arange(y_min, y_max, plot_step))
Z = bdt.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
cs = plt.contourf(xx, yy, Z, cmap=plt.cm.Paired)
plt.axis("tight")
# Plot the training points
for i, n, c in zip(range(2), class_names, plot_colors):
idx = np.where(y == i)
plt.scatter(X[idx, 0], X[idx, 1],
c=c, cmap=plt.cm.Paired,
s=20, edgecolor='k',
label="Class %s" % n)
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.legend(loc='upper right')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Decision Boundary')
### direct call from clustermap
df = pd.read_csv(txt, header=0, sep='\t', index_col=0)
fig, ax = plt.subplots(figsize=(25,40))
sns.clustermap(df, standard_scale=False, row_cluster=True, col_cluster=True, figsize=(25, 40), yticklabels=False)
savefn = txt.replace('.txt', '.cluster.png')
plt.savefig(savefn)
plt.close()
### calc linkage in hierarchy
df_array = np.asarray(df)
row_linkage = hierarchy.linkage(distance.pdist(df), method='average')
col_linkage = hierarchy.linkage(distance.pdist(df.T), method='average')
fig, ax = plt.subplots(figsize=(25,40))
sns.clustermap(df, row_linkage=row_linkage, col_linkage=col_linkage, standard_scale=False, row_cluster=True, col_cluster=True, figsize=(25, 40), yticklabels=False, method="average")
savefn = txt.replace('.txt', '.cluster.png')
plt.savefig(savefn)
plt.close()
hierarchy.fcluster
can be used to extract clusters based on max depth or cluster num as illustrated here:
### max depth
from scipy.cluster.hierarchy import fcluster
max_d = 50
clusters = fcluster(Z, max_d, criterion='distance')
clusters
array([2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
### user defined cluster number
k=2
fcluster(Z, k, criterion='maxclust')
Assign cluster id and plot (stackoverflow):
fcluster = hierarchy.fcluster(row_linkage, 10, criterion='maxclust')
lut = dict(zip(set(fcluster), sns.hls_palette(len(set(fcluster)), l=0.5, s=0.8)))
row_colors = pd.DataFrame(fcluster)[0].map(lut)
sns.clustermap(df, row_linkage=row_linkage, col_linkage=col_linkage,
standard_scale=False, row_cluster=True, col_cluster=False,
figsize=(50, 80), yticklabels=False, method="average",
cmap=cmap, row_colors=[row_colors])
在新版本的pytorch里面,在运行程序时可能默认使用1个CPU的全部核,导致其他用户不能正常使用,可通过指定参数OMP_NUM_THREADS=1
设置运行时使用的核数,下面是加和不加时的CPU监视截图:
不指定:
指定:
可以看到,在指定时,只使用了1个核,此时的CPU使用率是100%左右;在不指定时,CPU使用率到达了3600%,此时默认使用了36个核。具体命令如下:
# CUDA_VISIBLE_DEVICES: 指定GPU核
# OMP_NUM_THREADS: 指定使用的CPU核数
time CUDA_VISIBLE_DEVICES=0 OMP_NUM_THREADS=1 python $script_dir/main.py
参考这里
# 之前
out1, _ = self.lstm(x, (h0, c0))
# 之后
self.lstm.flatten_parameters()
out1, _ = self.lstm(x, (h0, c0))
在最近搭建的LSTM模型中,即使使用上了GPU,运行花费时间还是很长。模型不是很大,一个卷积层+一个resblock+一个LSTM,序列长度为100,使用双向的LSTM,两层、hidden size是128,基本运行需要10+h小时的时间,每一个epoch花费接近3min。有一些关于为什么LSTM运行很慢的文章(比如知乎帖子rnn为什么训练速度慢),可以参考一下,主要是:
最近有个SRU的模块,可以实现如CNN级别的速度训练RNN网络,在github上star数目还挺高的,可以参考一下。
for feature in log_data.keys():
Q1 = np.percentile(log_data[feature], 25)
Q3 = np.percentile(log_data[feature], 75)
step = 1.5 * (Q3 - Q1)
print("特征'{}'的异常值包括:".format(feature))
display(log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))])
outliers = [95, 338, 86, 75, 161, 183, 154] # 选择需要删除的异常值index
good_data = log_data.drop(log_data.index[outliers]).reset_index(drop = True) #删除选择的异常值
参考这里:Feature/Variable importance after a PCA analysis,下面是一个完整的函数,并给出每个特征的重要性及可解释比例:
def PCA_df(df, n_components=6):
from sklearn.decomposition import PCA
pca = PCA(n_components=n_components)
pca.fit(df)
df_pc = pca.transform(df)
# number of components
n_pcs= pca.components_.shape[0]
# get the index of the most important feature on EACH component
# LIST COMPREHENSION HERE
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]
# print('feature index of importance from 1st to last', most_important)
initial_feature_names = df.columns
# get the names
most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
# LIST COMPREHENSION HERE AGAIN
dic = {'PC{}'.format(i): most_important_names[i] for i in range(n_pcs)}
# build the dataframe
feature_importance_df = pd.DataFrame(dic.items())
feature_importance_df.columns = ['PC', 'Names']
feature_importance_df['Column_index'] = most_important
feature_importance_df['Explained_variance_ratio'] = pca.explained_variance_ratio_
feature_importance_df['Singular_values'] = pca.singular_values_
display(feature_importance_df)
return feature_importance_df
参考这里:
import numpy as np
import sklearn
from sklearn import preprocessing
from sklearn.preprocessing import Imputer
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: cannot import name 'Imputer' from 'sklearn.preprocessing' (E:\python\lib\site-packages\sklearn\preprocessing\__init__.py)
# 现在
# https://scikit-learn.org/stable/modules/impute.html#impute
from sklearn.impute import SimpleImputer
imp = SimpleImputer()
imp.fit([[1, 2],
[np.nan, 3],
[7, 6]])
imp.transform([[np.nan, 2],
[6, np.nan]])
参考这里:
X_train, X_test, Y_train, Y_test = model_selection.train_test_split(X, Y, test_size=test_size, random_state=seed)
# Fit the model on training set
model = LogisticRegression()
model.fit(X_train, Y_train)
# save the model to disk
filename = 'finalized_model.sav'
pickle.dump(model, open(filename, 'wb'))
# some time later...
# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))
result = loaded_model.score(X_test, Y_test)
print(result)
支持向量机(Support Vector Machines):在分类与回归分析中分析数据的监督式学习模型与相关的学习算法。
p维向量
,而我们想知道是否可以用p-1维超平面
来分开这些点。这就是所谓的线性分类器。可能有许多超平面可以把数据分类。最佳超平面的一个合理选择是以最大间隔把两个类分开的超平面。用SVM预测肿瘤的类型:
#Import scikit-learn dataset library
from sklearn import datasets
#Load dataset
cancer = datasets.load_breast_cancer()
# print the names of the 13 features
print("Features: ", cancer.feature_names)
# print the label type of cancer('malignant' 'benign')
print("Labels: ", cancer.target_names)
# Import train_test_split function
from sklearn.model_selection import train_test_split
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.3,random_state=109) # 70% training and 30% test
#Import svm model
from sklearn import svm
#Create a svm Classifier
clf = svm.SVC(kernel='linear') # Linear Kernel
#Train the model using the training sets
clf.fit(X_train, y_train)
#Predict the response for test dataset
y_pred = clf.predict(X_test)
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
df
中使用cut
进行分bin
,获得对应的bin值
importlib
.pythonrc
文件在打开python console时自动执行def solve(r1, r2):
# sort the two ranges such that the range with smaller first element
# is assigned to x and the bigger one is assigned to y
x, y = sorted((r1, r2))
#now if x[1] lies between x[0] and y[0](x[1] != y[0] but can be equal to x[0])
#then the ranges are not overlapping and return the differnce of y[0] and x[1]
#otherwise return 0
if x[0] <= x[1] < y[0] and all( y[0] <= y[1] for y in (r1,r2)):
return y[0] - x[1]
return 0
...
>>> solve([0,10],[12,20])
2
>>> solve([5,10],[1,5])
0
>>> solve([5,10],[1,4])
1
# the specific order
sorter = ['a', 'c', 'b']
df['column'] = df['column'].astype("category")
df['column'].cat.set_categories(sorter, inplace=True)
df.sort_values(["column"], inplace=True)
A free Python/R notebook can also be created online at https://rdrr.io/.
t = pd.DataFrame({1:[1,2,3], 2:[3,4,5], 3:[6,7,8]})
t
1 2 3
0 1 3 6
1 2 4 7
2 3 5 8
# by row sum
t.div(t.sum(axis=1), axis=0)
1 2 3
0 0.100000 0.300000 0.600000
1 0.153846 0.307692 0.538462
2 0.187500 0.312500 0.500000
# by column sum
t.div(t.sum(axis=0), axis=1)
1 2 3
0 0.166667 0.250000 0.285714
1 0.333333 0.333333 0.333333
2 0.500000 0.416667 0.380952
def dfs_tabs(df_list, sheet_list, file_name):
writer = pd.ExcelWriter(file_name,engine='xlsxwriter')
for dataframe, sheet in zip(df_list, sheet_list):
dataframe.to_excel(writer, sheet_name=sheet, startrow=0 , startcol=0, index=False)
writer.save()
As discussed here:
df = pd.DataFrame({'favcount':[1,2,3], 'sn':['a','b','c']})
print (df)
favcount sn
0 1 a
1 2 b
2 3 c
print (df.favcount.idxmax())
2
print (df.ix[df.favcount.idxmax()])
favcount 3
sn c
Name: 2, dtype: object
print (df.ix[df.favcount.idxmax(), 'sn'])
c
主要参考这篇文章:Changing the sans-serif font to Helvetica。转换好的字体文件放在了这里,可下载使用。
# 在mac上找到Helvetica字体
$ ls /System/Library/Fonts/Helvetica.ttc
# 复制到其他的位置
$ cp /System/Library/Fonts/Helvetica.ttc ~/Desktop
# 使用online的工具转换为.tff文件
# 这里使用的是: https://www.files-conversion.com/font/ttc
# 定位python库的字体文件
$ python -c 'import matplotlib ; print(matplotlib.matplotlib_fname())'
/Users/gongjing/usr/anaconda2/lib/python2.7/site-packages/matplotlib/mpl-data/matplotlibrc
# 将tff文件放到上述路径的font目录下
$ cp Helvetica.ttf /Users/gongjing/usr/anaconda2/lib/python2.7/site-packages/matplotlib/mpl-data/fonts/ttf
# 修改matplotlibrc文件
#font.sans-serif : DejaVu Sans, Bitstream Vera Sans, Computer Modern Sans Serif, Lucida Grande, Verdana, Geneva, Helvetica, Lucid, Arial, Avant Garde, sans-serif
=》font.sans-serif : Helvetica, DejaVu Sans, Bitstream Vera Sans, Computer Modern Sans Serif, Lucida Grande, Verdana, Geneva, Lucid, Arial, Avant Garde, sans-serif
# 重启jupyter即可
上面是设置全局的,也可以显示的在代码中指定,可以参考这里:
# 显示指定在此脚本中用某个字体
import matplotlib.pyplot as plt
plt.rcParams["font.family"] = "Helvetica"
# 对于不同的部分(标题、刻度等)指定不同的字体
csfont = {'fontname':'Comic Sans MS'}
hfont = {'fontname':'Helvetica'}
plt.title('title',**csfont)
plt.xlabel('xlabel', **hfont)
plt.show()
画图时,如果需要使用中文label,需要设置,主要参考这里:
SimHei
黑体字体文件.tff
文件放到matplotlib
包的路径下,路径为:matplotlib/mpl-data/fonts/ttf
,可以使用pip show matplotlib
查看包安装的位置matplotlibrc
,一般在matplotlib/mpl-data/
这个下面。
/Users/gongjing/.matplotlib
下面缓存的字体文件plt.rcParams["font.family"] = "SimHei"
可以使用numpy的函数nan_to_num:numpy.nan_to_num(x, copy=True)]
x = np.array([np.inf, -np.inf, np.nan, -128, 128])
# 默认copy=True,不改变原来数组的值
np.nan_to_num(x)
array([ 1.79769313e+308, -1.79769313e+308, 0.00000000e+000,
-1.28000000e+002, 1.28000000e+002])
# 设置copy=False,原来数组的值会被替换
np.nan_to_num(x, copy=False)
As discussed here:
import os
def check_dir_or_make(d):
if not os.path.exists(d):
os.makedirs(d)
As discussed here:
df = pd.DataFrame({'Name': ['Steve_Smith', 'Joe_Nadal',
'Roger_Federer'],
'Age':[32,34,36]})
# Age Name
# 0 32 Steve_Smith
# 1 34 Joe_Nadal
# 2 36 Roger_Federer
df[['First','Last']] = df.Name.str.split("_",expand=True,)
# expand需要设置为True,负责报错说原来df没有“first”,“last”列
# Age Name First Last
# 0 32 Steve_Smith Steve Smith
# 1 34 Joe_Nadal Joe Nadal
# 2 36 Roger_Federer Roger Federer
df
中使用cut
进行分bin
,获得对应的bin值
# 将数据分成10组
bins = 10
df = pd.DataFrame.from_dict({'value':[i/10 for i in range(10+1)]})
df['bins'] = pd.cut(df['value'], bins=bins)
df
value bins
0 0.0 (-0.001, 0.1]
1 0.1 (-0.001, 0.1]
2 0.2 (0.1, 0.2]
3 0.3 (0.2, 0.3]
4 0.4 (0.3, 0.4]
5 0.5 (0.4, 0.5]
6 0.6 (0.5, 0.6]
7 0.7 (0.6, 0.7]
8 0.8 (0.7, 0.8]
9 0.9 (0.8, 0.9]
10 1.0 (0.9, 1.0]
# 通过以value_counts()的index获得唯一的bin
# 打印:此时每一个i是interval对象
for i in list(df['bins'].value_counts().index):
print(i)
(-0.001, 0.1]
(0.9, 1.0]
(0.8, 0.9]
(0.7, 0.8]
(0.6, 0.7]
(0.5, 0.6]
(0.4, 0.5]
(0.3, 0.4]
(0.2, 0.3]
(0.1, 0.2]
# interval对象不能通过index进行取值
i[0]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-46-3aa51af8ff05> in <module>()
----> 1 i[0]
TypeError: 'pandas._libs.interval.Interval' object does not support indexing
# interval对象有特定的属性进行取值等操作
# closed, left, right, closed_left, closed_right, mid, open_left, open_right
i.left
0.1
i.right
0.2
国内有一些镜像,在安装时使用这些镜像会加快下载的速度,可参考这里。
临时修改(安装时指定):
pip install scrapy -i https://pypi.tuna.tsinghua.edu.cn/simple
永久修改(安装源写入配置文件~/.pip/pip.conf
):
[global]
index-url = https://pypi.tuna.tsinghua.edu.cn/simple
参考Efficient method to calculate the rank vector of a list in Python:
import scipy.stats as ss
ss.rankdata([3, 1, 4, 15, 92])
# array([ 2., 1., 3., 4., 5.])
ss.rankdata([1, 2, 3, 3, 3, 4, 5])
# array([ 1., 2., 4., 4., 4., 6., 7.])
参考这里:
percent_missing = df.isnull().sum() * 100 / len(df)
missing_value_df = pd.DataFrame({'column_name': df.columns,
'percent_missing': percent_missing})
import time
start =time.clock()
#中间写上代码块
end = time.clock()
print('Running time: %s Seconds'%(end-start))
参考这里:
In [326]:
df = pd.DataFrame({'id':['a','a','b','c','c'], 'words':['asd','rtr','s','rrtttt','dsfd']})
df
Out[326]:
id words
0 a asd
1 a rtr
2 b s
3 c rrtttt
4 c dsfd
In [327]:
df.groupby('id')['words'].apply(','.join)
Out[327]:
id
a asd,rtr
b s
c rrtttt,dsfd
Name: words, dtype: object
# 注意,这里是有两行,所以以id进行group之后,只剩下word
# groupby之后是得到一个df
# groupby()[col]是选取对应的column,但是选出的column是series,不是直接的list
df.groupby('photo_id')['like_flag'].apply(lambda x: np.cumsum(list(x))).to_dict()
参考这里:
import numpy as np
a = [4,6,12]
np.cumsum(a)
#array([4, 10, 22])
参考这里:
num
0 1
1 6
2 4
3 5
4 2
input = 3
# 这里是选取的最接近的前2个,控制index可选择1个等
df.iloc[(df['num']-input).abs().argsort()[:2]]
num
2 4
4 2
importlib
构建文件目录:
gongjing@bjzyx-c451:~/gj_py_func $ pwd
/home/gongjing/gj_py_func
gongjing@bjzyx-c451:~/gj_py_func $ lst
.
|-list.py
|-file.py
|-dataframe.py
|-init.py
|-__pycache__
| |-list.cpython-37.pyc
| |-file.cpython-37.pyc
| |-dataframe.cpython-37.pyc
| |-load_packages.cpython-37.pyc
|-.ipynb_checkpoints
| |-file-checkpoint.py
| |-dataframe-checkpoint.py
| |-list-checkpoint.py
调用:
import importlib, sys
if '/home/gongjing/' not in sys.path: sys.path.append('/home/gongjing/')
func_df = importlib.import_module('.dataframe', package='gj_py_func')
func_file = importlib.import_module('.file', package='gj_py_func')
func_ls = importlib.import_module('.list', package='gj_py_func')
importlib.reload(func_df)
importlib.reload(func_file)
importlib.reload(func_ls)
# 查看模块信息,包含哪些函数
help(func_df)
Help on module gj_py_func.dataframe in gj_py_func:
NAME
gj_py_func.dataframe
FUNCTIONS
df_col_missing_pct(df)
df_col_sum(df)
df_norm_by_colsum(df)
df_norm_by_rowsum(df)
df_row_sum(df)
load_data(fn, col_ls=None)
FILE
/home/gongjing/gj_py_func/dataframe.py
参考这里:
df.fillna({1:0}, inplace=True)
df[1].fillna(0, inplace=True)
参考这里:
if item:
JD = item.group()
csvwriter.writerow(JD)
# J,D,",", ,C,o,l,u,m,b,i,a, ,L,a,w, ,S,c,h,o,o,l,....
# one string per row
csvwriter.writerow([JD])
.pythonrc
文件在打开python console时自动执行参考这里:
1, 写.pythonrc
文件
print("how_use_pythonstartup")
2, 设置PYTHONSTARTUP
变量
export PYTHONSTARTUP=~/.pythonrc
3, 打开新的console
Python 2.7.6 (default, Nov 23 2017, 15:49:48)
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
how_use_pythonstartup
>>>
如何根据评价结果获取候选特征子集?【子集搜索】
如何评价候选特征子集的好坏?【子集评价】
特征选择:子集搜索+子集评价
sklearn的特征选择
如何确定相关统计量?
例子:布尔值类型的的数据聚集,移除0或者1超过80%的特征: 注意:布尔值是伯努利随机变量,方差:\(Var[X] = p(1-p)\),如果选择阈值为0.8,则方差阈值=0.8(1-0.8)=0.8*0.2=0.16。可以看到,第一列特征被移除:
>>> from sklearn.feature_selection import VarianceThreshold
>>> X = [[0, 0, 1], [0, 1, 0], [1, 0, 0], [0, 1, 1], [0, 1, 0], [0, 1, 1]]
>>> sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
>>> sel.fit_transform(X)
array([[0, 1],
[1, 0],
[0, 0],
[1, 1],
[1, 0],
[1, 1]])
原理:对单变量进行统计测试,以选取最好的特征,可作为评估器的预处理步骤
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectKBest
>>> from sklearn.feature_selection import chi2
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
>>> X_new.shape
(150, 2)
计算开销会大得多
>>> from sklearn.svm import LinearSVC
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
>>> model = SelectFromModel(lsvc, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 3)
>>> from sklearn.ensemble import ExtraTreesClassifier
>>> from sklearn.datasets import load_iris
>>> from sklearn.feature_selection import SelectFromModel
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> X.shape
(150, 4)
>>> clf = ExtraTreesClassifier()
>>> clf = clf.fit(X, y)
>>> clf.feature_importances_
array([ 0.04..., 0.05..., 0.4..., 0.4...])
>>> model = SelectFromModel(clf, prefit=True)
>>> X_new = model.transform(X)
>>> X_new.shape
(150, 2)
作为预处理的步骤:
clf = Pipeline([
('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
('classification', RandomForestClassifier())
])
clf.fit(X, y)