sklearn.feature_extraction
:提取符合机器学习算法的特征,比如文本和图片DictVectorizer
可处理字典元素,将字典中的数组转换为模型可用的数组one-of-K
、one-hot
编码,用于分类特征>>> measurements = [
... {'city': 'Dubai', 'temperature': 33.},
... {'city': 'London', 'temperature': 12.},
... {'city': 'San Francisco', 'temperature': 18.},
... ]
>>> from sklearn.feature_extraction import DictVectorizer
>>> vec = DictVectorizer()
# 直接应用于字典,将分类的进行独热编码,数值型的保留
>>> vec.fit_transform(measurements).toarray()
array([[ 1., 0., 0., 33.],
[ 0., 1., 0., 12.],
[ 0., 0., 1., 18.]])
>>> vec.get_feature_names()
['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']
FeatureHasher
目标:把原始的高维特征向量压缩成较低维特征向量,且尽量不损失原始特征的表达能力
哈希表:有一个哈希函数,实现键值的映射,哈希把不同的键散列到不同的块,但还是存在冲突,即把不同的键散列映射到相同的值。
# 使用词袋模型
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()
corpus=["I come to China to travel",
"This is a car polupar in China",
"I love tea and Apple ",
"The work is to write some papers in science"]
print vectorizer.fit_transform(corpus)
print vectorizer.fit_transform(corpus).toarray()
print vectorizer.get_feature_names()
(0, 16) 1
(0, 3) 1
(0, 15) 2
(0, 4) 1
(1, 5) 1
(1, 9) 1
(1, 2) 1
(1, 6) 1
(1, 14) 1
(1, 3) 1
(2, 1) 1
(2, 0) 1
(2, 12) 1
(2, 7) 1
(3, 10) 1
(3, 8) 1
(3, 11) 1
(3, 18) 1
(3, 17) 1
(3, 13) 1
(3, 5) 1
(3, 6) 1
(3, 15) 1
[[0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 1 0 0]
[0 0 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0]
[1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0]
[0 0 0 0 0 1 1 0 1 0 1 1 0 1 0 1 0 1 1]]
[u'and', u'apple', u'car', u'china', u'come', u'in', u'is', u'love', u'papers', u'polupar', u'science', u'some', u'tea', u'the', u'this', u'to', u'travel', u'work', u'write']
# 使用hash技巧,将原来的19维降至6维
from sklearn.feature_extraction.text import HashingVectorizer
vectorizer2=HashingVectorizer(n_features = 6,norm = None)
print vectorizer2.fit_transform(corpus)
print vectorizer2.fit_transform(corpus).toarray()
(0, 1) 2.0
(0, 2) -1.0
(0, 4) 1.0
(0, 5) -1.0
(1, 0) 1.0
(1, 1) 1.0
(1, 2) -1.0
(1, 5) -1.0
(2, 0) 2.0
(2, 5) -2.0
(3, 0) 0.0
(3, 1) 4.0
(3, 2) -1.0
(3, 3) 1.0
(3, 5) -1.0
[[ 0. 2. -1. 0. 1. -1.]
[ 1. 1. -1. 0. 0. -1.]
[ 2. 0. 0. 0. 0. -2.]
[ 0. 4. -1. 1. 0. -1.]]
CountVectorizer
,词切分+频数统计,用法参见上面的例子,同时注意文本文件的编码方式,一般是utf-8
编码,如果是其他编码,需要通过参数encoding
进行指定。公式:\(tf-idf(t,d) = tf(t,d) \times idf(t)\)
TfidfTransformer
TfidfVectorizer
,组合CountVectorizer+TfidfTransformer>>> from sklearn.feature_extraction.text import TfidfTransformer
>>> transformer = TfidfTransformer(smooth_idf=False)
>>> transformer
TfidfTransformer(norm=...'l2', smooth_idf=False, sublinear_tf=False,
use_idf=True)
# 直观分析count
# 第一个词在所有文档都出现,可能不重要
# 另外两个词,出现不到50%,可能具有代表性
>>> counts = [[3, 0, 1],
... [2, 0, 0],
... [3, 0, 0],
... [4, 0, 0],
... [3, 2, 0],
... [3, 0, 2]]
...
>>> tfidf = transformer.fit_transform(counts)
>>> tfidf
<6x3 sparse matrix of type '<... 'numpy.float64'>'
with 9 stored elements in Compressed Sparse ... format>
>>> tfidf.toarray()
array([[0.81940995, 0. , 0.57320793],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],
[0.47330339, 0.88089948, 0. ],
[0.58149261, 0. , 0.81355169]])
# TfidfVectorizer
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> vectorizer = TfidfVectorizer()
>>> vectorizer.fit_transform(corpus)
...
<4x9 sparse matrix of type '<... 'numpy.float64'>'
with 19 stored elements in Compressed Sparse ... format>
下面的例子中,单词words和wprds想表达相同意思,只是拼写错误,通过2-gram分析可以把他们的相同特征提取到:
>>> ngram_vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2))
>>> counts = ngram_vectorizer.fit_transform(['words', 'wprds'])
>>> ngram_vectorizer.get_feature_names() == (
... [' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'])
True
>>> counts.toarray().astype(int)
array([[1, 1, 1, 0, 1, 1, 1, 0],
[1, 1, 0, 1, 1, 1, 0, 1]])
CountVectorizer
参数analyzer
指定analyzer='char_wb'
:只能是每个单词内的analyzer='char'
:可以跨单词创建n-gram
extract_patches_2d
,从图像二维数组或沿第三轴颜色信息提取patchreconstruct_from_patches_2d
,从所有patch重建图像>>> import numpy as np
>>> from sklearn.feature_extraction import image
>>> one_image = np.arange(4 * 4 * 3).reshape((4, 4, 3))
>>> one_image[:, :, 0] # R channel of a fake RGB picture
array([[ 0, 3, 6, 9],
[12, 15, 18, 21],
[24, 27, 30, 33],
[36, 39, 42, 45]])
>>> patches = image.extract_patches_2d(one_image, (2, 2), max_patches=2,
... random_state=0)
>>> patches.shape
(2, 2, 2, 3)
>>> patches[:, :, :, 0]
array([[[ 0, 3],
[12, 15]],
[[15, 18],
[27, 30]]])
>>> patches = image.extract_patches_2d(one_image, (2, 2))
>>> patches.shape
(9, 2, 2, 3)
>>> patches[4, :, :, 0]
array([[15, 18],
[27, 30]])