这里列举了一些机器学习的一些知识图谱,主要来源于这里,同时还有如何选用不同的机器学习算法的map,可以参考一下。
生成对抗网络(Generative Adversarial Network, GAN),属于一种生成模型,通过生成样本,以训练模型,直到模型无法区分样本来源(真实 vs 生成)为止。
这篇文章(Using convolutional neural nets to detect facial keypoints tutorial)是14年获得kaggle人脸特征点检测第二名的post,详细的描述了如何构建模型并一步步的优化,里面的一些思路其实很直接干脆,可以参考一下。
Name | Description | Epochs | Train loss | Valid loss
-------|------------------|----------|--------------|--------------
net1 | single hidden | 400 | 0.002244 | 0.003255
net2 | convolutions | 1000 | 0.001079 | 0.001566
net3 | augmentation | 3000 | 0.000678 | 0.001288
net4 | mom + lr adj | 1000 | 0.000496 | 0.001387
net5 | net4 + augment | 2000 | 0.000373 | 0.001184
net6 | net5 + dropout | 3000 | 0.001306 | 0.001121
net7 | net6 + epochs | 10000 | 0.000760 | 0.000787
自编码(auto encoder):把输入数据进行一个压缩和解压缩的过程,对高维数据的一个低维表示,同时最大限度的保留原始数据信息。
1、框架(包含3个部分):
2、例子:这里通过一个二维线性分布的数据点,进行编码转换到一维,从而实现自编码过程,可以参考一下。
3、优缺点:
解决非线性问题
这里有个简单的实现版本。使用keras构建了一个自编码解码器,其有四个编码层,分别把图像特征从原始的784(28x28)维降至128、64、10、2维,然后进行解码。同时,提取了仅编码后的结果(2维)进行可视化(相当于看编码的效果好不好)。
import numpy as np
np.random.seed(1337) # for reproducibility
from keras.datasets import mnist
from keras.models import Model
from keras.layers import Dense, Input
import matplotlib.pyplot as plt
# download the mnist to the path '~/.keras/datasets/' if it is the first time to be called
# X shape (60,000 28x28), y shape (10,000, )
(x_train, _), (x_test, y_test) = mnist.load_data()
# data pre-processing
x_train = x_train.astype('float32') / 255. - 0.5 # minmax_normalized
x_test = x_test.astype('float32') / 255. - 0.5 # minmax_normalized
x_train = x_train.reshape((x_train.shape[0], -1))
x_test = x_test.reshape((x_test.shape[0], -1))
print(x_train.shape)
print(x_test.shape)
"""
(60000, 784)
(10000, 784)
"""
# in order to plot in a 2D figure
encoding_dim = 2
# this is our input placeholder
# define input layer
input_img = Input(shape=(784,))
# encoder layers
encoded = Dense(128, activation='relu')(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(10, activation='relu')(encoded)
encoder_output = Dense(encoding_dim)(encoded)
# decoder layers
decoded = Dense(10, activation='relu')(encoder_output)
decoded = Dense(64, activation='relu')(decoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(784, activation='tanh')(decoded)
# construct the autoencoder model
autoencoder = Model(input=input_img, output=decoded)
# construct the encoder model for plotting
# 这里的encoder就是整个AE的一部分,不需要在重新compile和fit,可直接提取
encoder = Model(input=input_img, output=encoder_output)
# compile autoencoder
autoencoder.compile(optimizer='adam', loss='mse')
# training
autoencoder.fit(x_train, x_train,
nb_epoch=20,
batch_size=256,
shuffle=True)
# plotting
# dot color by y_test (label, already known)
encoded_imgs = encoder.predict(x_test)
plt.scatter(encoded_imgs[:, 0], encoded_imgs[:, 1], c=y_test)
plt.colorbar()
plt.show()
这里给出了一个pytorch的版本,编码部分是4个网络层,最后降至3维,因为要想在3维空间进行可视化。
import torch
import torch.nn as nn
import torch.utils.data as Data
import torchvision
# 超参数
EPOCH = 10
BATCH_SIZE = 64
LR = 0.005
DOWNLOAD_MNIST = True # 下过数据的话, 就可以设置成 False
N_TEST_IMG = 5 # 到时候显示 5张图片看效果, 如上图一
# Mnist digits dataset
train_data = torchvision.datasets.MNIST(
root='./mnist/',
train=True, # this is training data
transform=torchvision.transforms.ToTensor(), # Converts a PIL.Image or numpy.ndarray to
# torch.FloatTensor of shape (C x H x W) and normalize in the range [0.0, 1.0]
download=DOWNLOAD_MNIST, # download it if you don't have it
)
class AutoEncoder(nn.Module):
def __init__(self):
super(AutoEncoder, self).__init__()
# 压缩
self.encoder = nn.Sequential(
nn.Linear(28*28, 128),
nn.Tanh(), # 这里使用的是tanh作为激活函数,不是relu函数
nn.Linear(128, 64),
nn.Tanh(),
nn.Linear(64, 12),
nn.Tanh(),
nn.Linear(12, 3), # 压缩成3个特征, 进行 3D 图像可视化
)
# 解压
self.decoder = nn.Sequential(
nn.Linear(3, 12),
nn.Tanh(),
nn.Linear(12, 64),
nn.Tanh(),
nn.Linear(64, 128),
nn.Tanh(),
nn.Linear(128, 28*28),
nn.Sigmoid(), # 激励函数让输出值在 (0, 1)
)
def forward(self, x):
encoded = self.encoder(x)
decoded = self.decoder(encoded)
return encoded, decoded
autoencoder = AutoEncoder()
optimizer = torch.optim.Adam(autoencoder.parameters(), lr=LR)
loss_func = nn.MSELoss()
for epoch in range(EPOCH):
for step, (x, b_label) in enumerate(train_loader):
b_x = x.view(-1, 28*28) # batch x, shape (batch, 28*28)
b_y = x.view(-1, 28*28) # batch y, shape (batch, 28*28)
encoded, decoded = autoencoder(b_x)
loss = loss_func(decoded, b_y) # mean square error
optimizer.zero_grad() # clear gradients for this training step
loss.backward() # backpropagation, compute gradients
optimizer.step() # apply gradients
# Parameter
learning_rate = 0.01
training_epochs = 5 # 五组训练
batch_size = 256
display_step = 1
examples_to_show = 10
# Network Parameters
n_input = 784 # MNIST data input (img shape: 28*28)
# hidden layer settings
n_hidden_1 = 256 # 1st layer num features
n_hidden_2 = 128 # 2nd layer num features
weights = {
'encoder_h1':tf.Variable(tf.random_normal([n_input,n_hidden_1])),
'encoder_h2': tf.Variable(tf.random_normal([n_hidden_1,n_hidden_2])),
'decoder_h1': tf.Variable(tf.random_normal([n_hidden_2,n_hidden_1])),
'decoder_h2': tf.Variable(tf.random_normal([n_hidden_1, n_input])),
}
biases = {
'encoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
'encoder_b2': tf.Variable(tf.random_normal([n_hidden_2])),
'decoder_b1': tf.Variable(tf.random_normal([n_hidden_1])),
'decoder_b2': tf.Variable(tf.random_normal([n_input])),
}
# Building the encoder
def encoder(x):
# Encoder Hidden layer with sigmoid activation #1
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['encoder_h1']),
biases['encoder_b1']))
# Decoder Hidden layer with sigmoid activation #2
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['encoder_h2']),
biases['encoder_b2']))
return layer_2
# Building the decoder
def decoder(x):
# Encoder Hidden layer with sigmoid activation #1
layer_1 = tf.nn.sigmoid(tf.add(tf.matmul(x, weights['decoder_h1']),
biases['decoder_b1']))
# Decoder Hidden layer with sigmoid activation #2
layer_2 = tf.nn.sigmoid(tf.add(tf.matmul(layer_1, weights['decoder_h2']),
biases['decoder_b2']))
return layer_2
# Construct model
encoder_op = encoder(X) # 128 Features
decoder_op = decoder(encoder_op) # 784 Features
# Prediction
y_pred = decoder_op # After
# Targets (Labels) are the input data.
y_true = X # Before
# Define loss and optimizer, minimize the squared error
cost = tf.reduce_mean(tf.pow(y_true - y_pred, 2))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)
# Launch the graph
with tf.Session() as sess:
# tf 马上就要废弃tf.initialize_all_variables()这种写法
# 替换成下面:
sess.run(tf.global_variables_initializer())
total_batch = int(mnist.train.num_examples/batch_size)
# Training cycle
for epoch in range(training_epochs):
# Loop over all batches
for i in range(total_batch):
batch_xs, batch_ys = mnist.train.next_batch(batch_size) # max(x) = 1, min(x) = 0
# Run optimization op (backprop) and cost op (to get loss value)
_, c = sess.run([optimizer, cost], feed_dict={X: batch_xs})
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1),
"cost=", "{:.9f}".format(c))
print("Optimization Finished!")
# # Applying encode and decode over test set
encode_decode = sess.run(
y_pred, feed_dict={X: mnist.test.images[:examples_to_show]})
# Compare original images with their reconstructions
f, a = plt.subplots(2, 10, figsize=(10, 2))
for i in range(examples_to_show):
a[0][i].imshow(np.reshape(mnist.test.images[i], (28, 28)))
a[1][i].imshow(np.reshape(encode_decode[i], (28, 28)))
plt.show()
要么是RNN出错,要么是集束搜索出错,如何定位?
算法翻译:算法的输出\(\overline{y}\)
直接输出音频的文本
定义的模型:
class CBOW(nn.Module):
def __init__(self, vocab_size, embd_size, context_size, hidden_size):
super(CBOW, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embd_size)
self.linear1 = nn.Linear(2*context_size*embd_size, hidden_size)
self.linear2 = nn.Linear(hidden_size, vocab_size)
def forward(self, inputs):
embedded = self.embeddings(inputs).view((1, -1))
hid = F.relu(self.linear1(embedded))
out = self.linear2(hid)
log_probs = F.log_softmax(out)
return log_probs
class SkipGram(nn.Module):
def __init__(self, vocab_size, embd_size):
super(SkipGram, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embd_size)
def forward(self, focus, context):
embed_focus = self.embeddings(focus).view((1, -1))
embed_ctx = self.embeddings(context).view((1, -1))
score = torch.mm(embed_focus, torch.t(embed_ctx))
log_probs = F.logsigmoid(score)
return log_probs
对于skim-gram和CBOW两种模型,根据给定的text数据生成训练数据的方式:
# context window size is two
def create_cbow_dataset(text):
data = []
for i in range(2, len(text) - 2):
context = [text[i - 2], text[i - 1],
text[i + 1], text[i + 2]]
target = text[i]
data.append((context, target))
return data
def create_skipgram_dataset(text):
import random
data = []
for i in range(2, len(text) - 2):
data.append((text[i], text[i-2], 1))
data.append((text[i], text[i-1], 1))
data.append((text[i], text[i+1], 1))
data.append((text[i], text[i+2], 1))
# negative sampling
for _ in range(4):
if random.random() < 0.5 or i >= len(text) - 3:
rand_id = random.randint(0, i-1)
else:
rand_id = random.randint(i+3, len(text)-1)
data.append((text[i], text[rand_id], 0))
return data
cbow_train = create_cbow_dataset(text)
skipgram_train = create_skipgram_dataset(text)
print('cbow sample', cbow_train[0])
print('skipgram sample', skipgram_train[0])
现在有很多的工具和已成型的代码,用于在网络上爬虫抓取内容,进行情感分析或者趋势分析。这里的代码是参考的这里gaussic/weibo_wordcloud @github,尝试着抓取数据并对内容进行分词分析。
抓取:
# coding: utf-8
import re
import json
import requests
# 基于 m.weibo.cn 抓取少量数据,无需登陆验证
url_template = "https://m.weibo.cn/api/container/getIndex?type=wb&queryVal={}&containerid=100103type=2%26q%3D{}&page={}"
def clean_text(text):
"""清除文本中的标签等信息"""
dr = re.compile(r'(<)[^>]+>', re.S)
dd = dr.sub('', text)
dr = re.compile(r'#[^#]+#', re.S)
dd = dr.sub('', dd)
dr = re.compile(r'@[^ ]+ ', re.S)
dd = dr.sub('', dd)
return dd.strip()
def fetch_data(query_val, page_id):
"""抓取关键词某一页的数据"""
resp = requests.get(url_template.format(query_val, query_val, page_id))
card_group = json.loads(resp.text)['data']['cards'][0]['card_group']
print('url:', resp.url, ' --- 条数:', len(card_group))
mblogs = [] # 保存处理过的微博
for card in card_group:
mblog = card['mblog']
blog = {'mid': mblog['id'], # 微博id
'text': clean_text(mblog['text']), # 文本
'userid': str(mblog['user']['id']), # 用户id
'username': mblog['user']['screen_name'], # 用户名
'reposts_count': mblog['reposts_count'], # 转发
'comments_count': mblog['comments_count'], # 评论
'attitudes_count': mblog['attitudes_count'] # 点赞
}
mblogs.append(blog)
return mblogs
def remove_duplication(mblogs):
"""根据微博的id对微博进行去重"""
mid_set = {mblogs[0]['mid']}
new_blogs = []
for blog in mblogs[1:]:
if blog['mid'] not in mid_set:
new_blogs.append(blog)
mid_set.add(blog['mid'])
return new_blogs
def fetch_pages(query_val, page_num):
"""抓取关键词多页的数据"""
mblogs = []
for page_id in range(1 + page_num + 1):
try:
mblogs.extend(fetch_data(query_val, page_id))
except Exception as e:
print(e)
print("去重前:", len(mblogs))
mblogs = remove_duplication(mblogs)
print("去重后:", len(mblogs))
# 保存到 result.json 文件中
fp = open('result_{}.json'.format(query_val), 'w', encoding='utf-8')
json.dump(mblogs, fp, ensure_ascii=False, indent=4)
print("已保存至 result_{}.json".format(query_val))
if __name__ == '__main__':
fetch_pages('颜宁', 50)
抓取搜索颜宁
的结果:
% /Users/gongjing/anaconda3/bin/python weibo_search.py
去重前: 454
去重后: 443
已保存至 result_颜宁.json
使用jieba分词,并画出词云图:
# coding: utf-8
import json
import jieba.analyse
import matplotlib as mpl
# from scipy.misc import imread
from imageio import imread
from wordcloud import WordCloud
# mpl.use('TkAgg')
import matplotlib.pyplot as plt
def keywords(mblogs):
text = []
for blog in mblogs:
keyword = jieba.analyse.extract_tags(blog['text'])
text.extend(keyword)
return text
def gen_img(texts, img_file):
data = ' '.join(text for text in texts)
image_coloring = imread(img_file)
wc = WordCloud(
background_color='white',
mask=image_coloring,
font_path='/Library/Fonts/STHeiti Light.ttc'
)
wc.generate(data)
# plt.figure()
# plt.imshow(wc, interpolation="bilinear")
# plt.axis("off")
# plt.show()
wc.to_file(img_file.split('.')[0] + '_wc.png')
if __name__ == '__main__':
keyword = '颜宁'
mblogs = json.loads(open('result_{}.json'.format(keyword), 'r', encoding='utf-8').read())
print('微博总数:', len(mblogs))
words = []
for blog in mblogs:
words.extend(jieba.analyse.extract_tags(blog['text']))
print("总词数:", len(words))
gen_img(words, 'yanning.jpg')
% /Users/gongjing/anaconda3/bin/python weibo_cloud.py
微博总数: 443
Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/1c/bxl334sx7797nfp66vmrphxc0000gn/T/jieba.cache
Loading model cost 1.174 seconds.
Prefix dict has been built succesfully.
总词数: 6862
/Users/gongjing/anaconda3/bin/python weibo_cloud.py 4.07s user 0.98s system 80% cpu 6.305 total
生成的词云和颜宁老师的日常很像(尤其是朱一龙这个),但是因为给出的背景图辨识度不高,所以从词云里面不能一眼看不出就是她: