深度学习——文本、序列（循环神经网络RNN）

一、文本数据处理
二、循环神经网络
- 基本实现（Keras）
三、结语

一、文本数据处理

文本进行分词，得到标记，对标记做one-hot编码或词嵌入。

1、one-hot 编码

高维的二进制稀疏向量
分词—>添加索引—>编码
将每个单词与唯一的一个整数索引关联，将整数索引i转换为长度为N的二进制向量，该向量第i个元素为1，其余元素为0。

实际表示的是这个单词在词表中的位置

python实现：

import numpy as np
#单词级
samples = ['The cat set on the mat.','The dog ate my homework.']

token_index = {}
for sample in samples:
    for word in sample.split():  #分词
        if word not in token_index :
            token_index[word] = len(token_index) + 1   #给单词添加索引

max_length = 10

results = np.zeros(shape=(len(samples),max_length,max(token_index.values())+1))
for i,sample in enumerate(samples):
    for j,word in list(enumerate(sample.split())) [:max_length]:
        index = token_index.get(word) #获取词索引
        results[i,j,index] = 1.

keras实现：

from keras.preprocessing.text import Tokenizer

samples = ['The cat set on the mat.','The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)   #创建分词器，并指定为前1000个单词
tokenizer.fit_on_texts(samples)   #构建单词索引

sequencs = tokenizer.text_to_sequences(samples)  #字符串转换成整数索引组成的列表

one_hot_results = tokenizer.texts_to_matrix(samples,mode='binary')   #得到one-hot的二进制表示

word_index = tokenizer.word_index  #找回单词索引
print('%s unique tokens.'%len(word_index))

关键操作：分词
tokenizer = Tokenizer(num_words=1000) #创建分词器，并指定为前1000个单词
tokenizer.fit_on_texts(samples) #构建单词索引
sequencs = tokenizer.text_to_sequences(samples) #字符串转换成整数索引组成的列表
word_index = tokenizer.word_index #找回单词索引

one-hot编码：
one_hot_results = tokenizer.texts_to_matrix(samples,mode=‘binary’) #得到one-hot的二进制表示

注：整数编码存在的问题
整数编码是任意的，相似单词之间编码无联系，这种模型训练的特征权重无意义

2、词嵌入

低维的浮点数向量（密集向量）
从数据中学习
相似的单词具有相似的编码
方法一：使用Embedding层，在完成主任务的同时学习词嵌入
方法二：预训练词嵌入

（1）Embedding层

embedding_layers = Embedding(1000,64)
#第一个参数是最大单词索引，第二个参数是嵌入维度
#常见词向量维度为256，512，1024

padded_batch 标准化文本长度
train_batches = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.shuffle(1000).padded_batch(10)

GlobalAveragePooling1D 层返回每个样本的固定长度输出向量

model = keras.Sequential([
  layers.Embedding(encoder.vocab_size, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'),
  layers.Dense(1)
])

（2）预训练词嵌入
常用词嵌入数据库GloVe、word2vec等。

3、常用语句

tf.keras.preprocessing.text.Tokenizer

tf.keras.preprocessing.text.Tokenizer(
num_words=None,
filters=’!”#$%&()*+,-./:;<=>?@[\]^_`{|}~\t\n’,
lower=True, split=’ ‘, char_level=False, oov_token=None,
document_count=0, **kwargs
)

num_words: 基于单词频率，保留的最大单词数。仅保留最常见的num_words-1个单词。
filters: 过滤字符串，其中每个元素都是将从文本中过滤掉的字符。默认值为所有标点符号，加上制表符和换行符，再减去’字符。
lower :是否将文本转换为小写
split: 分隔词的分隔符。
char_level : 如果为True，则每个字符都将被视为标记。
oov_token（如果给出）:它将被添加到word_index中，并用于在text_to_sequence调用期间替换词汇量不足的单词

Tokenizer方法：

#fit_on_sequences
fit_on_sequences(sequences)基于序列列表更新内部词汇表。

#fit_on_texts
fit_on_texts(texts)根据文本列表更新内部词汇。在使用texts_to_sequences或texts_to_matrix之前必需使用。

#sequences_to_matrix
sequences_to_matrix(sequences, mode=‘binary’)

sequences_to_texts(sequences)

sequences_to_texts_generator( sequences)

texts_to_matrix(texts, mode=‘binary’)

texts_to_sequences(texts)

texts_to_sequences_generator(texts)

to_json(**kwargs)

#get_config
get_config()以字典的形式返回标记器配置。标记器使用的单词计数字典被序列化为普通JSON，这样其他项目就可以读取配置。

tf.keras.preprocessing.sequence.pad_sequences

#将序列列表(长度为num_samples)(整数列表)转换为形状为2D Numpy的数组(num_samples, num_timesteps)。
tf.keras.preprocessing.sequence.pad_sequences(
sequences, maxlen=None, dtype=‘int32’, padding=‘pre’,
truncating=‘pre’, value=0.0
)

maxlen: 可选Int，所有序列的最大长度。如果没有提供，序列将被填充到最长的单个序列的长度。
dtype: (可选，默认为int32)输出序列的类型。要使用可变长度字符串填充序列，可以使用object。
padding: 填充字符串，‘pre’或’post’(可选，默认为’pre’):在每个序列之前或之后填充。
truncating: 截断字符串，‘pre’或’post’(可选，默认为’pre’):从大于maxlen的序列中删除值，无论是在序列的开头还是结尾。
value:填充值。(可选，默认为0)

4、代码示例：

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import tensorflow_datasets as tfds

tfds.disable_progress_bar()

(train_data, test_data), info = tfds.load(
    'imdb_reviews/subwords8k', 
    split = (tfds.Split.TRAIN, tfds.Split.TEST), 
    with_info=True, as_supervised=True)
encoder = info.features['text'].encoder
encoder.subwords[:20]

#padded_batch方法标准化评论的长度
train_batches = train_data.shuffle(1000).padded_batch(10)
test_batches = test_data.shuffle(1000).padded_batch(10)

train_batch, train_labels = next(iter(train_batches))
train_batch.numpy()

#创建模型

embedding_dim=16

model = keras.Sequential([
  layers.Embedding(encoder.vocab_size, embedding_dim),
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'),
  layers.Dense(1)
])

model.summary()

model.compile(optimizer='adam',
              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              metrics=['accuracy'])

history = model.fit(
    train_batches,
    epochs=10,
    validation_data=test_batches, validation_steps=20)

二、循环神经网络

循环神经网络就是一类以序列（sequence）数据为输入，在序列的演进方向进行递归（recursion）且所有节点（循环单元）按链式连接的递归神经网络（recursive neural network）。

循环神经网络原理见另一篇文章：循环神经网络原理笔记

基本实现（Keras）

（1）SimpleRNN层（循环层）：

处理序列批量
输入数据类型： (batch_size,timesteps,input_features)
输出数据类型： (batch_size,timesteps,output_features) 或 (batch_size,input_features) 。由参数return_sequrnces决定，True返回完整的状态序列（第一种），否则返回每个输入序列的最终输出（第二种）。

 from keras.layers import SimpleRNN
 model = keras.Sequential([
	 layers.Embedding(10000,32),
	 layers.SimpleRNN(32, return_sequences=True),
	 layers.SimpleRNN(32, return_sequences=True),
	 layers.SimpleRNN(32)
	])

当需要进行多个循环层堆叠的时候，中间层都应返回完整的输出序列，作为下一层的输入。

（2）GRU层（门控循环单元）：

activation:默认值：双曲正切（tanh）
recurrent_activation: 用于循环步骤的激活函数。默认值：Sigmoid。

tf.keras.layers.GRU(
units, activation=‘tanh’, recurrent_activation=‘sigmoid’,
use_bias=True, kernel_initializer=‘glorot_uniform’,
recurrent_initializer=‘orthogonal’,
bias_initializer=‘zeros’, kernel_regularizer=None,
recurrent_regularizer=None, bias_regularizer=None, activity_regularizer=None,
kernel_constraint=None, recurrent_constraint=None, bias_constraint=None,
dropout=0.0, recurrent_dropout=0.0, return_sequences=False, return_state=False,
go_backwards=False, stateful=False, unroll=False, time_major=False,
reset_after=True, **kwargs
)

tf.keras.layers.GRU(4, return_sequences=True, return_state=True)

（3）LSTM层（长短期记忆）：

允许过去的信息稍后重新进入以解决梯度消失

recurrent_activation: 用于循环步骤的激活函数。默认值：Sigmoid
use_bias布尔值（默认为True），是否图层使用偏置向量
dropout: 在0到1之间浮动。要下降的单位的分数，用于输入的线性转换。默认值：0
return_sequences: 布尔值。是否返回最后一个输出。在输出序列或完整序列中。默认值：False。
return_state: 布尔值。除输出外，是否返回最后一个状态。默认值：False。
go_backwards: 布尔值（默认为False）。如果为True，则向后处理输入序列并返回反向的序列。

tf.keras.layers.LSTM(
units, activation=‘tanh’, recurrent_activation=‘sigmoid’,
use_bias=True, kernel_initializer=‘glorot_uniform’,
recurrent_initializer=‘orthogonal’,
bias_initializer=‘zeros’, unit_forget_bias=True,
dropout=0.0, recurrent_dropout=0.0
)

tf.keras.layers.LSTM(4, return_sequences=True, return_state=True)

model = keras.Sequential([
	keras.layers.Embedding(max_feature,32),
	keras.layers.LSTM(32),
	keras.layers.Dense(1,activation = 'sigmoid')
])

（4）tf.keras.layers.Bidirectional 包装器（双向RNN）：

RNN的双向包装。包装LSTM层、GRU层。以完成信息的双向传递。实际是两层叠加，一个前向一个后向。

merge_mode ：合并前向和后向RNN的输出的模式。{‘sum’，‘mul’，‘concat’，‘ave’，None}中的一个。如果为None，则将不合并输出，它们将作为列表返回。默认值为“ concat”。
backward_layer ：可选的keras.layers.RNN或keras.layers.Layer实例，用于处理向后输入处理。如果未提供倒退图层，则作为layer参数传递的图层实例将用于自动生成倒退图层。请注意，所提供的backward_layer层应具有与layer参数相匹配的属性，尤其是对于有状态，return_states，return_sequence等，它应具有相同的值。此外，backward_layer和layer应具有不同的go_backwards参数值。

tf.keras.layers.Bidirectional(
layer, merge_mode=‘concat’, weights=None, backward_layer=None,
**kwargs
)

model = Sequential()
model.add(Bidirectional(LSTM(10, return_sequences=True), input_shape=(5, 10)))
model.add(Bidirectional(LSTM(10)))
model.add(Dense(5))
model.add(Activation('softmax'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

 # 自定义后向层
 model = Sequential()
 forward_layer = LSTM(10, return_sequences=True)
 backward_layer = LSTM(10, activation='relu', return_sequences=True,
                       go_backwards=True)
 model.add(Bidirectional(forward_layer, backward_layer=backward_layer,
                         input_shape=(5, 10)))
 model.add(Dense(5))
 model.add(Activation('softmax'))
 model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

（5）dropout正则化：

在循环神经网络中使用正则化应该使用循环dropout ,对每个时间步使用相同的dropout掩码，而不是让dropout掩码随时间变化随机变化，随机变化会破坏随时间传播的学习误差。
keras循环层内置dropout方法：

dropout ：指定该层输入单元的正则化比率

recurrent_dropout ：指定循环单元的正则化比率

三、结语

1、RNN的使用过程中要注意，RNN对时间顺序的依赖性很强，双向RNN则不然，如在自然语言处理上使用双向RNN更好，在天气预测上使用RNN更好。
2、在RNN中使用dropout正则化，要用不随时间改变的dropout掩码。
3、在使用机器学习之前建立基准方法会减少很多计算要求。可以建立符合常识的基准方法，也可以使用小模型，以保证进一步增加问题复杂度的合理性。

关于GRU、LSTM的原理在另一篇文章中：循环神经网络原理笔记

原文链接：https://blog.csdn.net/qq_43842886/article/details/112970292