keras源码分析-Tokenizer

Post author:xfxia
Post published:2023年10月17日
Post category:其他

非常喜欢
keras
框架，平时都是使用封装好的API，基本完全可以满足需求，很少需要修改源码的。最近对keras的实现更加好奇了，于是花点时间读源码，然后整理点学习笔记吧。

我大致浏览了keras中文文档以及英文文档和源码，发现文档不太全面，很多源码实现的接口而文档中没有涉及到，于是萌生了自己整理分析源码的想法。

本文作为第一篇文档，先从预处理的
tokenizer
开始整理。

tokenizer是什么

计算机在处理语言文字时，是无法理解文字的含义，通常会把一个词（中文单个字或者词组认为是一个词）转化为一个正整数，于是一个文本就变成了一个序列。而
tokenizer
的核心任务就是做这个事情。

基本参数说明

keras.preprocessing.text.Tokenizer(num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True, split=' ', char_level=False, oov_token=None, document_count=0)

num_words

: the maximum number of words to keep, based on word frequency. Only the most common num_words-1 words will be kept.
filters

: a string where each element is a character that will be filtered from the texts. The default is all punctuation, plus tabs and line breaks, minus the ’ character.
lower

: boolean. Whether to convert the texts to lowercase.
split

: str. Separator for word splitting.
char_level

: if True, every character will be treated as a token.
oov_token

: if given, it will be added to word_index and used to replace out-of-vocabulary words during text_to_sequence calls
¹

num_words

: 保留的最大词数，根据词频计算，保留前
num_word - 1
个
filters

: 过滤器，默认过滤掉常用的特殊符号
lower

：是否转化为小写
split

：词的分隔符
char_level

：是否将每个字符都认为是词，默认是否。在处理中文时如果每个字都作为是词，这个参数改为
True
.
oov_token

：如果给出，会添加到词索引中，用来替换超出词表的字符
document_count

：文档个数，这个参数一般会根据喂入文本自动计算，无需给出

几个重要接口

这里我直接截图了keras的中文文档
²
。有一个小问题，这是对象或者实例的方法，而不是类方法。

源码分析

def fit_on_texts(self, texts):
        """Updates internal vocabulary based on a list of texts.
		基于文本列表，更新内部词典，主要是word_index,和index_word这两个属性

        In the case where texts contains lists,
        we assume each entry of the lists to be a token.

        Required before using `texts_to_sequences` or `texts_to_matrix`.

        # Arguments
            texts: can be a list of strings, 
			字符串列表
                a generator of strings (for memory-efficiency),
				字符串的生成器
                or a list of list of strings.
				列表中嵌套的列表字符串
        """
		
        for text in texts:
            self.document_count += 1 # 更新文档数
            if self.char_level or isinstance(text, list):
                if self.lower:
                    if isinstance(text, list):
                        text = [text_elem.lower() for text_elem in text] # 将所有字符转为小写
                    else:
                        text = text.lower()
                seq = text # seq存储文本的词序列，单个字或者词作为元素
            else:
                seq = text_to_word_sequence(text,
                                            self.filters,
                                            self.lower,
                                            self.split) # 文本转为词序列，这个接口单独分析
			# self.word_counts是一个有序字典，用来统计词频

原文链接：https://blog.csdn.net/weixin_42060232/article/details/107462373

tokenizer是什么

基本参数说明

几个重要接口

源码分析

你可能也喜欢