大并发搜索下关键词前缀匹配值得考虑的一种数据结构—Trie

  • Post author:
  • Post category:其他


如果要实现一个能支撑大数据量并发搜索的引擎的关键词匹配,而是需要选择用一种紧凑高效的数据结构来实现,譬如说Trie。下面介绍一下Trie..

Trie,又称单词查找树或键树,是一种树形结构,是一种哈希树的变种。典型应用是用于统计和排序大量的字符串(但不仅限于字符串),所以经常被搜索引擎系统用于文本词频统计。它的优点是:最大限度地减少无谓的字符串比较,查询效率比哈希表高。

散列是一种常见的高效查找方法,它根据数组下标查询,所以速度快。首先根据词表构造散列表,具体来说就是用给定的哈希函数构造词典到数组下标的映射,如果存在冲突,则根据选择的冲突处理方法解决地址冲突。然后可以在哈希表的基础上执行哈希查找。

冲突导致散列性能降低。不存在冲突的散列表叫做完美散列(perfect hash)。整词散列不适合分词的最长匹配查找方式。

数字搜索树采用了逐字散列的方法,可以看成是一种逐字的完美散列。一个数字搜索Trie树(retrieve树)的一个节点只保留一个字符。如果一个单词比一个字符长,则包含第一个字符的节点有指针指向下一个字符的节点,以此类推。这样组成一个层次结构的树,树的第一层包括所有单词的第一个字符,树的第二层包括所有单词的第二个字符,以此类推,数字搜索树的最大高度是词典中最长单词的长度。例如,如下单词序列组成的词典(as at be by he in is it of on or to)会生成如图4-4所示的数字搜索树:

数字搜索树的结构独立于生成数时单词进入的顺序。这里,Trie树的高度是2。因为树的高度很小,在数字搜索Trie树中搜索一个单词的速度很快。但是,这是以内存消耗为代价的,树中的每一个节点都需要很多内存。假设每个词都是由26个小写英文字母中的一个组成的,这个节点中会有26个指针。所以不太可能直接用这样的数字搜索树来存储中文这样的大字符集。

它有3个基本性质:

根节点不包含字符,除根节点外每一个节点都只包含一个字符。

从根节点到某一节点,路径上经过的字符连接起来,为该节点对应的字符串。

每个节点的所有子节点包含的字符都不相同。

这是一个Trie结构的例子: Trie example.svg

[img]http://dl.iteye.com/upload/attachment/553074/ec4c8472-b93f-3885-904e-2ece35417635.png[/img]

在这个Trie结构中,保存了A、to、tea、ted、ten、i、in、inn这8个字符串,仅占用8个字节(不包括指针占用的空间)。

Trie树在实现上有一个树类(SearchTrie)和一个节点类(TrieNode)。SearchTrie的主要方法有两个:

增加单词到搜索树,方法原型是:addWord(String word)。

从文本的指定位置开始匹配单词,方法原型是:matchLong(String text,int offset)。

给定一个固定电话号码,找出这个电话号码对应的区域。固定电话号码都是以0开始的多位数字,可以通过给定电话号码的前缀找出对应的地区,例如:

0995:新疆:托克逊县  
0856:贵州:铜仁
0996:新疆:焉耆回族自治县

可以使用数字搜索树算法快速查找电话号码前缀。例如:0 3 7 1 这个区号有四个节点对应,也就是说有4个TrieNode对象。第一个对象中的splitChar属性是0;第二个对象中的splitChar属性是3;第三个对象中的splitChar属性是7;第四个对象中的splitChar属性是1。每个TrieNode对象有10个孩子节点,分别对应孩子节点的splitChar为0到9。

Trie树的节点类定义如下:

public static final class TrieNode {  
protected TrieNode[] children; //孩子节点
protected char splitChar; //分隔字符
protected String area; //电话所属地区信息

/**
* 构造方法
*
* @param splitchar分隔字符
*/
protected TrieNode(char splitchar) {
children = new TrieNode[10];
area = null;
this.splitChar = splitchar;
}
}

加载词,形成数字搜索树的方法如下:

private void addWord(String string, TSTNode root, String area) {
TSTNode tstNode = root;
for (int i = 1; i < string.length(); i++) {
char c0 = string.charAt(i);
int ind = Integer.parseInt(string.substring(i, i + 1));
TSTNode tmpNode = tstNode.children[ind];
if (null == tmpNode) {
tmpNode = new TSTNode(c0);
}
if (i == string.length() - 1) {
tmpNode.area = area;
}
tstNode.children[ind] = tmpNode;
tstNode = tmpNode;
}
}

查询的过程对于查询词来说,从前往后一个字符一个字符的匹配。对于Trie树来说,是从根节点往下匹配的过程。从给定电话号码搜索前缀的方法如下所示:

public String search(String tel) {  
TrieNode tstNode = root;
for (int i = 1; i < tel.length(); i++)
{//从前往后一个字符一个字符的匹配
tstNodetstNode = tstNode.children[(tel.charAt(i)-'0')];
if (null != tstNode.area) {
return tstNode.area;
}
}
return null;//没找到
}

考虑把Trie树改成通用的结构,使用散列表存储孩子节点,并使用范型定义值类型。

 public class TrieNode<T> {  
private Character splitChar; //分隔字符
private T nodeValue; //值信息
private Map<Character, TrieNode<T>> children =
new HashMap
<Character, TrieNode<T>>();
//孩子节点
}

Java 实现


/**
* Description: This program implements a trie that allows to search words. It
* reads words from a file specified by the user in the command line
* and constructs the dictionary. It allows user to find, print, add,
* and delete words from the trie. It folds all inpput into lower-case
* before storing them or looking them up. It also ignores characters
* that are not letters and only stores words that are three letters
* long. A reference pointer is used, when finding a word, the reference
* pointer will point to this word, when printing it will point to the
* last printed word, and when deleting it will be reset to the trie root.
*
* This class implements a dictionary tree in a structure called trie.
* We use a representation called "Leftmost-Child-Right-Sibling"
* representation. A node points to a list of its children and it points
* to the leftmost child in the list, and each child in the list points
* to the sibling to the right of it. We add one extra node at the end
* of the list to implement a pointer back to the parent, using the child
* pointer in that extra node. The root does not represent any letter, and
* the last sibling on each list (that points to the parent) has a right
* sibling pointer set to null.
*
* There is a pointer to the current position in the trie, this position
* is updated when we search a word to point to the start of the found
* word, when printing it points to the starting position of last printed
* word, and when deleting it resets position reference to the root. Finding
* and adding work in the same way, checking first if child node match letter
* if not then checking with siblings, until a match is found. Printing starts
* from the starting node position of the word, it goes to the end of the sibling
* list and then up to the parent node, gets its letter and continues until it
* get to the root. Deleting starts also from the node starting position of the
* word and continuous up the tree in the same as printing, but checking on its
* way if the nodes are still useful, if not it removes reference to these nodes.
* The basic structure of the trie is a node that contains a specific letter of a
* word, a child pointer, a sibling pointer and a boolean vallue indicating if
* this is the starting position of a word. The construction of the trie is done
* by reading the input file one line at a time and using the add word method.
*
* Input: It receives as input in the command line the name of the filename to
* be used as input in constructing the trie.
* Later it receives input telling what operation to perform: add, delete
* or find a word, or print n number of words
* Output: It prints out wether the operation requested by the user was succesful
* or not, and the output requested by each command.
*
* Status: This class works according to program specifications
*/

import java.io.*; // For console input

public class Trie
{
public Node root; // Trie root
public Node current; // Current Position Inside the Trie

/** Array of letters that are used in comparison operations in order to
* know which letter is currently being stored into the trie */
private String[] letters = {"a", "b", "c", "d", "e", "f", "g", "h", "i",
"j", "k", "l", "m", "n", "o", "p", "q", "r",
"s", "t", "u", "v", "w", "x", "y", "z"};

/** Stores the ocurrence of each letter read into the trie, the counter
* stored at each index corresponds to the letter stored at the same index
* in array letters */
private double[] letterCount = new double[26];

/** Stores the percentage of ocurrence of each letter in the trie, the
* precentage stored at each index corresponds to the letter stored at the
* same index in array letters */
private double[] percentageOfLetter = new double[26];

// Total number of letters read into the trie
private double totalLetterCount;

// Numbers used in calculating a random number to find a random letter
private final double max = 100.0;
private final double min = 0.0;

/**
* Declares an empty trie
*/
public Trie()
{
Node root = new Node();
}

/**
* Build the dictionary trie reading the input from a given filename
* @param An array of strings with the command line arguments
* @return None
*/
public Trie(String[] commandLine) throws IOException {

root = new Node(); // Declare a new trie
BufferedReader reader; // To read from a file
String input; // Store line of input

// Check if filename was declared and if it exists
checkIfValidFilename( commandLine );

// Declare in which file the input is in
reader = new BufferedReader( new FileReader ( commandLine[0] ) );

// Declare in which file the input is in
//reader = new BufferedReader( new FileReader ( "words" ) );

// Read each line of the input file
while ( (input = reader.readLine() ) != null) {

// Add word to trie
addWord(input);
}

// current position in trie is the root
current = root;

// Calculate percentage of ocurrence of letters
calculatePercentages();
}

/**
* Description: Node
* This class implement a node structure which is used as the
* structural components of the trie
*/
public class Node{
private String letter; // Letter hold in the node
boolean isAWord; // Indicates wether this node represents starting positionof a word
private Node next; // Leftmost child pointer
private Node sibling; // Right sibling pointer

/**
* Creates a new node
* @param None
* @return None
*/
Node () {
letter = null;
isAWord = false;
next = sibling = null;
}

/**
* Creates a new node with given letter and boolean value
* @param A string for the letter hold in node
* A boolean value to check if this the start of a word
* @return None
*/
Node (String s) {
letter = s;
isAWord = false;
next = sibling = null;
}
}

// Check if the given letter is in the next level of the trie from the given node pointer
// Returns boolean if it finds it, else returns null
public Node findLetter(Node crtPosition, String letter) {
Node pointer = crtPosition; // Initialize pointer for position refernce in trie

// Word is not in tree if still looking for word letters and there aren't any more
if (pointer.next == null) {
return null;
}
else { // There is a child from the current node

// Check if next node contains the current letter we're looking for
if ( pointer.next.letter.equals(letter) ) {
pointer = pointer.next; // go to next child
}
else { // Check if a sibling contains the same letter
pointer = pointer.next; // go to start of node sibling list

// Loop until a node with the same letter is found or sibling list ends
while ( pointer.sibling.sibling != null && !pointer.sibling.letter.equals(letter) ) {
pointer = pointer.sibling;
}

// If no sibling with same letter is found, word isn't in trie
if ( pointer.sibling.sibling == null ) {
return null; // String is not in tree
}
else { // A sibling has the letter we're looking for, continue
pointer = pointer.sibling; // One sibling has same letter
}
}
}
return pointer;
}


/**
* Find a given word in the given trie
* @param A string to look up in the trie
* The root of the trie in where to look up the word
* @return A pointer to the node with the starting position (letter) of the word found
* If word is not found, it returns a null pointer
*/
public Node findStr(String word) {
Node pointer = root; // Initialize pointer for position refernce in trie
String input = word; // Word to look up

// Start a while lopp searching the chars of the word in trie starting backwards
while (input.length() >= 1) {

// Get the last letter in the word
String firstChar = input.substring( 0, 1 );
input = input.substring( 1, input.length() ); // Update string minus last letter

if ( !Character.isLetter(firstChar.charAt(0)) ) { // If char is not a letter skip it
continue;
}

// Word is not in tree if still looking for word letters and there aren't any more
if (pointer.next == null) {
return null;
}
else { // There is a child from the current node

// Check if next node contains the current letter we're looking for
if ( pointer.next.letter.equals(firstChar) ) {
pointer = pointer.next; // go to next child
}
else { // Check if a sibling contains the same letter
pointer = pointer.next; // go to start of node sibling list

// Loop until a node with the same letter is found or sibling list ends
while ( pointer.sibling.sibling != null && !pointer.sibling.letter.equals(firstChar) ) {
pointer = pointer.sibling;
}

// If no sibling with same letter is found, word isn't in trie
if ( pointer.sibling.sibling == null ) {
return null; // String is not in tree
}
else { // A sibling has the letter we're looking for, continue
pointer = pointer.sibling; // One sibling has same letter
}
}
}
}
// All letters from word match, check if current pointer reference is start of word
if ( pointer.isAWord == true) {
current = pointer;
return pointer; // Return pointer to start of found word
}
else { // Letters matched, but it not a word in the tree
return null; // String is not in tree
}
}

/**
* Print n number of words starting from a given reference in the trie
* @param The root of the trie
* The current reference pointer in the trie
* The number of words to be printed
* @return A pointer reference to the last word printed
*/
public Node printNextWords( int n) {
Node pointer = current;

if (root.next == null) { // check if Dictinary has no words
System.out.println(" Trie is Empty");
return root;
}
System.out.println();

// Start loop to print n number of words
while ( n > 0) {

// Check if we are at the end of the trie or there is no child
if ( pointer.next == null || pointer.next == root) {

if ( pointer.next == root) { // We are at the end of the trie
System.out.println(">> End of Trie <<");
pointer = root; // Reset the position reference
break; // Stop printing words
}
else { // Pointer has no child
pointer = pointer.sibling; // Goto pointer sibling

// If we are at the end of the list, back up to parent node next sibling
while ( pointer.sibling == null ) {
if ( pointer.next == root ) { // We are at the end of the trie
break; // Stop printing words
}
else { // Back up to parent node next sibling, list in above level
pointer = pointer.next.sibling;
}
}
if ( pointer.isAWord == true) { // If pointer is start of word, prin it
printWord(pointer);
n--; // Decrement words to print counter
}
}
}
else { // Child of current reference pointer is not null
pointer = pointer.next;
if ( pointer.isAWord == true) { // If pointer is start of word, prin it
printWord(pointer);
n--; // Decrement words to print counter
}
}
}
current = pointer;
return pointer; // Pointer to last word printed
}

/**
* Delete a given word from given trie
* @param String to be found and deleted
* The root from the given trie
* @return A pointer to the next closest parent from the current nodes deleted
* If word to be deleted is not found, returns a null pointer
*/
public Node deleteWord(String lookUpWord) {
Node pointer = findStr( lookUpWord); // Check if word is in tree
Node tempPointer = null; // Extra reference pointer

if (pointer == null) { // Word is not in tree
return null;
}
else { // Word is in trie, delete it
if ( pointer.next != null ) { // Word starting position has children nodes
pointer.isAWord = false; // No longer represents a word
return pointer;
}
else { // Word starting position has no children nodes
tempPointer = pointer;

while (true) { // Loop until all parts of the word not used by others are deleted

do { // Go to the parent of the sibling list node of current pointer
tempPointer = tempPointer.sibling;
} while ( tempPointer.sibling != null ); // Goto last sibling
tempPointer = tempPointer.next; // Goto parent

// Check if our node is inmmediate child & has no children or siblings
if ( tempPointer.next == pointer && pointer.sibling.sibling == null) {
tempPointer.next = null; // Delete reference to child

if ( tempPointer == root) { // We have arrived to root, no more to delete
break;
}

// if parent is not the start of a word, it needs to be deleted
if ( tempPointer.isAWord == false ) {
pointer = tempPointer; // Point now to the parent
continue;
}
else { // parent is a word, can not delete anymore nodes
break;
}
}
else { // Current pointer has sibling nodes

// check if child is pointer first in sibling list,
if ( tempPointer.next == pointer ) {
// update the child to be the pointer's sibling
tempPointer.next = tempPointer.next.sibling;
break;
}
tempPointer = tempPointer.next; // Pointer is not first sibling

// Find node previous to our start of word node
while ( tempPointer.sibling != pointer ) {
tempPointer = tempPointer.sibling;
}

// Update reference of previous node to pointer sibling
tempPointer.sibling = pointer.sibling;
break;
}
}
}
}
return tempPointer;
}

/**
* Add a word to the trie
* @param The root of the trie
* The string to be added
* @return None
*/
public Node addWord(String input) {
Node pointer = root;

// Remove characters that are not letters
input = removeNonLetterChars(input);

// The words in this trie have to be at least 3 characters long
if (input.length() < 3 ) {
return pointer;
}

input = input.toLowerCase(); // Convert string to lower case

while (input.length() >= 1) {

// Get the first letter of the word
String firstChar = input.substring( 0, 1 );

// Update string minus first letter
input = input.substring( 1, input.length() );

// If char is not a letter skip it
if ( !Character.isLetter(firstChar.charAt(0)) ) {
continue;
}

// Keep track of letters read into the trie
countLetter(firstChar);

if (pointer.next == null) {
// No more child nodes, We need to add a new child
Node newChild = new Node(firstChar);
// Set child to be new node
pointer.next = newChild;

// Create the last sibling in list pointing to parent
Node lastSibling = new Node();
lastSibling.sibling = null;
lastSibling.next = pointer;
newChild.sibling = lastSibling;

pointer = newChild; // Update pointer
}

else { // There is still children nodes
if ( pointer.next.letter.equals(firstChar) ) {
// node has same letter, go to next child
pointer = pointer.next;
}
else { // Child has not the same letter we're looking for
pointer = pointer.next;

// Check if there's a sibling with same letter
while ( pointer.sibling.sibling != null &&
!pointer.sibling.letter.equals( firstChar ) ) {

pointer = pointer.sibling;
}

// Check if we got tothe last sibling, node matching letter was not found
if (pointer.sibling.sibling == null) {

// Add new node at end of sibling list
Node newSibling = new Node(firstChar);
newSibling.sibling = pointer.sibling;
pointer.sibling = newSibling;
pointer = newSibling;
}
else {
//There is a node matching the letter we're looking for
// update pointer to this position
pointer = pointer.sibling;
}
}
}
}
if ( pointer == root ) { // All characters in word were not a letter
return null; // No word will be added
}
pointer.isAWord = true;
return pointer;
}

/**
* Print a word that that starts at the given pointer reference
* @param A node referencce pointer
* @return None
*/
public void printWord(Node pointer) {
String word = ""; // Declaration of string to hold word to print

do {
word = pointer.letter + word; // Store next character of word

do { // Go to the end of the sibling list
pointer = pointer.sibling;
} while ( pointer.sibling != null );

pointer = pointer.next; // Go to the parent of the node

} while (pointer.sibling != null ); // Stop when we get to the root

System.out.println(word); // Print word
}

/**
* Get a random letter contained inside the trie based on how often it
* occurs in the trie
* @ param None
* @ return A string containing a letter
*/
public String getRandomLetter() {

/** The method used to get arandom letter is based on the percentage of
* how many times it was read into the trie. It gets a random number
* between 0.0 and 100.00, it adds the percentages of letters starting
* from "a" to "z" until a number equal or bigger than the random
* number is found, then we take the letter where this occurred as the
* random letter.
*/

String randomString = null;
double temp = 0.0;

// Get a random number between 0.0 and 100.0
double randomNumber = Math.floor (Math.random() * ( max - min + 1) ) + min;

// Add the ocurrence percentages of each letter starting from "a" to "z"
for (int i = 0; i < percentageOfLetter.length; i++) {

// Store the percentage sum
temp = percentageOfLetter[i] + temp;

// Stop when the sum is equal or bigger than random number
if ( temp >= randomNumber ) {
randomString = letters[i];
break;
}
}

// Return letter where the sum isequal or bigger than random number
return randomString;
}

/**
* Count the number of times aeach letter is read in the trie
* @ param A string containing the letter to be counted
* @ return None
*/
public void countLetter(String letter) {
String temp; // Temporary string holder

// Compare letter to array containing alphabet letters
for (int i = 0; i < letters.length; i++ ) {
temp = letters[i];

// If a match is found
if ( temp.equals( letter ) ) {

// Increase counter of that particular letter
letterCount[i]++;

// Increase counter of total letters read
totalLetterCount++;
}
}
}

/**
* Calculate the percentage of ocurrance of all letters read into trie
* @ param None
* @ return None
*/
public void calculatePercentages() {

// Valculate the percentage of ocurrence of each letter
for (int i = 0; i < percentageOfLetter.length; i++) {

// Math formula: Percentage ox X = times X occurs * 100 / total
percentageOfLetter[i] = (letterCount[i] * 100) / totalLetterCount;
}
}

/**
* Checks if a file was specified by the user in the inputline to use as input to
* build the trie, and if it was declared, it checks it exists.
* @param A string array containing the arguments from the command line
* @return None
*/
public static void checkIfValidFilename( String[] commandLine) throws IOException {
int numberOfArgs = commandLine.length; // Number of arguments in command line

// Check if an input file was actually declared
if ( numberOfArgs < 1 ) {
System.out.println( " No Input File Declared." );
System.out.println( " Ending Program . . . \n" );
System.exit(0); // End program if no file was declared
}

try { // Check if the file actually exists
BufferedReader reader = new BufferedReader( new FileReader ( commandLine[0] ) );
}
catch (FileNotFoundException e) {
System.out.println( " Input Filename Not Found" );
System.out.println( " Ending Program . . . \n" );
System.exit(0); // End program if no file was declared
}
}

/**
* Removes the non letter characters from the received String and returns that string
* @ param A string
* @ return A string with the non letter characters removed
*/
public String removeNonLetterChars(String word) {

// Temporary string will hold string with no letter chars
String tempWord = "";

// Traverse the entire string
while ( word.length() >=1 ) {

// Get the first letter of the word
String firstChar = word.substring(0, 1);

// Update string minus first letter
word = word.substring(1, word.length() );

//Add character to the new word only if it is a letter
if ( Character.isLetter( firstChar.charAt(0) ) ) {
tempWord = tempWord + firstChar;
}
}

// Returned string with no letter chars
return tempWord;

}

/**
* Check if number in a given string is really a number and bigger than zero
* @param A string containing a numerical value
* @return A boolean value, true f string is a number and bigger than zero
* false otherwise
*/
public boolean isValidNumber(String str) {
char[]c = str.toCharArray(); // An array of the chrarcter of the string

// Check that every character is a number
for ( int i = 0; i < str.length(); i ++ ) {
if ( !Character.isDigit( c[i] ) ) {
System.out.println(" Argument Is Not A Number");
printInvalidCommand();
return false;
}
}

int number = Integer.parseInt( str );

// Check if the number is bigger than zero
if ( number < 1 ) {
System.out.println(" Number not valid: Smaller than zero");
printInvalidCommand();
return false;
}
return true;
}

/**
* Print a invalid command message
* @param None
* @return None
*/
public void printInvalidCommand() {
System.out.println("Not Valid Command. Please Try Again. ");
}
}