Showing posts with label Data Structure. Show all posts
Showing posts with label Data Structure. Show all posts

November 18, 2012

Huffman Encoding

There are other uses of binary trees. One of these is in the compression of data. This is known as Huffman Encoding. Data compression plays a significant role in computer networks. To transmit data to its destination faster, it is necessary to either increase the data rate of the transmission media or simply send less data. The data compression is used in computer networks. To make the computer networks faster, we have two options i.e. one is to somehow increase the data rate of transmission or somehow send the less data. But it does not mean that less information should be sent or transmitted. Information must be complete at any cost.
Suppose you want to send some data to some other computer. We usually compress the file (using winzip) before sending. The receiver of the file decompresses the data before making its use. The other way is to increase the bandwidth. We may want to use the fiber cables or replace the slow modem with a faster one to increase the network speed. This way, we try to change the media of transmission to make the network faster. Now changing the media is the field of electrical or communication engineers. Nowadays, fiber optics is used to increase the transmission rate of data.
With the help of compression utilities, we can compress up to 70% of the data. How can we compress our file without losing the information? Suppose our file is of size 1 Mb and after compression the size will be just 300Kb. If it takes ten minutes to transmit the 1 Mb data, the compressed file will take 3 minutes for its transmission. You have also used the gif images, jpeg images and mpeg movie files. All of these standards are used to compress the data to reduce their transmission time. Compression methods are used for text, images, voice and other types of data.
We will not cover all of these compression algorithms here. You will study about algorithms and compression in the course of algorithm. Here, we will discuss a special compression algorithm that uses binary tree for this purpose. This is very simple and efficient method. This is also used in jpg standard. We use modems to connect to the internet. The modems perform the live data compression. When data comes to the modem, it compresses it and sends to other modem. At different points, compression is automatically performed. Let’s discuss Huffman Encoding algorithm.
Huffman code is method for the compression of standard text documents. It makes use of a binary tree to develop codes of varying lengths for the letters used in the original message. Huffman code is also part of the JPEG image compression scheme. The algorithm was introduced by David Huffman in 1952 as part of a course assignment at MIT.
Now we will see how Huffman Encoding make use of the binary tree. We will take a simple example to understand it. The example is to encode the 33-character phrase:
"traversing threaded binary trees"
In the phrase we have four words including spaces. There is a new line character in the end. The total characters in the phrase are 33 including the space. We have to send this phrase to some other computer. You know that binary codes are used for alphabets and other language characters. For English alphabets, we use ASCII codes. It normally consists of eight bits. We can represent lower case alphabets, upper case alphabets, 0,1,…9 and special symbols like $, ! etc while using ASCII codes. Internally, it consists of eight bits. If you have eight bits, how many different patterns you can be have? You can have 256 different patterns. In English, you have 26 lower case and 26 upper case alphabets. If you have seen the ASCII table, there are some printable characters and some unprintable characters in it. There are some graphic characters also in the ASCII table.
In our example, we have 33 characters. Of these, 29 characters are alphabets, 3 spaces and one new line character. The ASCII code for space is 32 and ASCII code for new line character is 10. The ASCII value for ‘a’ is 95 and the value of A is 65. How many bits we need to send 33 characters? As every character is of 8 bits, therefore for 33 characters, we need 33 * 8 = 264. But the use of Huffman algorithm can help send the same message with only 116 bits. So we can save around 40% using the Huffman algorithm.
Let’s discuss how the Huffman algorithm works. The algorithm is as:
  • List all the letters used, including the "space" character, along with the frequency with which they occur in the message.
  • Consider each of these (character, frequency) pairs to be nodes; they are actually leaf nodes, as we will see.
  • Pick the two nodes with the lowest frequency. If there is a tie, pick randomly amongst those with equal frequencies.
  • Make a new node out of these two, and turn two nodes into its children.
  • This new node is assigned the sum of the frequencies of its children.
  • Continue the process of combining the two nodes of lowest frequency till the time only one node, the root is left.
Let’s apply this algorithm on our sample phrase. In the table below, we have all the characters used in the phrase and their frequencies.
Original text:
traversing threaded binary trees
size: 33 characters (space and newline)

Letters : Frequency
NL : 1
SP : 3
a : 3
b : 1
d : 2
e : 5
g : 1
h : 1


The new line occurs only once and is represented in the above table as NL. Then white space occurs three times. We counted the alphabets that occur in different words. The letter a occurs three times, letter b occurs just once and so on.
Now we will make tree with these alphabets and their frequencies.
clip_image001[4]
We have created nodes of all the characters and written their frequencies along with the nodes. The letters with less frequency are written little below. We are doing this for the sake of understandings and need not to do this in the programming. Now we will make binary tree with these nodes. We have combined letter v and letter y nodes with a parent node. The frequency of the parent node is 2. This frequency is calculated by the addition of the frequencies of both the children.
In the next step, we will combine the nodes g and h. Then the nodes NL and b are combined. Then the nodes d and i are combined and the frequency of their parent is the combined frequency of d and i i.e. 4. Later, n and s are combined with a parent node. Then we combine nodes a and t and the frequency of their parent is 6. We continue with the combining of the subtree with nodes v and y to the node SP. This way, different nodes are combined together and their combined frequency becomes the frequency of their parent. With the help of this, subtrees are getting connected to each other. Finally, we get a tree with a root node of frequency 33.
clip_image002[4]
We get a binary tree with character nodes. There are different numbers with these nodes that represent the frequency of the character. So far, we have learnt how to make a tree with the letters and their frequency
Huffman encoding is used in data compression. Compression technique is employed while transferring the data. Suppose there is a word-document (text file) that we want to send on the network. If the file is, say, of one MB, there will be a lot of time required to send this file. However, in case of reduction of size by half through compression, the network transmission time also get halved. After this example, it will be quite easy to understand the Hoffman encoding to compress a text file.
We know that Huffman code is a method for the compression of standard text documents. It makes use of a binary tree to develop codes of varying lengths for the letters used in the original message. Huffman code is also a part of the JPEG image compression scheme. David Huffman introduced this algorithm in the year 1952 as part of a course assignment at MIT.
In the previous lecture, we had started discussing a simple example to understand Huffman encoding. In that example, we were encoding the 32-character phrase: "traversing threaded binary trees". If this phrase were sent as a message in a network using standard 8-bit ASCII codes, we would have to send 8*32= 256 bits. However, the Huffman algorithm can help cut down the size of the message to 116 bits.
In the Huffman encoding, following steps are involved:
  1. List all the letters used, including the "space" character, along with the frequency with which they occur in the message.
  2. Consider each of these (character, frequency) pairs as nodes; these are actually leaf nodes, as we will see later.
  3. Pick two nodes with the lowest frequency. If there is a tie, pick randomly amongst those with equal frequencies
  4. Make a new node out of these two and develop two nodes as its children.
  5. This new node is assigned the sum of the frequencies of its children.
  6. Continue the process of combining the two nodes of lowest frequency till the time, only one node, the root, is left.
In the first step, we make a list of all letters (characters) including space and end line character and find out the number of occurrences of each letter/character. For example we ascertain how many times the letter ‘a’ is found in the file and how many times ‘b’ occurs and so on. Thus we find the number of occurrences (i.e. frequency) of each letter in the text file.
In the step 2, we consider the pair (i.e. letter and its frequency) as a node. We will consider these as leaf nodes. Afterwards, we pick two nodes with the lowest frequency in the list. If there are more than one pairs of same frequency, we will choose a pair randomly amongst those with equal frequencies.
Suppose, in a file, the letter ‘a’ occurs 50 times and ‘b’ and ‘c’ five times each. Here, ‘b’ and ‘c’ have the lowest frequency. We will take these two letters as leaf nodes and build the tree from these ones. As fourth step states, we make a new node as the parent of these two nodes. The ‘b’ and ‘c’ are its children. In the fifth step, we assign to this new node the frequency equal to the sum of the frequencies of its children. Thus a three-node tree comes into existence. This is shown in the following figure.
clip_image001
We continue this process of combining the two nodes of lowest frequency till the time, only one node i.e. the root is left.
Now we come back to our example. In this example, there is a text string as written below.
traversing threaded binary trees
The size of this character string is 33 (it includes 3 space characters and one new line character). In the first step, we perform the counting of different characters in the string manually. We do not assign a fake or zero frequency to a letter that is not present in the string. A programmer may be concerned only with the characters/letters that are present in the text. We see that the letters and their frequencies in the above text is as given below.
Character frequency character frequency
NL 1 I 2
SP 3 n 2
A 3 r 5
B 1 s 2
D 2 t 3
E 5 v 3
G 1 y 1
H 1
Table 1: Frequency table
In the second step, we make nodes of these pairs of letters and frequencies. The following figure (fig 26.2) depicts the letters as nodes. We have written the frequency of each letter with the node. The nodes have been categorized with respect to the frequencies for simplicity. We are going to build the tree from downside i.e. from the lowest frequency.
Now according to third step, two nodes of lowest frequency are picked up. We see that nodes NL, b, g, h, v and y have the frequency 1. We randomly choose the nodes v and y. now, as the fourth step, we make a new node and join the leaf nodes v and y to it as its children. We assign the frequency to this new (parent) node equal to the sum of the frequencies of its children i.e. v and y. Thus in the fifth step; the frequency of this new node is 2. We have written no letter in this node as shown in the figure below.
clip_image002
2 is equal to sum
of the frequencies of
the two children nodes.
Now we continue this process with other nodes. Now we join the nodes g and h as children of a new node. The frequency of this node is 2 i.e. the sum of frequencies of g and h. After this, we join the nodes NL and b. This also makes a new node of frequency 2. Thus the nodes having frequency 1 have joined to the respective parent nodes. This process is shown in the following figure (Fig 26.3).
clip_image003
Now we come to the nodes with a frequency 2. Here we join the pair of nodes d and i and also the pair n and s. Resultantly, the two nodes coming into being after joining have the frequency 4. This frequency is the sum of the frequencies of their children. Now, we will bring the nodes, a and t together. The parent node of a and t has the frequency 6 i.e. the sum of a and t. These new nodes are shown in the following figure.
clip_image004
Now we consider these new nodes as inner nodes and build the tree upward towards the root. Now we take the node SP and join it with the node that is the parent of v and y. The resultant node has frequency 5 as the frequencies of its children are 2 and 5 respectively. Now we join these nodes of higher frequencies. In other words, the node r is joined with the newly created node of frequency 5 that is on the left side of node r in the figure 26.5. Thus a new node of frequency 10 is created. We join the node e and the node of frequency 4 on its right side. Resultantly, a new node of frequency 9 comes into existence. Then we join the nodes having frequency 4 and create a new node of frequency 8. The following figure shows the nodes created so far.
clip_image005
Now we will join the nodes of frequency 6 and 8 to create the node of frequency 14 and join the nodes of frequency of 9 and 10 to develop a new node of frequency of 19. At the end, we make the root node with the frequency 33 and it comprises nodes of frequency 14 and 19. Thus the tree is completed and shown in the following figure.
clip_image006
Now we will perform other steps of Hoffman encoding and develop character-encoding scheme needed for data compression.
To go ahead, we have to do the following steps.
  • Start at the root. Assign 0 to left branch and 1 to the right branch.
  • Repeat the process down the left and right subtrees.
  • To get the code for a character, traverse the tree from the root to the character leaf node and read off the 0 and 1 along the path.
We start from the root node of the tree and assign the value 0 to the left branch and 1 to the right branch. Afterwards, we repeat this value assigning at nodes in left and right subtrees. Thus all the paths have a value 0 or 1 as shown in the following figure. We will develop the code with the help of these path values.
clip_image007clip_image008
In the last step, we get the code for the characters. To get the code for a character, there is need of traversing the tree from the root to the character leaf node and read off the 0 and 1 along the path. We start from the root and go to the letter of leaf node following the edges. 0 and 1 are written down in the order in which they come in this traversal to leaf node. For example, we want the code of letter d. To reach d from the root; we have to go through the nodes of frequency 14. This path has value 0. Here, 0 will be the first digit in the code of d. From 14, we go to node of frequency 8. This link (from 14 to 8) has value 1. Thus the second digit in code is 1. From node of frequency 8, we go to node of frequency 4 to its left side. This side has value 0, meaning that 0 is the third digit of code. From 4, we finally go to d and in this link we get the value 0. Thus we see that to reach at d from the root, we have gone through the branches 0, 1, 0 and 0. Thus, the code of letter d is 0100. Similarly the code of i is 0101. The same way, we find the code for each letter in the tree. The following table shows the letters and their correspondent codes.
character code character code
NL 10000 i 0101
SP 1111 n 0110
a 000 r 110
b 10001 s 0111
d 0100 t 001
e 101 v 11100
g 10010 y 11101
h 10011
Table 2: Hoffman code table
We know that every character is stored in the computer in binary format. Each character has a code, called ASCII code. The ASCII code of a character consists of ones and zeros i.e. it is in binary form. ASCII code is of eight bits. Normally, we remember the decimal value of the characters. For example, letter ‘A’ has decimal value 65. We can easily convert the decimal value into binary one in bit pattern. We need not to remember these values. ASCII value of any character can be found from the ASCII table. The ASCII code of each character is represented in eight bits with different bit patterns.
Here in the example, we assign a code of our own (i.e. Hoffman code) to the letters that are in the message text. This code also consists of ones and zeros. The above table shows the characters and their Hoffman code. Now we come back to the message text from which we developed the tree and assigned codes to the letters in the text.
Look at the table (Table 2) shown above. Here we notice that the code of letters is of variable length. We see that letters with higher frequency have shorter code. There are some codes with a length five that are the codes of NL, b, g, h, v and y. Similarly we see that the letters SP, d, i, n and s have codes of length four. The codes of the remaining letters have length three. If we look at the frequency table (Table 1) of these letters, we see that the letters with some higher frequency like a, e, r, and t, have the code of shorter length, whereas the letters of lower frequency (like NL, b, g, h, v and y) have codes of larger length.
We see in the table of the Hoffman codes of the letters that there will be need of 5, 4 or in some codes only 3 bits to represent a character, whereas in ASCII code, we need 8 bits for each character. Now we replace the letters in the text message with these codes. In the code format i.e. in the form of ones and zeros, the message becomes as under.
Our original message was
traversing threaded binary trees
The encoded form of the message is as under
t r a v e r s i n g t
00111000011100101110011101010110100101111001
100111101010000100101010011111000101010110000
110111011111001110101101011110000


We see that there are only ones and zeros in the code of message. Some letters have been shown with their corresponding code. The first three bits 001 are for letter ‘t’. The next three bits 110 are for letter ‘r’. After this the letter ‘a’ has three bits i.e. 000. Next to it is the letter ‘v’ that has five bits. These bits are 11100 (shaded in the figure). Similarly we have replaced all the letters of the message with their corresponding Hoffman code. The encoded message is shown above.
Let’s compare this Hoffman encoded message with the encoding of message with ASCII code. The ASCII code encoding means that if we have encoded this message with 8 bits per character, the total length of message would be 264. As there are 33 characters, so the total length is 33 x 8 = 264. But in the Hoffman encoded message (written in above table), there are only 120 bits. This number of bits is 54% less than the number of bits in ASCII encoding form. Thus we can send the message with almost half of the original length to a receiver.
Here the Hoffman encoding process comes to an end. In this process, we took a message, did the frequency count of its characters and built a tree from these frequencies. From this tree, we made codes of letters consisting of ones and zeros only. Then with the help of these codes, we did the data compression. In this process we saw that less data is required for the same message.
Now an important thing to be noted is regarding the tree building. We have built the tree with our choices of nodes with same and different frequencies. Now if we have chosen the nodes to join in different way, the tree built would be in a different form. This results in the different Hoffman code for the letters. These will be in ones and zeros but with different pattern. Thus the encoded message would have different codes for the letters. Now the question arises how does the receiver come to know that what code is used for what letter? The answer is very simple that the sender has to tell the receiver that he is using such codes for letters. One way to tackle this problem is that the sender sends the tree built through the use of frequencies, to the receiver to decode the message. The receiver keeps a copy of this tree. Afterwards, when the sender sends a message, the receiver will match it with respect to bit pattern with the tree to see that what letter the sender has sent. The receiver will find the letter for the first 2, 3, 4 or what numbers of bits match to a letter that it has in the tree as the receiver also has a copy of the tree. We know that a tree has a root, inner nodes and leaf node(s). The leaf node is a node whose left and right links is NULL. An inner node has a left or right or both children. Now consider that the receiver has the same tree data structure that the sender used to construct the codes. The sender has sent the message that we have discussed above. The sender sends the 122 bits encoded message to the receiver. The receiver will take the first bit of message and being on the root of the tree, it will decide that on what side this bit will go. If the bit is zero, it will go to the left of the root. However, if the bit is one, it will go to the right side of the root before reaching to the child node. The next bit will go to the next level of the tree to the left or right side depending on zero or one. Thus the traversal will go one level down on receiving a bit. If we (the receiver) are on a path of the tree where the leaf node is at level 6, it cannot reach the leaf node unless the receiver receives 6 bits. We consider the case of letter ‘e’ whose code was of three bits. In this case, we go to the first level by first bit and the second bit takes us to the third level. Finally on reaching the third level with the third bit, we get the leaf node. We know that this node is letter ‘e’ as the sender has sent us (as receiver) the whole tree with the characters in the nodes. Thus after traversing the three bits, the receiver has confirmed that it has received the letter ‘e’. Now on receiving the fourth bit, the receiver will go back to the root and continue to choose the path i.e. on zero it will go to the left and on one it will go to right. This way, it will reach the leaf node. The character at that node will be the next character of the message. The bit pattern that was comprised of following these links in the tree is extracted from the message. On the next bit, the receiver again goes to the root node and the previous procedure is repeated to find the next character. This way, it decodes the whole message.
The compression is very useful technique especially for communication purposes. Suppose that we have to send a file of one Mb. Here each line in the file can be compressed to 50 or 60 percent. Thus the file of one MB will be compressed to half MB and can be sent more easily.
There is one more thing about this tree. When we started to build the tree from leaf nodes i.e. bottom-up build of tree, we saw that there were choices for us to choose any two leaf nodes to join. In our example, we chose them at random. The other way to do it is the priority queue. We have seen the example of bank simulation. There we used a priority queue for the events. We know that in a priority queue, the elements do not follow the FIFO (first in first out) rule. But the elements have their position in the queue with respect to a priority. In the example of bank simulation, in the Event Queue, we remove the element from the queue that was going to occur first in the future. We can use priority queue here in such a way that we put the letters in the queue with respect to their frequencies. We will put and remove letters from the queue with respect to their frequency. In this priority queue, the character with lowest frequency is at the start of the queue. If two characters have the same frequency, these will be one after the other in the queue. The character with the larger frequency will be in the last of the queue. Now we take two frequencies from the queue and join them to make a new node. Suppose that the nodes that we joined have frequency 1 and 2 respectively. So the frequency of the new node will be 3. We put this new node in the queue. It takes its position in the queue with respect to its frequency as we are using the priority queue. It is evident in procedure that we proceed to take two nodes form the queue. These are removed and their parent node goes back to the queue with a new frequency. This procedure goes on till the time the queue is empty. The last node that becomes in the queue in the result of this procedure is the root node. This root node has frequency 33 if we apply this procedure to our previous example.
Let’s talk about some general things related to Hoffman encoding. We use modems to connect to Internet. These modems do the compression. The modem has a compression feature. The modem has a chip that performs the task of compression. When we give a sentence of say 80 characters to the modem to send, the modem makes a Hoffman encoded tree of these characters. Then it will make codes from this tree. We know that there is also a modem on the other end that will decode this message. The sender modem will send the tree structure to the receiver. Now the sender modem will send the data in compressed form. The receiving modem will decode this compressed data by using the Hoffman encoded tree. Now the question arises, will these codes be useful for other messages? In our example message the letter ‘y’ has lower frequency and code of five bits. It may happen that in some messages ‘y’ has higher frequency, needing code of less number of bits. To solve this problem, the modems (sender and receiver) revise the codes after a certain time period or after a particular number of bytes. In this revision, they build a new Hoffman tree and exchange it to use it for communication for the next time period or for the next fixed number of bytes.
There is some other compression techniques/algorithms for compression. The zip routines like winzip and the jpeg, mpeg and other image formatting routines use different algorithms for compression. We will read the compression algorithm in detail in the course of Algorithms.

Expression tree

Trees are used in many other ways in the computer science. Compilers and database are two major examples in this regard. In case of compilers, when the languages are translated into machine language, tree-like structures are used. We have also seen an example of expression tree comprising the mathematical expression. Let’s have more discussion on the expression trees. We will see what are the benefits of expression trees and how can we build an expression tree. Following is the figure of an expression tree.

clip_image001
In the above tree, the expression on the left side is a + b * c while on the right side, we have d * e + f * g. If you look at the figure, it becomes evident that the inner nodes contain operators while leaf nodes have operands. We know that there are two types of nodes in the tree i.e. inner nodes and leaf nodes. The leaf nodes are such nodes which have left and right subtrees as null. You will find these at the bottom level of the tree. The leaf nodes are connected with the inner nodes. So in trees, we have some inner nodes and some leaf nodes.
In the above diagram, all the inner nodes (the nodes which have either left or right child or both) have operators. In this case, we have + or * as operators. Whereas leaf nodes contain operands only i.e. a, b, c, d, e, f, g. This tree is binary as the operators are binary. We have discussed the evaluation of postfix and infix expressions and have seen that the binary operators need two operands. In the infix expressions, one operand is on the left side of the operator and the other is on the right side. Suppose, if we have + operator, it will be written as 2 + 4. However, in case of multiplication, we will write as 5*6. We may have unary operators like negation (-) or in Boolean expression we have NOT. In this example, there are all the binary operators. Therefore, this tree is a binary tree. This is not the Binary Search Tree. In BST, the values on the left side of the nodes are smaller and the values on the right side are greater than the node. Therefore, this is not a BST. Here we have an expression tree with no sorting process involved.
This is not necessary that expression tree is always binary tree. Suppose we have a unary operator like negation. In this case, we have a node which has (-) in it and there is only one leaf node under it. It means just negate that operand.
Let’s talk about the traversal of the expression tree. The inorder traversal may be executed here.
clip_image002
We use the inorder routine and give the root of this tree. The inorder traversal will be a+b*c+d*e+f*g. You might have noted that there is no parenthesis. In such expressions when there is addition and multiplication together, we have to decide which operation should be performed first. At that time, we have talked about the operator precedence. While converting infix expression into postfix expression, we have written a routine, which tells about the precedence of the operators like multiplication has higher precedence than the addition. In case of the expression 2 + 3 * 4, we first evaluate 3 * 4 before adding 2 to the result. We are used to solve such expressions and know how to evaluate such expressions. But in the computers, we have to set the precedence in our functions.
We have an expression tree and perform inorder traversal on it. We get the infix form of the expression with the inorder traversal but without parenthesis. To have the parenthesis also in the expressions, we will have to do some extra work.
Here is the code of the inorder routine which puts parenthesis in the expression.

/* inorder traversal routine using the parenthesis */
void inorder(TreeNode<int>* treeNode)
{
if( treeNode != NULL ){
cout << "(";
inorder(treeNode->getLeft());
cout << ")";
cout << *(treeNode->getInfo());
cout << "(";
inorder(treeNode->getRight());
cout << ")";
}
}


This is the same inorder routine used by us earlier. It takes the root of the tree to be traversed. First of all, we check that the root node is not null. In the previous routine after the check, we have a recursive call to inorder passing it the left node, print the info and then call the inorder for the right node. Here we have included parenthesis using the cout statements. We print out the opening parenthesis ‘(‘before the recursive call to inorder. After this, we close the parenthesis. Then we print the info of the node and again have opening parenthesis and recursive call to inorder with the right node before having closing parenthesis in the end. You must have understood that we are using the parenthesis in a special order. We want to put the opening parenthesis before the start of the expression or sub expression of the left node. Then we close the parenthesis. So inside the parenthesis, there is a sub expression.
On executing this inorder routine, we have the expression as ( a + ( b * c )) + ((( d * e ) + f ) * g ).
clip_image003
We put an opening parenthesis and start from the root and reach at the node ‘a’. After reaching at plus (+), we have a recursive call for the subtree *. Before this recursive call, there is an opening parenthesis. After the call, we have a closing parenthesis. Therefore the expression becomes as (a + ( b * c )). Similarly we recursively call the right node of the tree. Whenever, we have a recursive call, there is an opening parenthesis. When the call ends, we have a closing parenthesis. As a result, we have an expression with parenthesis, which saves a programmer from any problem of precedence now.
Here we have used the inorder traversal. If we traverse this tree using the postorder mode, then what expression we will have? As a result of postorder traversal, there will be postorder expression.
clip_image004
This is the same tree as seen by us earlier. Here we are performing postorder traversal. In the postorder, we print left, right and then the parent. At first, we will print a. Instead of printing +, we will go for b and print it. This way, we will get the postorder traversal of this tree and the postfix expression of the left side is a b c * + while on the right side, the postfix expression is d e * f + g * +. The complete postfix expression is a b c * + d e * f + g * +. The expression undergoes an alteration with the change in the traversal order. If we have some expression tree, there may be the infix, prefix and postfix expression just by traversing the same tree. Also note that in the postfix form, we do not need parenthesis.
Let’s see how we can build this tree. We have some mathematical expressions while having binary operators. We want to develop an algorithm to convert postfix expression into an expression tree. This means that the expression should be in the postfix form. In the start of this course, we have seen how to covert an infix expression into postfix expression. Suppose someone is using a spreadsheet program and typed a mathematical expression in the infix form. We will first convert the infix expression into the postfix expression before building expression tree with this postfix expression.
We already have an expression to convert an infix expression into postfix. So we get the postfix expression. In the next step, we will read a symbol from the postfix expression. If the symbol is an operand, put it in a tree node and push it on the stack. In the postfix expression, we have either operators or operands. We will start reading the expression from left to right. If the symbol is operand, make a tree node and push it on the stack. How can we make a tree node? Try to memorize the treeNode class. We pass it some data and it returns a treeNode object. We insert it into the tree. A programmer can also use the insert routine to create a tree node and put it in the tree.
Here we will create a tree node of the operand and push it on the stack. We have been using templates in the stack examples. We have used different data types for stacks like numbers, characters etc. Now we are pushing treeNode on the stack. With the help of templates, any kind of data can be pushed on the stack. Here the data type of the stack will be treeNode. We will push and pop elements of type treeNode on the stack. We will use the same stack routines.
If symbol is an operator, pop two trees from the stack, form a new tree with operator as the root and T1 and T2 as left and right subtrees and push this tree on the stack. We are pushing operands on the stacks. After getting an operator, we will pop two operands from the stack. As our operators are binary, so it will be advisable to pop two operands. Now we will link these two nodes with a parent node. Thus, we have the binary operator in the parent node.
Let’s see an example to understand it. We have a postfix expression as a b + c d e + * *. If you are asked to evaluate it, it can be done with the help of old routine. Here we want to build an expression tree with this expression. Suppose that we have an empty stack. We are not concerned with the internal implementation of stack. It may be an array or link list.
First of all, we have the symbol a which is an operand. We made a tree node and push it on the stack. The next symbol is b. We made a tree node and pushed it on the stack. In the below diagram, stack is shown.
clip_image005
Our stack is growing from left to right. The top is moving towards right. Now we have two nodes in the stack. Go back and read the expression, the next symbol is + which is an operator. When we have an operator, then according to the algorithm, two operands are popped. Therefore we pop two operands from the stack i.e. a and b. We made a tree node of +. Please note that we are making tree nodes of operands as well as operators. We made the + node parent of the nodes a and b. The left link of the node + is pointing to a while right link is pointing to b. We push the + node in the stack.
clip_image006

Actually, we push this subtree in the stack. Next three symbols are c, d, and e. We made three nodes of these and push these on the stack.
clip_image007

Next we have an operator symbol as +. We popped two elements i.e. d and e and linked the + node with d and e before pushing it on the stack. Now we have three nodes in the stack, first + node under which there are a and b. The second node is c while the third node is again + node with d and e as left and right nodes.
clip_image008
The next symbol is * which is multiplication operator. We popped two nodes i.e. a subtree of + node having d and e as child nodes and the c node. We made a node of * and linked it with the two popped nodes. The node c is on the left side of the * node and the node + with subtree is on the right side.
clip_image009
The last symbol is the * which is an operator. The final shape of the stack is as under:
clip_image010
In the above figure, there is a complete expression tree. Now try to traverse this tree in the inorder. We will get the infix form which is a + b * c * d + e. We don’t have parenthesis here but can put these as discussed earlier.
This is the way to build an expression tree. We have used the algorithm to convert the infix form into postfix form. We have also used stack data structure. With the help of templates, we can insert any type of data in the stack. We have used the expression tree algorithm and very easily built the expression tree.
In the computer science, trees like structures are used very often especially in compilers and processing of different languages.
 

Deletion in AVL Tree

 
The deletion of a data item from a data structure has always been difficult whatever data structure we use. The deletion is the inverse of insertion. In deletion there is a given value x and an AVL tree T. We delete the node containing the value x and rebalance the tree if it becomes unbalance after deleting the node. We will see that the deletion of a node is considerably more complex than the insertion of a node. It is complex in the sense that we may have to do more than one rotations to rebalance the tree after deleting a node. We will discuss the deletion case by case and will see that about what points we have to take care in the deletion process.
We are not going to write the code for deletion here. If we have to use AVL tree routines or class, the deletion and insertion routines of AVL tree are available in standard library. We can use these routines in our program. We have no need to write these routines. But here we discuss these to understand their functionality.
We know that insertion in a height-balanced tree requires at most one single rotation or one double rotation. We do this rotation at the node whose balance violates the AVL condition. We can use rotations to restore the balance when we do a deletion. If the tree becomes unbalance after deleting a node then we use rotations to rebalance it. We may have to do a rotation at every level of the tree. Thus in the worst case of deletion we have to do log2 N rotations. As log2 N is the number of levels of a tree of N nodes.
Let’s consider an example of deleting a node from a tree. In this example, we will discuss the worst case of deletion that is we have to do rotation at each level after deleting a node. Look at the following figure i.e. Fig 23.8. In this figure the root of the tree is node N and we have expanded only the left subtree of this node. The right subtree of it is indicated by a triangle. We focus on the left subtree of N. The balance of each non-leaf node of the left subtree of N is –1. This means that at every non-leaf node the depth/height of left subtree is one shorter than the height of right subtree. For example look at the node C. The left subtree of C is one level deep where as it’s right subtree is two levels deep. So balance of it is 1 – 2 = -1. If we look at node I its left subtree has height 2 as there are two levels where nodes G and H exists. The right subtree of I has number of levels (i.e. height) 3 where exists the nodes K, L and M respectively. Thus the balance of I is 2 – 3 = -1. Similarly we can see that other nodes also have the balance –1. This tree is shown in the following figure.
clip_image001
Here in this tree, the deletion of node A from the left subtree causes the worst case of deletion in the sense that we have to do a large number of rotations. Now we delete the node A from the tree. The effect of this deletion is that if we consider the node C the height of its left subtree is zero now. The height of the right subtree of C is 2. Thus the balance of C is 2 now. This makes the tree unbalance. Now we will do a rotation to make the tree balanced. We rotate the right subtree of C that means the link between C and D so that D comes up and C goes down. This is mentioned in the figure below.
clip_image002
After this rotation the tree has transformed into the following figure. Now D becomes the left child of F and C becomes the left child of D.
clip_image003
By looking at the inorder traversal of the tree, we notice that it is preserved. The inorder traversal of the tree before rotation (i.e. fig 23.8) is C D E F G H I J K L M N. Now if we traverse the tree after rotation (i.e. fig 23.9) by inorder traversal we get C D E F G H I J K L M N, which is the same as it was before rotation.
After this rotation we see that the tree having D as root is balanced. But if we see the node F we notice that the height of its left subtree is 2 and height of its right subtree is 4. Thus the balance of F is –2 (or 2 if we take the absolute value) now. Thus the tree becomes unbalance. So we have to do rotation again to balance the tree. The whole left subtree of F is shorter so we do the left rotation on the link of F and I (in the tree in fig 23.9) to bring F down and I up so that the difference of height could be less. After rotation the tree gets the new form that is shown in the figure below.
clip_image004
Here we see that the nodes G and H, which were in the left subtree of I, now become the right subtree of F. We see that the tree with I as root is balanced now. Now we consider the node N. We have not expanded the right subtree of N. Although we have not shown but there may be nodes in the right subtree of N. If the difference of heights of left and right subtree of N is greater than 1 then we have to do rotation on N node to balance the tree.
Thus we see that there may be such a tree that if we delete a node from it we have to do rotation at each level of the tree. We notice that we have to do more rotations in deletion as compared to insertion. In deletion when we delete a node we have to check the balance at each level up to the root. We do rotation if any node at any level violates the AVL condition. If nodes at a level do not violate AVL condition then we do not stop here we check the nodes at each level and go up to the root. We know that a binary tree has log2 N levels (where N is total number of nodes) thus we have to do log2 N rotations. We have to identify the required rotations that mean we have to identify that which one rotation out of the four rotations (i.e. single left rotation, single right rotation, double right-left rotation and double left-right rotation) we have to do. We have to identify this at each level.
We can summarize the deletion procedure as under.
Delete the node as in binary search tree. We have seen in the discussion of deleting a node from a BST that we have three cases, which we discussed as follows
Case I: The node to be deleted is the leaf node i.e. it has no right or left child. It is very simple to delete such node. We make the pointer in the parent node pointing to this node as NULL. If the memory for the node has been dynamically allocated, we will release it.
Case II: The node to be deleted has either left child (subtree) or right child (subtree).
In this case we bypass the node in such a way that we find the inorder successor of this node and then link the parent of the node to be deleted to this successor node. Thus the node was deleted.
Case III: The node to be deleted has both the left and right children (subtree). This is the most difficult case. In this case we find the inorder successor of the node. The left most node of the right subtree will be the inorder successor of it. We put the value of that inorder successor node into the node to be deleted. After it we delete the inorder successor node recursively.
In deletion in AVL tree, we delete the node as we delete it in a BST. In third case of deletion in BST we note that the node deleted will be either a leaf or have just one subtree (that will be the right subtree as node deleted is the left most subtree so it cannot have a left subtree). Now we are talking about deletion in an AVL tree. Since this is an AVL tree so if the deleted node has one subtree that subtree contains only one node. Why it is so? Think about its reason and answer.
After deleting the node we traverse up the tree from the deleted node checking the balance of each node at each level up to the root. Thus the deletion in AVL tree is like the deletion in BST except that here in AVL tree we have to rebalance the tree using rotations.

Cases of Deletion in AVL Tree

Now let’s consider the cases of deletion that will help to identify what rotation will be applied at what point.
There are 5 cases to consider. Let’s go through the cases graphically and determine what actions are needed to be taken. We will not develop the C++ code for deleteNode in AVL tree. This is left as an exercise.
Case 1a:
The first case is that the parent of the deleted node had a balance of 0 and the node was deleted in the parent’s left subtree.
In the following figure (Fig 23.12) the left portion shows the parent node with a horizontal line in it. This horizontal line indicates that the left and right subtrees of this node have the same heights and thus the balance of this node is 0. When we delete a node from its left subtree then height of its right subtree becomes larger than the left subtree. The right portion in the figure shows this. We are showing a symbol instead of balance of node inside the node. The notation (symbol) in the node indicates that the height of left subtree is shorter than the height of right subtree of the node.
clip_image005
Now the action that will be taken in this case is that, change the balance of the parent node and stop. No further effect on balance of any higher node. There is no need of rotation in this case. Thus it is the easiest case of deletion.
Let’s consider an example to demonstrate the above case.
Consider the tree shown in the following figure i.e. Fig 23.13. This is a perfectly balanced tree. The root of this tree is 4 and nodes 1, 2 and 3 are in its left subtree. The nodes 5, 6 and 7 are in the right subtree.
clip_image006
Consider the node 2. We have shown the balance of this node with a horizontal line, which indicates that the height of its left subtree is equal to that of its right subtree. Similarly we have shown the balance of node 4.
Now we remove the node 1 which is the left subtree of node 2. After removing the left child (subtree) of 2 the height of left subtree of 2 is 0. The height of right subtree of 2 is 1 so the balance of node 2 becomes –1. This is shown in the figure by placing a sign that is down ward to right side, which indicates that height of right subtree is greater than left subtree. Now we check the node 4. Here the height of left subtree of it is still 2. The height of its right subtree is also 2. So balance of the node 4 is 0 that is indicated by the small horizontal line (minus sign) in the figure below. Here we don’t need to change the balance of 4.
clip_image007
clip_image008clip_image009clip_image010clip_image011clip_image012
clip_image013 clip_image014
clip_image015
Fig 23.14: Tree before and after deleting node 1
Case 1b:
This case is symmetric to case 1a. In this case, the parent of the deleted node had a balance of 0 and the node was deleted in the parent’s right subtree. The following figure shows that the balance of the node was zero as left and right subtree of it have the same heights.
clip_image016
After removing the right child the balance of it changes and it becomes 1, as the height of its left subtree is 1 greater than the height of right subtree. The action performed in this case is the same as that in case 1a. That is change the balance of the parent node and stop. No further effect on balance of any higher node.
The previous example can be done for this case only with the change that remove the right child of node 2 i.e. 3.
Case 2a:
This is the case where the parent of the deleted node had a balance of 1 and the node was deleted in the parent’s left subtree. This means that the height of left subtree of the parent node is 1 greater than the height of its right subtree. It is shown in the left portion of the figure below.
clip_image017 clip_image017[1] clip_image017[2] clip_image017[3]
clip_image018
Now if we remove a node from the left subtree of this node then the height of left subtree will decrease by 1 and get equal to the height of right subtree. Thus the balance of this node will be zero. In this case, the action that we will perform to balance the tree is that change the balance of the parent node. This deletion of node may have caused imbalance in higher nodes. So it is advised to continue up to the root of the tree. We will do rotation wherever the balance of the node violates the AVL condition.
There are five cases to consider while deleting a node of an AVL tree. When a node is deleted, the tree can become unbalanced. We calculate the balance factor of each node and perform rotation for unbalanced nodes. But this rotation can prolong to the root node. In case of insertion, only one node’s balance was adjusted as we saw in previous lectures but in case of deletion, this process of rotation may expand to the root node. However, there may also be cases when we delete a node and perform no or one rotation only.
Now, we will see the five cases of deletion. A side note is that we are not going to implement these cases in C++ in this lecture, you can do it yourself as an exercise with the help of the code given inside your text book. In this lecture, the emphasis will be on the deletion process and what necessary actions we take when a node is required to be deleted from an AVL tree. Actually, there are two kinds of actions taken here, one is deletion and the other one is the rotation of the nodes.
Case 1a: The parent of the deleted node had a balance of 0 and a node was deleted in the parent’s left subtree.
clip_image001[4]
In the left tree in the Fig 24.1, the horizontal line inside the tree node indicates that the balance is 0, the right and left subtrees of the node are of equal levels. Now, when a node is deleted from the left subtree of this node, this may reduce by one level and cause the balance of the right subtree of the node to increase by 1 relatively. The balance of the node in favor of the right subtree is shown by a triangular knob tilted towards right. Now, the action required in this case to make the tree balanced again is:
Change the balance of the parent node and stop. There is no further effect on balance of any higher node.
In this case, the balance of the tree is changed from 0 to –1, which is within the defined limits of AVL tree, therefore, no rotation is performed in this case.
Below is a tree in which the height of the left subtree does not change after deleting one node from it.
clip_image003[4]
The node 4 is the root node, nodes 2 and 6 are on level 1 and nodes 1, 3, 5, 7 are shown on level 2. Now, if we delete the node 1, the balance of the node 2 is tilted towards right, it is –1. The balance of the root node 4 is unchanged as there is no change in the number of levels within right and left subtrees of it. Similarly, there is no change in the balances of other nodes. So we don’t need to perform any rotation operation in this case.
Let’s see the second case.
Case 1b: the parent of the deleted node had a balance of 0 and the node was deleted in the parent’s right subtree.
clip_image004[4]
On the left of Fig 24.3, the tree is within the balance limits of AVL. After a node is deleted from the right subtree of it. The balance of the tree is tilted towards left as shown in the right tree show in the Fig 24.3. Now, we see what action will be required to make the tree balanced again.
Change the balance of the parent node and stop. No further effect on balance of any higher node (same as 1a).
So in this case also, we don’t need to perform rotation as the tree is still an AVL (as we saw in the Case 1a). It is important to note that in both of the cases above, the balance of the parent node was 0. Now, we will see the cases when the balance of the parent node is not 0 previously.
Case 2a: The parent of the deleted node had a balance of 1 and the node was deleted in the parent’s left subtree.
clip_image005[4]
In the Fig 24.4 above, the tree on the left contains the balance factor as 1, which means that the left subtree of the parent node is one level more than the number of levels in the right subtree of it. When we delete one node from the left subtree of the node, the height of the left subtree is changed and the balance becomes 0 as shown in the right side tree of Fig 24.4. But it is very important understand that this change of levels may cause the change of balance of higher nodes in the tree i.e.
Change the balance of the parent node. May have caused imbalance in higher nodes so continue up the tree.
So in order to ensure that the upper nodes are balanced, we calculate their balance factors for all nodes in higher levels and rotate them when required.
Case 2b: The parent of the deleted node had a balance of -1 and the node was deleted in the parent’s right subtree.
Similar to the Case 2a, we will do the following action:
Change the balance of the parent node. May have caused imbalance in higher nodes so continue up the tree.
Now, we see another case.
Case 3a:The parent had balance of -1 and the node was deleted in the parent’s left subtree, right subtree was balanced.
clip_image007[4]
As shown in the left tree in Fig 24.5, the node A is tilted towards right but the right subtree of A (node B above) is balanced. The deleted node lies in the left subtree of the node A. After deletion, the height of the left subtree is changed to h-1 as depicted in the right tree of above figure. In this situation, we will do the following action:
Perform single rotation, adjust balance. No effect on balance of higher nodes so stop here.
After single left rotation, we have the tree as shown in the figure below.
clip_image009[4]
Node A has become the left subtree of node B and node 2 left subtree of node B has become the right subtree of node A. The balance of node B is tiled towards left and balance of node A is tilted towards right but somehow, both are within AVL limits. Hence, after a single rotation, the balance of the tree is restored in this case.
Case 4a: Parent had balance of -1 and the node was deleted in the parent’s left subtree, right subtree was unbalanced.
clip_image011[4]
In the last case 3a, the right subtree of node A was balanced. But in this case, as shown in the figure above, the node C is tilted towards left. The node to be deleted lies in the left subtree of node A. After deleting the node the height of the left subtree of node A has become h-1. The balance of the node A is shown tilted towards right by showing two triangular knobs inside node A. So what is the action here.
Double rotation at B. May have affected the balance of higher nodes, so continue up the tree.
By performing double rotation in the tree above, we have the following tree.
clip_image013[4]
Node A, which was the root node previously, has become the left child of the new root node B. Node C, which was the right child of the root node C has now become the right child of the new root node B.
Case 5a: The parent had balance of -1 and the node was deleted in the parent’s left subtree, right subtree was unbalanced.
clip_image015[4]
In the figure above, the right tree of the node B has a height of h-1 while the right subtree is of height h. When we remove a node from the left subtree of node A, the new tree is shown on the right side of Fig 24.9. The subtree 1 has height h-1 now, while subtrees 2 and 3 have the same heights. So the action we will do in this case is:
Single rotation at B. May have effected the balance of higher nodes, so continue up the tree.
clip_image017[10]
These were the five cases of deletion of a node from an AVL tree. Until now, we are trying to understand the concept using the figures. You might have noticed the phrase ‘continue up the tree’ in the actions above. How will we do it? One way is to maintain the pointer of the parent node inside each node. But often the easiest method when we go in downward direction and then upward is recursion. In recursion, the work to be done later is pushed on to the stack and we keep on moving forward until at a certain point we back track and do remaining work present in the stack. We delete a node when we reach at the desired location and then while traversing back, do the rotation operation on our way to the root node.
Symmetrical to case 2b, we may also have cases 3b, 4b and 5b. This should not be a problem in doing it yourself.
Other Uses of Binary Trees
A characteristic of binary trees is that the values inside nodes on the left of a node are smaller than the value in the node. And the values inside the nodes on the right of a node are greater than the value in the node. This is the way a binary tree is constructed.
Whatever is the size of the tree, the search is performed after traversing upto log2n levels maximum.
We have observed that the binary tree becomes a linked list and it can become shallow. The AVL conditions came into picture to control the height balance of a binary tree. While searching in an AVL tree, in the worst case scenario we have to search 1.44 log2n levels. For searches, binary and AVL trees are the most efficient but we do have some other kinds of trees that will be studied later.
Lets see what could be some other uses of binary trees, we start our discussion with Expression Trees.

Expression Trees

Expression trees, the more general parse trees and abstract syntax trees are significant components of compilers.
We already know about compilers that whenever we write our program or code in some computer language like C++ or Java. These programs are compiled into assembly language or byte code (in case of Java). This assembly language code is translated converted into machine language and an executable file is made by another program called the assembler.
By the way, if you have seen your syllabus, you might have seen that there is a dedicated subject on compilers. We study in detail about compilers in that course. For this course, we will see expression or parse trees.
We will take some examples of expression trees and we will not delve into much depth of it rather that would be an introduction to expression trees.
clip_image019
You can see the infix expression above (a + b * c) + ( (d * e + f) * g), it is represented in the tree form also.
You can see from bottom of the tree that the nodes b and c in the nodes are present at the same level and their parent node is multiplication (*) symbol. From the expression also, we see that the b and c are being multiplied. The parent node of a is + and right subtree of + is b*c. You might have understood already that this subtree is depicting a+b*c. On the right side, node d and e are connected to the parent *. Symbol + is the parent of * and node f. The expression of the subtree at node + is d*e+f. The parent of node + is * node, its right subtree is g. So expression of the subtree at this node * is (d*e+f)*g). The root node of the tree is +.
These expression trees are useful in compilers and in spreadsheets also, they are sometimes called parse trees.
Parse Tree in Compilers
See the expression tree of expression A := A + B * C below. We are adding B and C, adding the resultant in A and then finally assigning the resultant to A.
clip_image020
The root node in the parse tree shown above is <assign>.
The assignment statement (<assign>) has three parts. On the left of it, there is always an identifier (single or an array reference) called l-value. The l-value shown in the tree above is <id> and the identifier is A in the expression above. The second part of assignment statement is assignment operator (= or :=). One the right of assignment operator lies the third part of assignment statement, which is an expression. In the expression A := A + B * C above , the expression after assignment operator is A + B * C. In the tree, it is represented by the node <expr>. The node <expr> has three subnodes: <expr>, + and <term>. <expr>’s further left subtree is <expr>, <term>, <factor>, <id> and then finally is B. The right subchild <term> has further subnodes as <term>, * and <factor>. <factor> has <id> as subchild and <id> has C.
Note the nodes in gray shade in the tree above form A = A + B * C.
Compiler creates these parse trees. We will see how to make these trees, when we will parse any language tree like C++. Parsing mean to read and then extract the required structure. A compiler parses a computer language like C++ to form parse trees. Similarly, when we do speech recognition. Each sentence is broken down into a certain structure in form of a tree. Hand writing recognition also involves this. The tablet PCs these days has lot of implementation of parse trees.

Parse Tree for an SQL Query

Let’s see another example of parse trees inside databases. The parse trees are used in query processing. The queries are questions to the databases to see a particular kind of data. Consider you have a database for a video store that contains data of movies and artists etc. You are querying that database to find the titles of movies with stars born in 1960. The language used to query from the database is called SQL (Structured Query Language), this language will be dealt in depth in the databases course. The tables lying in this movies database are:
StarsIn(title, year, starName)
MovieStar(name, address, gender, birthdate)
The following SQL query is used to retrieve information from this database:
SELECT title
FROM StarsIn, MovieStar
WHERE starName = name AND birthdate LIKE ‘%1960’ ;
This query is asking to retrieve the titles of those movies from StarsIn and MovieStar tables where the birthdate of an actor is 1960. We see in query processing a tree is formed as shown in the figure below:
clip_image022
The root node is Query. This node is further divided into SELECT, <SelList>, FROM, <FromList>, WHERE and <Condition> subnodes. <SelList> will be an Attribute and finally a title is reached. Observe the tree figure above, how the tree is expanded when we go in the downward direction. When the database engine does the query process, it makes these trees. The database engine also performs query optimization using these trees.

Compiler Optmization

Let’s see another expression tree here:
clip_image024
The root node is +, left subtree is capturing the f+d*e expression while the right subtree is capturing (d*e+f)*g.
Normally compilers has intelligence to look at the common parts inside parse trees. For example in the tree above, the expressions f+d*e and d*e+f are same basically. These common subtrees are called common subexpressions. To gain efficiency, instead of calculating the common subexpressions again, the compilers calculates them once and use them at other places. The part of the compiler that is responsible to do this kind of optimization is called optimizer.
See the figure below, the optimizer (part of compiler) will create the following graph while performing the optimization. Because both subtrees were equivalent, it has taken out one subtree and connected the link from node * to node +.
clip_image026
This figure is not a tree now because it has two or more different paths to reach a node. Therefore, this has become a graph. The new connection is containing a directed edge, which is there in graphs.
Optimizer uses the expressions trees and converts them to graphs for efficiency purposes. You read out from your book, how the expression trees are formed, what are the different methods of creating them.
 

C program to Read From a File

#include <stdio.h> #include <stdlib.h> void main() {     FILE *fptr;     char filename[15];     char ch;   ...