Today, I’m going to talk about Priority queues. I’ll begin by introducing the binary heap data structure and then move on to how they can be used as a priority queue. Then I will describe one application where you might use them. Or just for fun, I might do everything I described in the reverse order.
If you’ve written a web crawler, your architecture probably looks similar to the figure below.
- You begin with a list of Seed URLs(1) which are passed to the URL Queue(2).
- A group of Crawlers(3) listening on the queue then get a subset of the urls to crawl. Depending on your requirements, you may return subsets such that all urls in a subset are from the same domain, from distinct domains or are quite random.
- Each crawler will then fetch the page and pass it to the Extractor(5) which would extract the required data and outlinks. The data can be batched to the database(6) and the extracted outlinks pushed into the queue.
- The process continues until the Url Queue(2) is empty.
And the binary heap?
Sometimes, you might want the URL queue to function like a priority queue and not a FIFO queue. For instance, suppose you wanted to crawl pages by the number of times that url was linked to (until now), and not in any random order. You could do this by implementing a priority queue! To understand a priority queue, we’ll have to understand a binary heap first.
Binary heaps come in two forms: Max-heaps and Min-heaps. A max-heap is a binary tree with the property, “All child nodes of a node are smaller than or equal to the parent node”. A min-heap has the opposite property.
We’re going to implement a Max-heap today. Min-heaps are similar. All you need to do is change the inequality signs in the code. The figure below shows what a Max-heap looks like,
I’m going to use Python in addition to pseudo-code to illustrate everything. Lets create a simple class and we’ll add to it as we move along.
#!/usr/bin/env python # # Pravin Paratey (February 22, 2011) # Code released under BSD license class MaxHeap: """ This class illustrates a Max-Heap and its associated functions """ def __init__(self): """ We initialize the heap with some values 15 __/\__ / \ 13 9 _/\_ /\ / \ / \ 5 12 8 7 / \ /\ | 4 0 6 2 1 """ self.heap = [15, 13, 9, 5, 12, 8, 7, 4, 0, 6, 2, 1] if __name__ == '__main__': m = MaxHeap()
Representing a heap as an array
Since a heap is a lot like a balanced binary tree, it can very easily be stored as an array. This is quite powerful as accessing nodes simply means accessing elements in an array. This greatly simplifies the operations,
- Left-Child(i): return 2 × i
- Right-Child(i): return (2 × i + 1)
- Get-Parent(i): return math.floor(i/2)
... def parent(self, index): """ Parent will be at math.floor(index/2). Since integer division simulates the floor function, we don't explicitly use it """ return index / 2 def left_child(self, index): """ 1 is added because array begins at index 0 """ return 2 * index + 1 def right_child(self, index): return 2 * index + 2 if __name__ == '__main__': m = MaxHeap()
The first function we’ll look at is
Heapify(). This function is responsible for maintaining the heap property of a heap. The
Heapify() function assumes that the node index passed to it violates the max-heap property, i.e. it may be smaller than its children. It also assumes that the other nodes satisfy the max-heap property.
Heapify() works by swapping the current node with the largest of its children, then it does a recursive call on the swapped node. The easiest way of explaining this would be to look at the set of diagrams below,
In the figure above, 3 violates the max-heap property. Heapify works under the assumption that the rest of nodes satisfy the max-heap property.
Node 3 is compared with its children - 25 and 20 and the larger node 25 is swapped with node 3.
This process is repeated and node 9 is swapped with node 3 - at which point the tree satisfies the max-heap. This can be written in Python as,
... def max_heapify(self, index): """ Responsible for maintaining the heap property of the heap. This function assumes that the subtree located at left and right child satisfies the max-heap property. But the tree at index (current node) does not. O(log n) """ left_index = self.left_child(index) right_index = self.right_child(index) largest = index if left_index < len(self.heap) and self.heap[left_index] > self.heap[index]: largest = left_index if right_index < len(self.heap) and self.heap[right_index] > self.heap[largest]: largest = right_index if largest != index: self.heap[index], self.heap[largest] = self.heap[largest], self.heap[index] self.max_heapify(largest)
This function has a time-complexity of
This function builds a max-heap from an unordered array. The function begins with the lowest non-leaf nodes and calls
Heapify() on them.
... def build_max_heap(self): """ Responsible for building the heap bottom up. It starts with the lowest non-leaf nodes and calls heapify on them. This function is useful for initialising a heap with an unordered array. O(n) """ for i in xrange(len(self.heap)/2, -1, -1): self.max_heapify(i)
This is useful if you want to start with an unordered list or numbers (or urls) and want to construct a heap. The time-complexity of this function is
Lets digress a bit and look at the heap-sort algorithm which has a time complexity of
O(nlog(n)). Heapsort can be implemented as follows,
... def heap_sort(self): """ The heap-sort algorithm with a time complexity O(n log n) """ self.build_max_heap() output =  for i in xrange(len(self.heap)-1, 0, -1): self.heap, self.heap[i] = self.heap[i], self.heap output.append(self.heap.pop()) self.max_heapify(0) output.append(self.heap.pop()) self.heap = output
Adding an element
Adding an element is quite easy. All you have to do is appending the new element at the end of the array and then work our way up, swapping the current node with its parent (if smaller), until you reach the root or the parent is greater than the current node.
... def propagate_up(self, index): """ Compares index with parent and swaps node if larger O(log(n)) """ while index != 0 and self.heap[self.parent(index)] < self.heap[index]: self.heap[index], self.heap[self.parent(index)] = self.heap[self.parent(index)], self.heap[index] index = self.parent(index) def add(self, key): """ Adds an element in the heap O(ln(n)) """ self.heap.append(key) self.propagate_up(len(self.heap) - 1) # Index value is 1 less than length
The figures below illustrate the
In the figure above, 40 is a new node that has been added to the heap. Clearly, it violates the heap property.
40 is compared with its parent 20 and since 40 is greater, the nodes are swapped.
This process continues until 40 cannot be swapped with its parent (50). At this point you see that the heap property is satisfied for all its nodes.
Removing an element
Removing an element is actually the
ExtractMax() function which removes the element with the maximum value (i.e. the root node). This is our priority queue!
... def extract_max(self): """ Part of the Priority Queue, extracts the element on the top of the heap and then re-heapifies. O(log n) """ max = self.heap data = self.heap.pop() if len(self.heap) > 0: self.heap = data self.max_heapify(0) return max
Updating an element
Updating an element is done by first locating the element and incrementing it – and like
add() – and working our way up to root.
… def increment(self, key, value): “”” Increments key by the input value. O(log n) “”” for i in xrange(len(self.heap)): if self.heap[i] == key: self.heap[i] += value self.propagate_up(i) break
In closing, I would like to say that while the above code was quite illustrative, in real life, you will be working with a more involved data structure than an integer at each node. It is quite easy to extend this to work with (occurrence, url) pairs. To see some example code of how this can be done, use the links below.