Suppose we have a very large collection of items and need to select the first N, according to some criterion. In this article I will show you an efficient algorithm based on the heap data-structure and explain how to implement it using standard Java features.
Now we'd like to find the highest 10 achieved scores.
As our game has become quite popular, this list is very long. Let's say we've gathered a few millions of records.
According to Wikipedia:
From the heap property follows an important observation - the greatest element is always located in the root of the tree. Also, given heap is a complete binary tree, it has logarithmic height. Therefore the following convenient time complexities:
You can find a more detailed explanation and implementation details on the Internet.
a min-heap, where the heap property is inverted and root node contains the minimum element.
It is useful to observe that a min-heap is just a max-heap with inverted ordering
function. Specifically:
the top N elements of the N+i-th prefix of the input sequence. Therefore, at the end of step 2, after all items have been processed, it will contain the top N elements of the whole sequence.
Note that we've used a min-heap, even though we're interested in the N largest elements. This is an important trick that allows us to easily drop the smallest of the current N+1 largest elements in step 2b. Conversely, at the end in Step 4, we need to reverse the order of output, because the heap will give out the N largest elements sorted from the smallest.
The heap can be implemented using an underlying array of size N. Therefore the space complexity of this algorithm is O(N), which is also very good.
It has the following interface:
The comparator will look as follows:
You can find the complete code on GitHub under dusan-rychnovsky/top-n.
An Example
As an example, suppose we created a very popular on-line game, such as Tetris. Over the few months since the game was published we've gathered a list of achieved scores with the following structure:ID | PLAYER | DATE | SCORE |
---|---|---|---|
1 | Jose Rodriguez | 2016/10/01 14:55:02 | 154 |
2 | Nefart | 2016/12/15 03:12:21 | 12 |
3 | SUF4N | 2016/06/03 22:00:01 | 10280 |
4 | EtnaVolcano | 2016/12/24 18:56:15 | 1342 |
5 | Guido | 2016/09/09 07:15:00 | 280 |
... |
As our game has become quite popular, this list is very long. Let's say we've gathered a few millions of records.
Binary Heap
This is a task most suited for a heap data-structure. There are many variants of heap, of which the most basic and commonly known is binary (or 2-regular) heap.According to Wikipedia:
Binary heap is a complete binary tree which satisfies the heap property - the key stored in each node is greater than or equal to the keys in the node's children, according to some total order function.
A complete binary tree is, put informally, a binary tree for which every level, except possibly the last, is completely filled, and all nodes in the last level are as far left as possible.
An Example Binary Heap |
From the heap property follows an important observation - the greatest element is always located in the root of the tree. Also, given heap is a complete binary tree, it has logarithmic height. Therefore the following convenient time complexities:
- Insert O(log(N))
- FindMax O(1)
- DeleteMax O(log(N))
You can find a more detailed explanation and implementation details on the Internet.
Min vs. Max Heap
The above description holds for a so-called max-heap. Additionally, there also isa min-heap, where the heap property is inverted and root node contains the minimum element.
It is useful to observe that a min-heap is just a max-heap with inverted ordering
function. Specifically:
x <fmax y <=> x >fmin yTherefore the distinction between max-heap and min-heap lies purely in what ordering function it takes. It is, however, a useful abstraction that simplifies communication and reasoning about heap behaviour.
The Algorithm
So how can we use a heap to solve the problem at hand?Input: A sequence of M elements, an appropriate ordering function and a positive number N. Supposing M >> N. Steps: 1. Build a min-heap of size N by inserting the first N elements. 2. For each remaining element: a) Insert the element. b) Delete the current minimum. 3. While heap is non-empty: a) Find and collect minimum. b) Delete minimum. 4. Return collected elements, in reversed order. Output: A sequence of top N elements of the input sequence.A key observation here is that after i-th iteration in Step 2 the heap will contain
the top N elements of the N+i-th prefix of the input sequence. Therefore, at the end of step 2, after all items have been processed, it will contain the top N elements of the whole sequence.
Note that we've used a min-heap, even though we're interested in the N largest elements. This is an important trick that allows us to easily drop the smallest of the current N+1 largest elements in step 2b. Conversely, at the end in Step 4, we need to reverse the order of output, because the heap will give out the N largest elements sorted from the smallest.
Time And Space Complexity
The total time complexity of the algorithm is O(M log(N)), where M is the size of the input sequence and N is the size of the output sequence. Note that M >> N and so this bound is much better than for e.g. sorting the whole input sequence. Often N is very small, such as 10. In that case the time complexity is essentially O(M).The heap can be implemented using an underlying array of size N. Therefore the space complexity of this algorithm is O(N), which is also very good.
PriorityQueue
Now that the algorithm is clear, let's look at how we can implement it in Java. The standard library provides a binary heap implementation under a class named PriorityQueue.It has the following interface:
public PriorityQueue(int initialCapacity, Comparator<E> comparator) Constructor. The initialCapacity determines the initial size of the underlying array. This is just a technical detail. PriorityQueue is an unbounded queue and the array will grow as needed. The comparator implements the ordering function. PriorityQueue is a min-heap - root note will contain the record which the comparator deems smallest. If you need a max-heap, just give it a reversed comparator (see java.util.Collections#reverseOrder(Comparator<T> cmp)).
public boolean add(E element) Insert.
public int size() Size
public E poll() Retrieve and delete min.
The Implementation
Using PriorityQueue the algorithm can be expressed in a straight-forward way.public class TopN { public static <T> List<T> getTopN(Stream<T> entries, Comparator<T> comparator, int count) { PriorityQueue<T> queue = new PriorityQueue<>(count + 1, comparator); entries.forEach(entry -> { queue.add(entry); if (queue.size() > count) { queue.poll(); } }); List<T> result = new ArrayList<>(queue.size()); while (!queue.isEmpty()) { result.add(queue.poll()); } Collections.reverse(result); return result; } }
Back to our Example
Now that we have the algorithm implemented we just need to provide the representation of a record and the ordering function for our specific use-case.@Value public class Score { int id; String player; Date date; int score; }The @Value annotation comes from Project Lombok and denotes a value-object. It is just some syntactic sugar that allows to make our class definitions clearer. The compiler will augment the AST - make all those fields private and final, generate a corresponding getter for each of them plus an all-arguments constructor.
The comparator will look as follows:
public class ScoreComparator implements ComparatorWith all the pieces in place we can now use them this way:{ @Override public int compare(Score first, Score second) { return Integer.compare(first.getScore(), second.getScore()); } }
Stream<Score> scores = ...; // obtain a stream of all scores List<Score> top10 = TopN.getTopN(scores, new ScoreComparator(), 10);This will give us the 10 greatest scores achieved in our game.
Conclusion
In this article we've seen a brief overview of the heap data-structure and demonstrated, based on a use-case, one type of applications it is very useful for.You can find the complete code on GitHub under dusan-rychnovsky/top-n.