class Heap:
def __init__(self, op=lambda a, b: a < b):
self.nodes = [] # this is the list of elements in the heap
self.op = op # this is the generalized "less-than" operation
def _swap(self, i, j):
"""swap elements at index i and j"""
...
def _heapify_up(self, i = None):
"""swap element at i with parent (and their parents) if necessary."""
i = i or len(self) - 1 # None means last
...
def _heapify_down(self, i = 0):
"""swap element at i with "smaller" child (and their children) if necessary."""
def insert(self, item):
...
def extract(self):
...Complexity
This exercise statement turned out a bit longer, but shouldn’t take very long to solve, so don’t be afraid 😊. It consists of three parts, where the first two parts would require a short textual answer and the last one is a bit of a coding exercise.
Function interface matters
The function numpy.histogram computes a histogram from the values of an array. Although it would be easy to specify the histogram bins e.g. using
np.histogram(..., bins=np.linspace(a, b, n))it’s also possible to pass the range and number of bins explicitly, e.g.: using
np.histogram(..., bins=n, range=(a, b))Tasks
Why is this a good interface design?
Zarr can store arrays
Have a look at the diagram conceptualizing the terminology of the Zarr specification. As Zarr is a way of packing n-dimensional datasets into stored objects (e.g. on a filesystem), you may find some pieces which are similar to your work on last week’s exercise. Two pieces which probably haven’t been part of your solution are chunking and codecs.
- Arrays are split into chunks, which are stored individually.
- Chunks can be transformed (e.g. compressed) using codecs.
Tasks
- Reflect on similarities and differences between your solution for last week’s exercise and the sketch of the Zarr specification. Are there surprises? Would you do things differently?
- Think (and write some notes) about data access, mainly when reading data:
- Why are chunks considered useful? - How do they impact different workloads (like looking at a global map at one timestamp, observing a time-series at a single location or computing a histogram of all values).
- What influence might (chunk-by-chunk) compression have?
A binary heap
A binary heap is a data structure forming a binary tree. A binary tree is a tree where each parent has up to two children. A binary heap is suited to implement an efficient priority queue, i.e. data structure which allows to insert elements in any order and extract an extreme (e.g. minimum, maximum, etc…) element. A heap for extracting the minimum element is called a min-heap and one to extract the maximum element is called a max-heap. It’s of course possible to implement a generic heap by accepting a comparison function as an argument. For simplicity, and without loss of generality, we’ll continue talking about a min-heap and the “less than” comparison function.

The binary tree forming the heap must always satisfy the heap property: every parent must be smaller than their children. This property ensures that the smallest element can always be found efficiently. An empty heap trivially satisfies this property, so to keep this property, we’ll just have to ensure to maintain it upon any modification (i.e. insert or extract) of the tree. As we’ll see, for normal operations, there are only two positions where the tree will be modified:
- the root (if we insert a too-large element, we must move it down to keep the heap property. This function is called
heapify_down) - a leaf (if we insert a too-small element, we must move it up to keep the heap property. This function is called
heapify_up)
There’s an interesting (and well-suited) way of encoding “full” binary trees, that is, the tree is balanced (height difference <= 1) and if there are any empty leaves, the empty leaves are on the right. Such binary trees can be stored without any explicitly written pointers in a plain array (or Python list). The parent-child relations can be found using some arithmetic formulas, and as no explicit relations are stored, this type of data structure is called an implicit data structure.
For an array / list with 0-based indexing, the formulas for finding adjacent nodes to a given node at index i are:
- parent:
(i - 1) // 2(//is integer-division) - left child:
2 * i + 1 - right child:
2 * i + 2
Tasks
- verify that the formulas for parent / child relations are correct (think about it)
- implement a binary
Heap, based on the code scaffold below- start with
_heapify_up,_heapify_downand_swap - implement
insert()by appending an element at the back of the list and fixing the heap property using_heapify_up - implement
extract()by taking the element from the front of the list. You’ll have to fill the gap by_swap-ping the last element of the list to the front and fixing the heap property using_heapify_down
- start with
- test your
Heapimplementation - use your
Heapimplementation to build a sorting algorithm (heapsort)- test it
- compare it’s performance to bubble sort and quick sort from the lecture
Outlook (not part of the exercise, but maybe interesting)
You may want to read into other uses of a heap (or priority queue in general):
- A*-Algorithm for finding a shortest path
- batch scheduling (like the slurm queue used by DKRZ)
Or you might want to compare a heap (which provides access to the minimum element) with a sorted tree (e.g. red-black-tree or AVL tree), which would be more suitable to implement a dictionary.