Complexity

This exercise statement turned out a bit longer, but shouldn’t take very long to solve, so don’t be afraid 😊. It consists of three parts, where the first two parts would require a short textual answer and the last one is a bit of a coding exercise.

Function interface matters

The function numpy.histogram computes a histogram from the values of an array. Although it would be easy to specify the histogram bins e.g. using

np.histogram(..., bins=np.linspace(a, b, n))

it’s also possible to pass the range and number of bins explicitly, e.g.: using

np.histogram(..., bins=n, range=(a, b))

Tasks

Why is this a good interface design?

Zarr can store arrays

Have a look at the diagram conceptualizing the terminology of the Zarr specification. As Zarr is a way of packing n-dimensional datasets into stored objects (e.g. on a filesystem), you may find some pieces which are similar to your work on last week’s exercise. Two pieces which probably haven’t been part of your solution are chunking and codecs.

  • Arrays are split into chunks, which are stored individually.
  • Chunks can be transformed (e.g. compressed) using codecs.

Tasks

  • Reflect on similarities and differences between your solution for last week’s exercise and the sketch of the Zarr specification. Are there surprises? Would you do things differently?
  • Think (and write some notes) about data access, mainly when reading data:
    • Why are chunks considered useful? - How do they impact different workloads (like looking at a global map at one timestamp, observing a time-series at a single location or computing a histogram of all values).
    • What influence might (chunk-by-chunk) compression have?

A binary heap

A binary heap is a data structure forming a binary tree. A binary tree is a tree where each parent has up to two children. A binary heap is suited to implement an efficient priority queue, i.e. data structure which allows to insert elements in any order and extract an extreme (e.g. minimum, maximum, etc…) element. A heap for extracting the minimum element is called a min-heap and one to extract the maximum element is called a max-heap. It’s of course possible to implement a generic heap by accepting a comparison function as an argument. For simplicity, and without loss of generality, we’ll continue talking about a min-heap and the “less than” comparison function.

A min-heap has the smallest element in it’s root node.

The binary tree forming the heap must always satisfy the heap property: every parent must be smaller than their children. This property ensures that the smallest element can always be found efficiently. An empty heap trivially satisfies this property, so to keep this property, we’ll just have to ensure to maintain it upon any modification (i.e. insert or extract) of the tree. As we’ll see, for normal operations, there are only two positions where the tree will be modified:

  • the root (if we insert a too-large element, we must move it down to keep the heap property. This function is called heapify_down)
  • a leaf (if we insert a too-small element, we must move it up to keep the heap property. This function is called heapify_up)

There’s an interesting (and well-suited) way of encoding “full” binary trees, that is, the tree is balanced (height difference <= 1) and if there are any empty leaves, the empty leaves are on the right. Such binary trees can be stored without any explicitly written pointers in a plain array (or Python list). The parent-child relations can be found using some arithmetic formulas, and as no explicit relations are stored, this type of data structure is called an implicit data structure.

Binary tree encoded as an array

For an array / list with 0-based indexing, the formulas for finding adjacent nodes to a given node at index i are:

  • parent: (i - 1) // 2 (// is integer-division)
  • left child: 2 * i + 1
  • right child: 2 * i + 2

Tasks

  1. verify that the formulas for parent / child relations are correct (think about it)
  2. implement a binary Heap, based on the code scaffold below
    • start with _heapify_up, _heapify_down and _swap
    • implement insert() by appending an element at the back of the list and fixing the heap property using _heapify_up
    • implement extract() by taking the element from the front of the list. You’ll have to fill the gap by _swap-ping the last element of the list to the front and fixing the heap property using _heapify_down
  3. test your Heap implementation
  4. use your Heap implementation to build a sorting algorithm (heapsort)
    • test it
    • compare it’s performance to bubble sort and quick sort from the lecture
class Heap:
    def __init__(self, op=lambda a, b: a < b):
        self.nodes = []  # this is the list of elements in the heap
        self.op = op  # this is the generalized "less-than" operation
    def _swap(self, i, j):
        """swap elements at index i and j"""
        ...

    def _heapify_up(self, i = None):
        """swap element at i with parent (and their parents) if necessary."""
        i = i or len(self) - 1  # None means last
        ...
            
    def _heapify_down(self, i = 0):
        """swap element at i with "smaller" child (and their children) if necessary."""

    def insert(self, item):
        ...

    def extract(self):
        ...

Outlook (not part of the exercise, but maybe interesting)

You may want to read into other uses of a heap (or priority queue in general):

  • A*-Algorithm for finding a shortest path
  • batch scheduling (like the slurm queue used by DKRZ)

Or you might want to compare a heap (which provides access to the minimum element) with a sorted tree (e.g. red-black-tree or AVL tree), which would be more suitable to implement a dictionary.