Data structures

A data structure is a data organization and storage format.

Usually chosen for efficient access to data.

from bits to applications

applications data structures primitive types bits

… a story of packing things 📦

bits & primitive types

bits form computer memory

Core & SD memory by Daniel Sancho from Málaga, Spain, CC BY 2.0

bits & primitive types

  • a computer works on bits (which are either 0 or 1)
  • bits themselves have a position, but no particular meaning
  • bits are usually grouped into bytes, which consist of 8 bits
  • (primitive) types provide meaning to groups of bits or bytes

primitive types: unsigned integer

---
title: "8 bits of the value 81"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "0"
  1: "1"
  2: "0"
  3: "1"
  4: "0"
  5: "0"
  6: "0"
  7: "1"

interpretation:

---
title: "bit values of an 8 bit unsigned integer"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "128"
  1: "64"
  2: "32"
  3: "16"
  4: "8"
  5: "4"
  6: "2"
  7: "1"

\[ 0 \cdot 128 + 1 \cdot 64 + 0 \cdot 32 + 1 \cdot 16 + 0 \cdot 8 + 0 \cdot 4 + 0 \cdot 2 + 1 \cdot 1 = 81 \]

primitive types: unsigned integer

---
title: "8 bits of the value 81"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "0"
  1: "1"
  2: "0"
  3: "1"
  4: "0"
  5: "0"
  6: "0"
  7: "1"

used in upcoming slides:

---
title: "showing value"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0-7: "81"

---
title: "using hexadecimal"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0-3: "0x5"
  4-7: "0x1"

primitive types: signed integer

---
title: "bit values of an 8 bit signed integer"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "-128"
  1: "64"
  2: "32"
  3: "16"
  4: "8"
  5: "4"
  6: "2"
  7: "1"

note, it’s -128!

this representation is called two’s complement

  • the most commonly used
  • simplifies hardware implementations

another (very uncommon, but obvious) option:

---
title: "sign and value representation"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "sign"
  1-7: "value"

primitive types: float (IEEE 754)

---
title: "32 bit float (single precision)"
config:
    packet:
        bitsPerRow: 32
---
packet-beta
  0: "sign"
  1-8: "exponent"
  9-31: "fraction"

\[ \textrm{value} = (-1)^\textrm{sign} \cdot 2^{\textrm{exponent} - 127} \cdot \left( 1 + \textrm{fraction} \right) \]

  • exponent is an unsigned integer
  • fraction is a binary fraction:

---
title: "bit values of the fraction"
config:
    packet:
        showBits: false
        bitsPerRow: 23
---
packet-beta
  0: "1/2"
  1: "1/4"
  2: "1/8"
  3: "1/16"
  4: "1/32"
  5: "1/64"
  6-22: "..."

primitive types: characters

a character set (e.g. ASCII) assigns a different meaning to each stored number:

decimal  hex meaning
10 0x0a new line
48 0x30 0
63 0x3f ?
65 0x41 A
97 0x61 a

(these day’s we use UTF-8 instead of ASCII, which uses more than one byte, such that characters of all languages can be represented)

representation vs value

representation: the bit-pattern (0b01010001 or 0x51)

value: the interpretation (81 or Q)

The same bit-pattern may be interpreted differently depending on the type!

Hands On

Take the bit-representation of a 32 bit float value of 1.0 and re-interpret it as an integer. Which value do you get? What’s the bit-representation?

  • In Python, you may want to use numpy as this provides more fine-grained control about the exact primitive types.
  • A 32 bit float has the dtype <f4, a 32 bit unsigned integer the dtype <u4.
  • .view(dtype) can be used to re-interpret memory as different type.

solution

import numpy as np
i = int(np.array(1.0, dtype="<f4").view("<u4"))
print(f"{i} {i:032b}")
1065353216 00111111100000000000000000000000
i = int(np.array(1.5, dtype="<f4").view("<u4"))
print(f"{i} {i:032b}")
1069547520 00111111110000000000000000000000

compounds

⚠️ memory locations

Memory is usually addressed (numbered) byte-wise.

From now, we’ll use byte-addressing instead of bit-addressing in the figures.

compounds

Multiple bits form primitive types.

Multiple primitive types form compound types

struct or @dataclass, tuple etc… form a fixed group of elements, each with a distinct meaning

compound: complex number

---
title: "2x4 byte (32 bit) complex number"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0-3: "real part"
  4-7: "imaginary part"

\[ \textrm{value} = \textrm{real part} + i \cdot \textrm{imaginary part} \]

from dataclasses import dataclass

@dataclass
class ComplexNumber:
    real: float
    imag: float

it’s about co-location

The groupings of bits and bytes we’ve seen so far use fixed, pre-defined and agreed-on sequences of co-located memory locations to provide meaning to those memory locations.

To make use of the bits, we have to know what to expect.

other relations

Are there other (less fixed) ways to group and provide meaning to memory locations?

  • computed addresses (e.g. array)
  • tabulated addresses (e.g. pointers)

Array

0x01 1 0x02 5 0x03 17 0x04 6

a series of consecutive memory cells

Array

0x01 1 0x02 5 0x03 0 0x04 6

element access is easy

Array

0x01 1 0x02 5 0x03 0 0x04 6 0x05 3

appending: easy if free space available

Array

0x01 1 0x02 3 0x03 5 0x04 0 0x05 6 0x06 3

insert or removal is expensive

Array

Arrays are containers, which store data elements in consecutive memory cells.

  • There’s no overhead, just data.
  • Accessing elements is cheap.
  • Modifying the structure / order etc.. can be expensive.
  • Some extra metadata may be added for convenience.

other names for similar things: vector, string

Array metadata

What metadata might be relevant for arrays?

e.g.:

  • start location
  • size (length / end / sentinel)
  • element size
  • data type

Type-Length-Value (TLV)

TLV is a typical data structure for data acquisition 1 2

---
title: "TLV temperature data"
config:
    packet:
        bitsPerRow: 24
---
packet-beta
  0-3: "TEMP"
  4-7: "4 (or 16)"
  8-11: "12.4"
  12-15: "17.2"
  16-19: "21.1"
  20-23: "23.6"

  • (fixed size) type tag (TEMP for temperature as float)
  • (fixed size) size of data (4 values, alternative: 16 bytes)
  • (variable size) actual data values

TLV is a combination of a struct and an array.

Pointers and References

An integer (stored in memory) can be interpreted as a byte-address in memory (where another variable is located).

This is called a pointer.

The term “reference” is used for variable names which act like pointers, but hide their pointer-like nature.

references in Python

Python usually doesn’t expose pointers directly, but references are everywhere.

a = [5]    # a is a reference to the list
print(a)
[5]
b = a      # b is another reference to the SAME list
b[0] = 10
print(a)   # NOTE, it's a again!
[10]

pointers & references form indirections

Pointers (and thus references) can be used to form more complex relations and data structures.

More Containers

(Linked) lists

  • Variable size, can but needn’t be contiguous in memory

  • Next element is at the address saved in the additional pointer

  • The previous element can only be found by starting at the beginning

(Linked) lists

  • Element removal only requires to change a pointer address

  • Element insertion only requires allocating memory for the new value and setting two pointers

(Linked) lists

  • in theory, insert + remove is cheaper than on an array
  • in practice, memory allocation can be more expensive than moving data
  • Python’s list is more like an array of pointers

Dictionaries

also called associative array, key-value store, map(ping), object etc…

Access elements by key (e.g. name) instead of position

temperatures = {'hamburg': 21, 'madrid': 37, 'reykjavik': 10}
print(temperatures['hamburg'])
21

use dictionaries to group things

items = ["knife", "fork", "hammer", "nail"]
by_length = {}
for element in items:
    l = len(element)
    if l not in by_length: by_length[l] = []
    by_length[l].append(element)

for l, elements in sorted(by_length.items()):
    print(l, elements)
4 ['fork', 'nail']
5 ['knife']
6 ['hammer']

use dictionaries to group things

from collections import defaultdict
items = ["knife", "fork", "hammer", "nail"]
by_length = defaultdict(list)
for element in items:
    by_length[len(element)].append(element)

for l, elements in sorted(by_length.items()):
    print(l, elements)
4 ['fork', 'nail']
5 ['knife']
6 ['hammer']

building a dictionary

Key space is often (infinitely) large (e.g. all possible strings)

There can’t be one “slot” for every possible key

Let’s make a list of keys and values:

temperatures = [('hamburg', 21), ('madrid', 37), ('reykjavik', 10)]

def get(kvlist, key):
    for k, v in kvlist:
        if k == key:
            return v
    raise KeyError(f"key '{key}' not found!")

print(get(temperatures, 'hamburg'))
21

building a dictionary

A dictionary as a list works and is useful for data exchange, but getting an element requires accessing all of them (slow).

Better options for efficient access:

  • (binary) tree
  • hash map

Set

Sets are dictionaries without values.

{"a", "b", "c", "a"}
{'a', 'b', 'c'}

Combined data structures

people = [
    {
        "givenname": "Jane",
        "name": "Doe",
        "skills": ["Python", "C++", "Clouds"],
    },
    {
        "givenname": "John",
        ...
    }
    ...
]

Shotgun Buffet

Bit Orderings

Depending on the computer architecture or application, the ordering of the bits may be reversed:

---
title: "MSB first 8 bit integer"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "128"
  1: "64"
  2: "32"
  3: "16"
  4: "8"
  5: "4"
  6: "2"
  7: "1"

---
title: "LSB first 8 bit integer"
config:
    packet:
        bitsPerRow: 8
---
packet-beta
  0: "1"
  1: "2"
  2: "4"
  3: "8"
  4: "16"
  5: "32"
  6: "64"
  7: "128"

binary-coded decimal (BCD)

BCD numbers use four bits to encode a digit (0 … 9).

The possible values 10 … 15 are unused.

---
title: "4 bit integer"
config:
    packet:
        bitsPerRow: 4
---
packet-beta
  0: "8"
  1: "4"
  2: "2"
  3: "1"

These digits can be combined to form a decimal number.

---
title: "4 digit BCD number"
config:
    packet:
        bitsPerRow: 16
---
packet-beta
  0-3: "1000"
  4-7: "100"
  8-11: "10"
  12-15: "1"

(it’s mostly an ancient and wasteful thing, but reduces complexity for transformations between bit representation and decimal value interpretation)

Stack

Only add or remove elements at the end (LIFO: last in, first out)

stack = ['this', 'is']
print(stack)
['this', 'is']
stack.append('new')
print(stack)
['this', 'is', 'new']
stack.pop()
stack.append('changed')
print(stack)
['this', 'is', 'changed']

(i.e. like an array with enough space at the end)

Queue

Add at the end, remove at the front (FIFO: first in, first out)

from queue import Queue

queue = Queue()
queue.put('First')
queue.put('Second')
print(list(queue.queue))
['First', 'Second']
print(queue.get())
First
queue.put('Third')
print(list(queue.queue))
['Second', 'Third']

can be implemented using an array as a circular buffer

Hash Map

hash map