File and Data Systems

Florian Ziemen and Karl-Hermann Wieners

Storing data

  • Recording of information in a medium
    • Deoxyribonucleic acid (DNA)
    • Hand writing
    • Magnetic tapes
    • Hard disks

Topics

  • Any storage is finite (except for /dev/null and /dev/zero).
  • The pro’s and con’s of different storage media
  • Indexing of storage / storage metadata
  • Challenges of parallel storage access

Quota and permission

Quota

  • Distributing a scarce resource between users.
  • Every user / project gets a specified share.
  • Usually no over-commitment.
/sw/bin/lfsquota /work/bb1153
Disk quotas for prj 30001639 (pid 30001639):
     Filesystem    used   quota   limit   grace   files   quota   limit   grace
          /work  588.4T    595T    595T       - 22140190       0       0       -

Permissions

  • Am I allowed to read / write a file?
  • How about others?
  • See man ls and man chmod for details for a standard file system.
  • Other storage systems can have varying ways of controlling access.

Properties of storage systems

Latency

  • How long does it take until we get the first bit of data?
  • Crucial when opening many small files (e.g. starting python)
  • Less crucial when reading one big file start-to-end.
  • Largely determined by moving parts in the storage medium.

Continuous read / write

  • How much data do we get per second when we are reading continuously?
  • Important for reading/writing large blobs of data from/into individual files.

Random read / write

  • Mixture of latency and continuous read/write
  • Reading many small files / skipping around in files

Caching

  • Keeping data in memory for frequent re-use.
  • Usually storage media like disks have small caches with better properties.
  • e.g. HDD of 16 TB with 512 MB of RAM cache.
  • Operating systems also cache reads.
  • Caching writes in RAM is trouble because of the risk of data loss due to power loss / system crash.

Hardware types

Speed vs cost per space

Device Latency Cont. R/W Rand. R/W EUR/TB
RAM 10s of ns 10s of GB/s 10s of GB/s ~ 3000
SSD 100s of \(\mu\)s GB/s GB/s ~ 100
HDD ms 200 MB/s MB/s ~ 10
Tape minutes 300 MB/s minimal ~ 5
  • All figures based on a quick google search in 06/2024.
  • RAM needs electricity to keep the data (volatile memory).
  • All but tape usually remain powered in an HPC.

RAM disk

  • Use RAM as if it were a disk
    • tmpfs filesystems on levante (/dev/shm)
  • High speed, low volume, lost on reboot.

Solid-state disk/flash drives

  • Non-volatile electronic medium.
    • Keeps state (almost) without energy supply.
  • High speed, also under random access.

Hard disk

  • Basically a stack of modern record players.
  • Stack of magnetic disks with read/write heads.
  • Spinning to make every point accessible by heads.
  • Good for bigger files, not ideal for random access.

Tape

  • Spool of magnetizable bands.
  • Serialized access only.
  • Used for backup / long-term storage.

Hands-on

import getpass
import os
import timeit
user = getpass.getuser()
destination = f"/scratch/{user[0]}/{user}"
def run_write (path, length, blocksize=1024**2):
    (times, remainder) = divmod(length, blocksize)
    data = bytearray(blocksize)
    with open(path, "wb") as of:
        for i in range(times):
            of.write(data)
        if remainder:
            of.write(bytearray(remainder))
duration = {}
duration['1GB'] = timeit.timeit(lambda : run_write(f"{destination}/test.1GB", (1024**3)), number = 1)
print(duration)
{'1GB': 0.4544727400643751}

Take this set of calls, and measure the write speed for different file sizes on your /scratch/

Storage Architectures

These aren’t books in which events of the past are pinned like so many butterflies to a cork. These are the books from which history is derived. There are more than twenty thousand of them; each one is ten feet high, bound in lead, and the letters are so small that they have to be read with a magnifying glass.

from “Small Gods” by Terry Pratchett

Thou shalt have identifiable data

  • There must be a way to reference the data stored on a medium
  • Usual means are (symbolic) names or (numerical) identifiers
  • Must be determined at time of data storage
  • Either implicit, stored with the data or externally
  • Medium is “formatted” to provide required infrastructure

The more the merrier

  • Additional information (metadata) may be needed
    • Required by the storage architecture
      • Support optimized data storage or access
    • Defined by users or applications
      • Allows indexing of data beyond name or id
      • Especially for Content-Adressed Storage1 (git2)

File systems (POSIX)

  • Data is organized in “Files”
  • Files are grouped in special files, “Directories”
  • File data is stored in fixed size blocks
  • Focus on consistently managing changing data

Blocks

  • Minimal size of data transfer
  • Reduce effect of latency by random access
  • Read-ahead for sequential processing
  • “Sweet spot” between single bytes and big blocks
  • Usually a multiple of device blocks

Inodes

  • The file’s name refers to a metadata table (“inode”)
    • unique numerical identifier
    • times of state change, permissions
    • contains actual block locations
stat slides.qmd
  File: slides.qmd
  Size: 6268        Blocks: 16         IO Block: 4194304 regular file
Device: 84f0b5a2h/2230367650d   Inode: 144131850904906407  Links: 1
Access: (0644/-rw-r--r--)  Uid: (20472/ m221078)   Gid: (32054/  mpiscl)
Access: 2024-06-21 15:01:18.000000000 +0200
Modify: 2024-06-21 15:01:18.000000000 +0200
Change: 2024-06-21 17:29:10.000000000 +0200
 Birth: 2024-06-21 15:01:18.000000000 +0200

File system in action

stat --file-system .
  File: "."
    ID: 84f0b5a200000000 Namelen: 255     Type: lustre
Block size: 4096       Fundamental block size: 4096
Blocks: Total: 31118373528 Free: 22547214420 Available: 20963365466
Inodes: Total: 2465930855 Free: 2227421415
df --block-size=4096 .
Filesystem                             4K-blocks       Used   Available Use% Mounted on
10.128.100.149@o2ib2[...]:/home/home 31118373528 8571074099 20963485307  30% /home
df --inodes .
Filesystem                               Inodes     IUsed      IFree IUse% Mounted on
10.128.100.149@o2ib2[...]:/home/home 2465932215 238493275 2227438940   10% /home

Directories

  • Directories link names to inode numbers,
    ie. they do not “contain” files
  • Links to itself (.) and the “parent” directory (..)
  • UNIX has “root” directory (/) as global anchor
  • Files identified via path in directory tree
  • Tree may embed (“mount”) different file systems

Directories in action

pwd
/home/[...]/generic_software_skills/lecture-materials/lectures/file-and-data-systems
ls -lia
total 24
144131850904906406 drwxr-xr-x  3 m221078 mpiscl 4096 Jun 21 15:01 .
144131850904906358 drwxr-xr-x 16 m221078 mpiscl 4096 Jun 21 15:01 ..
144131850904906407 -rw-r--r--  1 m221078 mpiscl 6268 Jun 21 15:01 slides.qmd
144131850904906408 drwxr-xr-x  2 m221078 mpiscl 4096 Jun 21 15:01 static
144131850904906413 -rw-r--r--  1 m221078 mpiscl 1849 Jun 21 15:01 timer.ipynb

Fragmentation

  • Reading is fast if data stays “in line”
  • Line breaks when adding blocks later
    or re-using blocks of unlinked inodes
  • Slowing down due to “jumps” across medium
  • Reduced by block range reservations (“Extents”)
  • Defragmentation tools shuffle inodes but keep IDs

Hands-on

Add a similar function to the previous one for reading, and read the data you just produced. What’s your throughput?

Other architectures

Object storage

  • Data is presented as immutable “objects” (“BLOB”)
  • Each object has a globally unique identifier (eg. UUID or hash)
  • Objects may be assigned names, grouped in “buckets”
  • Generally supports creation (“put”) and retrieval (“get”), can support much more (versioning, etc).
  • Focus on data distribution and replication, fast read access

Object storage – metadata

  • Object metadata stored independent from data location
    • Provide object migration or replication across devices
    • Allows optimizations (eg. database, indices)
    • Metadata, too, may be replicated
  • Enables additional or user defined metadata

Object storage – applications

  • Cloud storage services
  • Parallelized file systems

In-file storage systems

  • Explicit: zip, tar, …
  • Implicit: HDF5, mp4, databases
  • Pseudo-Filesystems: git, …

In-file storage – features

  • Counterbalance latency
  • Decrease blocking loss
  • Use storage features in application
  • Portable across storage systems

Redundancy

Protection against

  • accidental deletion
  • data loss due to hardware failure
  • downtimes due to hardware failure

Backups

  • Keep old states of the file system available.
  • Need at least as much space as the (compressed version of the) data being backuped.
  • Often low-freq full backups and hi-freq incremental backups
    to balance space requirements and restoring time
  • Ideally at different locations
  • Automate them!

RAID

Combining multiple harddisks into bigger / more secure combinations - often at controller level.

  • RAID 0 distributes the blocks across all disks - more space, but data loss if one fails.
  • RAID 1 mirrors one disk on an identical copy.
  • RAID 5 is similar to 0, but with one extra disk for (distributed) parity info
  • RAID 6 is similar to 5, but with two extra disks for parity info (levante uses 8+2 disks).

Erasure coding

Similar to raid, but more flexible with the numbers of disks (more than two parity disks are possible).

  • Used in object stores.
  • Usually, data is distributed across independent servers for higher availability.
  • Requires more computational resources than RAID.

Lustre as a parallel file system

What if you are not the only one controlling the FS?

The file system becomes an independent system.

File system via network

  • All nodes see the same set of files.
  • A set of central servers manages the file system.
  • All nodes accessing the lustre file system run local clients.
  • Many nodes can write into the same file at the same time (MPI-IO).
  • Optimized for high traffic volumes in large files.

Metadata and storage servers

  • The index is spread over a group of Metadata servers (MDS, 8 for /work on levante).
  • The files are spread over another group (40 OSS / 160 OST on levante).
  • Every directory is tied to one MDS.
  • A file is tied to one or more OSTs.
  • An OST contains many hard disks.

The work file system of levante in context

Striping

Zheng et al. 2020 CC-BY-4.0

Striping – features

  • Increased bandwidth by parallel reads
    • Eventually limited by network interfaces
  • More points of failure in one dataset
    • Additional redundancy or error correction

Shotgun buffet

IPFS (InterPlanetary File System)

  • Content addressable storage on the internet (the hash of the data identifies a file.
  • Distributed index tables.
  • You either provide the data online or pay others to do so.
  • Automated methods of caching and replication.

fsspec

  • Python package that provides pseudo-filesystems on various backends.
  • Unifies access to various implementations.

DNA encoding of data

  • Repeating nucleotides are problematic
  • Using “Trits” referring to previous nucleotide
Previous 0 1 2
Thymine A C G
Guanine T A C
Cytosine G T A
Adenine C G T

Further reading

  • John Harris, Remo Software: History of Storage from Cave Paintings to Electrons
    http://www.remosoftware.com/info/history-of-storage-from-cave-paintings-to-electrons/