rzsun blog about photography

Systems Notes: Storage

Personal notes on storage architectures.

Disk I/O

Two types of I/O on disks:

  1. Random
  2. Sequential

Random is much slower than sequential. On HDDs, the overhead is from disk head movement. On SSDs, overhead is reduced but exists due to still needing additional page reads.

LSM tree (log-structured merge-tree)

Tree like structure with multiple levels, used for write heavy cases, main optimization comes from sequential writes. Most commonly for key-value data.

L0 [Memtable keys a-z]
L1 [SST keys a-e] [SST keys f-p] [SST keys q-z]
L2 [SST keys a-c] [SST keys d-f] [SST keys g-p] [SST keys q-t] [SST keys u-z]
LN [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] [SST] ...


  • L0 “Memtable” stores most recently inserted data in memory
  • Implemented as a sorted data structure such as self balancing tree or skip list
  • Has a max size, once size is filled, the contents are flushed to disk (sequential writes).


  • Storage format for key-value pairs flushed to disk
  • Two sections, Index and Segments
  • Segments is organized by fixed-size blocks
  • Index is sparse and contains keys and offsets for faster reads
  • Can apply additional optimizations such bloom filter (faster read) and delta encoding (lower storage)


  • Looks for key until it finds it with from top to lowest level


  • When a level has reached capacity, compaction will merge level into lower level


Event streaming queue that takes advantage of sequential I/O. Producers write to it and consumers read from it. Runs on JVM.



  • a stream of messages of a certain type


  • design principle: “sequential disk access can in some cases be faster than random memory access!”
  • filesystem used for both storing and caching
  • storage: append only segment files
  • caching: uses kernel pagecache (disk)
  • cache to storage flush can be configured based on number of messages and duration


  • topic shard, across brokers
  • guaranteed in-order reads per partition by consumer
  • smallest unit of parallelism
  • smallest unit of replication


  • replicated log: logs are replicated from leader to follower

Leader Election

  • Kafka dynamically maintains a set of in-sync replicas (ISR) that are caught-up to the leader
  • any replica in the ISR is eligible to be elected leader


  • distance in bytes from the beginning of the log


  • client controls which partition, can implement load balancing
  • send is async for batching to reduce I/O ops


  • Consumers request messages by offset

Consumer Group

  • set of consumers that receive each message at least once
  • each partition is consumed by exactly one consumer within each subscribing consumer
  • one broker is group coordinator and manages offsets


Schema Design

Reference: https://cloud.google.com/spanner/docs/whitepapers/optimizing-schema-design

  • PRIMARY KEY organizes table lexicographically
  • root table (parent) -> interleaved table (child) (similar to snowflake schema)
    • large columns can also be stored in non-interleaved tables
  • partition happens automatically based on data locality of root table
  • prioritize desired locality for most important root entities and most common access patterns


  • external consistency: system behaves as if all transactions were executed sequentially