Stratum: High-Performance Columnar Analytics for Clojure

authors: Christian Weilbach

last change: 2026-02-17

Introduction

Stratum is a persistent columnar analytics engine built for the JVM, combining Clojure's immutability with Java Vector API SIMD acceleration. It provides:

  • Blazing fast queries — competitive with or faster than DuckDB on OLAP benchmarks
  • Clojure-native API — query maps, SQL strings, or tablecloth datasets
  • Copy-on-write semantics — fork datasets in O(1), structural sharing
  • Persistent storage — Konserve-backed with snapshot isolation and time-travel
  • SQL interface — PostgreSQL wire protocol (psql, JDBC, DBeaver, psycopg2)

Setup

(ns stratum-intro
  (:require [stratum.api :as st]
            [stratum.dataset :as dataset]    ;; transient mutations
            [stratum.storage :as storage]    ;; branch management
            [tablecloth.api :as tc]
            [tech.v3.dataset :as ds]
            [scicloj.kindly.v4.kind :as kind]))

1. Your Data, Your Way

Stratum works with column maps — maps from keywords to typed arrays. The same shape tablecloth uses under the hood.

(def orders
  {:product (into-array String ["Apple" "Banana" "Apple" "Cherry" "Banana" "Apple"])
   :qty     (double-array [10.0 15.0 8.0 5.0 20.0 12.0])
   :price   (double-array [1.2  0.5  1.2  2.5  0.5  1.2])})

DSL query — Clojure map with keywords:

(st/q {:from   orders
       :group  [:product]
       :agg    [[:sum :qty]
                [:avg :price]
                [:count]]})
[{:product "Apple", :_count 3, :sum 30.0, :avg 1.2, :count 3}
 {:product "Banana", :_count 2, :sum 35.0, :avg 0.5, :count 2}
 {:product "Cherry", :_count 1, :sum 5.0, :avg 2.5, :count 1}]

SQL string — same query, same result:

(st/q "SELECT product, SUM(qty), AVG(price), COUNT(*) FROM orders GROUP BY product"
      {"orders" orders})
[{:product "Apple", :_count 3, :sum 30.0, :avg 1.2, :count 3}
 {:product "Banana", :_count 2, :sum 35.0, :avg 0.5, :count 2}
 {:product "Cherry", :_count 1, :sum 5.0, :avg 2.5, :count 1}]

Predicates in the DSL accept both keyword and symbol style:

(st/q {:from  orders
       :where [[:> :price 0.6]]
       :agg   [[:sum :qty] [:count]]})
[{:_count 4, :sum 35.0, :count 4}]

2. Tablecloth Integration

Pass a tablecloth (or tech.ml.dataset) dataset directly — Stratum reads its columns with zero copy. Results come back as a sequence of maps by default.

(def tc-ds
  (tc/dataset {:category ["Electronics" "Clothing" "Electronics" "Clothing" "Electronics"]
               :revenue  (double-array [250.0 80.0 320.0 110.0 280.0])
               :qty      (long-array [1 2 1 3 2])}))

Query it like any Stratum source:

(st/q {:from  tc-ds
       :group [:category]
       :agg   [[:sum :revenue]
               [:sum :qty]
               [:count]]})
[{:category "Electronics",
  :_count 3,
  :sum_revenue 850.0,
  :sum_qty 4.0,
  :count 3}
 {:category "Clothing",
  :_count 2,
  :sum_revenue 190.0,
  :sum_qty 5.0,
  :count 2}]

SQL also works directly with tablecloth datasets:

(st/q "SELECT category, SUM(revenue) AS total FROM t WHERE qty > 1 GROUP BY category"
      {"t" tc-ds})
[{:category "Electronics", :_count 1, :total 280.0}
 {:category "Clothing", :_count 2, :total 190.0}]

3. Why It's Fast: Fused SIMD Execution

Stratum compiles filter + aggregate into a single pass using the Java Vector API (AVX-512 / AVX2 / NEON depending on CPU). No intermediate arrays, no row-by-row iteration. At 1M rows you'll see sub-millisecond aggregations.

(def large-ds
  {:amount (double-array (repeatedly 1000000 #(* (rand) 1000.0)))
   :region (long-array   (repeatedly 1000000 #(rand-int 10)))
   :flag   (long-array   (repeatedly 1000000 #(rand-int 2)))})

Single-pass fused filter + aggregate:

(time
  (st/q {:from  large-ds
         :where [[:> :amount 500.0] [:= :flag 1]]
         :agg   [[:sum :amount] [:count]]}))
OUT
"Elapsed time: 633.059803 msecs"
[{:_count 249871, :sum 1.872378534388573E8, :count 249871}]

Dense group-by (direct array indexing, no hash for ≤200K groups):

(time
  (st/q {:from  large-ds
         :group [:region]
         :agg   [[:sum :amount] [:avg :amount] [:count]]}))
OUT
"Elapsed time: 113.438972 msecs"
[{:region 0,
  :_count 99810,
  :sum 4.98691338161383E7,
  :avg 499.64065540665564,
  :count 99810}
 {:region 1,
  :_count 100345,
  :sum 5.022336075499722E7,
  :avg 500.5068588868127,
  :count 100345}
 {:region 2,
  :_count 99849,
  :sum 4.990947493019742E7,
  :avg 499.84952208031547,
  :count 99849}
 {:region 3,
  :_count 99731,
  :sum 4.987591717699816E7,
  :avg 500.1044527478734,
  :count 99731}
 {:region 4,
  :_count 100065,
  :sum 4.9989384241334185E7,
  :avg 499.5691224837274,
  :count 100065}
 {:region 5,
  :_count 100226,
  :sum 5.005925636449416E7,
  :avg 499.46377551228386,
  :count 100226}
 {:region 6,
  :_count 99627,
  :sum 4.9861620384415016E7,
  :avg 500.4830054544954,
  :count 99627}
 {:region 7,
  :_count 100136,
  :sum 5.00775462426446E7,
  :avg 500.09533277387357,
  :count 100136}
 {:region 8,
  :_count 99715,
  :sum 4.978032174562664E7,
  :avg 499.2260115892959,
  :count 99715}
 {:region 9,
  :_count 100496,
  :sum 5.020025729390712E7,
  :avg 499.5249292897938,
  :count 100496}]

4. Zone Map Pruning

For index-backed datasets, Stratum tracks min/max per 8192-row chunk. Range queries classify each chunk as: skip / stats-only / SIMD — so a predicate like ts >= 900000 only touches ~1% of chunks.

(def time-series
  (st/make-dataset
    {:ts    (st/index-from-seq :int64   (range 0 1000000))
     :value (st/index-from-seq :float64 (repeatedly 1000000 rand))}))

Only the last ~1% of chunks are scanned:

(time
  (st/q {:from  time-series
         :where [[:>= :ts 900000] [:< :ts 910000]]
         :agg   [[:sum :value] [:count]]}))
OUT
"Elapsed time: 14.440906 msecs"
[{:_count 10000, :sum 4935.906568791859, :count 10000}]

5. Persistence: Version Control for Analytics

Datasets are immutable Clojure values. st/sync! durably persists to a Konserve store (file, S3, memory …). st/fork is O(1) — structural sharing with copy-on-write on mutation.

(require '[konserve.store :as kstore])
(def store-cfg
  {:backend :file
   :path    "/tmp/stratum-intro"
   :id      #uuid "550e8400-e29b-41d4-a716-446655440000"})
(when (kstore/store-exists? store-cfg {:sync? true})
  (kstore/delete-store store-cfg {:sync? true}))
nil
(def store (kstore/create-store store-cfg {:sync? true}))

Create a dataset with index-backed columns (required for persistence):

(def base
  (st/make-dataset
    {:product (st/index-from-seq :int64   [101 102 103])
     :qty     (st/index-from-seq :float64 [100.0 50.0 75.0])
     :revenue (st/index-from-seq :float64 [1000.0 750.0 1125.0])}
    {:name "sales-q1"}))

Persist to "main" branch — returns new dataset with commit metadata:

(def v1 (st/sync! base store "main"))
(:id (:commit-info v1))
#uuid "4e4eef56-9cd5-4892-a960-3a7bf7fa3c66"

=> #uuid "..."

Fork for a what-if scenario. Fork is O(1) — all chunks shared. Mutations use the transient/persistent protocol (like Clojure's collections):

(def what-if
  (-> (st/fork base)
      transient
      (dataset/set-at!    :qty     0 150.0)      ;; optimistic product 101 sales
      (dataset/set-at!    :revenue 0 1500.0)
      (dataset/append! {:product 104 :qty 60.0 :revenue 900.0})
      persistent!))

Sync what-if to a separate branch:

(def v1-whatif (st/sync! what-if store "what-if"))
(storage/list-dataset-branches store)
#{"what-if" "main"}

=> #{"main" "what-if"}

Time-travel: load any version and query:

(let [main-ds   (st/load store "main")
      whatif-ds (st/load store "what-if")]
  {:main   (first (st/q {:from main-ds   :agg [[:sum :revenue]]}))
   :whatif (first (st/q {:from whatif-ds :agg [[:sum :revenue]]}))})
{:main {:sum 2875.0, :_count 3}, :whatif {:sum 4275.0, :_count 4}}

=> {:main {:sum 2875.0} :whatif {:sum 4275.0}}

GC: mark-and-sweep from all branch HEADs — unreachable chunks deleted:

(st/gc! store)
{:deleted-pss-nodes 0,
 :deleted-index-commits 0,
 :deleted-dataset-commits 0,
 :kept-pss-nodes 6,
 :kept-index-commits 6,
 :kept-dataset-commits 2}

6. Advanced: Statistics and Joins

VARIANCE, STDDEV, CORR — Welford's online algorithm, single pass in Java:

(def sensors
  {:temp     (double-array [20.1 20.5 21.0 19.8 20.3 22.0])
   :humidity (double-array [65.0 68.0 70.0 62.0 66.0 72.0])
   :sensor   (into-array String ["A" "A" "A" "B" "B" "B"])})
(st/q {:from  sensors
       :group [:sensor]
       :agg   [[:avg :temp]
               [:stddev :temp]
               [:corr :temp :humidity]]})
[{:sensor "A",
  :avg 20.533333333333335,
  :stddev 0.4509249752821684,
  :corr 0.984018104872232}
 {:sensor "B",
  :avg 20.7,
  :stddev 1.1532562594670808,
  :corr 0.9819805060619333}]

Hash joins — INNER, LEFT, RIGHT, FULL:

(def fact
  {:order-id   (long-array [1 2 3 4 5])
   :product-id (long-array [101 102 101 103 102])
   :qty        (double-array [2 1 3 1 2])})
(def dim
  {:product-id (long-array [101 102 103])
   :name       (into-array String ["Widget" "Gadget" "Gizmo"])
   :price      (double-array [10.0 20.0 15.0])})
(st/q {:from fact
       :join [{:with dim
               :on   [:= :product-id :product-id]
               :type :inner}]
       :group [:name]
       :agg   [[:sum :qty]]})
[{:name "Widget", :_count 2, :sum 5.0}
 {:name "Gadget", :_count 2, :sum 3.0}
 {:name "Gizmo", :_count 1, :sum 1.0}]

7. SQL Interface

Any PostgreSQL-compatible client works: psql, DBeaver, Python psycopg2, JDBC.

(comment
  (def srv (st/start-server {:port 5433}))
  (st/register-table! srv "orders" orders)

  ;; psql -h localhost -p 5433 -U stratum
  ;; SELECT product, SUM(qty) FROM orders GROUP BY product;

  (st/stop-server srv))

Performance Numbers (6M rows, 8-core Lunar Lake, vs DuckDB v1.5)

QueryStratum 1TStratum 8TDuckDB 1TDuckDB 8T
Filtered agg (TPC-H)12ms2.9ms34ms6.9ms
TPC-H Q161ms26ms102ms19ms
SSB Q1.112ms2.8ms35ms7.3ms
Filtered count2.1ms1.3ms14ms3.4ms
Group-by 10 keys12ms3.5ms28ms5.9ms
STDDEV group-by28ms19ms54ms30ms
CORR group-by53ms32ms72ms37ms

All benchmarks: single JVM process, G1GC, JDK 21, no JVM tuning.

source: notebooks/stratum_intro.clj