authors: Christian Weilbach
last change: 2026-02-17
Stratum is a persistent columnar analytics engine built for the JVM, combining Clojure's immutability with Java Vector API SIMD acceleration. It provides:
(ns stratum-intro
(:require [stratum.api :as st]
[stratum.dataset :as dataset] ;; transient mutations
[stratum.storage :as storage] ;; branch management
[tablecloth.api :as tc]
[tech.v3.dataset :as ds]
[scicloj.kindly.v4.kind :as kind]))Stratum works with column maps — maps from keywords to typed arrays. The same shape tablecloth uses under the hood.
(def orders
{:product (into-array String ["Apple" "Banana" "Apple" "Cherry" "Banana" "Apple"])
:qty (double-array [10.0 15.0 8.0 5.0 20.0 12.0])
:price (double-array [1.2 0.5 1.2 2.5 0.5 1.2])})DSL query — Clojure map with keywords:
(st/q {:from orders
:group [:product]
:agg [[:sum :qty]
[:avg :price]
[:count]]})[{:product "Apple", :_count 3, :sum 30.0, :avg 1.2, :count 3}
{:product "Banana", :_count 2, :sum 35.0, :avg 0.5, :count 2}
{:product "Cherry", :_count 1, :sum 5.0, :avg 2.5, :count 1}]
SQL string — same query, same result:
(st/q "SELECT product, SUM(qty), AVG(price), COUNT(*) FROM orders GROUP BY product"
{"orders" orders})[{:product "Apple", :_count 3, :sum 30.0, :avg 1.2, :count 3}
{:product "Banana", :_count 2, :sum 35.0, :avg 0.5, :count 2}
{:product "Cherry", :_count 1, :sum 5.0, :avg 2.5, :count 1}]
Predicates in the DSL accept both keyword and symbol style:
(st/q {:from orders
:where [[:> :price 0.6]]
:agg [[:sum :qty] [:count]]})[{:_count 4, :sum 35.0, :count 4}]
Pass a tablecloth (or tech.ml.dataset) dataset directly — Stratum reads its columns with zero copy. Results come back as a sequence of maps by default.
(def tc-ds
(tc/dataset {:category ["Electronics" "Clothing" "Electronics" "Clothing" "Electronics"]
:revenue (double-array [250.0 80.0 320.0 110.0 280.0])
:qty (long-array [1 2 1 3 2])}))Query it like any Stratum source:
(st/q {:from tc-ds
:group [:category]
:agg [[:sum :revenue]
[:sum :qty]
[:count]]})[{:category "Electronics",
:_count 3,
:sum_revenue 850.0,
:sum_qty 4.0,
:count 3}
{:category "Clothing",
:_count 2,
:sum_revenue 190.0,
:sum_qty 5.0,
:count 2}]
SQL also works directly with tablecloth datasets:
(st/q "SELECT category, SUM(revenue) AS total FROM t WHERE qty > 1 GROUP BY category"
{"t" tc-ds})[{:category "Electronics", :_count 1, :total 280.0}
{:category "Clothing", :_count 2, :total 190.0}]
Stratum compiles filter + aggregate into a single pass using the Java Vector API (AVX-512 / AVX2 / NEON depending on CPU). No intermediate arrays, no row-by-row iteration. At 1M rows you'll see sub-millisecond aggregations.
(def large-ds
{:amount (double-array (repeatedly 1000000 #(* (rand) 1000.0)))
:region (long-array (repeatedly 1000000 #(rand-int 10)))
:flag (long-array (repeatedly 1000000 #(rand-int 2)))})Single-pass fused filter + aggregate:
(time
(st/q {:from large-ds
:where [[:> :amount 500.0] [:= :flag 1]]
:agg [[:sum :amount] [:count]]}))"Elapsed time: 633.059803 msecs"
[{:_count 249871, :sum 1.872378534388573E8, :count 249871}]
Dense group-by (direct array indexing, no hash for ≤200K groups):
(time
(st/q {:from large-ds
:group [:region]
:agg [[:sum :amount] [:avg :amount] [:count]]}))"Elapsed time: 113.438972 msecs"
[{:region 0,
:_count 99810,
:sum 4.98691338161383E7,
:avg 499.64065540665564,
:count 99810}
{:region 1,
:_count 100345,
:sum 5.022336075499722E7,
:avg 500.5068588868127,
:count 100345}
{:region 2,
:_count 99849,
:sum 4.990947493019742E7,
:avg 499.84952208031547,
:count 99849}
{:region 3,
:_count 99731,
:sum 4.987591717699816E7,
:avg 500.1044527478734,
:count 99731}
{:region 4,
:_count 100065,
:sum 4.9989384241334185E7,
:avg 499.5691224837274,
:count 100065}
{:region 5,
:_count 100226,
:sum 5.005925636449416E7,
:avg 499.46377551228386,
:count 100226}
{:region 6,
:_count 99627,
:sum 4.9861620384415016E7,
:avg 500.4830054544954,
:count 99627}
{:region 7,
:_count 100136,
:sum 5.00775462426446E7,
:avg 500.09533277387357,
:count 100136}
{:region 8,
:_count 99715,
:sum 4.978032174562664E7,
:avg 499.2260115892959,
:count 99715}
{:region 9,
:_count 100496,
:sum 5.020025729390712E7,
:avg 499.5249292897938,
:count 100496}]
For index-backed datasets, Stratum tracks min/max per 8192-row chunk. Range queries classify each chunk as: skip / stats-only / SIMD — so a predicate like ts >= 900000 only touches ~1% of chunks.
(def time-series
(st/make-dataset
{:ts (st/index-from-seq :int64 (range 0 1000000))
:value (st/index-from-seq :float64 (repeatedly 1000000 rand))}))Only the last ~1% of chunks are scanned:
(time
(st/q {:from time-series
:where [[:>= :ts 900000] [:< :ts 910000]]
:agg [[:sum :value] [:count]]}))"Elapsed time: 14.440906 msecs"
[{:_count 10000, :sum 4935.906568791859, :count 10000}]
Datasets are immutable Clojure values. st/sync! durably persists to a Konserve store (file, S3, memory …). st/fork is O(1) — structural sharing with copy-on-write on mutation.
(require '[konserve.store :as kstore])(def store-cfg
{:backend :file
:path "/tmp/stratum-intro"
:id #uuid "550e8400-e29b-41d4-a716-446655440000"})(when (kstore/store-exists? store-cfg {:sync? true})
(kstore/delete-store store-cfg {:sync? true}))nil
(def store (kstore/create-store store-cfg {:sync? true}))Create a dataset with index-backed columns (required for persistence):
(def base
(st/make-dataset
{:product (st/index-from-seq :int64 [101 102 103])
:qty (st/index-from-seq :float64 [100.0 50.0 75.0])
:revenue (st/index-from-seq :float64 [1000.0 750.0 1125.0])}
{:name "sales-q1"}))Persist to "main" branch — returns new dataset with commit metadata:
(def v1 (st/sync! base store "main"))(:id (:commit-info v1))#uuid "4e4eef56-9cd5-4892-a960-3a7bf7fa3c66"
=> #uuid "..."
Fork for a what-if scenario. Fork is O(1) — all chunks shared. Mutations use the transient/persistent protocol (like Clojure's collections):
(def what-if
(-> (st/fork base)
transient
(dataset/set-at! :qty 0 150.0) ;; optimistic product 101 sales
(dataset/set-at! :revenue 0 1500.0)
(dataset/append! {:product 104 :qty 60.0 :revenue 900.0})
persistent!))Sync what-if to a separate branch:
(def v1-whatif (st/sync! what-if store "what-if"))(storage/list-dataset-branches store)#{"what-if" "main"}
=> #{"main" "what-if"}
Time-travel: load any version and query:
(let [main-ds (st/load store "main")
whatif-ds (st/load store "what-if")]
{:main (first (st/q {:from main-ds :agg [[:sum :revenue]]}))
:whatif (first (st/q {:from whatif-ds :agg [[:sum :revenue]]}))}){:main {:sum 2875.0, :_count 3}, :whatif {:sum 4275.0, :_count 4}}
=> {:main {:sum 2875.0} :whatif {:sum 4275.0}}
GC: mark-and-sweep from all branch HEADs — unreachable chunks deleted:
(st/gc! store){:deleted-pss-nodes 0,
:deleted-index-commits 0,
:deleted-dataset-commits 0,
:kept-pss-nodes 6,
:kept-index-commits 6,
:kept-dataset-commits 2}
VARIANCE, STDDEV, CORR — Welford's online algorithm, single pass in Java:
(def sensors
{:temp (double-array [20.1 20.5 21.0 19.8 20.3 22.0])
:humidity (double-array [65.0 68.0 70.0 62.0 66.0 72.0])
:sensor (into-array String ["A" "A" "A" "B" "B" "B"])})(st/q {:from sensors
:group [:sensor]
:agg [[:avg :temp]
[:stddev :temp]
[:corr :temp :humidity]]})[{:sensor "A",
:avg 20.533333333333335,
:stddev 0.4509249752821684,
:corr 0.984018104872232}
{:sensor "B",
:avg 20.7,
:stddev 1.1532562594670808,
:corr 0.9819805060619333}]
Hash joins — INNER, LEFT, RIGHT, FULL:
(def fact
{:order-id (long-array [1 2 3 4 5])
:product-id (long-array [101 102 101 103 102])
:qty (double-array [2 1 3 1 2])})(def dim
{:product-id (long-array [101 102 103])
:name (into-array String ["Widget" "Gadget" "Gizmo"])
:price (double-array [10.0 20.0 15.0])})(st/q {:from fact
:join [{:with dim
:on [:= :product-id :product-id]
:type :inner}]
:group [:name]
:agg [[:sum :qty]]})[{:name "Widget", :_count 2, :sum 5.0}
{:name "Gadget", :_count 2, :sum 3.0}
{:name "Gizmo", :_count 1, :sum 1.0}]
Any PostgreSQL-compatible client works: psql, DBeaver, Python psycopg2, JDBC.
(comment
(def srv (st/start-server {:port 5433}))
(st/register-table! srv "orders" orders)
;; psql -h localhost -p 5433 -U stratum
;; SELECT product, SUM(qty) FROM orders GROUP BY product;
(st/stop-server srv))| Query | Stratum 1T | Stratum 8T | DuckDB 1T | DuckDB 8T |
|---|---|---|---|---|
| Filtered agg (TPC-H) | 12ms | 2.9ms | 34ms | 6.9ms |
| TPC-H Q1 | 61ms | 26ms | 102ms | 19ms |
| SSB Q1.1 | 12ms | 2.8ms | 35ms | 7.3ms |
| Filtered count | 2.1ms | 1.3ms | 14ms | 3.4ms |
| Group-by 10 keys | 12ms | 3.5ms | 28ms | 5.9ms |
| STDDEV group-by | 28ms | 19ms | 54ms | 30ms |
| CORR group-by | 53ms | 32ms | 72ms | 37ms |
All benchmarks: single JVM process, G1GC, JDK 21, no JVM tuning.
source: notebooks/stratum_intro.clj