A high-performance HTTP/1.1 and WebSocket load generator built on Linux io_uring.
Official load generator for Http Arena
The fastest HTTP load generator available. Built on io_uring's zero-copy async I/O to maximize requests per second from a single machine.
Per-request latency tracking with microsecond-resolution histograms via CLOCK_MONOTONIC. Every response is recorded — percentiles are exact, not estimated.
Pass --tui for a rich terminal interface with live progress, throughput graph, and colored results.
During execution, the TUI displays a progress bar, real-time throughput stats, and a sparkline graph showing req/s over time. Updates every second.
Percentile latencies displayed in a clean box-drawn table with color coding: cyan for normal, yellow for p99, red for p99.9.
When using -r N, each connection closes and reconnects after N request/response pairs. The results show the total reconnect count and confirm that every response was latency-sampled. In this example, 28.97M responses across 10 req/conn = ~2.9M reconnects.
The histogram automatically zooms into your data range. Bucket boundaries are computed from p0 to p99.9 of the actual latency distribution, divided into equal-width slices. Control granularity with -b.
Simple CLI interface. The only required argument is the target URL.
| Flag | Default | Description |
|---|---|---|
| <url> | required | Target URL. Only http:// is supported (no TLS). Can be omitted if --raw files contain a Host header. |
| -c <N> | 100 | Total number of concurrent TCP connections, distributed evenly across worker threads. |
| -t <N> | 1 | Number of worker threads. Each runs an independent io_uring event loop. Typically set to the number of CPU cores. |
| -d <duration> | 10s | Benchmark duration. Accepts seconds (5s) or minutes (1m). A 100ms warmup is excluded from results. |
| -p <N> | 1 | HTTP pipeline depth (max 64). Sends N requests per connection before waiting for responses. Higher values increase throughput but raise latency. |
| -r <N> | unlimited | Requests per connection. After N request/response pairs complete, the connection is closed and reopened. Useful for testing connection handling. Default is keep-alive forever. |
| -s <code> | 200 | Expected HTTP status code. Glass Cannon warns and exits with code 1 if a significant portion of responses don't match this status class. |
| --raw <files> | Comma-separated list of raw HTTP request files. Each file contains a complete HTTP request including headers. Connections rotate through templates on reconnect. | |
| --ws | WebSocket echo mode. Performs an HTTP upgrade, then sends and receives WebSocket frames. | |
| --ws-msg <text> | hello | Custom WebSocket message payload. Implies --ws. |
| --tui | Enable TUI mode. Shows a live progress bar with throughput sparkline during execution, and colored table-formatted results. | |
| -b <N> | 10 | Number of latency histogram buckets in TUI mode (max 100). Buckets are computed adaptively from the actual data range, not fixed thresholds. |
| --cqe-latency | Measure latency at io_uring CQE arrival (when the kernel signals data is ready) instead of after the full HTTP response is parsed. Gives a lower-level "time to first data" measurement that excludes userspace parse time. |
Latency numbers are only useful if they're measured correctly. Glass Cannon tracks per-request latency with microsecond resolution using a two-tier histogram.
CLOCK_MONOTONIC
All timestamps use the Linux kernel's CLOCK_MONOTONIC via clock_gettime(). This is a kernel-maintained clock that counts nanoseconds since an arbitrary point (usually boot). Unlike wall-clock time (CLOCK_REALTIME), it is immune to NTP adjustments, leap seconds, and manual time changes — it only moves forward at a steady rate. This makes it the correct clock for measuring elapsed durations. On modern x86_64, clock_gettime(CLOCK_MONOTONIC) reads the TSC register via the kernel's vDSO, so it doesn't even require a syscall — it's a fast userspace read with nanosecond resolution.
When a batch of pipelined requests is dispatched via io_uring, clock_gettime(CLOCK_MONOTONIC) captures a nanosecond timestamp. This timestamp is stored for each request in the batch in a per-connection circular buffer (send_times[]). Since all requests in a pipeline batch go out in a single io_uring_prep_send(), they share the same dispatch timestamp — this measures latency from application dispatch, not individual wire time.
When the HTTP response parser (picohttpparser) completes parsing a response, a second clock_gettime(CLOCK_MONOTONIC) call captures the arrival time. The latency for that request is arrival − send_times[oldest], measured in microseconds. The oldest timestamp is consumed from the circular buffer, matching responses to requests in FIFO order.
The latency sample is recorded in a two-tier histogram. Tier 1 covers 0–10ms at 1μs resolution (10,000 buckets) — this is where most samples land for fast servers. Tier 2 covers 10ms–5s at 100μs resolution (49,900 buckets). Anything above 5s goes into an overflow counter. This gives exact percentile calculations without storing individual samples or doing any heap allocation.
After the benchmark completes, histograms from all worker threads are merged by summing corresponding bucket counts. Percentiles (p50, p90, p99, p99.9) are computed by walking the merged histogram and finding the bucket where the cumulative count crosses the target threshold. This is an exact calculation, not a statistical estimate.
Many benchmarking tools report only averages or sample a fraction of requests. Averages hide tail latency — your p99 could be 100x your average, and you'd never know. Glass Cannon records every single response in a fixed-memory histogram, so percentiles are exact, not estimated.
The two-tier histogram uses ~2MB per worker and requires zero allocations during the benchmark. The sub-10ms tier has 1μs precision, which matters when your server responds in tens of microseconds. And because CLOCK_MONOTONIC reads via vDSO, the timestamping itself adds negligible overhead — no syscall, no context switch, just a fast register read.
The tool itself must not be the bottleneck. If your load generator maxes out before your server does, you're measuring the wrong thing.
| Feature | Glass Cannon | wrk | hey | bombardier |
|---|---|---|---|---|
| I/O model | io_uring | epoll | goroutines | goroutines |
| Zero-copy recv | Yes | No | No | No |
| Pipelined requests | Up to 64 | No | No | No |
| Cross-thread sync | None | Minimal | Channels | Channels |
| Syscalls per request | ~0 (batched) | 2+ per request | 2+ per request | 2+ per request |
| Latency histogram | μs resolution | HdrHistogram | Summary only | HdrHistogram |
| Memory allocations in hot path | Zero | Minimal | GC pressure | GC pressure |
| WebSocket support | Yes | No | No | No |
io_uring processes thousands of operations per kernel transition. With epoll, each send/recv is a separate syscall — at 1M req/s that's 2M+ context switches per second.
These io_uring flags tell the kernel only one thread submits per ring, and to defer work until the next submission. This eliminates internal locking and reduces interrupts.
Written in C with all buffers pre-allocated. No GC pauses, no allocation jitter. The benchmark runs at constant memory usage from start to finish.
Glass Cannon uses a fundamentally different approach to I/O than traditional load generators. Instead of one thread per connection or async callbacks, it talks directly to the Linux kernel's io_uring interface.
Most load generators (wrk, hey, ab) use epoll — the application asks the kernel "which sockets are ready?", then makes individual read() and write() system calls for each one. Every syscall means a context switch between your program and the kernel.
io_uring uses two shared memory ring buffers between your program and the kernel. You write requests into the submission queue, and the kernel writes results into the completion queue. No system calls per operation. The kernel processes batches of I/O while your code processes batches of results.
The main thread spawns N worker threads. Each worker owns an independent io_uring ring, a set of connections, and pre-built request buffers. There is zero communication between workers during the benchmark — no mutexes, no atomics, no shared state.
Pre-registered memory that the kernel writes recv data into directly. No copying from kernel space to user space.
One submission arms continuous receive on a socket. The kernel keeps delivering data without repeated requests.
N copies of the HTTP request are pre-built and concatenated at startup. One send call pushes the entire pipeline.
Each connection has a generation counter packed into io_uring user-data. Stale completions from reconnected sockets are safely ignored.