Profiling Mode
Ox Content ships a built-in profiler — ox_content_profiler — for chasing down
allocations and time in the Markdown engine. It is intentionally tuned for
"how much is this code actually doing?" rather than wall-clock benchmarking,
which is what criterion and the JS benchmark harness in benchmarks/ cover.
Use the profiler when you want to answer questions like:
How many allocations does parsing this file produce, and from which spans?
Which block-level parser function dominates time for documents in our corpus?
Did this change actually reduce allocations, or just trade them around?
What it measures
There are three independent layers, all exposed through one CLI:
Counting global allocator (
ox_content_profiler::CountingAllocator) wrapsstd::alloc::Systemand atomically records every allocation, deallocation, byte counter, peak live bytes, and a power-of-two size-class histogram. Installed as#[global_allocator]in theox-content-profilebinary, so the counts include everything the process does during a measurement window.Hierarchical timing spans (
ox_content_profiler::scope) maintain a thread-local span stack with self / inclusive time aggregation, plus per-span allocation deltas. The parser and renderer crates each have aprofileCargo feature that swaps in realprofile_span!guards at their hot block-level entry points (parse_block,parse_html_block,visit_heading,write_escaped, etc.). With the feature disabled (the default),profile_span!expands to a zero-sized binding the optimizer drops.Report formatting (
ox_content_profiler::Report) folds per-iteration records into percentile timings, allocation summaries, span breakdown, and a histogram. Renders as a monospace table or a single-line JSON document for CI consumption.
CLI quick start
The CLI lives in crates/ox_content_profile_cli and builds the profile
features of both ox_content_parser and ox_content_renderer:
# Pipeline (parse + render) over the embedded corpus
cargo run --release -p ox_content_profile_cli -- pipeline
# Profile a specific file, GFM-enabled, with 200 measured iterations
cargo run --release -p ox_content_profile_cli -- \
pipeline --gfm --iters 200 --warmup 20 \
docs/content/api/types.md
# Parse only — useful for isolating parser work
cargo run --release -p ox_content_profile_cli -- parse path/to/file.md
# Render only — input is parsed once outside the measurement loop
cargo run --release -p ox_content_profile_cli -- render path/to/file.md
# Machine-readable output for diffing in CI
cargo run --release -p ox_content_profile_cli -- pipeline --json path/to/file.md
Always build --release: the macro-expanded profile_span! guards are
cheap, but in a debug build they dominate the actual work.
Reading the report
Timing
min 15.50 µs
p50 35.83 µs
p95 38.38 µs
...
throughput 680.30 MB/s
Allocations (per iteration)
count 46.0
bytes 57.01 KB
peak (max) 46.25 KB
largest 90.00 KB
Spans (sorted by total inclusive time)
name hits self inclusive share allocs bytes
parser::parse_html_block 7600 3.56 ms 3.56 ms 55.4% 0 0 B
...
Timing percentiles are computed over
--itersiterations after dropping the first--warmupto discard cold-cache effects.Allocations per iteration are the mean count + bytes over those iterations, plus the maximum peak live bytes any single iteration reached above its starting baseline.
Spans are aggregated across all measured iterations.
selfis inclusive minus child-span inclusive time, so it's the time the function spends in its own body.shareis the span's self time as a fraction of the total self time across all spans, which is a quick proxy for "fraction of CPU spent here."Size-class histogram shows the last iteration's allocations bucketed by power-of-two size. Useful for spotting spikes in small short-lived allocations.
Profile-feature anatomy
The instrumentation hooks are gated on a Cargo feature per crate. To profile a different consumer of the parser/renderer:
[dependencies]
ox_content_parser = { workspace = true, features = ["profile"] }
ox_content_renderer = { workspace = true, features = ["profile"] }
ox_content_profiler = { workspace = true }
Inside your binary, install the global allocator and enable both layers before the workload, then drain the report:
use ox_content_profiler::{CountingAllocator, Recorder, scope};
#[global_allocator]
static GLOBAL: CountingAllocator = CountingAllocator::new();
fn main() {
CountingAllocator::enable();
scope::enable();
let mut recorder = Recorder::new("my-workload");
for _ in 0..100 {
recorder.record(|| {
// ...exercise parser + renderer...
});
}
let report = recorder.finish();
println!("{}", report.render_table());
}
Suggested workflow for performance work
Run the profiler against a representative corpus before changing anything. Save the table or JSON.
Find the highest-
sharespan that is not obviously memory-bandwidth bound. Look at itsallocsandbytescolumns — span-level allocations are usually low-hanging fruit.Make a change that targets that span.
Re-run the profiler with the same flags and compare span counts, allocations, and tail percentiles.
Run
cargo bench -p ox_content_parserto confirm the synthetic benchmarks haven't regressed.
This was the loop used to land issue #159:
the first run on docs/content/api/types.md showed parse_html_block
consuming 86.9% of pipeline time with to_ascii_lowercase() allocating
per line; replacing that with a byte-level case-insensitive search and
inlining consume_line's newline scan moved the same file from
240 MB/s → 803 MB/s end-to-end while cutting per-iteration allocations
from 122 → 32.