Profiling Mode

Ox Content ships a built-in profiler — ox_content_profiler — for chasing down allocations and time in the Markdown engine. It is intentionally tuned for "how much is this code actually doing?" rather than wall-clock benchmarking, which is what criterion and the JS benchmark harness in benchmarks/ cover.

Use the profiler when you want to answer questions like:

How many allocations does parsing this file produce, and from which spans?
Which block-level parser function dominates time for documents in our corpus?
Did this change actually reduce allocations, or just trade them around?

What it measures

There are three independent layers, all exposed through one CLI:

Counting global allocator (ox_content_profiler::CountingAllocator) wraps std::alloc::System and atomically records every allocation, deallocation, byte counter, peak live bytes, and a power-of-two size-class histogram. Installed as #[global_allocator] in the ox-content-profile binary, so the counts include everything the process does during a measurement window.
Hierarchical timing spans (ox_content_profiler::scope) maintain a thread-local span stack with self / inclusive time aggregation, plus per-span allocation deltas. The parser, renderer, and docs-generator crates each have a profile Cargo feature that swaps in real profile_span! guards at their hot entry points — parse_block, parse_html_block, visit_heading, write_escaped in the Markdown engine, and docs::oxc_parse, docs::parse_jsdoc, docs::visit_ast, docs::render_entry_page, etc. in the JS/TS docs generator. With the feature disabled (the default), profile_span! expands to a zero-sized binding the optimizer drops.
Report formatting (ox_content_profiler::Report) folds per-iteration records into percentile timings, allocation summaries, span breakdown, and a histogram. Renders as a monospace table or a single-line JSON document for CI consumption.

CLI quick start

The CLI lives in crates/ox_content_profile_cli and builds the profile features of both ox_content_parser and ox_content_renderer:

# Pipeline (parse + render) over the embedded corpus
cargo run --release -p ox_content_profile_cli -- pipeline

# Profile a specific file, GFM-enabled, with 200 measured iterations
cargo run --release -p ox_content_profile_cli -- \
    pipeline --gfm --iters 200 --warmup 20 \
    docs/content/api/types.md

# Parse only — useful for isolating parser work
cargo run --release -p ox_content_profile_cli -- parse path/to/file.md

# Render only — input is parsed once outside the measurement loop
cargo run --release -p ox_content_profile_cli -- render path/to/file.md

# Machine-readable output for diffing in CI
cargo run --release -p ox_content_profile_cli -- pipeline --json path/to/file.md

Always build --release: the macro-expanded profile_span! guards are cheap, but in a debug build they dominate the actual work.

Profiling the JS/TS docs generator

The docs-* subcommands profile ox_content_docs (the "cargo doc for JavaScript" generator) over a source directory rather than a single Markdown file. They reproduce the production pipeline — OXC parse → JSDoc parse → AST visit → normalize → TypeDoc/pure-Markdown render:

# Extraction only: OXC parse + JSDoc parse + AST visit + normalize,
# for every .ts/.tsx/.mts/.cts file under the directory.
cargo run --release -p ox_content_profile_cli -- docs-extract path/to/src

# Render only: extraction is hoisted out of the measurement loop so the
# timing reflects the Markdown render path in isolation.
cargo run --release -p ox_content_profile_cli -- docs-render path/to/src

# Full pipeline: extraction + normalize + Markdown render.
cargo run --release -p ox_content_profile_cli -- docs-pipeline path/to/src --json

The report's throughput is computed over the total bytes of source ingested, and span rows are prefixed with docs:: so they're easy to tell apart from the Markdown-engine spans.

Reading the report

 Timing
   min   15.50 µs
   p50   35.83 µs
   p95   38.38 µs
   ...
   throughput     680.30 MB/s

 Allocations (per iteration)
   count               46.0
   bytes           57.01 KB
   peak (max)      46.25 KB
   largest         90.00 KB

 Spans (sorted by total inclusive time)
   name                          hits      self  inclusive  share   allocs  bytes
   parser::parse_html_block      7600   3.56 ms    3.56 ms  55.4%      0     0 B
   ...

Timing percentiles are computed over --iters iterations after dropping the first --warmup to discard cold-cache effects.
Allocations per iteration are the mean count + bytes over those iterations, plus the maximum peak live bytes any single iteration reached above its starting baseline.
Spans are aggregated across all measured iterations. self is inclusive minus child-span inclusive time, so it's the time the function spends in its own body. share is the span's self time as a fraction of the total self time across all spans, which is a quick proxy for "fraction of CPU spent here."
Size-class histogram shows the last iteration's allocations bucketed by power-of-two size. Useful for spotting spikes in small short-lived allocations.

Profile-feature anatomy

The instrumentation hooks are gated on a Cargo feature per crate. To profile a different consumer of the parser/renderer:

[dependencies]
ox_content_parser    = { workspace = true, features = ["profile"] }
ox_content_renderer  = { workspace = true, features = ["profile"] }
ox_content_profiler  = { workspace = true }

Inside your binary, install the global allocator and enable both layers before the workload, then drain the report:

use ox_content_profiler::{CountingAllocator, Recorder, scope};

#[global_allocator]
static GLOBAL: CountingAllocator = CountingAllocator::new();

fn main() {
    CountingAllocator::enable();
    scope::enable();

    let mut recorder = Recorder::new("my-workload");
    for _ in 0..100 {
        recorder.record(|| {
            // ...exercise parser + renderer...
        });
    }
    let report = recorder.finish();
    println!("{}", report.render_table());
}

Suggested workflow for performance work

Run the profiler against a representative corpus before changing anything. Save the table or JSON.
Find the highest-share span that is not obviously memory-bandwidth bound. Look at its allocs and bytes columns — span-level allocations are usually low-hanging fruit.
Make a change that targets that span.
Re-run the profiler with the same flags and compare span counts, allocations, and tail percentiles.
Run cargo bench -p ox_content_parser to confirm the synthetic benchmarks haven't regressed.

This was the loop used to land issue #159: the first run on docs/content/api/types.md showed parse_html_block consuming 86.9% of pipeline time with to_ascii_lowercase() allocating per line; replacing that with a byte-level case-insensitive search and inlining consume_line's newline scan moved the same file from 240 MB/s → 803 MB/s end-to-end while cutting per-iteration allocations from 122 → 32.