Reading the output

ns/op, B/op, allocs/op — what each tells you.

6 mineasy

You wrote a benchmark and got numbers. Now make them mean something. This lesson is fluency with ns/op, B/op, and allocs/op — what each tells you, when each matters, and what red flags to look for.

ns/op — the headline

Nanoseconds per operation. The number you optimise for, most of the time.

  • Under 100ns — fast. Probably a pure function with no syscalls.
  • 100ns – 1µs (1000ns) — typical for string ops, simple JSON marshal, small loops.
  • 1µs – 100µs — non-trivial work. Allocation, regex, complex parsing.
  • 100µs – 1ms — usually some I/O or a big loop. If you didn't expect this, investigate.
  • 1ms+ — likely the wrong target for microbenchmarking. Profile the broader system instead.

B/op — bytes per op

How many bytes of heap memory the operation allocates per iteration.

  • 0 B/op — zero allocations. Often achievable for pure ops with stack-only data. The dream for hot paths.
  • 16-64 B/op — typical for ops that build a small string, return a struct on the heap, or take a slice.
  • 1KB+ /op — large; either you copied data or allocated a buffer. Usually a flag for optimisation.

allocs/op — number of heap allocations

Sometimes more important than bytes. Each allocation:

  • Costs CPU to satisfy.
  • Increases the work the GC has to do later.
  • Causes a tiny pause when the GC runs.

For libraries used a million times per second (logging, request routing), driving allocs/op to 0 or 1 is a common target.

Reading a realistic output

BenchmarkPluralize-10 5102937 226 ns/op 64 B/op 2 allocs/op
BenchmarkGoType-10 7891234 151 ns/op 48 B/op 1 allocs/op
BenchmarkZodType-10 892145 1342 ns/op 256 B/op 8 allocs/op ← red flag
BenchmarkGORMTag-10 2912034 411 ns/op 128 B/op 3 allocs/op
BenchmarkInjectBefore-10 503456 2385 ns/op 512 B/op 12 allocs/op ← bigger flag
BenchmarkParseInline-10 1056432 1147 ns/op 192 B/op 5 allocs/op

Two red flags worth noting:

  • BenchmarkZodType at 1.3µs with 8 allocs. Likely building intermediate strings. strings.Builder would drop both numbers.
  • BenchmarkInjectBefore at 2.4µs with 12 allocs. 12 allocations PER OP is a lot — almost certainly a string search-and-replace with many intermediate values. A scanner or single buffer would help.

Spotting hidden allocations

// Allocates a new string every call (string is immutable in Go)
func concat1(s string) string {
return s + " · v1"
}
// Bench: 30 ns/op, 16 B/op, 1 allocs/op
// Allocates a new slice on every append (cap growth)
func concat2(words []string) string {
out := ""
for _, w := range words {
out += w + " "
}
return out
}
// Bench (10 words): 290 ns/op, 192 B/op, 9 allocs/op ← yikes
// strings.Builder reuses one growing buffer
func concat3(words []string) string {
var b strings.Builder
for _, w := range words {
b.WriteString(w)
b.WriteByte(' ')
}
return b.String()
}
// Bench (10 words): 110 ns/op, 32 B/op, 2 allocs/op ← much better

Same logical work, ~3x faster + 6x fewer allocations. The benchmark numbers told you exactly where the inefficiency was.

What "noise" looks like

# Run 5 times, results swinging 50%:
BenchmarkX-10 5102937 226 ns/op
BenchmarkX-10 4992014 289 ns/op
BenchmarkX-10 5023412 232 ns/op
BenchmarkX-10 4123912 312 ns/op
BenchmarkX-10 5202345 228 ns/op

Noise. Your laptop is doing something else under load (browser, Spotlight, antivirus). Don't trust these numbers. Fix by:

  • Closing background apps.
  • Running on a quiet machine (CI, dedicated box).
  • Using -count=10 + benchstat (chapter 3).
  • Avoiding battery / thermal throttling.
The compiler can elide your work. If your benchmark's return value is unused, the Go compiler may skip the call entirely — making your benchmark say 0ns. The fix: assign the result to a package-level variable so the compiler can't prove the work is dead.
var sink string // package-level
func BenchmarkPluralize(b *testing.B) {
var s string
for i := 0; i < b.N; i++ { s = Pluralize("cat") }
sink = s // prevent dead-code elimination
}

What ns/op alone can't tell you

Benchmarks measure ONE thing in isolation. They don't tell you:

  • How often this function is called. A 5µs function called 10× a day is fine. A 50ns function called a million times per request matters.
  • What the rest of the system is doing while it runs. Lock contention, GC pauses, network jitter all dwarf microbenchmark differences.
  • Real-world data shapes. Your benchmark hits one input; production hits a distribution.

Profiling (next chapter) helps with the "how often" question. Load testing (the K6 course) helps with the "real world" question. Use the right tool for the right question.

Quick check

Benchmark A: 50 ns/op, 0 B/op, 0 allocs/op. Benchmark B: 30 ns/op, 32 B/op, 1 allocs/op. Which is faster?

Try it

Interpret real numbers:

  1. Take the benchmarks you wrote last lesson. Look at each B/op number.
  2. Find the benchmark with the highest allocs/op. Read the source. Make ONE hypothesis about why it allocates that much.
  3. Try one quick change (use strings.Builder, pre-allocate a slice with make([]T, 0, cap), or avoid an intermediate string). Re-run.
  4. Did the numbers move? Write down: how much, in which direction, and your guess at why.
  5. Paste before / after in notes.md.

What's next

Chapter 2 — Profiling with pprof. Benchmarks tell you ONE function's cost; pprof tells you where a whole program spends its time. The CPU + heap flame graph is the highest-bandwidth tool in Go.

Spot a typo? Have an idea?

Help us improve this lesson. One click opens a GitHub issue with the lesson URL pre-filled — suggest clearer wording, report a bug, or request more depth. The course keeps improving thanks to learners like you.

Suggest an improvement on GitHub