Introduction to Binary Logging — Comparing 6 Formats and Selection Criteria

2026-04-02

Treeru

After building many software projects, I became interested in collecting event logs in binary format. Binary logging has a lot of advantages for AI training data, and I've been applying it to actual projects. I plan to share empirical results and implementation details in upcoming posts in this category. Today, as the first post, I'll introduce and compare six binary logging formats.

Why I Got Interested in Binary Logging

JSON logging is great. It's human-readable, universally supported, and works well for most projects. I still run my database audit logs in JSON, and there's no reason to change that.

Binary logging caught my attention when I started thinking about AI training data. As you build multiple projects, event logs naturally accumulate. To feed them into a training pipeline later, you end up going through an export-and-convert process. What if you collected them in a format AI can read directly from the start? That question was the starting point.

There turned out to be more options than I expected, so I needed to organize them. This post is the result.

General Advantages of Binary Logging

Size reduction. Most binary formats are 35–67% the size of JSON. No key name repetition, or schema-based compression eliminates the overhead.
Serialization/deserialization speed. Many formats are 2–10x faster than JSON parsing. This matters when reading large volumes in analysis or training pipelines.
Python ecosystem compatibility. Avro, MessagePack, and Protobuf can be read directly in Python and converted to Pandas DataFrames or PyTorch Datasets.

Six Formats at a Glance

One-Line Summaries

MessagePack — Binary translation of JSON. Schema-free, simplicity is the point.
Protobuf — Google's schema-first format. Strict but compact and fast.
CBOR — IoT international standard. Similar to MessagePack with more precise types.
FlatBuffers — Google's game-oriented format. Zero-copy access without parsing.
Avro — Apache's big-data format. Schema is stored together with the data.
Cap'n Proto — Rebuilt by Protobuf's author. True zero-copy.

Speed, Size, and Scale Comparison

Speed and Size

Format	Serialization (write)	Deserialization (read)	Size vs JSON
JSON	~300 MB/s	~250 MB/s	100%
MessagePack	~800 MB/s	~900 MB/s	65%
Protobuf	~1,000 MB/s	~1,200 MB/s	35%
CBOR	~700 MB/s	~800 MB/s	67%
FlatBuffers	~600 MB/s	~3,000 MB/s (zero-copy)	40%
Avro	~900 MB/s	~1,000 MB/s	38%
Cap'n Proto	~2,000 MB/s	~5,000 MB/s (zero-copy)	42%

Scale Suitability

Format	Small (tens of thousands/day)	Medium (millions/day)	Large (hundreds of millions+)
MessagePack	★★★ Ideal	★★ Sufficient	★ Inefficient
Protobuf	★★ Can be overkill	★★★ Ideal	★★★ Ideal
CBOR	★★★ Good	★★ Sufficient	★ Similar
FlatBuffers	★ Overkill	★★ Special use	★★★ Read-intensive
Avro	★ Overkill	★★★ Ideal	★★★ Best with Hadoop
Cap'n Proto	★ Overkill	★★ Special use	★★★ Extreme performance

Schema vs Schemaless — The Biggest Decision

When comparing the six formats, everything converges to one question: do you define the event structure upfront, or keep it flexible?

Classification	Formats	Characteristics
Schemaless	MessagePack, CBOR	No code changes when event structure changes. But invalid data can slip in unnoticed.
Schema required	Protobuf, FlatBuffers, Cap'n Proto	Data integrity guaranteed, size minimized. Schema changes require updates and redeployment.
Middle ground	Avro	Schema recommended but stored with the file. Built-in schema evolution for backward compatibility.

In the exploration phase (event structure keeps changing), schemaless is more convenient. In the stable phase (fields are fixed, long-term collection), schema-based formats win on size and integrity.

AI Training Data Fitness

When using logs as AI training data, the key question is: how easy is it to read and process later? Python ecosystem compatibility, Pandas conversion, and schema evolution support are the deciding factors.

Format	Python	Pandas	Schema Evolution	Unstructured Data	AI Fit
MessagePack	★★★	Direct	Flexible	Excellent	★★★
Protobuf	★★★	Needs conversion	Versioned	Difficult	★★
CBOR	★★	Direct	Flexible	Excellent	★★
Avro	★★★	Direct	Built-in	Schema needed	★★★
FlatBuffers	★★	Cumbersome	Difficult	Difficult	★
Cap'n Proto	★	Cumbersome	Possible	Difficult	★

MessagePack excels with unstructured data — great for exploratory collection where fields vary per event. Avro has built-in schema evolution — data from six months ago and today can be read with automatic compatibility handling. Ideal for long-term AI training datasets.

Selection Guide by Situation

Small scale + unstructured + quick start → MessagePack
Medium scale+ + stable structure + team → Protobuf
Big data ecosystem (Kafka / Spark / Hadoop) → Avro
Game / real-time + extreme read speed → FlatBuffers
IoT + international standard required → CBOR
Extreme performance + C++-centric system → Cap'n Proto

Using Protobuf or Avro at small scale can mean schema management overhead exceeds the savings. Conversely, using MessagePack at large scale means key name repetition wastes non-trivial storage space. Matching the format to your scale and situation is what matters.

Coming Up Next

This post was an introduction and comparison of binary logging formats. Upcoming posts in this category will cover:

Applying binary logging in production (async queue, file separation strategies)
Feeding collected binary logs into an AI training pipeline
JSON vs binary log measured size/speed comparison (with empirical data)
Designing role separation between DB audit logs and file-based event logs

I'll share each as the empirical data accumulates.

Performance figures (MB/s) are general benchmark references and vary by hardware and data characteristics. Measure on your target system before committing to any format.

Treeru

Sharing practical insights on web development, IT infrastructure, and AI solutions. Treeru — your partner in digital transformation.

Avro Binary Logging Logging MessagePack Protobuf AI Training Schema Evolution Async Queue

Tools