Introduction to Binary Logging — Comparing 6 Formats and Selection Criteria
After building many software projects, I became interested in collecting event logs in binary format. Binary logging has a lot of advantages for AI training data, and I've been applying it to actual projects. I plan to share empirical results and implementation details in upcoming posts in this category. Today, as the first post, I'll introduce and compare six binary logging formats.
Why I Got Interested in Binary Logging
JSON logging is great. It's human-readable, universally supported, and works well for most projects. I still run my database audit logs in JSON, and there's no reason to change that.
Binary logging caught my attention when I started thinking about AI training data. As you build multiple projects, event logs naturally accumulate. To feed them into a training pipeline later, you end up going through an export-and-convert process. What if you collected them in a format AI can read directly from the start? That question was the starting point.
There turned out to be more options than I expected, so I needed to organize them. This post is the result.
General Advantages of Binary Logging
- Size reduction. Most binary formats are 35–67% the size of JSON. No key name repetition, or schema-based compression eliminates the overhead.
- Serialization/deserialization speed. Many formats are 2–10x faster than JSON parsing. This matters when reading large volumes in analysis or training pipelines.
- Python ecosystem compatibility. Avro, MessagePack, and Protobuf can be read directly in Python and converted to Pandas DataFrames or PyTorch Datasets.
Six Formats at a Glance
One-Line Summaries
- MessagePack — Binary translation of JSON. Schema-free, simplicity is the point.
- Protobuf — Google's schema-first format. Strict but compact and fast.
- CBOR — IoT international standard. Similar to MessagePack with more precise types.
- FlatBuffers — Google's game-oriented format. Zero-copy access without parsing.
- Avro — Apache's big-data format. Schema is stored together with the data.
- Cap'n Proto — Rebuilt by Protobuf's author. True zero-copy.
Speed, Size, and Scale Comparison
Speed and Size
| Format | Serialization (write) | Deserialization (read) | Size vs JSON |
|---|---|---|---|
| JSON | ~300 MB/s | ~250 MB/s | 100% |
| MessagePack | ~800 MB/s | ~900 MB/s | 65% |
| Protobuf | ~1,000 MB/s | ~1,200 MB/s | 35% |
| CBOR | ~700 MB/s | ~800 MB/s | 67% |
| FlatBuffers | ~600 MB/s | ~3,000 MB/s (zero-copy) | 40% |
| Avro | ~900 MB/s | ~1,000 MB/s | 38% |
| Cap'n Proto | ~2,000 MB/s | ~5,000 MB/s (zero-copy) | 42% |
Scale Suitability
| Format | Small (tens of thousands/day) | Medium (millions/day) | Large (hundreds of millions+) |
|---|---|---|---|
| MessagePack | ★★★ Ideal | ★★ Sufficient | ★ Inefficient |
| Protobuf | ★★ Can be overkill | ★★★ Ideal | ★★★ Ideal |
| CBOR | ★★★ Good | ★★ Sufficient | ★ Similar |
| FlatBuffers | ★ Overkill | ★★ Special use | ★★★ Read-intensive |
| Avro | ★ Overkill | ★★★ Ideal | ★★★ Best with Hadoop |
| Cap'n Proto | ★ Overkill | ★★ Special use | ★★★ Extreme performance |
Schema vs Schemaless — The Biggest Decision
When comparing the six formats, everything converges to one question: do you define the event structure upfront, or keep it flexible?
| Classification | Formats | Characteristics |
|---|---|---|
| Schemaless | MessagePack, CBOR | No code changes when event structure changes. But invalid data can slip in unnoticed. |
| Schema required | Protobuf, FlatBuffers, Cap'n Proto | Data integrity guaranteed, size minimized. Schema changes require updates and redeployment. |
| Middle ground | Avro | Schema recommended but stored with the file. Built-in schema evolution for backward compatibility. |
In the exploration phase (event structure keeps changing), schemaless is more convenient. In the stable phase (fields are fixed, long-term collection), schema-based formats win on size and integrity.
AI Training Data Fitness
When using logs as AI training data, the key question is: how easy is it to read and process later? Python ecosystem compatibility, Pandas conversion, and schema evolution support are the deciding factors.
| Format | Python | Pandas | Schema Evolution | Unstructured Data | AI Fit |
|---|---|---|---|---|---|
| MessagePack | ★★★ | Direct | Flexible | Excellent | ★★★ |
| Protobuf | ★★★ | Needs conversion | Versioned | Difficult | ★★ |
| CBOR | ★★ | Direct | Flexible | Excellent | ★★ |
| Avro | ★★★ | Direct | Built-in | Schema needed | ★★★ |
| FlatBuffers | ★★ | Cumbersome | Difficult | Difficult | ★ |
| Cap'n Proto | ★ | Cumbersome | Possible | Difficult | ★ |
MessagePack excels with unstructured data — great for exploratory collection where fields vary per event. Avro has built-in schema evolution — data from six months ago and today can be read with automatic compatibility handling. Ideal for long-term AI training datasets.
Selection Guide by Situation
- Small scale + unstructured + quick start → MessagePack
- Medium scale+ + stable structure + team → Protobuf
- Big data ecosystem (Kafka / Spark / Hadoop) → Avro
- Game / real-time + extreme read speed → FlatBuffers
- IoT + international standard required → CBOR
- Extreme performance + C++-centric system → Cap'n Proto
Using Protobuf or Avro at small scale can mean schema management overhead exceeds the savings. Conversely, using MessagePack at large scale means key name repetition wastes non-trivial storage space. Matching the format to your scale and situation is what matters.
Coming Up Next
This post was an introduction and comparison of binary logging formats. Upcoming posts in this category will cover:
- Applying binary logging in production (async queue, file separation strategies)
- Feeding collected binary logs into an AI training pipeline
- JSON vs binary log measured size/speed comparison (with empirical data)
- Designing role separation between DB audit logs and file-based event logs
I'll share each as the empirical data accumulates.
Performance figures (MB/s) are general benchmark references and vary by hardware and data characteristics. Measure on your target system before committing to any format.
Comments
(0)Log in to leave a comment.
Related Posts
© 2026 TreeRU. All rights reserved.
All content is copyrighted by TreeRU. Unauthorized reproduction without attribution is prohibited.