SDBD: Creating a Data Format

Let's Create a Data Format

by CheddarCrisp

I have a problem. I want to be able to transfer self-contained binary data with metadata through a variety of protocols with no knowledge of the binary data's format or the protocol being used for transfer.

Or in other words, I want to be able to send files anywhere without losing the filename.

That's a bit simpler than my actual goal, but I think this is a problem every software developer has considered at some point. We've all asked the question, "Why isn't the filename attached to the file?" or slightly more advanced, "Why isn't the file format attached to the file?"

The answer isn't all that complicated.

But I'm going to declare that Good Enough isn't good enough. Perhaps this is the metadata that is most useful for files, but it's not the only useful metadata. It also depends on the transfer protocol to preserve the metadata. What if I don't want to rely on a specific protocol?

And so, knowing full well that this is likely to go nowhere and that solutions to this problem almost certainly already exist, I'm going to set out to create a new data format that encapsulates data and metadata into a single file.

The hard problem

The first thing to do is give my new format a name. After some deliberation, I'm going to settle on Self-described Binary Document or SDBD for short. It's contains arbitrary binary data. It's a document and not a file because it could live anywhere. And the whole purpose is to let the document describe its own contents. Now that I've tackled one of the hard problems, the rest should be easy.

The next step is to talk about how we talk about the format. How is it structured and what concepts do we use to build it?


Semantics of SDBD

by CheddarCrisp

To really start bringing this new data format to life, we need to talk about what's in it. Establishing the semantics of a format gives us the terms and concepts we need to talk about the format abstractly before we get to any concrete details.

Start with the data

Let's start with a simple example. Here is a base64 encoding of some arbitrary data.

oSAEACBQbusr6ZZslqwlPqhpIogL6slC7t74JE2zvQjVgNvwuQIxo76Rt2W9KIJ/khyI8jgd61ZU
uZPTWWJqf2Uw9N1cYoABdEISbjSoHOWd9JE8NIMQABpwLrv1qgE=

How does somebody make sense of this? On my system I know how I created this file and I know the filename, so I know exactly what it is. But without that, what do you do? Let's use the file tool to make a guess by running file -i on it.

application/octet-stream; charset=binary

That's not useful. The output might as well have been

¯\_(ツ)_/¯

Add the metadata

The only way you're going to make sense of this data is if I start giving you more information. The first thing I'm going to tell you is that this is compressed with brotli.

The format will use the concept that data encoding is different from the file type. Generally that will mean we can compress the data without having to nest our metadata. I'll add a metadata field to the format that tells you the encoding.

content-encoding: br

If you decompress the data and look at the result, it will immediately become apparent what the file is. For the sake of continuing this article, let's pretend that we're not human beings with advanced pattern recognition capabilities, but a computer that's not being allowed to guess.

In order to understand this file, the next thing you need is to know the file type. Running file -i on the decompressed file gets it right this time, but we're no longer allowed to guess. I'll give you another metadata field that tells you the file type.

content-type: text/plain; charset=us-ascii

It's just plain ASCII text! Now that you have this information, you can properly interpret the data. Go ahead and try decoding the data now.

This looks familiar...

Go ahead and give yourself a gold star if you recognize what I've been doing so far. That's right, I am straight up ripping off HTTP. The metadata of SDBD will be a list of headers semantically equivalent to HTTP headers. I'll even say, for the sake of implementation, we'll follow similar rules:

  1. Headers names are case insensitive
  2. Multiple headers can have the same name
  3. The order of headers is significant and must be preserved Note that when we get to the proof of concept, I won't be following any of these rules.

What's your name?

I do need to make a few tweaks. Solving part of our original problem requires storing the filename. There aren't any standard HTTP headers that are exactly meant for this. content-disposition can contain a filename, but its real purpose is something else. That header will normally look something like this:

content-disposition: attachment; filename="filename.jpg"

This header is meant to tell a browser whether the response should be displayed in the browser or downloaded. SDBD isn't meant specifically for browsers so we would be including useless information just to store a filename in an awkwardly formatted field.

I'll create a new header for this.

content-name: sample.txt

When does it end?

We need a way to identify exactly what size the data is. content-length is technically optional in HTTP, since you can mark the end of a response by closing the connection. But we're not making any assumptions about the context of an SDBD, so we can't assume a connection to close. I'd rather not try to create a marker for the end of the data, so I'll say that content-length is required.

What do we actually need?

Let's wrap this up by deciding what's required, what should be present, and what's optional. I've already decided that content-length is required, and I think that's the only thing we absolutely need.

I think content-type should be present, but I don't want to make it an absolute requirement. I don't see a use case where you wouldn't want content-type, but let's not limit ourselves unnecessarily. Also, since I haven't explicitly stated it yet, the value of content-type must be a MIME type.

If the documented is encoded (compressed), content-encoding must also be present unless the metadata is specifically describing the encoded data. I don't want to define exactly what the value of content-encoding must be, other than that implementations should support the common values used on the web: gzip, deflate, and br.

Other encodings should perhaps use their own MIME type, such as application/x-bzip2. I'm not ready to set that in stone in case somebody comes up with a use case for encoding that isn't about compression.

The data should have a good identifier as well, so we should give it a filename with content-name, a URI with content-location, or both.

Anything else is optional.

With that, here is the complete metadata for the example document.

content-name: sample.txt
content-encoding: br
content-type: text/plain; charset=us-ascii
content-length: 95

Sounds simple enough

That's all the information a computer would need to interpret the example data. The format consists of the data combined with metadata that is a list of headers semantically similar to HTTP headers. There's one header required for the length of the data, and a few others that are commended. That's all there is to it.

Now we need to define what this will all look like in a real binary format.


What Does SDBD Actually Look Like?

by CheddarCrisp

To turn the semantics into a real file format, we need to define what they look like as actual data. How do you write an SDBD to a stream or a disk? How do we turn our HTTP-like semantics into a real document? That question almost answers itself. I'll keep stealing from HTTP.

Consider the options

There are three versions of HTTP that each represent headers in a different format. HTTP/1.1 uses a human-readable plain text format. It's basically the format you'll see in browser development tools and blog posts about HTTP headers. HTTP/2 uses a compressible binary format called HPACK. HTTP/3 uses a variant of HPACK called QPACK that's designed to be less trouble for the underlying QUIC protocol.

I'm going to decide that SDBD will be a binary format. No part of it needs to be plain text. I'm going to choose HPACK over QPACK. Mostly that's because I think it should be easier to implement the proof of concept with HPACK due to available tools.

Represent

It's time to put it together. I already have plans for updating the format, so the very first byte of SDBD will be a version number. We'll start simply with 0x01. The rest of this section will describe a version 1 document.

Next comes the metadata. We need to know how long the metadata is, so we'll use two bytes to store an unsigned int to tell us the length of the metadata section in bytes. (But what happens if the metadata is more than 64KB? We'll worry about that later.) The metadata headers will be encoded with HPACK.

After the metadata comes the data, which better be the exact length defined by content-length, or bad things will happen.

That's it. If we encode the original example with this format, this is the result:

0000	01 3a 00 00 89 21 ea 49  6a 4a d5 0e 92 ff 8d 5f   .:...!.IjJ....._
0010	ff f8 20 74 d7 41 57 4f  94 af 1d 9f 0f 0b 02 62   .. t.AWO.......b
0020	72 0f 10 94 49 7c a5 8a  e8 19 aa fb 50 93 8e c4   r...I|......P...
0030	15 30 5a 85 86 82 18 df  0f 0d 02 39 35 a1 20 04   .0Z........95. .
0040	00 20 50 6e eb 2b e9 96  6c 96 ac 25 3e a8 69 22   . Pn.+..l..%>.i"
0050	88 0b ea c9 42 ee de f8  24 4d b3 bd 08 d5 80 db   ....B...$M......
0060	f0 b9 02 31 a3 be 91 b7  65 bd 28 82 7f 92 1c 88   ...1....e.(.....
0070	f2 38 1d eb 56 54 b9 93  d3 59 62 6a 7f 65 30 f4   .8..VT...Ybj.e0.
0080	dd 5c 62 80 01 74 42 12  6e 34 a8 1c e5 9d f4 91   .\b..tB.n4......
0090	3c 34 83 10 00 1a 70 2e  bb f5 aa 01               <4....p.....

It is finally time to get to the good part. Let's write some code.


Building a Proof of Concept

by CheddarCrisp

The format certainly looks sound. Could there be any surprises when we try to implement it? There's one way to find out. I'll write a Proof of Concept. Finally, we get to the code!

Looking Sharp

My career experience has mostly been with .NET and web development. My latest job had me coding in Ruby on Rails. After a two year absence, I'm anxious to get back into the world of .NET. I'm going to do the implementation in C# with .NET 8.0.

Start at the interface

When I code, I always start at the interfaces. Whether it's the user interface (UI) or the application programming interfaces (APIs), starting at the interfaces sets up good boundaries that guide good design. I'll start by defining the contract for an API that encodes and decodes SDBD data.

namespace SDBD;

public interface ICodec {
  Document Decode(byte[] data);
  byte[] Encode(Document document);
}

public record Document(
  Dictionary<string, string> Metadata,
  byte[] Data
);

A basic demo

The demo program will be a command line encoder/decoder. To encode, pass a file path on its own or with -e as the first parameter. The program will encode the file to SDBD with the original filename embedded and write a file with a .sdbd extension. To decode, pass -d as the first parameter and the path to an .sdbd file. It will write a file with the original data and the original filename.

SDBD.ICodec codec = new SDBD.Codec();

var (encode, filepath) = ParseArgs(args);
var inputData = File.ReadAllBytes(filepath);
var filename = Path.GetFileName(filepath);

if(encode) {
  SDBD.Document document = new (
    new() {
      { "content-name", filename }
    },
    inputData
  );

  var outputData = codec.Encode(document);

  File.WriteAllBytes($"{filename}.sdbd", outputData);
} else {
  var document = codec.Decode(inputData);

  File.WriteAllBytes(document.Metadata["content-name"], document.Data);
}

(bool encode, string filepath) ParseArgs(string[] args) {
  return args switch {
    [var filepath] => (true, filepath),
    ["-d", var filepath] => (false, filepath),
    ["-e", var filepath] => (true, filepath),
    _ => throw new Exception("I don't like those arguments")
  };
}

First run

I'll mock up an implementation of the interface to make sure the program is working. All it will do is echo back the data it's given, and give the document the name text.txt.

namespace SDBD;

public class Codec : ICodec {
  public Document Decode(byte[] data) {
    return new(
      new () {
        { "content-name", "text.txt" }
      },
      data
    );
  }

  public byte[] Encode(Document document) {
    return document.Data;
  }
}

If we run the program on any file, it will output the same file with the .sdbd extension added. Run the program on a file with the -d parameter and it will output the same data with the file name text.txt. Looking good so far.


The Heart of the Code

by CheddarCrisp

To complete the proof of concept we need to implement SDBD.ICodec. This is what we're proving after all. The most complicated part will be the HPACK encoding. I'd rather not implement that myself, not for something basic. Fortunately there is a NuGet package that should do the trick. It's called simply hpack.

At the risk of pulling a "rest of the owl", I'm going to jump straight to the final code. I think a short walkthrough is enough to show how the code works without going through the coding process step by step.

Encoder

public byte[] Encode(Document document) {
  var dataLength = document.Data.Length;
  var contentLength = new Dictionary<string, string> () {
    { "content-length", dataLength.ToString() }
  };
  var headers = document.Metadata.Union(contentLength);

  var packedHeaders = packHeaders(headers);
  var headerLength = Convert.ToUInt16(packedHeaders.Length);

  using var output = new MemoryStream();

  output.WriteByte(0x01);
  output.Write(BitConverter.GetBytes(headerLength));
  output.Write(packedHeaders);
  output.Write(document.Data);

  return output.ToArray();
}

private byte[] packHeaders(IEnumerable<KeyValuePair<string, string>> headers) {
  //0 will disable dynamic table that we don't need anyways
  var encoder = new hpack.Encoder(0);
  using var output = new MemoryStream();
  using var writer = new BinaryWriter(output);

  foreach(var (name, value) in headers) {
    encoder.EncodeHeader(writer, name, value);
  }

  return output.ToArray();
}

The first thing we do is add our one required header to the metadata: content-length. We then pack the headers. The hpack encoder is pretty easy to use. Passing that 0 into the constructor should disable some HPACK features that don't make sense in the context of SDBD.

That gives us all the bits we need to write out a document. Version number: 0x01. Header length as an unsigned 16-bit integer. The packed headers themselves. And finally the data.

If I encode a file named test.txt with the content This is a test. This is only a test., this is the output:

0000	01 17 00 00 89 21 ea 49  6a 4a d5 0e 92 ff 86 49   .....!.IjJ.....I
0010	50 95 d3 e5 3f 0f 0d 02  33 36 54 68 69 73 20 69   P...?...36This i
0020	73 20 61 20 74 65 73 74  2e 20 54 68 69 73 20 69   s a test. This i
0030	73 20 6f 6e 6c 79 20 61  20 74 65 73 74 2e         s only a test.

Looks promising. The first byte is the version, the next two decode to integer 23. The next 23 bytes look like the headers to me, and we can even see the content length 36 in the text. Finally there's the 36 bytes of data.

Decoder

public Document Decode(byte[] data) {
  using var input = new MemoryStream(data);

  var version = input.ReadByte();

  return version switch {
    0x01 => DecodeV1(input),
    _ => throw new Exception("Unsupported version")
  };
}

private Document DecodeV1(Stream stream) {
  var headerLengthBytes = new byte[2];
  stream.ReadExactly(headerLengthBytes);
  var headerLength = BitConverter.ToUInt16(headerLengthBytes);

  var headerBytes = new byte[headerLength];
  stream.ReadExactly(headerBytes);
  var headers = unpackHeaders(headerBytes);
   
  string contentLengthString;
  headers.Remove("content-length", out contentLengthString);
  var contentLength = int.Parse(contentLengthString);

  var data = new byte[contentLength];
  stream.ReadExactly(data);

  return new Document(headers, data);
}

To decode we first read the version byte. If the version is the only version we have implemented now, we jump into the real implementation. (Throw an exception otherwise.)

We read the next two bytes to get the header length. We read that number of bytes to get the encoded headers. We decode the headers, extract content-length, and read that number of bytes as the data. Then we pack it up in our Document data structure and send it back.

I left out the implementation of unpackHeaders here because it's mildly confusing if you're not familiar with old .NET patterns. You can find it with the complete source for this implementation on GitHub.

And now...

It's time for a break. I have successfully created a brand new data format. Or at least stitched one together like Frankenstein's monster. I have a working implementation that can encode and decode the format, albeit with only one header implemented.

I'm already building a list of improvements. I'm not planning on tackling them for a bit. First I want to hear other people's feedback. Since this is the internet, somebody will no doubt tell me that something like this already exists. If that's true, great! I'll be sure to link it here for anybody interested.

Either way I certainly haven't wasted my time, and I hope you don't feel like you've wasted yours. I think the process of developing the SDBD format and a proof of concept was a valuable learning experience on its own. I want to hear feedback, so feel free to log issues on GitHub or find me on the Fediverse.

If you like SDBD and want to use it, go for it! I think the format itself is sound. The demo implementation has at least one security issue, so use it with caution.


Improvements to SDBD

by CheddarCrisp

Format

  1. QPACK might be better than HPACK, but I can't find a .NET implementation of QPACK.
  2. Both QPACK and HPACK use a static table designed for compressing the most common headers on the web. A different static table designed for SDBD could compress things better.
  3. For better interaction with streams, I want to change the metadata length value from number of bytes to number of headers. We know how many headers there are before we start encoding them. We don't know how many bytes the result will be until encoding is done.
  4. Also 64KB of header data is possible, but having over 64K headers shouldn't happen. Even 256 headers seems absurd, so maybe number of headers could be shrunk to one byte.

Demo Implementation

  1. The proof of concept could use tests, error handling, data validation, and a fix for the warnings. At the least, validate the content-name so nobody can do anything hacky with it.
  2. It shouldn't be hard to add content-type to the demo output.
  3. I think a reference API built around streams instead of byte arrays would feel nicer.
  4. Wrangling the hpack NuGet package to work around streams might take some effort.
  5. Also a spec-compliant version of Document. Dictionary<string, string> breaks all three rules for headers listed back in Semantics.