Skip to main content
Crisp Byte

Semantics of SDBD

To really start bringing this new data format to life, we need to talk about what's in it. Establishing the semantics of a format gives us the terms and concepts we need to talk about the format abstractly before we get to any concrete details.

Start with the data #

Let's start with a simple example. Here is a base64 encoding of some arbitrary data.

oSAEACBQbusr6ZZslqwlPqhpIogL6slC7t74JE2zvQjVgNvwuQIxo76Rt2W9KIJ/khyI8jgd61ZU
uZPTWWJqf2Uw9N1cYoABdEISbjSoHOWd9JE8NIMQABpwLrv1qgE=

How does somebody make sense of this? On my system I know how I created this file and I know the filename, so I know exactly what it is. But without that, what do you do? Let's use the file tool to make a guess by running file -i on it.

application/octet-stream; charset=binary

That's not useful. The output might as well have been

¯\_(ツ)_/¯

Add the metadata #

The only way you're going to make sense of this data is if I start giving you more information. The first thing I'm going to tell you is that this is compressed with brotli.

The format will use the concept that data encoding is different from the file type. Generally that will mean we can compress the data without having to nest our metadata. I'll add a metadata field to the format that tells you the encoding.

content-encoding: br

If you decompress the data and look at the result, it will immediately become apparent what the file is. For the sake of continuing this article, let's pretend that we're not human beings with advanced pattern recognition capabilities, but a computer that's not being allowed to guess.

In order to understand this file, the next thing you need is to know the file type. Running file -i on the decompressed file gets it right this time, but we're no longer allowed to guess. I'll give you another metadata field that tells you the file type.

content-type: text/plain; charset=us-ascii

It's just plain ASCII text! Now that you have this information, you can properly interpret the data. Go ahead and try decoding the data now.

This looks familiar... #

Go ahead and give yourself a gold star if you recognize what I've been doing so far. That's right, I am straight up ripping off HTTP. The metadata of SDBD will be a list of headers semantically equivalent to HTTP headers. I'll even say, for the sake of implementation, we'll follow similar rules:

  1. Headers names are case insensitive
  2. Multiple headers can have the same name
  3. The order of headers is significant and must be preserved Note that when we get to the proof of concept, I won't be following any of these rules.

What's your name? #

I do need to make a few tweaks. Solving part of our original problem requires storing the filename. There aren't any standard HTTP headers that are exactly meant for this. content-disposition can contain a filename, but its real purpose is something else. That header will normally look something like this:

content-disposition: attachment; filename="filename.jpg"

This header is meant to tell a browser whether the response should be displayed in the browser or downloaded. SDBD isn't meant specifically for browsers so we would be including useless information just to store a filename in an awkwardly formatted field.

I'll create a new header for this.

content-name: sample.txt

When does it end? #

We need a way to identify exactly what size the data is. content-length is technically optional in HTTP, since you can mark the end of a response by closing the connection. But we're not making any assumptions about the context of an SDBD, so we can't assume a connection to close. I'd rather not try to create a marker for the end of the data, so I'll say that content-length is required.

What do we actually need? #

Let's wrap this up by deciding what's required, what should be present, and what's optional. I've already decided that content-length is required, and I think that's the only thing we absolutely need.

I think content-type should be present, but I don't want to make it an absolute requirement. I don't see a use case where you wouldn't want content-type, but let's not limit ourselves unnecessarily. Also, since I haven't explicitly stated it yet, the value of content-type must be a MIME type.

If the documented is encoded (compressed), content-encoding must also be present unless the metadata is specifically describing the encoded data. I don't want to define exactly what the value of content-encoding must be, other than that implementations should support the common values used on the web: gzip, deflate, and br.

Other encodings should perhaps use their own MIME type, such as application/x-bzip2. I'm not ready to set that in stone in case somebody comes up with a use case for encoding that isn't about compression.

The data should have a good identifier as well, so we should give it a filename with content-name, a URI with content-location, or both.

Anything else is optional.

With that, here is the complete metadata for the example document.

content-name: sample.txt
content-encoding: br
content-type: text/plain; charset=us-ascii
content-length: 95

Sounds simple enough #

That's all the information a computer would need to interpret the example data. The format consists of the data combined with metadata that is a list of headers semantically similar to HTTP headers. There's one header required for the length of the data, and a few others that are commended. That's all there is to it.

Now we need to define what this will all look like in a real binary format.