Binary Markup Language

Binary Markup Language, or BML is a platform-independent binary file meta-format with semantics inspired by XML.

Goals

BML has been based on concepts from a number of different existing formats. In particular, BML is very similar to IFF, except that element and attribute names are not limited to 4 characters, and it discourages the use of fixed-format structures in binary files. BML only supports elements and attributes. There seems to be no need for  the XML equivalent of text nodes, comments, processing instructions or entities (although "enum attributes" are similar to entities, see below). BML currently does not support the concept of validation or DTDs, although that can be added if needed.

BML currently does not implement support for storing floating-point numbers. That is an issue which will need to be addressed soon. (Need to select a standard, machine-independent way to encode them.)

Theory of Operation

Nodes and Attributes

A BML stream encodes a hierarchical tree-structure with element nodes and attributes. Each element node can have an arbitrary number of child elements, and an arbitrary number of attributes.

A BML stream consists of a set of opcodes which can reconstruct this tree, either as a series of nested events (as in SAX) or as an in-memory structure (as in DOM).

Each element node and attribute has an associated identifier. In the case of a node, the identifier indicates the "type" of the node. It is equivalent to the element name in XML. In the case of the attribute, the identifier indicates the name of the attribute.

Encoding of Identifiers

From the API level, these identifiers appear to tbe plain ASCII strings. However, at the stream level, they are actually encoded as tokens. Each time an identifier is passed to the BML writer API, it determines whether that identifier has been seen before. If it has not, it emits a "define token" opcode, and then uses the resulting token ID in place of the identifier for the rest of the document. A dictionary is kept of tokens which have been defined up to that point in the stream.

Variable-length Integers: VarInt

BML streams make use of "variable-length integers" or VarInts. These are integers which are stored in a variable number of bytes, depending upon the magnitude of the integer. The format for storing a VarInt is exactly the same as in the Standard Midi File format. A value of 0-127 is stored as a single byte. Value larger than this are stored as a series of bytes, each byte containing 7 bits of the integer, in order from most significan to least significant bits. The high bit of each byte is used to indicate the end of the sequence - that is, all of the bytes of the sequence should have the high bit set except for the last byte.

Negative integers are stored similarly, but using a different opcode.

In many cases, if a number is very small (<64), it will be stored directly in the opcode.

Strings are stored as a VarInt containing the length of the string, followed by the actual string bytes. The terminating null character is not stored.

Encoding of string values

Actual string attribute values (as opposed to identifiers) should always be encoded as UTF-8. (Q: Is this correct? Or should we have different encoding types.)

Stream Signature

Each BML stream begins with a stream signature. The signature is similar to that used for PNG, except that the letters PNG have been replaced with BML:
#define BML_SIGNATURE  "\211BML\r\n\032\n"
The signature should be followed by a version number, which is a varInt. This is the version number of BML, not the version number of the encoded elements. It is recommended that the top-most element of a BML stream have a "version" attribute for purposes of versioning the actual data contained within.

The current BML version number is 1.

A Sample Writer API:

Here is a sample of what an API for writing BML files might look like:
class BinaryMarkupWriterInterface {
public:
        /** Add a string-valued attribute to the current element. */
    void putStringAttribute( const char *name, const char *val );
        /** Add an enum-valued attribute to the current element.
            (This is semantically identical to putStringAttribute,
            except the value is stored as a token.)
        */
    void putEnumAttribute( const char *name, const char *val );
        /** Add an integer-valued attribute to the current element */
    void putIntegerAttribute( const char *name, long val );
        /** Add an double-valued attribute to the current element */
    void putDoubleAttribute( const char *name, double val );
        /** Add an boolean-valued attribute to the current element */
    void putBooleanAttribute( const char *name, bool val );
        /** Add an boolean-valued attribute to the current element */
    void putBinaryAttribute( const char *name, const char *data, size_t len );
        /** Start a new nested element */
    void startElement( const char *className );
        /** Pop the current element */
    void endElement();
        /** Write an end-of-stream marker */
    void end();
};

Sample Reader API:

A BML reader in the SAX style would often consist of two seperate classes: A generic "parser" class, combined with an application specific "handler" class:
class BinaryMarkupInputHandler {
public:
    virtual bool onStringAttribute( const char *name, const char *val ) = 0;
    virtual bool onEnumAttribute( const char *name, const char *val ) = 0;
    virtual bool onIntegerAttribute( const char *name, long val ) = 0;
    virtual bool onDoubleAttribute( const char *name, double val ) = 0;
    virtual bool onBooleanAttribute( const char *name, bool val ) = 0;
    virtual bool onBinaryAttribute( const char *name,
                                    const char *data,
                                    size_t length ) = 0;
    virtual bool onStartElement( const char *name ) = 0;
    virtual bool onEndElement( const char *name ) = 0;
};

BML Opcode definitions:

enum {
        // End of stream marker
        // format: op()
    BMOp_End = 0,
        // Define a token for later use. The token is assigned the next available token id,
        // starting from 1 and incrementing by 1 for each token defined.
        // format: op( nameLength:varInt, name:uint8[ len ] )
    BMOp_DefineToken = 0x01,
        // A string-valued property
        // format: op( nameToken:varInt, strLength:varInt, strVal:uint8[ len ] )
    BMOp_StringAttribute = 0x02,
        // An integer-valued property
        // format: op( nameToken:varInt, value:varInt )
    BMOp_IntegerAttribute = 0x03,
        // An integer-valued property (stored as negative of real value)
        // format: op( nameToken:varInt, value:varInt )
    BMOp_NegativeIntegerAttribute = 0x04,
        // An double-valued property
        // format: op( nameToken:varInt, value:double[encoded?] )
    BMOp_DoubleAttribute = 0x05,
        // An boolean-valued property with value "true"
        // format: op( token:varInt )
    BMOp_BooleanTrueAttribute = 0x06,
        // An boolean-valued property with value "true"
        // format: op( token:varInt )
    BMOp_BooleanFalseAttribute = 0x07,
        // A string-valued property, stored as a token
        // format: op( token:varInt, valueToken:varInt )
    BMOp_EnumAttribute = 0x08,
        // An binary-valued property
        // format: op( token:varInt, length:varInt, data:uint8[ length ] )
    BMOp_BinaryAttribute = 0x09,
        // Start an element
        // format: op( elementNameToken:varInt )
    BMOp_StartElement = 0x0a,
        // Finish an object
        // format: op()
    BMOp_EndElement = 0x0b,
        // Ox0c - 0x1f reserved for future expansion and fixing boneheaded mistakes
        // A property for short negative integers
        // Token name length stored in low 5 bits.
        // format: op+len( token:varInt )
    BMOp_SmallNegIntegerAttribute = 0x20,
        // A property for short token names
        // Token name length stored in low 6 bits.
        // format: op+len( name:uint8[ len ] )
    BMOp_DefineSmallToken = 0x40,
        // A property for short strings
        // String length stored in low 6 bits.
        // format: op+len( val:uint8[ len ] );
    BMOp_SmallStringAttribute = 0x80
        // A property for small integer values
        // Integer stored in low 6 bits.
        // format: op+val( token:varInt );
    BMOp_SmallIntegerAttribute = 0xc0,
};

Return to Talin's project page.
Return to Talin's home page.