BYOB - object serialiazion

In the initial post I described the motivation behind developing a new OO database from scratch in pharo. At the end of that post I've sketched a possible API for Soil. Let's pick that and see how we can make it happen. I copied the snippet over here:

"create a new database path and initialize its content"
soil := Soil new
    path: 'soil';
    initializeFilesystem.
"create a transaction from soil which will be done within a process"
tx := soil newTransaction.
"set the root object to any object or object graph"
tx root: 'soil is ramping up'.
"commit the transaction which will write the content back to disk and updates all
dependent structures"
tx commit.
"clean up and release all OS resources"
soil close.

The snippet should set up a new database on disk under the path soil. Then it writes the string "soil is ramping up" as the root object into the database. This is a very simple example and also a very good start in order not to make it too complicated.

memory layouts

We start with an in-memory object and put that on disk. What we know is that on disk it needs to be represented as a sequence of bytes because disks can only store it that way. The act of writing the in-memory object into the sequence of bytes we call serialization. Looking at the in-memory object it is bytes as well but the location of those bytes and its structure need to be taken care of. Dealing with an object implies two things:

there needs to be a representation of an object somewhere in memory
we need to have a reference to that object

immediate objects

Unlike java, pharo does not have primitive types for integers, floats etc. "Everything is an object" is meant that way without exception. It means integers, floats and characters are objects as well. But it does not seem to be feasible to have that, right? Imagine the integer 1 which has only one bit of information. Having an object somewhere in memory that knows its class for that number would be overkill.

We've just found out that having access to an object means we have a reference to it. We can store information in the reference as well. In pharo object references have a few bits reserved to tag the reference as integer, float, character or heap object. For integers, floats and characters it is possible to encode their value in the reference itself. An integer is just the same as reference into memory, a float has a few bytes for mantissa and for exponent which fit into a reference. For a character we store the unicode code point which is also just a number.

Instead of having an object somewhere in memory where a reference points to we get the class information (from the tag bits) and the content immediately from the reference value. Therefor these are called immediate objects. The virtual machine handles immediate objects in a way they appear the same as every other object, we can do things like

1 class

and we get

SmallInteger

In the context of serialization we note that we can handle immediate objects almost like any other object as it has a class in the system. When the serializer manages object references we need to make that a special case. Instead of a reference to an immediate we always write the value of the immediate object.

heap objects

A heap object (not sure there is a better term) is an object where the reference to the object contains an address to a memory location where we can find that object. At the memory location the first thing we will find is another reference: The reference to the class of this object.

A class consists of a class layout which determines the memory structure after the class reference in the memory presentation. These are e.g.:

FixedLayout for objects with instance variables. It means the size of that object is fixed, it is the number of instance variables of that object. Instance variables are names in that layout as replacement for the index of that instance variables. Just like in a C struct.
VariableLayout for collections. The size of such an object can vary, we can add and remove items to/from a collection
many more we leave out here. If you are interested just have a look at the subclasses of AbstractLayout

For such objects we need to encode the information of its class and then we can ask the layout to do the concrete structure. All of these objects are heap objects and therefor references to those objects need to be treated by the serializer.

building the serializer

What we learned before is that we need to serialize based on the structure of an object and if it can be referenced or not. Let's sketch the general case for heap objects. We only know that they are derived from the class Object.

Object>>#soilSerialize: aSoilSerializer 
    aSoilSerializer 
        registerObject: self 
        ifAbsent: [ self soilBasicSerialize: aSoilSerializer ].

The method registerObject:ifAbsent: is for handling references to such an object. It can serialize surrogates of a known references and has an fallback to soilBasicSerialize: (the implementation of registerObject:ifAbsent: will be handled in a future post about graph serialization)

Object>>soilBasicSerialize: serializer
    "Delegate serialization to the class layout"
    self class classLayout soilBasicSerialize: self with: serializer

For the generic case we just delegate to the class layout of that object which has the responsibility to write the proper format. For FixedLayout this is

FixedLayout>>#soilBasicSerialize: anObject with: serializer
    | description |
    description := self soilSerializeBehaviorDescription: anObject with: serializer.

    description instVarNames do: [:ivarName | (anObject instVarNamed: ivarName) soilSerialize: serializer ]

We store the information about its class (will be handled in a future post about behavior management) and then we store each of its instance variable values in order. Calling soilSerialize: on the value of the instance variable enables to have a per class serialization.

We have now an API where we can:

implement soilBasicSerialize: if we want a different serialization to bytes for a class
implement soilSerialize: if we want to avoid the registration of an object (immediate objects) are have needs that needs a very differenct handling

storing the string

In the snippet at the beginning of this post we started with storing a string. What would happen if we store the string with this generic serializer? In pharo a string is a collection of characters. The serializer would encounter the collection and would call soilSerialize: on it which will end up in the class layout of that collection which is VariableLayout. The case for VariableLayout is this:

soilBasicSerialize: anObject with: serializer
    | description basicSize|

    description := self soilSerializeBehaviorDescription: anObject with: serializer.
    basicSize := anObject basicSize.

    serializer nextPutLengthEncodedInteger: basicSize.
    description instVarNames do: [:ivarName | (anObject instVarNamed: ivarName) soilSerialize: serializer ].
    1 to: basicSize do: [:i | (anObject basicAt: i) soilSerialize: serializer ]

It stores the information about the class, then the number of items in the collection and then it stores each item by calling soilSerialize: on it. A character being an immediate object does not make the situation easier. The serialization might even use the memory representation of a character which can be many bytes for a wide string. There are many problems for a string. But we can overwrite the way a string is serialized:

String>>#soilBasicSerialize: serializer
    serializer nextPutString: self

It seems to be easiest to just encode the string with utf-8 and store the resulting bytes as a compact and less error-prone variant. But how can we write that in a way that on materialization we know it is meant to be a string?

disk layouts

We can do something which is not too different to what immediate objects do. Instead of embedding class information in the serialized artefact we encode tags to the sequence of bytes to make us recognize on materialization. For that we need a format how to encode this

type - length - value

Encoding structured data can be done in basically two ways: textual or binary. A textual format has the advantage that it can be edited easily by humans. The downside is that textual formats need character delimiters to encode the structure. Delimiters are characters that are from the same set as the content in the structure is encoded. This makes it necessary to escape delimiters in the content which is brittle. As human editability is not one of our concerns we can just use a binary format which is likely to be more performant. And the best format I encountered so far is type - length - value

type

type can be a byte that we use as tag to identify what class the content belongs to. We do not expect a larger number of those types so a single byte should be sufficient

length

length is the number of bytes to read for the content . For the length it is not easy to tell how many bytes we need for it. Choosing a small number of bytes restricts the format to support only lengths up to what you encode in those bytes. Choosing a large number of bytes wastes memory where the lengths are rather small.

variable length encoding

to our help there is encoding that solves that. The number of bytes for the length is variable. It is using the MSB (most significant bit) of a byte to indicate an extended length. As long as we read bytes that have the MSB set we know there is another byte following that belongs to this length. This way the length is not restricted by size at all. The smalltalk code to help understand it is this

nextLengthEncodedInteger
    | value |
    value := self nextByte.
    (value < 128) ifTrue: [ ^ value ].
    ^ (self nextLengthEncodedInteger bitShift: 7) bitOr: (value bitAnd: 127)

and writing

SoilSerializer>>#nextPutLengthEncodedInteger: anInteger 
    "store length of integer encoded in a way that the presence of a
    most significant bit indicates that the next byte is part of
    the value"
    anInteger < 128 ifTrue: [ ^ self nextPutByte: anInteger ].
    self
        nextPutByte: ((anInteger bitAnd: 127) bitOr: 128);
        nextPutLengthEncodedInteger: (anInteger bitShift: -7)

value is the byte representation of the content. As it is only bytes and won't interpreted as such we can put everything there transparently

type codes

For the type we use a byte code. In order to avoid magic numbers everywhere in the code we want to have the values assigned once and then reused with the symbols defined for it. This could be class variables that are assigned their numeric value. In order to reuse the type code mapping we use a SharedPool for that so that we can use it in a test class as well.

SharedPool subclass: #SoilTypeCodes
    instanceVariableNames: ''
    classVariableNames: '... TypeCodeString ...'
    package: 'Soil-Core-Serialization'

For serialization we need

SoilTypeCodes class>>#initializeTypeCodes
...
   TypeCodeString := 13.
...

and the reverse mapping for reading

SoilTypeCodes class>>#initializeTypeCodeMapping

    TypeCodeMapping := Array new: 255.
    TypeCodeMapping
...
        at: TypeCodeString                  put: String;
...

defines the mapping from type code to instance creator.

wrapping up the serializer

While building the serializer above we wanted to have a special serialization for our string. After the detour over the type codes we can continue here. So this was the intended code:

String>>#soilBasicSerialize: serializer
    serializer nextPutString: self

the serializer can now be done this way

SoilSerializer>>#nextPutString: aString
    self
        nextPutByte: TypeCodeString;
        basicNextPutString: aString

SoilSerializer>>#basicNextPutString: aString
    | buf |
    buf := aString asByteArray.
    self
        nextPutLengthEncodedInteger: buf size;
        nextPutBytesFrom: buf

Here the string is turned into an byte array which might be confusing. The implementation for String is the general case which is ByteString, a string objects that only contains 8bit characters. If a wide character enters the string the string becomes a WideString and the implementation there is

WideString>>#soilBasicSerialize: serializer
    
    serializer nextPutWideString: self

SoilSerializer>>#nextPutWideString: aWideString 
    | buf |
    buf := self class encodeString: aWideString.
    self
        nextPutByte: TypeCodeWideString;
        nextPutLengthEncodedInteger: buf size;
        nextPutBytesFrom: buf

Finally we can use the serializer to convert the string to its byte representation

bytes := ByteArray streamContents: [ :stream |
    SoilSerializer new
        stream: stream;
        serialize: 'soil is ramping up' ]

results into

#[13 18 115 111 105 108 32 105 115 32 114 97 109 112 105 110 103 32 117 112]

which we can read back using

SoilMaterializer new
    stream: bytes readStream;
    materialize

and the materializer uses the inverse path to decode

SoilMaterializer>>#materialize
    ^ self nextSoilObject

SoilMaterializer>>#nextSoilObject
    ^ (TypeCodeMapping at: self nextByte) soilMaterialize: self

String class>>#soilMaterialize: materializer
      ^ materializer nextString

SoilMaterializer>>#nextString
    | string |
    string := self basicNextString.
    self registerObject: string.
    ^ string

SoilMaterializer>>#basicNextString
    | buf length |
    buf := ByteArray new: (length := self nextLengthEncodedInteger).
    stream readInto: buf startingAt: 1 count: length.
    ^ buf asString

which gives us

soil is ramping up

Conclusion

We have a serialization infrastructure that can serialize non-complex objects (not containing more references) to bytes which is a prerequisite to store it on disk. Making the snippet at the top work needs nothing more than to open a file and writing to it. With a pre-defined filename writing and reading works. In this second post we have almost built a minimal database that can store one simple object. Not very useful but a good starting point.

Things that were left out of this post: how to serialize a small graph involving more than a single object and how is the class/behavior information managed and stored. This will be the topics of the coming posts