In the initial post I described the motivation behind developing a new OO database from scratch in pharo. At the end of that post I've sketched a possible API for Soil. Let's pick that and see how we can make it happen. I copied the snippet over here:
"create a new database path and initialize its content"
soil := Soil new
path: 'soil';
initializeFilesystem.
"create a transaction from soil which will be done within a process"
tx := soil newTransaction.
"set the root object to any object or object graph"
tx root: 'soil is ramping up'.
"commit the transaction which will write the content back to disk and updates all
dependent structures"
tx commit.
"clean up and release all OS resources"
soil close.
The snippet should set up a new database on disk under the path soil. Then it writes the string "soil is ramping up" as the root object into the database. This is a very simple example and also a very good start in order not to make it too complicated.
We start with an in-memory object and put that on disk. What we know is that on disk it needs to be represented as a sequence of bytes because disks can only store it that way. The act of writing the in-memory object into the sequence of bytes we call serialization. Looking at the in-memory object it is bytes as well but the location of those bytes and its structure need to be taken care of. Dealing with an object implies two things:
Unlike java, pharo does not have primitive types for integers, floats etc. "Everything is an object" is meant that way without exception. It means integers, floats and characters are objects as well. But it does not seem to be feasible to have that, right? Imagine the integer 1 which has only one bit of information. Having an object somewhere in memory that knows its class for that number would be overkill.
We've just found out that having access to an object means we have a reference to it. We can store information in the reference as well. In pharo object references have a few bits reserved to tag the reference as integer, float, character or heap object. For integers, floats and characters it is possible to encode their value in the reference itself. An integer is just the same as reference into memory, a float has a few bytes for mantissa and for exponent which fit into a reference. For a character we store the unicode code point which is also just a number.
Instead of having an object somewhere in memory where a reference points to we get the class information (from the tag bits) and the content immediately from the reference value. Therefor these are called immediate objects. The virtual machine handles immediate objects in a way they appear the same as every other object, we can do things like
1 class
and we get
SmallInteger
In the context of serialization we note that we can handle immediate objects almost like any other object as it has a class in the system. When the serializer manages object references we need to make that a special case. Instead of a reference to an immediate we always write the value of the immediate object.
A heap object (not sure there is a better term) is an object where the reference to the object contains an address to a memory location where we can find that object. At the memory location the first thing we will find is another reference: The reference to the class of this object.
A class consists of a class layout which determines the memory structure after the class reference in the memory presentation. These are e.g.:
For such objects we need to encode the information of its class and then we can ask the layout to do the concrete structure. All of these objects are heap objects and therefor references to those objects need to be treated by the serializer.
What we learned before is that we need to serialize based on the structure of an object and if it can be referenced or not. Let's sketch the general case for heap objects. We only know that they are derived from the class Object.
Object>>#soilSerialize: aSoilSerializer
aSoilSerializer
registerObject: self
ifAbsent: [ self soilBasicSerialize: aSoilSerializer ].
The method registerObject:ifAbsent: is for handling references to such an object. It can serialize surrogates of a known references and has an fallback to soilBasicSerialize: (the implementation of registerObject:ifAbsent: will be handled in a future post about graph serialization)
Object>>soilBasicSerialize: serializer
"Delegate serialization to the class layout"
self class classLayout soilBasicSerialize: self with: serializer
For the generic case we just delegate to the class layout of that object which has the responsibility to write the proper format. For FixedLayout this is
FixedLayout>>#soilBasicSerialize: anObject with: serializer
| description |
description := self soilSerializeBehaviorDescription: anObject with: serializer.
description instVarNames do: [:ivarName | (anObject instVarNamed: ivarName) soilSerialize: serializer ]
We store the information about its class (will be handled in a future post about behavior management) and then we store each of its instance variable values in order. Calling soilSerialize: on the value of the instance variable enables to have a per class serialization.
We have now an API where we can:
In the snippet at the beginning of this post we started with storing a string. What would happen if we store the string with this generic serializer? In pharo a string is a collection of characters. The serializer would encounter the collection and would call soilSerialize: on it which will end up in the class layout of that collection which is VariableLayout. The case for VariableLayout is this:
soilBasicSerialize: anObject with: serializer
| description basicSize|
description := self soilSerializeBehaviorDescription: anObject with: serializer.
basicSize := anObject basicSize.
serializer nextPutLengthEncodedInteger: basicSize.
description instVarNames do: [:ivarName | (anObject instVarNamed: ivarName) soilSerialize: serializer ].
1 to: basicSize do: [:i | (anObject basicAt: i) soilSerialize: serializer ]
It stores the information about the class, then the number of items in the collection and then it stores each item by calling soilSerialize: on it. A character being an immediate object does not make the situation easier. The serialization might even use the memory representation of a character which can be many bytes for a wide string. There are many problems for a string. But we can overwrite the way a string is serialized:
String>>#soilBasicSerialize: serializer
serializer nextPutString: self
It seems to be easiest to just encode the string with utf-8 and store the resulting bytes as a compact and less error-prone variant. But how can we write that in a way that on materialization we know it is meant to be a string?
We can do something which is not too different to what immediate objects do. Instead of embedding class information in the serialized artefact we encode tags to the sequence of bytes to make us recognize on materialization. For that we need a format how to encode this
Encoding structured data can be done in basically two ways: textual or binary. A textual format has the advantage that it can be edited easily by humans. The downside is that textual formats need character delimiters to encode the structure. Delimiters are characters that are from the same set as the content in the structure is encoded. This makes it necessary to escape delimiters in the content which is brittle. As human editability is not one of our concerns we can just use a binary format which is likely to be more performant. And the best format I encountered so far is type - length - value
type can be a byte that we use as tag to identify what class the content belongs to. We do not expect a larger number of those types so a single byte should be sufficient
length is the number of bytes to read for the content . For the length it is not easy to tell how many bytes we need for it. Choosing a small number of bytes restricts the format to support only lengths up to what you encode in those bytes. Choosing a large number of bytes wastes memory where the lengths are rather small.
variable length encoding
to our help there is encoding that solves that. The number of bytes for the length is variable. It is using the MSB (most significant bit) of a byte to indicate an extended length. As long as we read bytes that have the MSB set we know there is another byte following that belongs to this length. This way the length is not restricted by size at all. The smalltalk code to help understand it is this
nextLengthEncodedInteger
| value |
value := self nextByte.
(value < 128) ifTrue: [ ^ value ].
^ (self nextLengthEncodedInteger bitShift: 7) bitOr: (value bitAnd: 127)
and writing
SoilSerializer>>#nextPutLengthEncodedInteger: anInteger
"store length of integer encoded in a way that the presence of a
most significant bit indicates that the next byte is part of
the value"
anInteger < 128 ifTrue: [ ^ self nextPutByte: anInteger ].
self
nextPutByte: ((anInteger bitAnd: 127) bitOr: 128);
nextPutLengthEncodedInteger: (anInteger bitShift: -7)
value is the byte representation of the content. As it is only bytes and won't interpreted as such we can put everything there transparently
For the type we use a byte code. In order to avoid magic numbers everywhere in the code we want to have the values assigned once and then reused with the symbols defined for it. This could be class variables that are assigned their numeric value. In order to reuse the type code mapping we use a SharedPool for that so that we can use it in a test class as well.
SharedPool subclass: #SoilTypeCodes
instanceVariableNames: ''
classVariableNames: '... TypeCodeString ...'
package: 'Soil-Core-Serialization'
For serialization we need
SoilTypeCodes class>>#initializeTypeCodes
...
TypeCodeString := 13.
...
and the reverse mapping for reading
SoilTypeCodes class>>#initializeTypeCodeMapping
TypeCodeMapping := Array new: 255.
TypeCodeMapping
...
at: TypeCodeString put: String;
...
defines the mapping from type code to instance creator.
While building the serializer above we wanted to have a special serialization for our string. After the detour over the type codes we can continue here. So this was the intended code:
String>>#soilBasicSerialize: serializer
serializer nextPutString: self
the serializer can now be done this way
SoilSerializer>>#nextPutString: aString
self
nextPutByte: TypeCodeString;
basicNextPutString: aString
SoilSerializer>>#basicNextPutString: aString
| buf |
buf := aString asByteArray.
self
nextPutLengthEncodedInteger: buf size;
nextPutBytesFrom: buf
Here the string is turned into an byte array which might be confusing. The implementation for String is the general case which is ByteString, a string objects that only contains 8bit characters. If a wide character enters the string the string becomes a WideString and the implementation there is
WideString>>#soilBasicSerialize: serializer
serializer nextPutWideString: self
SoilSerializer>>#nextPutWideString: aWideString
| buf |
buf := self class encodeString: aWideString.
self
nextPutByte: TypeCodeWideString;
nextPutLengthEncodedInteger: buf size;
nextPutBytesFrom: buf
Finally we can use the serializer to convert the string to its byte representation
bytes := ByteArray streamContents: [ :stream |
SoilSerializer new
stream: stream;
serialize: 'soil is ramping up' ]
results into
#[13 18 115 111 105 108 32 105 115 32 114 97 109 112 105 110 103 32 117 112]
which we can read back using
SoilMaterializer new
stream: bytes readStream;
materialize
and the materializer uses the inverse path to decode
SoilMaterializer>>#materialize
^ self nextSoilObject
SoilMaterializer>>#nextSoilObject
^ (TypeCodeMapping at: self nextByte) soilMaterialize: self
String class>>#soilMaterialize: materializer
^ materializer nextString
SoilMaterializer>>#nextString
| string |
string := self basicNextString.
self registerObject: string.
^ string
SoilMaterializer>>#basicNextString
| buf length |
buf := ByteArray new: (length := self nextLengthEncodedInteger).
stream readInto: buf startingAt: 1 count: length.
^ buf asString
which gives us
soil is ramping up
We have a serialization infrastructure that can serialize non-complex objects (not containing more references) to bytes which is a prerequisite to store it on disk. Making the snippet at the top work needs nothing more than to open a file and writing to it. With a pre-defined filename writing and reading works. In this second post we have almost built a minimal database that can store one simple object. Not very useful but a good starting point.
Things that were left out of this post: how to serialize a small graph involving more than a single object and how is the class/behavior information managed and stored. This will be the topics of the coming posts