BYOB - build your own (data)base

Persistency is not simple. We can make programs that create data structures and operate on it rather easy. But persisting the same thing is not easy. If the data structure you are dealing with is not super simple, persisting gets complicated really quick. Why is that? Why do we even need to persist data? At one time in history storage technology divided itself into fast-expensive and slow-cheap memory. We learned it is due to the difficulty in producing the same density for non-volatile storage than it is for volatile storage. For the same amount of money you can have a some non-volatile fast memory or much more volatile fast memory or way more slow non-volatile memory. In the early days of computing I think that the availbility of more memory could unblock a lot of developments which is good. But the price to pay was to settle on an architecture that has split-brain memory, one small fast storage (volatile main memory) to do the computation and a secondary storage (slow hard-disks) for persisting.

Having to deal with two kind of memories makes projects a lot more complex. You need to load data from secondary memory into main memory for it to operate fast enough. You need to track modifications of that data in order to write it back because otherwise it won't persist. Having the persisted data on secondary memory allowed to store more data than what fits in main memory. So you need a way to read a subset of secondary memory data into main memory. The mitigation of those problems developed into general purpose components we call database today. A database is about the organization how data is stored on secondary memory, it allows to query a subset of the data stored on secondary memory into main memory for processing.

Later relational databases and SQL became popular as a common way how to deal with data. Most of that happened more than 50 years ago. I guess it developed into a habit that using a database is the way to go. This is something that strikes me. When talking about databases they are never questioned. If I tell people that they can store their data in a file on disk they look at me in a very strange way. I've seen people building web sites with a mysql database that have 50 records in total. Telling people that they can load all their data in main memory and use their programming language as a query language they look at me in a similar strange way. And then nobody seems to know how databases work internally. It is almost as if there would be a sign "you shall not pass!" making it a big miracle. And so databases are considered to do something magical. They do not!

Database systems are frameworks. They provide a DSL that you can use to store, query and update your data. That's it. It means you have your program and the data objects you've created and then you have to map that to the DSL/structure of that database system you are using. This is cumbersome in the best case. In the worst case there is a huge conceptional gap between your programming language and the way the database works. You will fail to properly handle complex scenarios. A prominent example of this are ORMs. An ORM (object relational mapper) is a mitigator between an object oriented programming language and a relational database. A relational database is an entity-relationship engine. It has a powerful query engine that enables you to query your data sets on an attribute level and the result is also defined as a composition of individual attributes. SQL is nice, flexible and powerful. On the other hand you have an object oriented language. One criteria for object orientation is the support of polymorphism, the multi-shapedness of objects. It means that for a single query you need to operate on different shapes. No matter what you try there is no way making polymorphism proper on a relational database, there is just a conceptional mismatch. If you try to do it you will force object oriented schemes on a relational model which ends up being a limited way of polymorphism and at the same time you have to sacrifize relational assets.

The limits in a lower layer (database) will influence the expressiveness of what you do in an upper layer (program). You are forced to make your model more simple, more relational like. This is a big loss. So maybe it is not object orientation that failed but the technology world that failed on object orientation. This is especially painful when working in a live-programming environment like pharo. When you are using a relational database you should use it in a relational not object oriented fashion. Otherwise it will be the worst of both worlds.

I don't want to sacrifize object orientation just because I need a persistence layer. So are there any options left? We ruled out SQL databases already. NoSQL databases look more promising here. Databases like MongoDB are document oriented and schemaless. Being schemaless enables to use polymorphism more easy. Document orientation enables to store nested objects. But referencing objects is not really supported there. You have to do it on your own. Even if you are willing to do it (what we do with voyage) you still need to use BSON (a binary JSON format) and its really limited set of types. In the pharoverse there are other lightweight approaches to deal with the problem. There are prevlayer style approaches which read all data in memory and support writing the dirty objects back on disk...making a long story short: Most of the databases are ruled out by either not supporting object orientation or by not dealing with concurrency, data integrity etc. And we are not even talked about inspection, debugging etc.

Our curent project is apptivegrid which is a web based visual database that supports plenty of types. Sooner or later the gap between what we need and what a database system has to offer will slow us down or even block us. What we need is something which is simple, changeable and less likely to kill the fun.

This is how I grew my motivation to start an object oriented database myself. I'm not an expert on the topic, so I have to see how far I can get. But it will be beneficial for our project to have a better database. And pharo deserves a lightweight OO database. I always admired GemStone/S as a great product. But to my taste it is too heavy weight for most of our applications. While being a user of it I learned useful things that I might be able to copy partially. One thing I already achieved from that is mimicking GemStone's exception handling with SnapDump.

So let's start the journey into making an object oriented database. And let's start from scratch to have the best chance to end with something simple.

What does it need to be an object database?

Thinking of object there are two things come to mind:

  1. objects reference each other building an object graph
  2. an object is the combination of state and behavior

As we make a pharo image a database we have something that is suited perfectly to hold behavior. The database should be able to store behavior but I'm not tempted to rely on that from the beginning. So we can concentrate on 1.

As an example let's assume we want to develop a HTTP service application that needs a database.

The first requirement we developed is:

- The database should be able to write an object graph to disk and read it back

If it were only about storing a whole graph we wouldn't have to do much because the pharo image does this. So let's see

saving the image

Pharo is image based and everything is an object. If an image is saved all objects are hibernated to disk. If we add our application and create some data it will be included. But if we save the image it has drawbacks:

  • the http server needs to shutdown because OS resources like sockets cannot be hibernated. This will lead to service interruption because requests are being rejected
  • the image needs to freeze because that is the only reliable way of storing a sane graph to disk

We could create a tiny image to lower the time it needs to interrupt the service. Or we could fork the current OS process and store in background which will use a lot of system resources and that would not be able to scale

Saving the image is useful when developing an application. You can create example data for your application and store it along with your code in the image on disk. Not having to deal with databases and setups accelerate application development in the early phase.

This is not the perfect scenario we imagined, right? But we got new requirements for our database definition.

- The database should be able to write an object graph to disk and read it back
- Writing to disk should not interrupt the service
- Storage times should be as small as possible 

Let's see if we can improve

fuel serializer

Pharo comes with an object graph serializer: fuel. It is able to store arbitrary object graphs to disk. Taking a deeper look we discover that fuel has the benefits:

  • it doesn't "stop the world" when serializing so the system can go uninterrupted
  • fuel is handed out a single object and it will serialize everything that is connected from there. This enables us to store only the model of our http service and not all of the image

The drawbacks are:

  • it only serializes whole graphs
  • not stopping the world makes it open for concurrency problems. If an object is modified while it serializes the modified object will be stored and can lead to data corruption.

Compared to the image approach we improved the situation. We do not have to reject incoming requests, we just block them for some time making the requests just go slower but working. However when the model grows the ratio of modified to written objects will get worse the bigger the model grows.

Still not the approach we are looking for. But some new requirements for our database defintion:

- The database should be able to write an object graph to disk and read it back
- Writing to disk should not interrupt the service
- Storage times should be as small as possible 
- Efficient read/write: Only needed objects are read and only changed objects are written
- Concurrent use should be supported 

If we are able to come which supports what fuel does but avoids it problems we should be pretty good.

A journey into OO databases

Let's start with a simple model that we can extend to the point it satisifes all our requirements. This is an introductory post for Soil. In the upcoming posts we will ramp up an object oriented database that should be simple and lightweight.

Naming things is one of the hardest topics in IT. When I decided to start a new database I needed a name. After some research Soil came into my vision. Persistency is always considered *below* somehow and as far as I understand the term soil it means ground layer that we stand on / build on. But I'm not a native speaker ...

Let's define a simple model

Soil

The class Soil is the logical database. As we are writing to the local disk the responsibility of that object is to have a path of a directory that we consider to contain all of the database content. It needs to provide a way to create a new database from nothing or to open an existing one.

Transaction

We want to be concurrent safe and also efficient. The smalltalk-process to transaction relationship is 1:m which means that a process can have any number of transactions but a transaction always has only one process. This way all actions inside a transaction are thread safe as there can't be another thread executing it. Only when the transaction is accessing an external component we need to take care of synchronization. The transaction is the main object to organize reading and writing of objects. It will therefor hold read objects for subsequent access and it will provide a way to mark objects dirty so that only the changed objects will be written. A transaction also provides the root object of the database graph. This is the only object that has a known location.

A simple API sketch to begin with would be

"create a new database path and initialize its content"
soil := Soil new 
    path: 'soil';
    initializeFilesystem.
"create a transaction from soil which will be done within a process"
tx := soil newTransaction.
"set the root object to any object or object graph"
tx root: 'soil is ramping up'.
"commit the transaction which will write the content back to disk and updates all 
dependent structures"
tx commit.
"clean up and release all OS resources"
soil close.

code repository

The code is available on github if you want to have a look or remarks