LiteCluster: Replicated, leaderless, ACID compliant & high availability SQLite

Introduction

LiteCluster is a globally replicated SQL database with a leaderless architecture that delivers high performance operation while retaining ACID semantics and strong consistency. In this post I will walk through the features of LiteCluster, the design of such features and the tradeoffs that were taken during the design process.

It is important to note that there are no silver bullets and each design comes with tradeoffs that make it more suitable for certain use cases than others. Nevertheless, there are still designs striking a balance that allows them to benefit very broad sets of use cases and I certainly hope that LiteCluster is one such design.

Background

LiteCluster uses SQLite as its backing data store and it leverages its SQL capabilities to offer a rich set of SQL features right out of the box. One of the design goals of LiteCluster was to try and not impose restrictions on traditional SQLite operation as much as possible. Hopefully this was a success because only very few restrictions exist which will be explained later. It is also worthy to mentions that SQLite, fascinating as it is, imposes heavy limitations on the concurrency model. Thus LiteCluster had to come up with novel ways to overcome such limitations as we will see.

Features

We will quickly go through the high level list of feature and briefly explain what they mean then delve into the details of how these operate.

LiteCluster is:

Replicated

Write transactions sent to one LiteCluster node is replicated to all nodes in the cluster synchronously. As a write transaction concludes and returns to the caller all data is guaranteed to exist on all nodes and to durably reside on each node’s disk. The maximum roundtrip time between nodes sets a lower bound for transaction latency.

Leaderless

All nodes are peers and clients can freely write to any node in the cluster. Node failure does not disrupt operation and does not require leader election. Global setups can exist but require certain node counts to remain in operation to resist network partitions.

Follower nodes can still be added to the cluster. A follower needs to follow a specific leader. The leader will include that follower as it sends data throughout the cluster, but it will not wait for confirmation before proceeding as it does with the other leaders.

Highly Available

As a result of replication and the leaderless design LiteCluster remains highly available even in the case of node failures. A globally replicated setup will ensure high availability even against global internet outages.

Strongly Consistent

LiteCluster is fully ACID compliant with serializable isolation level between transactions. All read queries are strongly consistent with snapshot isolation. LiteCluster uses optimistic locking to ensure conflict free operation. It also supports automatic transaction retries with bounded operation guarantees.

Concurrent Writes

Contrary to typical SQLite journaling modes (DELETE & WAL), LiteCluster can perform transactions concurrently. Furthermore, it supports concurrent creation of indexes, allowing writes to the database and to the table being indexed itself to proceed while the index in being created.

High Performance

LiteCluster enjoys very low read latencies and extremely high and scalable read throughput. Write latency is lower bound by the maximum round trip time between nodes. Write throughput is bound by the lowest single threaded performance of the cluster nodes, though due to the concurrency model, it achieves an order of magnitude higher write throughput than a typical SQLite setup.

Secure Operation

All messages sent between cluster nodes are encrypted (regardless of the TCP layer supporting TLS or not) to ensure data cannot be sniffed over the network. Furthermore, any node joining the cluster has to go through a secure handshake operation to ensure it is the actual LiteCluster replicator and not something else.

Vanilla SQLite

LiteCluster doesn’t require changing you SQL queries or higher level logic that uses it. The only limitations imposed by LiteCluster are:

Auto-increment keys and incrementing rowid tables are strongly not recommended as they can cause a lot of (recoverable) but very expensive conflicts.
Usage of non-deterministic user defined functions to generate data that will be inserted in the database is prohibited. It is totally fine though to insert data from random(), randomblob() and time and date functions including time('now')

Limitations, Caveats

LiteCluster is a very young project and is undergoing a lot of change. Here is a current set of limitations/caveats:

LiteCluster requires language specific drivers, only Ruby is currently supported

In order to operate correctly with an embedded database, LiteCluster needs to integrate with the host language concurrency model. Thus a language specific driver is required for each host language supported by LiteCluster. Currently this support only exists for the Ruby language.

Even within Ruby, ActiveRecord & Sequel are not yet supported

Connection pooling and juggling connections between threads/fibers is very rigid in AR and Sequel while LiteCluster does a lot of connection reuse, but I hope to have a working solution for that soon.

Dynamic cluster reconfiguration not yet implemented

While the current implementation can tolerate failing nodes. A new node joining a running cluster is not yet implemented. But that’s a very high priority and hopefully a straightforward feature so hopefully it should be out soon.

Long running write queries are prohibited

If a write query exceeds a certain time threshold it will be killed. Any long running write operation should be split to smaller ones that are able to finish with the time boundaries.

Availability

LiteCluster is not currently available to public, and the source code is not available.

System Architecture

A LiteCluster is comprised of multiple nodes, each node holds a complete copy of the whole database and is able to independently receive both read and write queries from the client. The nodes are connected as a mesh, with each node able to reach every other node directly.

LiteCluster deploys SQLite with a custom VFS (called dfvs) which is responsible for ensuring a deterministic operation of transactions when certain initial conditions are met. Transactions remain deterministic and can be safely retried even when they call functions like random(), randomblob() or time('now')

-- this statement will still produce deterministic results
INSERT INTO users 
  (id, name, created_at) 
VALUES 
  (hex(randomblob(16)), 'anwar', unixepoch('now'));

LiteCluster requires that applications access SQLite via its custom driver, which, behind the scenes, communicates with a replicator process that broadcasts the logical write commands to different nodes in the cluster. The driver halts execution of the running context (thread or fiber) until a response comes back from the replicator.

The replicator is responsible for grouping, ordering and broadcasting of transactions to all other nodes, it also maintains the system clock and ensures all nodes are in lockstep.

Once a specific step is validated to be available on all nodes then the committer starts checking the transactions, given the new ordering, for conflicts. The ordering is done to minimize the potential for conflicts, and to bring previously conflicted transactions forward to increase their chances of success.

After committing all transactions a tick is concluded and the results are returned to local clients only to resume from suspension and either complete execution or raise an exception due to a detected conflict. Conflicts are automatically retried for a certain number of times (3 by default) after which the exception is raised to the caller.

Physical vs Logical Replication

Physical replication (like what Litestream or Litefs do for example) involves replicating data that’s already written by the leader node to the follower nodes. On the other hand, logical replication is about sharing the write commands (INSERT, UPDATE, DELETE, etc.) and their parameters and applying them on all nodes to get them all to the same state. Each approach has its advantages and disadvantages as we can see in the following table.

In physical replication the data shared is usually in database page sizes, even if the actual change is smaller. This is due to the fact that the database writes data in units of pages (usually 4KB or more). On the other hand logical replication only sends the data and the textual commands required to write them.

Also due to the fact that in logical replication the data can be sent before transactions are committed then transaction re-ordering is possible to ensure determinism and reduce conflicts.

A summary of differences between physical and logical replication can be seen in the following table:

	Physical Replication	Logical Replication
Payload size	Larger, sends complete pages even for partial page writes	Smaller, sends only the data written and the SQL text
Maintaining consistency	Easier, the replica is a byte for byte copy from the leader database	Harder, fully deterministic writes a must to ensure consistency
Transaction re-ordering	Not possible since data is already committed	Possible in synchronous replication scenarios

Table 1: Physical vs Logical replication

LiteCluster uses logical replication to enable multi-leader support by ensuring all nodes are always in the same state even though all of them can directly accept writes.

Pessimistic vs Optimistic Locking

Concurrent access control involves locking the data sets that the concurrent transactions are operating on, such that no other transaction would change them in a conflicting manner.

Pessimistic Locking

Locking a specific dataset happens before the transaction proceeds to use this dataset. All other transactions trying to use the same dataset will be blocked until the transaction that holds it is finished with it.

This ensures consistency but can lead to deadlocks and starvation. Also in a multi-leader setup it requires locking across the cluster which is quite expensive time wise.

Optimistic Locking

In this case transactions are allowed to proceed, while they do so they track all the data they access and at commit time they check if any of this data was changed. In which case a conflict is detect and the transactions is aborted or potentially retried.

Whether pessimistic or optimistic locking is more suitable depends on the potential number of conflicting transactions. If that number is high then pessimistic locking prevents those conflicts from happening and the cost of locking before hand is well spent. On the other hand if the number of anticipated conflicts is low, then pessimistic locking is a potentially much more efficient solution.

	Pessimistic Locking	Optimistic Locking
Locking behaviour	Locks resources before using them, preventing others from access to these resources.	Proceeds without locking resources, checks resource state conflicts at commit time
Deadlocks and Starvation	Transactions can deadlock and can be starved for resources held by other transactions	No deadlocks but a conflicting transaction can keep conflicting even after retrying
Efficiency	Efficient when there are many potential conflicts between transactions	Efficient when there are few potential conflicts between transactions

Table 2: Pessimistic vs Optimistic Locking

LiteCluster use optimistic locking to prevent taking locks across the network. It uses transaction reordering to greatly minimize potential conflicts.

Fault Tolerance

The replicator is always aware of the current state of the cluster, and expects to get signals from each node on given intervals, even if the payload is empty. A node can generally fail in two ways:

Sudden death

The node is suddenly cutoff from the network or crashes or is rebooted. In this case, the connection with the node will be closed, the other nodes in the cluster will detect this event and will remove the affected node from the nodes list

Slow death

A node can still be present on the network but not responding or very slow to respond due to network issues, hardware failure (e.g. a disk error) or some other system issue. In this case the heart beats from and to the node will timeout and it will be removed from the cluster.

Care is taken to ensure that previous writes from a failed node are never considered by some nodes of the cluster and not the rest. This results in a doubling of the write latency, but it ensures that nodes don’t get in a wrong state if one node fails after it sent data to some nodes and not the others. This is relaxed when there are only two nodes in the cluster as there is no risk of data sent by the failing node reaching only some of the nodes and not the others.

Sync Modes

LiteCluster has multiple sync modes, the default of which is full sync mode.

Full Sync Mode

Every write or write transaction will be fully committed and fsync is called to to ensure it reached the physical disk on all cluster nodes before returning to the caller. This is critical for ensuring data durability but is also important for conflict detection. This mode is the one suitable for your main app’s database or a distributed cache store.

Network Sync Mode

In this mode the caller is sure that the write they are performing is idempotent and will not conflict with any other write in the system. Also the caller is not interested in the result of the call. The replicator will return to the caller once the logical instructions are replicated across the cluster before they are committed to disk. This is suitable for operations like a background job queue, where you want to ensure that the job is durably replicated to the cluster but can hit the disk at a later time.

Full Async Mode

In this mode the caller is also sure that the write will not conflict with any other write, and also is comfortable tolerating a small potential of data loss in case of a failing node. In this mode the clients just move forward after sending the write instructions to the replicator. This mode is most suitable for applications like a replicated log or a metrics collector.

Figure 2: Different LiteCluster sync modes

Security

Only trusted nodes are accepted in the cluster. Upon joining the cluster each node sends an encrypted identification message which then allows other nodes to include it in their nodes list. This prevents a rogue process residing on one of the cluster nodes from acting as a replicator in order to corrupt the whole cluster.

The transport and encoding layer is responsible for making sure messages sent across the cluster are intact and secure. Each cluster node will have a copy of a security key that is supplied to each node and used to encode messages before sending over the network and decoding them as they arrive at each node.

For performance reasons, the data is encrypted once on each node and is not encrypted separately for each node it is sending to.

Given the above all data on transit (outside of the node boundaries) is encrypted. Further protection can be applied by using filesystem encryption to encrypt data at rest as well.

Comparison to other replicated SQLite solutions

There are several SQLite replication solutions available. As engineering solutions, each makes its own tradeoffs and tunes its operation for the three most important dimensions, consistency, availability and performance (I know the canonical rendition says partition tolerance, but I believe this is an extension of consistency and that performance, as in latency and throughput, is a dimension that shouldn’t be ignored).

–	Bedrock	ComDB2	cr-sqlite	Litefs	Dqlite	rqlite	LiteCluster
Replication model	Leader / Followers	Leader / Followers	Leaderless	Leader / Followers	Leader / Followers	Leader / Followers	Leaderless
Replication method	Physical	Physical	Physical	Physical	Physical	Physical	Logical
Locking mode	Optimistic	Pessimistic	Optimistic	Pessimistic	Pessimistic	Pessimistic	Optimistic
Consensus method	Paxos	?	CRDT	Leases	RAFT	RAFT	Lockstep
Replication Lag	Yes	Yes	Yes	Yes	Yes	Yes	No
Always consistent	Yes	Yes	No	Yes	Yes	Yes	Yes

Table 3: Comparing several replicated SQLite solutions

Performance Evaluation

LiteCluster is very early in the implementation phase, but we were able to get some interesting performance numbers already. These benchmarks do not compare LiteCluster to any other replicated SQLite solution, they only focus on current LiteCluster’s performance.

Please note that there is a ton of optimizations not yet implemented but I think the numbers are interesting enough to publish already.

Also note that while the test simulates network latency, it is penalized by the fact that all writes are happening on the same machine.

Benchmark Setup

In order to measure the cluster performance we prepared the following setup:

LiteCluster: A 2 node cluster running on the same machine
Only one of the nodes in the cluster had client processes connected to it (8 processes)
Network latency simulated using tc with varying latency values
- 0ms latency to establish a baseline
- 1 to 2ms latency for nodes in the same datacenter
- 5 to 10ms latency for nodes in the same city
- 10 to 20ms latency for nodes in the same state
Load was generated using wrk, running 256 to 2048 concurrent connections
After each run, the database files were compared using sqldiff to ensure consistency
All operations were performed in full sync mode

The tc command was used as follows to introduce delays:

sudo tc qdisc del dev lo root netem delay 10ms

And then to reset the delay

sudo tc qdisc del dev lo root netem

Wrk was used as follows

wrk -t4 -d5 -c 2048 --latency http://192.168.0.232:3000/

Benchmark App

A small rack application was used to simulate write and read IO (with a 50% chance for each)

app = proc do |env| 
  if rand(2) > 0
    time = (Time.now - START_TIME).to_f
    id = "#{(time * 1000000).to_i}%05d" % [Process.pid]
    id = id.to_i
    result = DB.query_array(
      "INSERT INTO events(id, data, created_at) VALUES(?, ?, ?)", 
      [id, "some event information", time.to_i]
    )
  else
    result = DB.query_array("SELECT * FROM events ORDER BY id DESC LIMIT 1")[0]
  end  
  [200, {}, "OK"]
end

Performance Results

Throughput (in requests per second)

Latency/Clients	256 Clients	512 Clients	1024 Clients	2048 Clients
0ms	25,197	53,507	88,059	111,921
1ms	24,918	47,132	75,702	100,134
2ms	20,485	39,856	66,356	96.951
5ms	12,759	26,199	48,670	67,948
10ms	7,860	15,507	30,709	35,432^*
20ms	4,281	8,542	15,081	14,475^*

* < 0.01% timed out connections

P90 Latency (in ms)

Latency/Clients	256 Clients	512 Clients	1024 Clients	2048 Clients
0ms	24.35	18.91	23.09	37.68
1ms	19.14	20.54	26.06	40.93
2ms	22.04	22.97	28.24	38.58
5ms	32.04	30.06	33.48	162.19
10ms	47.51	48.02	47.12	234.69^*
20ms	81.38	84.46	95.04	337.76^*

* < 0.01% timed out connections

P99 Latency (in ms)

Latency/Clients	256 Clients	512 Clients	1024 Clients	2048 Clients
0ms	39.78	30.63	30.67	45.3
1ms	31.91	29.66	36.73	55.73
2ms	36.57	33.71	42.33	60.21
5ms	54.01	43.98	50.46	559.8
10ms	61.62	63.15	102.36	697.8^*
20ms	90.68	92.2	332.82	958.43^*

* < 0.01% timed out connections

Conclusion

LiteCluster shows nice performance in lower latency environments, still, even with higher latency, a LiteCluster based application will hold its own against 1024 or less connected clients.

We expect better numbers with further optimizations to the committer/replicator duo and with prepared statement support in clients.

Future Work

As LiteCluster matures, more testing should be done including on real life clusters on multiple machines. Both in local and wide area networks.

Also testing for failure scenarios and cluster recovery.

One area that is not implemented yet is dynamically joining a running cluster, which should be a priority for the coming efforts.

A lot of optimization opportunities in the committer/replicator should be addressed in order to eliminate some inefficiencies and reduce latencies.

Finally, ActiveRecord/Sequel & Litestack integration should materialize and then benchmarking with a full fledged Rails app.

LiteCluster: Replicated, leaderless, ACID compliant & high availability SQLite

Introduction

Background

Features

Replicated

Leaderless

Highly Available

Strongly Consistent

Concurrent Writes

High Performance

Secure Operation

Vanilla SQLite

Limitations, Caveats

LiteCluster requires language specific drivers, only Ruby is currently supported

Even within Ruby, ActiveRecord & Sequel are not yet supported

Dynamic cluster reconfiguration not yet implemented

Long running write queries are prohibited

Availability

System Architecture

Physical vs Logical Replication

Pessimistic vs Optimistic Locking

Pessimistic Locking

Optimistic Locking

Fault Tolerance

Sudden death

Slow death

Sync Modes

Full Sync Mode

Network Sync Mode

Full Async Mode

Security

Comparison to other replicated SQLite solutions

Performance Evaluation

Benchmark Setup

Benchmark App

Performance Results

Throughput (in requests per second)

P90 Latency (in ms)

P99 Latency (in ms)

Conclusion

Future Work

Share this:

Leave a comment Cancel reply