
Introduction
LiteCluster is a globally replicated SQL database with a leaderless architecture that delivers high performance operation while retaining ACID semantics and strong consistency. In this post I will walk through the features of LiteCluster, the design of such features and the tradeoffs that were taken during the design process.
It is important to note that there are no silver bullets and each design comes with tradeoffs that make it more suitable for certain use cases than others. Nevertheless, there are still designs striking a balance that allows them to benefit very broad sets of use cases and I certainly hope that LiteCluster is one such design.
Background
LiteCluster uses SQLite as its backing data store and it leverages its SQL capabilities to offer a rich set of SQL features right out of the box. One of the design goals of LiteCluster was to try and not impose restrictions on traditional SQLite operation as much as possible. Hopefully this was a success because only very few restrictions exist which will be explained later. It is also worthy to mentions that SQLite, fascinating as it is, imposes heavy limitations on the concurrency model. Thus LiteCluster had to come up with novel ways to overcome such limitations as we will see.
Features
We will quickly go through the high level list of feature and briefly explain what they mean then delve into the details of how these operate.
LiteCluster is:
Replicated
Write transactions sent to one LiteCluster node is replicated to all nodes in the cluster synchronously. As a write transaction concludes and returns to the caller all data is guaranteed to exist on all nodes and to durably reside on each node’s disk. The maximum roundtrip time between nodes sets a lower bound for transaction latency.
Leaderless
All nodes are peers and clients can freely write to any node in the cluster. Node failure does not disrupt operation and does not require leader election. Global setups can exist but require certain node counts to remain in operation to resist network partitions.
Follower nodes can still be added to the cluster. A follower needs to follow a specific leader. The leader will include that follower as it sends data throughout the cluster, but it will not wait for confirmation before proceeding as it does with the other leaders.
Highly Available
As a result of replication and the leaderless design LiteCluster remains highly available even in the case of node failures. A globally replicated setup will ensure high availability even against global internet outages.
Strongly Consistent
LiteCluster is fully ACID compliant with serializable isolation level between transactions. All read queries are strongly consistent with snapshot isolation. LiteCluster uses optimistic locking to ensure conflict free operation. It also supports automatic transaction retries with bounded operation guarantees.
Concurrent Writes
Contrary to typical SQLite journaling modes (DELETE & WAL), LiteCluster can perform transactions concurrently. Furthermore, it supports concurrent creation of indexes, allowing writes to the database and to the table being indexed itself to proceed while the index in being created.
High Performance
LiteCluster enjoys very low read latencies and extremely high and scalable read throughput. Write latency is lower bound by the maximum round trip time between nodes. Write throughput is bound by the lowest single threaded performance of the cluster nodes, though due to the concurrency model, it achieves an order of magnitude higher write throughput than a typical SQLite setup.
Secure Operation
All messages sent between cluster nodes are encrypted (regardless of the TCP layer supporting TLS or not) to ensure data cannot be sniffed over the network. Furthermore, any node joining the cluster has to go through a secure handshake operation to ensure it is the actual LiteCluster replicator and not something else.
Vanilla SQLite
LiteCluster doesn’t require changing you SQL queries or higher level logic that uses it. The only limitations imposed by LiteCluster are:
- Auto-increment keys and incrementing rowid tables are strongly not recommended as they can cause a lot of (recoverable) but very expensive conflicts.
- Usage of non-deterministic user defined functions to generate data that will be inserted in the database is prohibited. It is totally fine though to insert data from
random(),randomblob()and time and date functions includingtime('now')
Limitations, Caveats
LiteCluster is a very young project and is undergoing a lot of change. Here is a current set of limitations/caveats:
LiteCluster requires language specific drivers, only Ruby is currently supported
In order to operate correctly with an embedded database, LiteCluster needs to integrate with the host language concurrency model. Thus a language specific driver is required for each host language supported by LiteCluster. Currently this support only exists for the Ruby language.
Even within Ruby, ActiveRecord & Sequel are not yet supported
Connection pooling and juggling connections between threads/fibers is very rigid in AR and Sequel while LiteCluster does a lot of connection reuse, but I hope to have a working solution for that soon.
Dynamic cluster reconfiguration not yet implemented
While the current implementation can tolerate failing nodes. A new node joining a running cluster is not yet implemented. But that’s a very high priority and hopefully a straightforward feature so hopefully it should be out soon.
Long running write queries are prohibited
If a write query exceeds a certain time threshold it will be killed. Any long running write operation should be split to smaller ones that are able to finish with the time boundaries.
Availability
LiteCluster is not currently available to public, and the source code is not available.
System Architecture
A LiteCluster is comprised of multiple nodes, each node holds a complete copy of the whole database and is able to independently receive both read and write queries from the client. The nodes are connected as a mesh, with each node able to reach every other node directly.
LiteCluster deploys SQLite with a custom VFS (called dfvs) which is responsible for ensuring a deterministic operation of transactions when certain initial conditions are met. Transactions remain deterministic and can be safely retried even when they call functions like random(), randomblob() or time('now')
-- this statement will still produce deterministic results
INSERT INTO users
(id, name, created_at)
VALUES
(hex(randomblob(16)), 'anwar', unixepoch('now'));
LiteCluster requires that applications access SQLite via its custom driver, which, behind the scenes, communicates with a replicator process that broadcasts the logical write commands to different nodes in the cluster. The driver halts execution of the running context (thread or fiber) until a response comes back from the replicator.
The replicator is responsible for grouping, ordering and broadcasting of transactions to all other nodes, it also maintains the system clock and ensures all nodes are in lockstep.
Once a specific step is validated to be available on all nodes then the committer starts checking the transactions, given the new ordering, for conflicts. The ordering is done to minimize the potential for conflicts, and to bring previously conflicted transactions forward to increase their chances of success.
After committing all transactions a tick is concluded and the results are returned to local clients only to resume from suspension and either complete execution or raise an exception due to a detected conflict. Conflicts are automatically retried for a certain number of times (3 by default) after which the exception is raised to the caller.

Physical vs Logical Replication
Physical replication (like what Litestream or Litefs do for example) involves replicating data that’s already written by the leader node to the follower nodes. On the other hand, logical replication is about sharing the write commands (INSERT, UPDATE, DELETE, etc.) and their parameters and applying them on all nodes to get them all to the same state. Each approach has its advantages and disadvantages as we can see in the following table.
In physical replication the data shared is usually in database page sizes, even if the actual change is smaller. This is due to the fact that the database writes data in units of pages (usually 4KB or more). On the other hand logical replication only sends the data and the textual commands required to write them.
Also due to the fact that in logical replication the data can be sent before transactions are committed then transaction re-ordering is possible to ensure determinism and reduce conflicts.
A summary of differences between physical and logical replication can be seen in the following table:
| Physical Replication | Logical Replication | |
|---|---|---|
| Payload size | Larger, sends complete pages even for partial page writes | Smaller, sends only the data written and the SQL text |
| Maintaining consistency | Easier, the replica is a byte for byte copy from the leader database | Harder, fully deterministic writes a must to ensure consistency |
| Transaction re-ordering | Not possible since data is already committed | Possible in synchronous replication scenarios |
LiteCluster uses logical replication to enable multi-leader support by ensuring all nodes are always in the same state even though all of them can directly accept writes.
Pessimistic vs Optimistic Locking
Concurrent access control involves locking the data sets that the concurrent transactions are operating on, such that no other transaction would change them in a conflicting manner.
Pessimistic Locking
Locking a specific dataset happens before the transaction proceeds to use this dataset. All other transactions trying to use the same dataset will be blocked until the transaction that holds it is finished with it.
This ensures consistency but can lead to deadlocks and starvation. Also in a multi-leader setup it requires locking across the cluster which is quite expensive time wise.
Optimistic Locking
In this case transactions are allowed to proceed, while they do so they track all the data they access and at commit time they check if any of this data was changed. In which case a conflict is detect and the transactions is aborted or potentially retried.
Whether pessimistic or optimistic locking is more suitable depends on the potential number of conflicting transactions. If that number is high then pessimistic locking prevents those conflicts from happening and the cost of locking before hand is well spent. On the other hand if the number of anticipated conflicts is low, then pessimistic locking is a potentially much more efficient solution.
| Pessimistic Locking | Optimistic Locking | |
|---|---|---|
| Locking behaviour | Locks resources before using them, preventing others from access to these resources. | Proceeds without locking resources, checks resource state conflicts at commit time |
| Deadlocks and Starvation | Transactions can deadlock and can be starved for resources held by other transactions | No deadlocks but a conflicting transaction can keep conflicting even after retrying |
| Efficiency | Efficient when there are many potential conflicts between transactions | Efficient when there are few potential conflicts between transactions |
LiteCluster use optimistic locking to prevent taking locks across the network. It uses transaction reordering to greatly minimize potential conflicts.
Fault Tolerance
The replicator is always aware of the current state of the cluster, and expects to get signals from each node on given intervals, even if the payload is empty. A node can generally fail in two ways:
Sudden death
The node is suddenly cutoff from the network or crashes or is rebooted. In this case, the connection with the node will be closed, the other nodes in the cluster will detect this event and will remove the affected node from the nodes list
Slow death
A node can still be present on the network but not responding or very slow to respond due to network issues, hardware failure (e.g. a disk error) or some other system issue. In this case the heart beats from and to the node will timeout and it will be removed from the cluster.
Care is taken to ensure that previous writes from a failed node are never considered by some nodes of the cluster and not the rest. This results in a doubling of the write latency, but it ensures that nodes don’t get in a wrong state if one node fails after it sent data to some nodes and not the others. This is relaxed when there are only two nodes in the cluster as there is no risk of data sent by the failing node reaching only some of the nodes and not the others.
Sync Modes
LiteCluster has multiple sync modes, the default of which is full sync mode.
Full Sync Mode
Every write or write transaction will be fully committed and fsync is called to to ensure it reached the physical disk on all cluster nodes before returning to the caller. This is critical for ensuring data durability but is also important for conflict detection. This mode is the one suitable for your main app’s database or a distributed cache store.
Network Sync Mode
In this mode the caller is sure that the write they are performing is idempotent and will not conflict with any other write in the system. Also the caller is not interested in the result of the call. The replicator will return to the caller once the logical instructions are replicated across the cluster before they are committed to disk. This is suitable for operations like a background job queue, where you want to ensure that the job is durably replicated to the cluster but can hit the disk at a later time.
Full Async Mode
In this mode the caller is also sure that the write will not conflict with any other write, and also is comfortable tolerating a small potential of data loss in case of a failing node. In this mode the clients just move forward after sending the write instructions to the replicator. This mode is most suitable for applications like a replicated log or a metrics collector.

Security
Only trusted nodes are accepted in the cluster. Upon joining the cluster each node sends an encrypted identification message which then allows other nodes to include it in their nodes list. This prevents a rogue process residing on one of the cluster nodes from acting as a replicator in order to corrupt the whole cluster.
The transport and encoding layer is responsible for making sure messages sent across the cluster are intact and secure. Each cluster node will have a copy of a security key that is supplied to each node and used to encode messages before sending over the network and decoding them as they arrive at each node.
For performance reasons, the data is encrypted once on each node and is not encrypted separately for each node it is sending to.
Given the above all data on transit (outside of the node boundaries) is encrypted. Further protection can be applied by using filesystem encryption to encrypt data at rest as well.
Comparison to other replicated SQLite solutions
There are several SQLite replication solutions available. As engineering solutions, each makes its own tradeoffs and tunes its operation for the three most important dimensions, consistency, availability and performance (I know the canonical rendition says partition tolerance, but I believe this is an extension of consistency and that performance, as in latency and throughput, is a dimension that shouldn’t be ignored).
| – | Bedrock | ComDB2 | cr-sqlite | Litefs | Dqlite | rqlite | LiteCluster |
|---|---|---|---|---|---|---|---|
| Replication model | Leader / Followers | Leader / Followers | Leaderless | Leader / Followers | Leader / Followers | Leader / Followers | Leaderless |
| Replication method | Physical | Physical | Physical | Physical | Physical | Physical | Logical |
| Locking mode | Optimistic | Pessimistic | Optimistic | Pessimistic | Pessimistic | Pessimistic | Optimistic |
| Consensus method | Paxos | ? | CRDT | Leases | RAFT | RAFT | Lockstep |
| Replication Lag | Yes | Yes | Yes | Yes | Yes | Yes | No |
| Always consistent | Yes | Yes | No | Yes | Yes | Yes | Yes |
Performance Evaluation
LiteCluster is very early in the implementation phase, but we were able to get some interesting performance numbers already. These benchmarks do not compare LiteCluster to any other replicated SQLite solution, they only focus on current LiteCluster’s performance.
Please note that there is a ton of optimizations not yet implemented but I think the numbers are interesting enough to publish already.
Also note that while the test simulates network latency, it is penalized by the fact that all writes are happening on the same machine.
Benchmark Setup
In order to measure the cluster performance we prepared the following setup:
- LiteCluster: A 2 node cluster running on the same machine
- Only one of the nodes in the cluster had client processes connected to it (8 processes)
- Network latency simulated using
tcwith varying latency values- 0ms latency to establish a baseline
- 1 to 2ms latency for nodes in the same datacenter
- 5 to 10ms latency for nodes in the same city
- 10 to 20ms latency for nodes in the same state
- Load was generated using
wrk, running 256 to 2048 concurrent connections - After each run, the database files were compared using
sqldiffto ensure consistency - All operations were performed in full sync mode
The tc command was used as follows to introduce delays:
sudo tc qdisc del dev lo root netem delay 10ms
And then to reset the delay
sudo tc qdisc del dev lo root netem
Wrk was used as follows
wrk -t4 -d5 -c 2048 --latency http://192.168.0.232:3000/
Benchmark App
A small rack application was used to simulate write and read IO (with a 50% chance for each)
app = proc do |env|
if rand(2) > 0
time = (Time.now - START_TIME).to_f
id = "#{(time * 1000000).to_i}%05d" % [Process.pid]
id = id.to_i
result = DB.query_array(
"INSERT INTO events(id, data, created_at) VALUES(?, ?, ?)",
[id, "some event information", time.to_i]
)
else
result = DB.query_array("SELECT * FROM events ORDER BY id DESC LIMIT 1")[0]
end
[200, {}, "OK"]
end
Performance Results
Throughput (in requests per second)
| Latency/Clients | 256 Clients | 512 Clients | 1024 Clients | 2048 Clients |
|---|---|---|---|---|
| 0ms | 25,197 | 53,507 | 88,059 | 111,921 |
| 1ms | 24,918 | 47,132 | 75,702 | 100,134 |
| 2ms | 20,485 | 39,856 | 66,356 | 96.951 |
| 5ms | 12,759 | 26,199 | 48,670 | 67,948 |
| 10ms | 7,860 | 15,507 | 30,709 | 35,432* |
| 20ms | 4,281 | 8,542 | 15,081 | 14,475* |
P90 Latency (in ms)
| Latency/Clients | 256 Clients | 512 Clients | 1024 Clients | 2048 Clients |
|---|---|---|---|---|
| 0ms | 24.35 | 18.91 | 23.09 | 37.68 |
| 1ms | 19.14 | 20.54 | 26.06 | 40.93 |
| 2ms | 22.04 | 22.97 | 28.24 | 38.58 |
| 5ms | 32.04 | 30.06 | 33.48 | 162.19 |
| 10ms | 47.51 | 48.02 | 47.12 | 234.69* |
| 20ms | 81.38 | 84.46 | 95.04 | 337.76* |
P99 Latency (in ms)
| Latency/Clients | 256 Clients | 512 Clients | 1024 Clients | 2048 Clients |
|---|---|---|---|---|
| 0ms | 39.78 | 30.63 | 30.67 | 45.3 |
| 1ms | 31.91 | 29.66 | 36.73 | 55.73 |
| 2ms | 36.57 | 33.71 | 42.33 | 60.21 |
| 5ms | 54.01 | 43.98 | 50.46 | 559.8 |
| 10ms | 61.62 | 63.15 | 102.36 | 697.8* |
| 20ms | 90.68 | 92.2 | 332.82 | 958.43* |
Conclusion
LiteCluster shows nice performance in lower latency environments, still, even with higher latency, a LiteCluster based application will hold its own against 1024 or less connected clients.
We expect better numbers with further optimizations to the committer/replicator duo and with prepared statement support in clients.
Future Work
As LiteCluster matures, more testing should be done including on real life clusters on multiple machines. Both in local and wide area networks.
Also testing for failure scenarios and cluster recovery.
One area that is not implemented yet is dynamically joining a running cluster, which should be a priority for the coming efforts.
A lot of optimization opportunities in the committer/replicator should be addressed in order to eliminate some inefficiencies and reduce latencies.
Finally, ActiveRecord/Sequel & Litestack integration should materialize and then benchmarking with a full fledged Rails app.
Leave a comment