I've been thinking a bit about roles and responsibilities for the primary and secondary nodes of a log. Here I'm sketching a model that is mostly compatible with the current replication protocol, and which makes the nodes a bit more independent (e.g., could be run by different organizations).
A log instance consists of a primary node and (ideally) several secondary nodes. A log is identified by its key, i.e., the key that is used to sign the log's advertised tree heads (the same tree heads that are cosigned by witnesses, and for which the log operator's states intended reliability etc). Each node is identified by a separate node key.
* Local trees
Each node (including the primary) keeps its own local tree. That tree is possibly larger (but not smaller, except when a new node is starting up) than the log's advertised tree. Each node is identified by its node key. The node key is used to sign the tree heads of its local tree. These signatures must not be confused with the log's signed tree heads; if it's not enough that separate keys are used, they could use a separate signature namespace.
The semantics of the signatures on local trees is that the node promises that it's local tree is append-only, and that all data covered by the signed tree head is committed to local storage. I.e., the tree should survive events like a local power outage. However, reliability is best effort. If the node suffers a disk failure, or is decommissioned for any other reason, the contents of the tree may be lost (except for parts of it replicated elsewhere, as described below).
* Primary node
The primary node's responsibility is to accept new leaves from users, commit into its local tree, and sign resulting local tree using its node key. Periodically, it queries the signed tree heads of the secondary nodes' trees, checks consistency, and publishes new versions of the *log*'s signed tree head once data is replicated to all secondaries. (If we have a larger number of secondaries, we could consider allowing the primary to proceed even in the case that a single secondary is behind or unreachable).
* Secondary nodes
Secondaries only accept new leaves from the primary. A secondary that is new or for some reason is behind, will first get the log's signed tree head, and retrieve all leaves it is missing. It must check inclusion and consistency before committing the leaves to its local tree and underlying storage. Next, it will periodically get the primary node's local tree head (verifying the signature using the node key of the node that is the current primary), and similarly incorporate after inclusion and consistency checks pass. Periodically, or when asked by the primary, it will sign the head of its local tree using its own node key.
So at all time we have this relation between tree sizes:
log's tree <= each secondary node tree <= primary node tree
Extensions: It may be useful to enable secondaries to also act as mirrors, republishing the latest tree head it has received from the primary node, together with available cosignatures. It may be possible to distribute new leaves in more of a peer-to-peer fashion, instead of each secondary retrieving them directly from the primary.
* Migration on primary failure
What needs to happen when a primary fails or is to be replaced? We need the following steps:
0. If possible, the primary node's access to the log signing key should be removed.
1. Each secondary must be configured that the primary is down. This must likely be a manual procedure, with a human determining that the primary should no longer be used. On each secondary, this means that the node key of the old primary is removed from the configuration.
2. Once all the secondaries agree that there is no longer any primary node, one of the secondaries can become new primary. If the local trees of the secondaries are of different sizes, the one with the largest tree should be selected as the new (interrim) primary, but not yet with access to the log's signing key. (If, for some reason, a different node is chosen, the nodes that are ahead of the chosen node must be reset: Discard the extra leaves, destroy previous node key and create a new one).
3. The secondaries that were not chosen as primary are now reconfigured to use the chosen node (identified by node key, as usual) as primary, and retrieve all leaves and commit them to their local trees.
4. After some time, nodes should all be in sync. If desired, the chosen node can now be demoted back to secondary (after which all the other secondaries will again be reconfigured that there is no primary), and a new primary node can be selected.
5. Finally, the new primary should be given access to the log's signing key and start normal operation (accept leaves from users, advertise new tree heads, request cosignatures, etc).
If we are willing to have secondaries coordinate with eachother, part of this process could potentially be automated. If all nodes are connected to each other (with the exception of the failing primary, which is explicitly and manually removed from the set of nodes in step (1) above), we could maybe have a protocol that lets nodes first agree that there is no primary, and then elect a new primary based on tree size, and which nodes are configured as candidates for getting access to the log's signing key.
Regards, /Niels
sigsum-general@lists.sigsum.org