Thanks Rasmus! I've cc'd the list and added Bob who's interested in this topic too.
What submit latency are you willing to accept? I'm asking because
depending on if you need ~1s or ~10s will influence the options.
I'd like to keep this latency as low as possible. It would be a breaking change across the ecosystem if we upped latency to ~10s, as I'm assuming clients have not configured their timeouts to expect this high of a latency. That's not to say we couldn't make this change, as we could provide a different API, I'd just like to explore a low latency initially.
I.e., the log can keep track of a witness' latest state X, then provide
to the witness a new checkpoint Y and a consistency proof that is valid from X -> Y. If all goes well, the witness returns its cosignature. If they are out of sync, the log needs to try again with the right state.
Assuming that all witnesses are responsive and maintain the same state, this could work. Keeping track of N different witnesses is doable, but I think it's likely they would get out of sync, e.g. a request to cosign a checkpoint times out but the witness still verifies and persists the checkpoint. This isn't a blocker though, it's just an extra call if needed.
The current plan for Sigsum is to accept up to T seconds of logging
latency, where T is in the order of 5-10s. Every T seconds the log selects the current checkpoint, then it collects as many cosignatures as possible before making the result available and starting all over again.
This seems like the most sensible approach assuming that latency can be accepted by the ecosystem. Batching entries is something we've discussed before, there's other performance benefits besides witnessing.
An alternative implementation of the same witness protocol would be as follows: always be in the process of creating the next witnessed checkpoint. I.e., as soon as one finalized a witnessed checkpoint, start all over again because the log's tree already moved forward. To keep the latency down, only collect the minimum number of cosignatures
needed to satisfy all trust policies that the log's users depend on.
This makes sense, though I think adding some latency as suggested above makes this more straightforward. One detail, which may not be relevant depending on your order of operations, is that we just need to confirm that the inclusion proof returned will be based on the cosigned checkpoint. Currently our workflow is first requesting an inclusion proof for the latest tree head, then signing the tree head.
On Fri, Feb 2, 2024 at 3:37 AM Rasmus Dahlberg rgdd@glasklarteknik.se wrote:
Hi Hayden,
Exciting that you're exploring this are, answers inline!
On Thu, Feb 01, 2024 at 01:05:48PM -0800, Hayden Blauzvern wrote:
Hey y'all! I was reading up on Sigsum docs and witnessing and had a question about if or how you're handling logs with significant traffic.
Context is I've been looking at improving our witnessing story with Sigstore and exploring the viability of the bastion-based witnessing approach. Currently, the Sigstore log does no batching of entry uploads, and so the tree head/checkpoint is frequently updated. Consequently this means that two witnesses are very unlikely to witness the same
checkpoint.
To solve this, we added a 'stable' checkpoint, one that is published
every
X minutes (5 currently). Witnesses are expected to compute consistency proofs off that checkpoint so that multiple witnesses verify the same checkpoint.
Sounds similar the initial witness protocol we used: the log makes available a checkpoint for some time, and witnesses poll to cosign it.
We moved away from this communication pattern to solve two problems:
- High submit latency, which is the issue you're experiencing.
- Ensure logs without publicly reachable endpoints are not excluded.
While reworking this, we also tried to keep as many of the properties we liked with the old protocol. For example, the bastion host stems from the nice property that witnesses can be pretty locked down behind a NAT.
I've been exploring the bastion-based approach where for each entry or
tree
head update, the log requests cosignatures from a set of witnesses. What I'm pondering now is how to deal with a log that frequently updates its tree head due to frequent new entries. One solution is to batch entries for a long enough period, let's say 1 minute, so that the log can fetch cosignatures from a quorum of witnesses while accounting for some latency. But this is not our preferred user experience, to have signers wait that long. Lowering the batch to 1 second would solve the UX issue.
What submit latency are you willing to accept? I'm asking because depending on if you need ~1s or ~10s will influence the options.
However now there's an issue for updating a witness's checkpoint. Using the API
Filippo
has documented for the witness, the log makes two requests to the
witness:
One for the latest witness checkpoint, one to provide the log's new checkpoint.
The current witness protocol allows the log to collect a cosignature from a witness in a single API call, see the add-tree-head endpoint:
https://git.glasklar.is/sigsum/project/documentation/-/blob/d8de0eeebbb5bb01...
(Warning: the above API document is being reworked and moved to C2SP. The new revision will revolve around checkpoint names and encodings. You'll find links to all the decided proposals on www.sigsum.org/docs.)
I.e., the log can keep track of a witness' latest state X, then provide to the witness a new checkpoint Y and a consistency proof that is valid from X -> Y. If all goes well, the witness returns its cosignature. If they are out of sync, the log needs to try again with the right state.
This seemingly would not work with a high-volume log since the witness's latest checkpoint would update too frequently.
Did you have any thoughts on how to handle this?
The current plan for Sigsum is to accept up to T seconds of logging latency, where T is in the order of 5-10s. Every T seconds the log selects the current checkpoint, then it collects as many cosignatures as possible before making the result available and starting all over again.
The rationale is: a witness that is online will be able to respond in 5-10s, so waiting longer than that will not really do much. I.e., the witness is either online and responding or it isn't. So: under normal circumstances one would expect cosignatures from all reliable witnesses.
An alternative implementation of the same witness protocol would be as follows: always be in the process of creating the next witnessed checkpoint. I.e., as soon as one finalized a witnessed checkpoint, start all over again because the log's tree already moved forward. To keep the latency down, only collect the minimum number of cosignatures needed to satisfy all trust policies that the log's users depend on.
For example, if you're opinionated and say users should rely on 10 selected witnesses with a 3-of-10 policy; the log server can publish the next checkpoint as soon as it received cosignatures from 3 witnesses.
Both approaches work, but depending on which one you choose the properties and complexity will be slightly different. Avoiding to hash out that analysis here in order to keep this initial answer brief, but if you need the ~1s latency the second option should get you close.
By the way, would it be OK to @CC the sigsum-general list? Pretty sure this is a conversation other folks would be interested in as well!
-Rasmus
Hi Hayden! Thank you for reaching out with this.
The requirements you describe seems analogous to what I (with my Go project hat) need for the Go Checksum Database, and was indeed a motivation for moving to a synchronous witness API.
My plan is what Rasmus described, except with somewhat more optimistic expectations of the latency of witnesses. 1. Incorporate a batch of new leaves, possibly holding the submission requests. 2. Sign a new checkpoint. 3. Send out parallel requests to all witnesses to cosign the checkpoint, over keep-alive connections. 4. As soon as enough cosignatures are returned, publish the checkpoint and release the client requests. 5. If a witness doesn't return by the time the next checkpoint is signed, ignore it for the next round(s). 6. If a witness times out, assume it kept the previous state. If that's incorrect, it will send a 409 Conflict response with its correct state (this is a recent API change based on Trust Fabric feedback) and can be contacted successfully in the next round. I believe all of that can happen in 1s. 500ms to batch leaves, 500ms for m-of-n witnesses to respond to a single HTTP request over an established connection. US west coast to EU RTT is <150ms, that leaves 350ms for the witness to do signature verify+sign and database read+write.
On Fri, Feb 02, 2024 at 01:40:21PM -0800, Hayden Blauzvern wrote:
Thanks Rasmus! I've cc'd the list and added Bob who's interested in this topic too.
Great, and hi Bob! Happy to have you CC:ed here as well :).
What submit latency are you willing to accept? I'm asking because
depending on if you need ~1s or ~10s will influence the options.
I'd like to keep this latency as low as possible. It would be a breaking change across the ecosystem if we upped latency to ~10s, as I'm assuming clients have not configured their timeouts to expect this high of a latency. That's not to say we couldn't make this change, as we could provide a different API, I'd just like to explore a low latency initially.
FWIW we havn't made any detailed analysis of why the KISS approach that I referred to as ~10s couldn't be, e.g., ~3s. So if you'd like to weight that into your exploring, it might be worth thinking about. But it makes sense to try and minimize latency given your current design!
I.e., the log can keep track of a witness' latest state X, then provide
to the witness a new checkpoint Y and a consistency proof that is valid from X -> Y. If all goes well, the witness returns its cosignature. If they are out of sync, the log needs to try again with the right state.
Assuming that all witnesses are responsive and maintain the same state, this could work. Keeping track of N different witnesses is doable, but I think it's likely they would get out of sync, e.g. a request to cosign a checkpoint times out but the witness still verifies and persists the checkpoint. This isn't a blocker though, it's just an extra call if needed.
I think that's a reasonable (and crucial) assumption. I would not recommmend putting a witness into a trust policy that doesn't have a convincing plan for how to stay online and responsive most of the time.
Pretty sure you will run into some interesting implementation details here though if you go for the lowest-latency option as no one have dog fooded such an implementation of the protocol yet. It is a bit more involved than the KISS approach (so trading latency vs complexity here).
The current plan for Sigsum is to accept up to T seconds of logging
latency, where T is in the order of 5-10s. Every T seconds the log selects the current checkpoint, then it collects as many cosignatures as possible before making the result available and starting all over again.
This seems like the most sensible approach assuming that latency can be accepted by the ecosystem. Batching entries is something we've discussed before, there's other performance benefits besides witnessing.
Yeah, and it kinda makes sense to not push more complicated low-latency solutions on use cases that work without it. I think most use cases can tolate a little bit of latency, but there are of course some exceptions.
An alternative implementation of the same witness protocol would be as follows: always be in the process of creating the next witnessed checkpoint. I.e., as soon as one finalized a witnessed checkpoint, start all over again because the log's tree already moved forward. To keep the latency down, only collect the minimum number of cosignatures
needed to satisfy all trust policies that the log's users depend on.
This makes sense, though I think adding some latency as suggested above makes this more straightforward. One detail, which may not be relevant depending on your order of operations, is that we just need to confirm that the inclusion proof returned will be based on the cosigned checkpoint. Currently our workflow is first requesting an inclusion proof for the latest tree head, then signing the tree head.
You could also consider making the submitter fetch the proof using a separate API based on the cosigned checkpoint that they are happy with.
-Rasmus
sigsum-general@lists.sigsum.org