Let me start with a description of my current understanding of the checksum field, present in the tree_leaf struct in the spec.
The submitter submits a message M to the log (M typically a hash of some data not disclosed to the log), together with a public key and signature.
The log first verifies the signature, and then adds a leaf to the merkle tree. The signatures are done using ssh format, configured to use sha256. This implies that sign and verify operations on which includes computing SHA256(M), and we call this "checksum" and include it in the tree_leaf struct together with the signature.
My first question: Does it matter in any way that the checksum happens to be a value used internally in the ssh signature formatting?
If we instead publish a signature of M created using the SHA512 hash internally, as is the ssh-keygen -Y sign default, and publish this signature together with checksum = SHA256(M), wouldn't that work just as well?
Next question: Do we really need to publish the checksum at all? It serves as a unique and random-looking identifier for the message M, but who's using this id? We have the following roles:
1. Submitter. Will collect signatures on the submitted leaf. Obviously knows everything needed to query for the leaf hash, and will crete the "sigsum proof package" package to distribute to sigsum verifiers.
2. Sigsum verifier (the party that gets the message M and wants to verify that it is properly logged). The verifier needs to get (by other means than querying the log by itself) all of
M itself signature of M inclusion proof for the leaf including this signature witness signatures all related public keys
As far as I see, signer will clearly recompute the checksum, in the internals of the signature verification. It could also explicitly compare the checksum it to the value stored in the leaf, but what benefit does that give, if the signature is already verified? On the other hand, it seems essential to verify that the signature and the public key hash in the leaf are as expected.
I think this is the core of the question: Is there any reason for the verifier to validate the checksum stored in the leaf, in addition to verifying the signature? If not, what use is that field?
3. Witness. The witness doesn't have access to M, so to a witness the checksum is just random string, there's no way to validate it. It could possibly use it to verify the signature (not by using ssh-keygen though, but by digging into internals if ssh signatures), except that the witness is not expected to have access to the submitter's public key.
4. Monitors. The purpose of a monitor is to query the log and alert whenever an unexpected signature appears. My understanding of monitoring is somewhat fuzzy, but I think the monitor is expected to query for recent tree heads (relying on witness cosignatures to know that what it gets is recent), download all (new) leaves in the tree, and filter on one or more public key hashes of interest. For the leaves found, it will then alert key owner on "unexpected" checksums. However, couldn't one do without the checksums and just as well look at unexpected signatures?
The checksum uniquely (except for hash collisions) identifies a single message M. But the signature itself also uniquely identifies a single message M (it seems highly unlikely to have collisions, even if we allow the public key to vary, and in case we insist on having the same public key, any collision represents a break of the security of the signature algorithm).
The difference is that the checksum can be (re)computed from M only, while computing the signature also requires the private key. That's sonds like a big dfference, but the only roles above that are expected to know M, are the submitter and the verifier. The submitter by definition knows the private key. And the verifier should be provided with the signature by other means, and just verify it.
Regards, /Niels