Let me start with a description of my current understanding of the
checksum field, present in the tree_leaf struct in the spec.
The submitter submits a message M to the log (M typically a hash of some
data not disclosed to the log), together with a public key and
signature.
The log first verifies the signature, and then adds a leaf to the merkle
tree. The signatures are done using ssh format, configured to use
sha256. This implies that sign and verify operations on which includes
computing SHA256(M), and we call this "checksum" and include it in the
tree_leaf struct together with the signature.
My first question: Does it matter in any way that the checksum happens
to be a value used internally in the ssh signature formatting?
If we instead publish a signature of M created using the SHA512 hash
internally, as is the ssh-keygen -Y sign default, and publish this
signature together with checksum = SHA256(M), wouldn't that work just as
well?
Next question: Do we really need to publish the checksum at all? It
serves as a unique and random-looking identifier for the message M, but
who's using this id? We have the following roles:
1. Submitter. Will collect signatures on the submitted leaf. Obviously
knows everything needed to query for the leaf hash, and will crete
the "sigsum proof package" package to distribute to sigsum verifiers.
2. Sigsum verifier (the party that gets the message M and wants to
verify that it is properly logged). The verifier needs to get (by
other means than querying the log by itself) all of
M itself
signature of M
inclusion proof for the leaf including this signature
witness signatures
all related public keys
As far as I see, signer will clearly recompute the checksum, in the
internals of the signature verification. It could also explicitly
compare the checksum it to the value stored in the leaf, but what
benefit does that give, if the signature is already verified? On the
other hand, it seems essential to verify that the signature and the
public key hash in the leaf are as expected.
I think this is the core of the question: Is there any reason for the
verifier to validate the checksum stored in the leaf, in addition
to verifying the signature? If not, what use is that field?
3. Witness. The witness doesn't have access to M, so to a witness the
checksum is just random string, there's no way to validate it. It
could possibly use it to verify the signature (not by using
ssh-keygen though, but by digging into internals if ssh signatures),
except that the witness is not expected to have access to the
submitter's public key.
4. Monitors. The purpose of a monitor is to query the log and alert
whenever an unexpected signature appears. My understanding of
monitoring is somewhat fuzzy, but I think the monitor is expected to
query for recent tree heads (relying on witness cosignatures to know
that what it gets is recent), download all (new) leaves in the tree,
and filter on one or more public key hashes of interest. For the
leaves found, it will then alert key owner on "unexpected" checksums.
However, couldn't one do without the checksums and just as well look
at unexpected signatures?
The checksum uniquely (except for hash collisions) identifies a single
message M. But the signature itself also uniquely identifies a single
message M (it seems highly unlikely to have collisions, even if we allow
the public key to vary, and in case we insist on having the same public
key, any collision represents a break of the security of the signature
algorithm).
The difference is that the checksum can be (re)computed from M only,
while computing the signature also requires the private key. That's
sonds like a big dfference, but the only roles above that are expected
to know M, are the submitter and the verifier. The submitter by
definition knows the private key. And the verifier should be provided
with the signature by other means, and just verify it.
Regards,
/Niels