Document what happens if the primary-secondary setup is subject to a cosmic ray
Background:
- https://groups.google.com/a/chromium.org/g/ct-policy/c/PCkKU357M2Q/m/zmaD_ezAAQAJ
- https://groups.google.com/a/chromium.org/g/ct-policy/c/S17_j-WJ6dI
Fortunately sigsum does not have SCTs which makes things a bit better; submitters also resubmit entries by default until they have been (correctly) merged (without bit-flips). So my intuition would be that this is not going to affect end users, and mainly be a headache for log operators that need to communicate what happened, why, and who is affected. It would be a good idea to hash this out in detail, so that if/when it happens, it is already clear which steps to take moving forward.
Here's a brief sketch as a starting point. It is probably missing some cases and was noted down way too quickly, be warned. :)
- case a) bit-flip while the add-leaf request has yet to be merged in the primary
- case b) bit-flip after the add-leaf request was merged in primary but before replicated at the secondary
- case c) bit-flip after the add-leaf request has been merged in both primary and secondary
- case d) bit-flip in the tree head right before the log signs it
- case e) more cases?
What can be flipped for add-leaf blobs?
- key_hash
- checksum
- signature
What can be flipped relating to (signed) tree heads?
- (constants)
- tree size
- root hash
- signature
Drafty notes for why I think this would not really have that much impact:
- case a) we will sequence and replicate the wrong leaf request; the submitter already re-submits the right request so recovery is automatic. The annoying part is if we sequence an invalid checksum or signature, which a monitor would spot and consider log misbehavior. I think this would be hand-waved as "ops, cosmic ray as you can see. Sorry about that".
- case b) i guess due to subtree caching in trillian the secondary will download the flipped leaf which will not correspond to the primary's tree, so everything halts (due to inconsistencies) before the primary moves its (externally signed) tree head forward
- case c) i think this just means we would return the wrong (bit-flipped) log-entry on the get-entries endpoint; so rebuilding the tree would fail for monitors and incluion queries for that entry would similarly fail. Once the bit-flip is fixed, we should be back in a good state.
- case d) witnesses would not cosign because the tree head is invalid, similarly any inclusion proofs returned to this tree head would be invalid so submitters would similarly reject them. Strictly speaking a split-view, but one that would likely be resolved by explaining what happened and resetting state in trillian.