blob: ccc273cc36b9c49496f2d370ed4f7a8a17a4c83e [file] [log] [blame]
Header-Based Patch Attestation
==============================
Author: Konstantin Ryabitsev <konstantin@linuxfoundation.org>
Status: Alpha, soliciting comments
Preamble
--------
Projects participating in decentralized development continue to use
RFC-2822 (email) formatted messages for code submissions and review.
This remains the only widely accepted mechanism for code collaboration
that does not rely on centralized infrastructure maintained by a single
entity, which necessarily introduces a single point of dependency and
a single point of failure.
RFC-2822 formatted messages can be delivered via a variety of means. To
name a few of the more common ones:
- email
- usenet
- aggregated archives (e.g. public-inbox)
Among these, email remains the most widely used transport mechanism for
RFC-2822 messages, most commonly delivered via subscription-based
services (mailing lists).
Email and end-to-end attestation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There are two commonly used standards for cryptographic email
attestation: PGP and S/MIME. When it comes to patches sent via email,
there are significant drawbacks to both:
- Mailing list software may modify email body contents to add
subscription information footers, causing message attestation to
fail.
- Attestation via detached MIME signatures may not be preserved by
mailing list software that aggressively quarantines attachments.
- Inline PGP attestation generally frustrates developers working with
patches due to extra surrounding content and the escaping it
performs for strings containing dashes at the start of the line for
canonicalization purposes.
- Only the body of the message is attested, leaving metadata such as
"From", "Subject", and "Date" open to tampering. Git uses this
metadata to formulate git commits, so leaving them unattested is
suboptimal (they can be duplicated into the body of the message,
but git format-patch will not do this by default).
- PGP key distribution and trust delegation remains a difficult
problem to solve. Even if PGP attestation is available, the
developer on the receiving end of the patches may not make any use
of it due to not having the sender's key in their keyring.
- S/MIME certificates are increasingly difficult to obtain for
developers not working in corporate environments. At the time of
writing, only two commercial CAs continue to provide this service --
and only one does it for free.
For these reasons, end-to-end attestation is rarely used in communities
that continue to use email as their main conduit for code submissions
and review.
Email and domain-level attestation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Since unsolicited emails (SPAM) frequently forge headers in order to
appear to be coming from trusted sources, most major service providers
have adopted DKIM (RFC-6376) to provide cryptographic attestation for
header and body contents. A message that originates from gmail.com will
contain a "DKIM-Signature" header that attests the contents of the
following headers (among others):
- from
- date
- message-id
- subject
The "DKIM-Signature" header also includes a hash of the message body
(bh=) that is included in the final verification hash. When a DKIM
signature is successfully verified using a public key that is published
via gmail.com DNS records, this provides a degree of assurance that the
email message has not been modified since leaving gmail.com
infrastructure.
Just as PGP and S/MIME attestation, this has important problems when it
comes to patches sent via mailing lists:
- If the "sender" header is included in the attestation, the DKIM
signature will no longer verify due to mailing lists necessarily
rewriting it for bounce handling.
- ML software commonly modifies the subject header in order to insert
list identification (e.g. ``[some-topic]``). Since the "subject"
header is almost always included into the list of headers attested
by DKIM, this causes DKIM signatures to fail verification.
- ML software also routinely modifies the message body for the
purposes of stripping attachments or inserting list subscription
metadata. Since the bh= hash is included in the final signature
hash, this results in a failed DKIM signature check.
Even if all of the above does not apply and the DKIM signature is
successfully verified, body canonicalization routines mandated by the
DKIM RFC may result in a false-positive successful attestation for
patches. The "relaxed" canonicalization instructs that all consecutive
whitespace is collapsed, so patches for languages like Python or GNU
Make where whitespace is syntactically significant may have different
code result in the same hash.
DKIM works well enough for end-to-end email attestation, but has
important drawbacks for domain-level attestation of patches, especially
when they are delivered via mailing lists.
Proposal
--------
The goal of this document is to propose a scheme that would provide
cryptographic attestation for all message contents necessary for trusted
distributed code collaboration. It draws on the success of the DKIM
standard in order to adapt (and adopt) it for this purpose.
Anatomy of an email patch
~~~~~~~~~~~~~~~~~~~~~~~~~
A patch submitted via an RFC-2822 formatted message consists of the
following three significant parts:
- *metadata*, which includes the Author, Email, Subject, and Date of
the submission
- *commit message*, which describes what the change is supposed to
accomplish
- *diff content*, which is structured data that should be applied
to the codebase in order to implement the changes proposed
Patch submissions also routinely provide additional content that may
have significance to the author or to the reviewer, but is not preserved
in the codebase after patches are applied, such as:
- information describing changes between revisions
- statistics about what files are changed (diffstat)
- structured data indicating tree dependencies (base-commit)
- author's signature and software version info
- mailing list subscription metadata
Our goal is to provide attestation for the significant parts and ignore
the parts that are not preserved after code is committed to a git
repository.
Three hashes per patch
~~~~~~~~~~~~~~~~~~~~~~
Instead of creating a single attestation hash, we create a separate hash
for each meaningful part of the patch submission:
- **i**: patch metadata
- **m**: commit message
- **p**: diff content
This allows the person performing verification to identify which part of
the submission has been altered since being signed. A change to a commit
message may be explained by the addition of a ``Signed-off-by`` (or
similar) trailer, so the developer performing the review may ignore a
failure in the "m" hash if the other two hashes are passing.
Similarly, a patch that goes through a chain of maintainers will
necessarily have its commit message modified by the inclusion of various
trailers. Having a separate hash for the patch content and patch
metadata provides a way to track whether or not any of the
submaintainers made changes to the patch code, or just to the commit
message, as expected.
To generate the three parts, we rely on the ``git mailinfo`` command,
that does most of what we need::
git mailinfo m p > i < email.msg
The above command will produce three files that closely match what we
need, but require a bit of extra processing to remove content that is
likely to be altered in transmission.
To get the "m" hash, we take the "m" file as-is::
sha256sum m
To get the "i" hash, we remove the "Date" header from the output,
because it can be modified by git during format-patch or send-email
stages (or, infrequently, by SMTP relays). We only take the "Author",
"Email", and "Subject" headers::
egrep '^(Author|Email|Subject)' i | sha256sum
The "p" file requires most work, as it contains data from the "below the
cut" portion of the commit message (usually, diffstat and revision
information), plus trailing content such as signatures or mailing list
subscription info. All of this is stripped away to leave just the diff
content.
Why not use git patch-id?
~~~~~~~~~~~~~~~~~~~~~~~~~
Git provides a command to generate a "patch-id" that can be used to
quickly identify similar patches. To generate the patch-id hash, git
performs several canonicalization routines that make this hash
unsuitable for attestation purposes:
- it collapses all whitespace together
- it removes all line numbers from diff contents
It is possible for a malicious actor to create two patches that generate
identical patch-id hashes but have drastically different results in the
code. For more info, see discussion here:
- https://lore.kernel.org/git/20200210164115.x4gciujyjisivfgi@chatter.i7.local/
X-Patch-Hashes header
~~~~~~~~~~~~~~~~~~~~~
After the i, m, p hashes are generated, we insert them into the email
message as a separate header. You can use the proof-of-concept code
included to generate one yourself::
$ ./main.py hashes-hdr
Using emails/unsigned.eml as message source
--- HEADER STARTS ---
X-Patch-Hashes: v=1; h=sha256; i=pkD5Pg8+cndZAzQQzo3RBSOOUzZM3GYWxiFIKFGIKe0=;
m=yW4TvC/DGWCUJTa11Aw1b/2ZAXobsLD45aLA/440yQI=;
p=iJdYN6+isP/3HmQaf1IiG7OfA1vzRxXlPGZtvecS484=
Running POC code
~~~~~~~~~~~~~~~~
The POC code is written in Python and requires an extra set of libraries
in order to work. To get going, please do the following::
$ python3 -mvenv .venv
$ source .venv/bin/activate
$ pip install --upgrade pip
$ pip install -r requirements.txt
Domain-level attestation
------------------------
Once the X-Patch-Hashes header is generated and inserted into the email,
it will need to be signed in order to be useful for attestation
purposes. Adding domain-level signatures is the simplest way to
accomplish this, as it would allow entire companies to automatically
attest all patches sent out via their infrastructure.
This can be easily done by introducing a patch-attestation milter that
would automatically analyze body contents and generate the
X-Patch-Hashes header if it finds that the message contains a patch
(unless this header is already present). This milter can then either
create its own cryptographic signature or let the usual DKIM-signing
infrastructure create the necessary attestation.
Using vanilla DKIM
~~~~~~~~~~~~~~~~~~
Vanilla DKIM is well-suited for this purpose, as it was specifically
created to sign email headers. The following changes will need to be
made to the configuration:
- add "x-patch-hashes" to the list of signed headers
- ensure that "sender" is not included
- potentially, exclude "subject" from the list of signed headers, in
order to hedge against mailing lists that add ``[topic]`` to all
email subjects
Here's how it looks with the POC command, using the bundled rsa.key::
./main.py sign-dkim
Signing: plain DKIM
Using emails/unsigned.eml as message source
Using rsa.key to sign
--- MESSAGE STARTS ---
[...]
X-Patch-Hashes: v=1; h=sha256; i=pkD5Pg8+cndZAzQQzo3RBSOOUzZM3GYWxiFIKFGIKe0=;
m=yW4TvC/DGWCUJTa11Aw1b/2ZAXobsLD45aLA/440yQI=;
p=iJdYN6+isP/3HmQaf1IiG7OfA1vzRxXlPGZtvecS484=
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=example.org;
i=@example.org; q=dns/txt; s=patches; t=1600264001; h=from : date :
x-patch-hashes; bh=g2Sv1ZR+jIrWukzdXbqb+aeiqyFQOBLDQY6z0BBnGg4=;
b=pphhMzvqehfxDDLx/OqjbrP6HnMjhlklrQacWqwf5bpZ3cVZ00z5D+BcwpzsKnpQF7c7A
2FmO6Mtjtn/lVRwppIF+tlph46sLE9XfdS+60X6Bzzxu1u/l0uieQ+cIT3DjUuejfxVpvIE
Zd4oAeVHD/OWRTJrWGYzrK3e+9UpIZJnxRkJLNj9OKOCwZDiGobM6+NusTWduqjYLRlMXXt
EvRbs8QXsTkoTttngM5DwSFRXC7zYSprKxbL6i/DdE+GM+iN2UQk10lpVfhYXtDBoKX1/vX
CXb77/X1ug1/ktfYU1xEDUU/NrovqnAfcJHCAL2lHomznoi/IYBC1qfR5t2w==
[...]
Note, that the b= value will be different for you since the timestamp is
included into the hashed content and will be different each time the
code runs.
This header was created by a generic DKIM implementation (dkimpy),
commonly used in production via the popular dkimpy-milter daemon.
This POC also includes a few example emails signed by the kernel.org DKIM
key. You can run the POC verification yourself::
./main.py -m emails/korg-signed-dkim.eml verify
Using emails/korg-signed-dkim.eml as message source
Verifying: Plain DKIM
DNS-lookup: default._domainkey.kernel.org.
PASS : identity and domain match From header
PASS : time drift between Date and t (2 days, 23:24:18)
PASS : DKIM signature for d=kernel.org, s=default
----- ---------------
PASS : metadata
PASS : commit message
PASS : diff content
----- ---------------
PASS : All hashes verified
As you can see, the verification steps will check several things:
- that the DKIM signature passes verification (this is done by
normalizing and concatenating all signed headers, plus the
DKIM-signature header itself, minus the signature content following
b=)
- that the x-patch-hashes header is included in the content attested
by DKIM
- that the domain (d=) and identity (i=) values match what is in the
From: field of the email message
- that time drift between the Date header and the timestamp of the
signature is reasonable
- that all patch hashes that we generate match the hashes in the
signed header
Note, that this check specifically excludes checking the body hash (bh=)
value, for the reasons described in the previous section concerning DKIM
drawbacks. Also, since we excluded "subject" from the list of signed
headers, the verification will succeed even with usual mailman-induced
changes to the email content::
./main.py -m emails/korg-signed-dkim-with-ml-junk.eml verify
Using emails/korg-signed-dkim-with-ml-junk.eml as message source
Verifying: Plain DKIM
DNS-lookup: default._domainkey.kernel.org.
PASS : identity and domain match From header
PASS : time drift between Date and t (2 days, 23:24:18)
PASS : DKIM signature for d=kernel.org, s=default
----- ---------------
PASS : metadata
PASS : commit message
PASS : diff content
----- ---------------
PASS : All hashes verified
However, since we include the subject of the commit (as git sees it)
into the "i" hash, any changes to the subject header that aren't extra
prefixes like ``[topic]`` will result in verification failure::
./main.py -m emails/korg-signed-dkim-changed-subject.eml verify
Using emails/korg-signed-dkim-changed-subject.eml as message source
Verifying: Plain DKIM
DNS-lookup: default._domainkey.kernel.org.
PASS : identity and domain match From header
PASS : time drift between Date and t (2 days, 23:24:18)
PASS : DKIM signature for d=kernel.org, s=default
----- ---------------
FAIL : metadata
PASS : commit message
PASS : diff content
----- ---------------
FAIL : Some or all hashes failed verification
Using the X-Patch-Sig header
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
There may be several reasons why you may not want to use DKIM for the
purpose of attesting the X-Patch-Hashes header:
- you may not have sufficient control over the infrastructure
performing DKIM signing, for example if your company uses a
commercial upstream relayhost that performs DKIM signing for your
domain
- you may not want to exclude the "subject" header from your DKIM
configuration, as it reduces the overall scope of your email
attestation
- you may not want to rely on DNS for the purposes of public key
lookups, since DNS records are easily spoofed (and DNSSec adoption
is still very low)
For these reasons, we also introduce a separate "X-Patch-Sig" header
that acts as a compatible subset of the DKIM RFC:
- we only use the "x-patch-hashes" header, omitting the need for the
h= record, and always normalize it as "relaxed"
- we omit the bh= field entirely
- we omit the v= field, since we will rely on the v= value in the
X-Patch-Hashes header for versioning info
- we add the m= field to indicate the signature mode (dk, wk, pgp,
wkd, discussed below)
- for the purposes of the POC, we hardcode the algorithm to
ed25519-sha256, though other algorithms like rsa-sha256 or
rsa-sha512 can be easily implemented
The signature is generated in the exact same way as the DKIM signature,
by concatenating the x-patch-hashes header and the x-patch-sig header
(after normalizing them using the "relaxed" mode), obviously excluding
the content that follows b=.
Here's the result of running the POC code, using the bundled dk.key::
./main.py sign-dk
Signing: X-Patch-Sig header using dk mode
Using emails/unsigned.eml as message source
--- MESSAGE STARTS ---
[...]
X-Patch-Hashes: v=1; h=sha256; i=pkD5Pg8+cndZAzQQzo3RBSOOUzZM3GYWxiFIKFGIKe0=;
m=yW4TvC/DGWCUJTa11Aw1b/2ZAXobsLD45aLA/440yQI=;
p=iJdYN6+isP/3HmQaf1IiG7OfA1vzRxXlPGZtvecS484=
X-Patch-Sig: m=dk; d=example.org; i=@example.org; s=patches; t=1600268242;
a=ed25519-sha256;
b=Ot3276T9ebQJ5Rzof7TNjz70IVpq9y/4ggevAO9iHVDg3P2tgBesuu2w/6mRIZ6m7mYuy22fNUW
3hmxYCG9VCegq3sEw9y0B7Poj6fvA6ZBcza41HhCNxb5J44UFgnDM
[...]
DK Mode
~~~~~~~
The DK mode is fully compatible with the DKIM standard and will perform
the exact same DNS query to look up the public key for the selector
specified::
./main.py -m emails/korg-signed-dk.eml verify
Using emails/korg-signed-dk.eml as message source
Verifying: X-Patch-Sig (mode=dk)
DNS-lookup: patches._domainkey.kernel.org.
PASS : identity and domain match From header
PASS : time drift between Date and t (4 days, 5:56:18)
PASS : mode=dk signature verified for: d=kernel.org, i=@kernel.org, s=patches
----- ---------------
PASS : metadata
PASS : commit message
PASS : diff content
----- ---------------
PASS : All hashes verified
WK Mode
~~~~~~~
Instead of looking up the public key using DNS, we perform a HTTPS
lookup instead. This has the advantages of being more secure, but
requires caching, TTL expiration, and proxy configuration by the client,
plus is more fragile due to the less distributed nature of the web as
opposed to the distributed and fault-tolerant implementation of DNS.
The query is performed to the domain name specified in the signature,
using the following rule::
https://[domain]/.well-known/_domainkey/[selector].txt
We have it set up for kernel.org and you can perform a verification
lookup using the provided example::
./main.py -m emails/korg-signed-wk.eml verify
Using emails/korg-signed-wk.eml as message source
Verifying: X-Patch-Sig (mode=wk)
Retrieving: https://kernel.org/.well-known/_domainkey/patches.txt
PASS : identity and domain match From header
PASS : time drift between Date and t (4 days, 6:18:45)
PASS : mode=wk signature verified for: d=kernel.org, i=@kernel.org, s=patches
----- ---------------
PASS : metadata
PASS : commit message
PASS : diff content
----- ---------------
PASS : All hashes verified
Developer-level attestation
---------------------------
The domain-level attestation has significant advantages, but also
important drawbacks:
- advantage: it allows auto-enrolling entire companies, without the
need for individual developers to make any changes to their usual
routines
- advantage: it piggybacks on the existing DKIM standard, which has
proven success record
- disadvantage: it requires changes to the IT infrastructure, including
adding a new milter daemon to the authenticated SMTP relay, which has
security and stability implications
- disadvantage: it requires explicit trust that the infrastructure
performing the hashing and signing has not been compromised by
malicious attackers
- disadvantage: it allows someone with access to a compromised account
to send out patches purporting to be coming from an official employee
of the company
- disadvantage: it is not useful to unaffiliated developers sending
patches from generic email addresses (gmail, yahoo, hotmail, etc).
These disadvantages can be mitigated by allowing individual developers
to provide their own signatures, using the "pgp" and "wkd" modes of the
X-Patch-Sig header.
PGP mode
~~~~~~~~
Many open-source projects already provide a mechanism for developers to
exchange and use PGP keys for the purposes of code attestation (e.g. via
signed git tags and git commits). We can easily use GnuPG to provide the
signature content of the X-Patch-Sig header.
Here are the headers from emails/mricon-signed-pgp.eml::
X-Patch-Hashes: v=1; h=sha256;
i=pkD5Pg8+cndZAzQQzo3RBSOOUzZM3GYWxiFIKFGIKe0=;
m=yW4TvC/DGWCUJTa11Aw1b/2ZAXobsLD45aLA/440yQI=;
p=iJdYN6+isP/3HmQaf1IiG7OfA1vzRxXlPGZtvecS484=
X-Patch-Sig: m=pgp; i=mricon@kernel.org; s=0xE63EDCA9329DD07E;
b=iHUEABYIAB0WIQR2vl2yUnHhSB5njDW2xBzjVmSZbAUCX1+/nQAKCRC2xBzjVmSZbFiQAQD42c
l5It3AVJbtkwbY5XZxb9I9YuvvX3L3buU+EwjumwD9HBH8t6xcavIKQF6dwKjsmhwJnDj1tCfaxg
3WRdUllgM=
Since a lot of the attesting information is already embedded into the
PGP signature itself, the signature structure is different from the "dk"
or "wk" mode:
- we don't need to know the domain, since we won't be doing any
lookups on our own (GnuPG can handle this, if configured)
- the selector field identifies the public key ID of the certification
subkey, for ease of lookups
- the identity field is informational only, but can be used by GnuPG
to perform WKD lookups, if it matches the From header (not
implemented)
- the timestamp field is missing, since this data is embedded into the
PGP signature itself
On the verification side, if the key specified by the selector is
already present in the verifier's default keyring, we will verify that
the signature is GOOD, VALID, and that it is either TRUST_FULLY or
TRUST_ULTIMATE.
If the key is not present in the verifier's default keyring, the POC
will check if there is a matching entry in
.keys/openpgp/keys/[keyid].asc, and if so, will use
.keys/openpgp/pubring.kbx for performing the verification. In this case,
TRUST_* fields are not used, as they will always be "unknown".
In-git key distribution is discussed further below.
WKD mode (EXPERIMENTAL)
~~~~~~~~~~~~~~~~~~~~~~~
I wanted to provide a way for developers to use a WK-like mode for
public key lookups as an alternative to PGP. The signature is generated
just like for the domain-level WK mode, using the ed25519 key provided
by each individual developer.
Here's the POC running with the bundled "ingit.key"::
./main.py sign-wkd
Signing: X-Patch-Sig header using wkd mode
Using emails/unsigned.eml as message source
--- MESSAGE STARTS ---
[...]
X-Patch-Hashes: v=1; h=sha256; i=pkD5Pg8+cndZAzQQzo3RBSOOUzZM3GYWxiFIKFGIKe0=;
m=yW4TvC/DGWCUJTa11Aw1b/2ZAXobsLD45aLA/440yQI=;
p=iJdYN6+isP/3HmQaf1IiG7OfA1vzRxXlPGZtvecS484=
X-Patch-Sig: m=wkd; d=example.org; i=dev@kernel.org; s=patches; t=1600270651;
a=ed25519-sha256;
b=/s2WOrzK2tmqCYj3x22uck6Yi6V1ODX+PZiE2TLstSoVDGvTAaYoPZwmO7IKbUC148KEeGVXB0W
g+wGNtQn3AmUsvnoX0Jppqc5ei6GDzr0yMQKzEbUt0DkPrd/Y000b
[...]
It is very similar to content created in the "DK" or "WK" mode, except
the identity field includes the entire email address of the developer.
When we verify the attestation, we will do the following:
- check if that key is available in .keys/devkey/[domain]/[local]/[selector].txt
- if it is not present, we perform a https query to
https://[domain]/.well-known/devkey/[zbase32-encoded-hash-of-local]/[selector].txt
The hashing and zbase32-encoding is taken to be compatible with
openpgp's WKD implementation and is done to prevent someone from easily
finding out full email addresses from the directory listing.
You can run the verification using the POC example. Here's without the
in-git matching key::
./main.py -m emails/mricon-signed-wkd.eml verify
Using emails/mricon-signed-wkd.eml as message source
Verifying: X-Patch-Sig (mode=wkd)
Retrieving: https://kernel.org/.well-known/devkey/sapsizz4qsj4zmmscbz9f7y8cunt496y/patches.txt
PASS : identity and domain match From header
PASS : time drift between Date and t (4 days, 6:58:47)
PASS : mode=wkd signature verified for: d=kernel.org, i=mricon@kernel.org, s=patches
----- ---------------
PASS : metadata
PASS : commit message
PASS : diff content
----- ---------------
PASS : All hashes verified
Here is with the public key provided in git repository itself::
./main.py -m emails/dev-signed-wkd-ingit.eml verify
Using emails/dev-signed-wkd-ingit.eml as message source
Verifying: X-Patch-Sig (mode=wkd)
Loading: WKD key from /var/home/user/work/git/patch-attestation-poc/.keys/devkey/kernel.org/dev/patches.txt
PASS : identity and domain match From header
PASS : time drift between Date and t (4 days, 7:28:47)
PASS : mode=wkd signature verified for: d=kernel.org, i=dev@kernel.org, s=patches
----- ---------------
PASS : metadata
PASS : commit message
PASS : diff content
----- ---------------
PASS : All hashes verified
The structure and nature of the WKD mechanism is entirely up for
discussion, along with everything else in this README.
Public keys bundled with git repos
----------------------------------
TBA.