refs/heads/whitespaces - pub/scm/linux/kernel/git/viro/sparse

commit: aeddbcf2878ebf8bc5a489c13922d82ae0c9a277
[log]
author: Al Viro <viro@zeniv.linux.org.uk>
Sun Mar 15 19:53:16 2026 -0400
committer: Al Viro <viro@zeniv.linux.org.uk>
Wed Apr 01 05:31:30 2026 -0400
tree: 890a988da02d3bf7166f0942cdbd3832820c7315
parent: 18732e35971d11d6ed44438be912116c37e2bb2c [diff]

try to get whitespaces right

Preprocessor started as a text filter - it took a stream of characters
and produced a stream of characters, with the output fed to compiler.
Semantics had been not _quite_ as bad as with Bourne shell, but it
had been full of dark corners all the same.

By C99 times it had been replaced with something far better defined.
Now it operates on stream of tokens, and the output stream is _not_
fed through the tokenizer again.  Operations are defined in terms of
manipulations with that stream of tokens, which avoids a lot of
headache.

Unfortunately, it's not exactly a stream of tokens - it's a stream of
tokens and whitespaces.  Which is where the things get interesting.
When the standard says "replace the tokens from <here> to <there> with
the sequence of tokens obtained by <this>", the general understanding
is that
	* whatever whitespaces might've been between <here> and <there>
are removed
	* any whitespaces prior to <here> and past <there> remain as-is
	* any whitespaces between the tokens of the sequence we are
told to substitute go into the the result.
What's not agreed upon is what should be done to possible leading or
trailing whitespaces in the sequence.

In a lot of cases it doesn't matter - subsequent phases care only about
tokens.  Moreover, '#' operator (which does care about whitespaces) is
explicitly required to trim any leading and trailing whitespace.  However,
there are cases where it is observable and different implementations
yield different results.

Currently sparse is prone to losing whitespace in many situations
where it definitely shouldn't.  This commit attempts to fix that;
semantics I'm going for matches what clang is doing, namely "trim the
leading and trailing whitespace off the sequence being substituted".
I would consider doing what gcc does (and it does diverge from clang in
some cases), but... it's full of interesting corner cases and downright
broken around combining __VA_OPT__() with ##.  When/if they decide what
to do with that...

What this commit does is
	* don't lose ->pos.{newline,whitespace} deposited on TOKEN_UNTAINT
tokens; when scan_next() finds and skips those, have it collect their
->pos.{whitespace,newline}.  If we are not at the end of the list and
eventually get to a normal token, add the collected flags to it. Turns
out that the comment in expand() had been too pessimistic - it's _not_
terribly costly, provided that the slow case of scan_next() is uninlined.

	* have substitute() keep better track of pending whitespace.
The tricky part is the logics related to placemarkers and concatenation;
see the comments in front of substitute() for details.

Note: gcc code generation for bitfields really, really stinks.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

5 files changed

tree: 890a988da02d3bf7166f0942cdbd3832820c7315