blob: b3fb59b100333a2200976af2c95cc5276172caad [file] [log] [blame]
Machine assisted license cleanup
--------------------------------
1. Tools
1.1 scancode toolkit
A license scanner tool which can be run from the command line and
provides excellent parellellsation. While fast, its recommended to
be run on a machine with tons of CPUs and tons of Memory.
A run with 128 parallel scan threads takes about 15 minutes. Go
figure how long it will take on your laptop :)
https://github.com/nexB/scancode-toolkit
1.2 spdx helper scripts
A bunch of horrible python scripts with even more horrible shell
glue.
git://git.kernel.org/pub/scm/utils/spdx/spdx-utils
The main workhorse is lcheck.py. I wrote it initialy to gather
statistics and other information, but over time it evolved to a
swiss army knife. lcheck.py --help gives you the gory details, no
manpage sorry.
1.3 git
The git tools must be available.
A clean linux tree must be cloned. Ensure that there are no
artifacts from editing, patch directories etc.
To reproduce the setup (in case you have a big enough machine or
lots of time for thumb twiddling):
- Install scancode and git. If you need help with scancode talk
to Philipe.
- Clone the linux kernel
- Clone the spdx scripts
- cd into the spdx scripts directory
- invoke the runscript with:
./runall.sh path/to/linux/kernel
The path can be relative or absolute
- Wait ....
- Check the results in the stepX directories
- Chech the results in the kernel directory (each step creates a
branch).
For your convenience:
The spdx-utils repository contains aside of the master branch a branch
named linux-5.0. That contains:
- the scancode json files for each step
- the stats.txt file for each step
- the rules which are handled in each step
- the resulting patches
The resulting git tree is pushed to:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/linux-spdx.git
Branches step1, step2, step3 contain the steps documented below.
2) Approach
The Documentation directory is ignored for now. That needs some extra
care.
2.1 Files with no license
These files have not been touched during the first large sweep.
2.1.1 Build files
Make/Kconfig files without license information
2.1.2 Source files which have only MODULE_LICENSE("GPL") and/or
EXPORT_SYMBOL_GPL()
Now that MODULE_LICENSE is clarified this can be tackled.
The scripts identify these files in the scanner result and add the
proper license identifier (GPL-2.0-only)
The scripts generate patches which can be applied with quilt or imported
into git with 'git quiltimport'
SPDX count goes from 22574 to 25712
2.2 Files with a single license: GPL-2.0-only or GPL-2.0-or-later
The scripts handle the following tasks:
- Find the affected files in the scanner output
- Generate a list of match rules which represent a unique pattern
This is achieved by normalizing the texts (removing formatting,
white space damage, uppercase / lowercase and punctuation damage.
- Add the appropriate license header and remove the boiler plate
text or the license reference.
- Create a patch series. Each patch contains only the modifications
for a single match rule. The rule (and eventual variants)
are saved in the change log of each patch to ease review
- Once a reference dataset (compliance data provided by Siemens) is
available the scripts will also check for conflicts with that
data set.
This results in 515 patches at the moment.
The scripts generate patches which can be applied with quilt or
imported into git with 'git quiltimport'
SPDX count goes from 25712 to 46368
2.3. Files with GPL-2.9-only/or-later and Linux-OpenIB
Basically the same as above just with dual licensing.
2.4 More fun later :)
I have quite a bunch of steps in preparation but lets get the above
agreed on and reviewed first.