Occasionally, compiling the Linux kernel with a new compiler flag will result in a boot failure. If you are lucky, there will be some output to the serial console but that may not happen if the issue happens in early boot code before the serial driver has loaded. When this happens, it usually requires building part of the kernel without the compiler flag (or the “negative” version of it) to try and figure out the exact translation unit and function that causes the problem. I’ll go over this process at a high level to help others who might encounter this same issue.
I will honest up front, this process is not super clean cut; there is some “feel” about it, but I will do my best to explain that right up front. If any of this is confusing, please let me know where you were confused and I will do my best to clarify in this post for future travellers.
1. Figure out what flag is causing the issue
Hopefully you are not making multiple code generation flag changes at once ;) but if you are, you need to figure out which flag is causing the problem.
The whole preface of this blog post came from the ClangBuiltLinux issue INIT_STACK_ALL_ZERO - Framework Laptop system freezes (kernel oops?) on boot. The reporter already figured out that
CONFIG_INIT_STACK_ALL_ZERO was responsible for the change, which we can see in the main kernel
Makefile corresponds to
$ sed -n '826,832p' Makefile ifdef CONFIG_INIT_STACK_ALL_ZERO KBUILD_CFLAGS += -ftrivial-auto-var-init=zero ifdef CONFIG_CC_IS_CLANG # https://bugs.llvm.org/show_bug.cgi?id=45497 KBUILD_CFLAGS += -enable-trivial-auto-var-init-zero-knowing-it-will-be-removed-from-clang endif endif
2. Figure out if there is a “negative” flag
Once the problematic flag has been uncovered, we want to know if the flag has a “negative” version; in other words, we want to know if there is a flag to turn off the problematic flag. In the case of
-ftrivial-auto-var-init=zero, we can go to Clang’s cpmmand line reference and see that the default of
uninitialized, which means that
-ftrivial-auto-var-init=uninitialized will “undo”
-ftrivial-auto-var-init=zero. This allows us to append
-ftrivial-auto-var-init=uninitialized to certain files to see if the issue disappears, which will allow us to discover the translation unit that has the issue.
If there is not a negative flag, you will have to resort to removing the flag from folders and translation units, which is possible, but I am not going to cover that in this post.
3. Strategically start adding negative flag to translation units
The Linux kernel’s build system (Kbuild) allows one to include flags for individual translation units (
.o) and entire folders. You can read about the details of each in the Linux Kernel Makefiles documentation. First, we will use
subdir-ccflags-y to add the negative flag to entire subfolders then slowly work towards individual flags. Picking the folder or folders that we start in is one of the hardest parts of this process, as it is not the same every single time.
To start, it is pretty safe to start with
mm/Makefile, as that is all code that runs early in boot, which would look like:
diff --git a/arch/x86/Kbuild b/arch/x86/Kbuild index 5a83da703e87..4bb2323eba7c 100644 --- a/arch/x86/Kbuild +++ b/arch/x86/Kbuild @@ -30,3 +30,5 @@ obj-$(CONFIG_KEXEC_FILE) += purgatory/ # for cleaning subdir- += boot tools + +subdir-ccflags-y := -ftrivial-auto-var-init=uninitialized diff --git a/mm/Makefile b/mm/Makefile index 4cc13f3179a5..30717aad76de 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -3,6 +3,8 @@ # Makefile for the linux memory manager. # +subdir-ccflags-y := -ftrivial-auto-var-init=uninitialized + KASAN_SANITIZE_slab_common.o := n KASAN_SANITIZE_slab.o := n KASAN_SANITIZE_slub.o := n
If the issue is resolved with this diff, we have a known good kernel and a known bad kernel, which means it is possible to do further bisection. If the issue is not resolved, it means that these folders do not have the code responsible for the issue. The next place I would start is somewhere in
drivers/, as it is possible that a driver problem can cause issues with the device fully starting up. For example, a problem in
drivers/gpu/drm might result in no graphical output:
diff --git a/drivers/gpu/drm/Makefile b/drivers/gpu/drm/Makefile index c2ef5f9fce54..a3ed2f48fa84 100644 --- a/drivers/gpu/drm/Makefile +++ b/drivers/gpu/drm/Makefile @@ -3,6 +3,8 @@ # Makefile for the drm device driver. This driver provides support for the # Direct Rendering Infrastructure (DRI) in XFree86 4.1.0 and higher. +subdir-ccflags-y := -ftrivial-auto-var-init=uninitialized + drm-y := drm_aperture.o drm_auth.o drm_cache.o \ drm_file.o drm_gem.o drm_ioctl.o \ drm_drv.o \
4. Start moving down the folder that is problematic
At this point, you will want to try and narrow the problem down to one top level folder (such as
mm) so that you can focus on moving down the directory structure quickly.
Once you have the folder (I’ll be using
arch/x86 as an example), you will want to find all the
Makefiles that are directly below where you started. For example, in
$ fd -d 2 Makefile arch/x86 arch/x86/Makefile arch/x86/Makefile.um arch/x86/Makefile_32.cpu arch/x86/boot/Makefile arch/x86/coco/Makefile arch/x86/crypto/Makefile arch/x86/entry/Makefile arch/x86/events/Makefile arch/x86/hyperv/Makefile arch/x86/ia32/Makefile arch/x86/kernel/Makefile arch/x86/kvm/Makefile arch/x86/lib/Makefile arch/x86/math-emu/Makefile arch/x86/mm/Makefile arch/x86/net/Makefile arch/x86/pci/Makefile arch/x86/platform/Makefile arch/x86/power/Makefile arch/x86/purgatory/Makefile arch/x86/realmode/Makefile arch/x86/tools/Makefile arch/x86/um/Makefile arch/x86/video/Makefile arch/x86/xen/Makefile
We want to remove the original
subdir-ccflags-y from the higher
Makefile and move it down into the individual
Makefiles. Once that is done, boot your kernel to make sure the issue is still fixed, as it should be.
Once that is done, remove the addition to half of the Makefile and see if the issue is resolved. If it is, we know the issue is in the folders that still have the negative flag. If it is not, we know the issue is in the folders that we just removed the negative flag from. From there, move the flags further and further down until you arrive at a folder with no subdirectories. I would recommend committing your changes via
git, as that will make it easier to verify what kernel you are testing and undoing changes is much easier.
5. Bisect individual translation units
Once you have arrived at a folder with just
Makefile and some
.c files, you are ready to figure out the problematic translation unit (or units!).
subdir-ccflags-y that you have added. We need to generate a set of
CFLAGS for the individual translation units to test. Assuming you stil have the object files from your previous build, you can generate this using a shell command such as:
$ for file in <subdir>/*.o; do echo CFLAGS_$(basename "$file") := <negative_flag> done >><subdir>/Makefile
For example, if I were testing
$ for file in mm/*.o; do echo CFLAGS_$(basename "$file") := <negative_flag> done >>mm/Makefile
At this point, make sure the issue is still fixed. If it is, do the same process as above by either deleting or commenting out the flags and seeing if the issue is still resolved. If it is, the issue is in one of the files that still has the flags applied. If the issue comes back, you know the issue is in one of the files that just had the flag removed. Repeat this process until you are left with a minimal set of files that does not have the issue.
6. Bisect individual functions
Once there is a set of translation units, it might be possible to further narrow down what function causes the problem using function or variable attributes. By this point, it might be obvious why there is a problem but if not, you will basically apply the attribute to the types that need it and do a similar process.
In the case of
-ftrivial-auto-var-init=uninitialized, there is
__attribute((uninitialized)), which can be applied to local variables, such as:
diff --git a/mm/hmm.c b/mm/hmm.c index 3fd3242c5e50..54f53bbe3632 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -235,7 +235,7 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, struct hmm_vma_walk *hmm_vma_walk = walk->private; struct hmm_range *range = hmm_vma_walk->range; unsigned int required_fault; - unsigned long cpu_flags; + __attribute((uninitialized)) unsigned long cpu_flags; pte_t pte = *ptep; uint64_t pfn_req_flags = *hmm_pfn;
From there, do the same process of applying the attribute, testing, and removing until you are left with a single set of variables. After that, analysis can commence.