README update from Jakub (3296cffe) · Commits · Harris, Austin / register_pressure_kernels

README.md

+227 −2

Original line number	Diff line number	Diff line
		# Register pressure kernels
		# Clang Register Allocation Reproducers

		A collection of kernels that stress register pressure and related aspects of Clang/LLVM for AMDGPU compilers.
		No newline at end of file
		A collection of reproducers for Clang register allocation issues in GPU kernels.

		The goal is simple: make it easy for another engineer to reproduce the problem,
		understand what is broken, and start debugging immediately.

		These reproducers are meant to expose issues caused by high register pressure,
		including:

		- correctness bugs
		- performance regressions
		- unstable or surprising code generation

		## What this repo is for

		This repo is for cases where Clang's register allocator appears to make a bad
		choice, falls over under pressure, or generates code that is clearly wrong or
		clearly worse than expected.

		A good reproducer is not a full application dump. It is a compact, focused test
		case that isolates the behavior well enough that somebody else can build it,
		run it, and see the issue without needing a guided tour.

		## What makes a good reproducer?

		The short version: simple beats complete.

		If the bug originally came from a large application, that is fine. But the goal
		should be to reduce it until only the essential ingredients remain.

		We group reproducers into two buckets.

		### 1. Minimal reproducers (ideal)

		This is the gold standard.

		A strong minimal reproducer:

		- fits in one source file
		- fits on one screen if reasonably possible
		- builds with one command
		- runs with one command
		- prints output that makes the problem obvious

		For correctness issues, the output should clearly say whether the test passed or
		failed.

		For performance issues, the output should print a clear figure of merit, such
		as:

		- execution time
		- throughput in GB/s or GFLOP/s
		- slowdown or speedup relative to a baseline

		Example shape:

		```bash
		# Build
		clang++ <flags> reproducer.cpp -o reproducer

		# Run
		./reproducer

		# Output should make the result obvious:
		# PASS
		# FAIL
		# 412.7 GFLOP/s
		# 37% regression vs baseline
		```

		If this kind of reduction is possible, this is what we want.

		### 2. Larger reproducers with trivial setup

		Sometimes the problem only shows up inside a real framework, a generated kernel,
		or a code path that is annoying to peel apart.

		That is acceptable, but the reproduction workflow still needs to be dead simple.

		A good larger reproducer:

		- has setup instructions that still fit on one screen
		- uses copy-paste-friendly commands
		- avoids mystery dependencies and hand-wavy steps
		- makes the failure or regression obvious
		- leaves no doubt about what the expected result was

		Example shape:

		```bash
		# Clone and build
		git clone <repo> && cd <repo> && ./build.sh

		# Run
		./run_test.sh

		# Output should clearly show the problem:
		# PASS / FAIL
		# Expected ~500 GFLOP/s, got ~250 GFLOP/s
		```

		This is not as nice as a one-file reproducer, but it is still good if the setup
		is trivial and the result is undeniable.

		## Rules of thumb

		### Make the result obvious

		Whoever picks this up should not have to guess what went wrong.

		Bad:

		```text
		Looks suspicious
		```

		Good:

		```text
		FAIL: lane 37 produced 0x00000000, expected 0x3f800000
		```

		Or for performance:

		```text
		Baseline: 820 GB/s
		Current: 515 GB/s
		Regression: 37.2%
		```

		### Keep the commands boring

		The best reproducer is one that another engineer can copy, paste, and run in a
		few minutes.

		Try to avoid:

		- long multi-step setup procedures
		- undocumented environment assumptions
		- local patches with no explanation
		- hidden dependencies on private trees or machine-specific scripts

		### Minimize dependencies

		If the reproducer needs an entire framework, fine. But only keep what is
		actually required to trigger the problem.

		Strip out:

		- unrelated code paths
		- unnecessary kernels
		- extra inputs
		- giant build systems when a smaller one will do

		### Reduce noise

		The point is to highlight the compiler issue, not bury it.

		Prefer:

		- one kernel over many
		- one input size over a matrix of cases
		- one strong signal over lots of vague evidence

		## What to include with each reproducer

		Every reproducer submission should include the basics needed to reproduce and
		triage the issue quickly.

		### Required

		- source file or repository link
		- exact build command(s)
		- exact run command(s)
		- expected result
		- actual result
		- Clang version(s) affected

		### Strongly recommended

		- target GPU architecture
		- relevant compiler flags
		- whether the issue is a correctness bug or a performance regression
		- a short note on how the case was reduced from the original application

		### For performance issues

		Also include:

		- the metric being reported
		- the baseline used for comparison
		- the measured regression or speedup
		- run methodology if it matters

		You do not need a full benchmarking paper here. Just enough context that
		another engineer can trust the number and reproduce it.

		## What “good enough” looks like

		A reproducer is good enough when another engineer can:

		1. build it without reverse-engineering your environment
		2. run it without asking follow-up questions
		3. see the problem immediately
		4. start debugging the compiler instead of debugging the reproducer

		That is the bar.

		## Non-goals

		This repo is not trying to preserve the full original application or create
		perfectly realistic production benchmarks.

		If realism fights simplicity, simplicity usually wins.

		The point is not to tell the whole story.
		The point is to isolate the bug.

		## Contributing

		When submitting a reproducer, optimize for clarity and reduction.

		If you are unsure whether something is small enough, it probably is not.
		Try one more round of trimming.

		If you cannot reduce it further without losing the issue, that is fine too.
		Just make the reproduction steps trivial and the result unambiguous.