Commit 3296cffe authored by Harris, Austin's avatar Harris, Austin
Browse files

README update from Jakub

parent 9bec43ae
Loading
Loading
Loading
Loading
+227 −2
Original line number Diff line number Diff line
# Register pressure kernels
# Clang Register Allocation Reproducers

A collection of kernels that stress register pressure and related aspects of Clang/LLVM for AMDGPU compilers.
 No newline at end of file
A collection of reproducers for Clang register allocation issues in GPU kernels.

The goal is simple: make it easy for another engineer to reproduce the problem,
understand what is broken, and start debugging immediately.

These reproducers are meant to expose issues caused by high register pressure,
including:

- correctness bugs
- performance regressions
- unstable or surprising code generation

## What this repo is for

This repo is for cases where Clang's register allocator appears to make a bad
choice, falls over under pressure, or generates code that is clearly wrong or
clearly worse than expected.

A good reproducer is not a full application dump. It is a compact, focused test
case that isolates the behavior well enough that somebody else can build it,
run it, and see the issue without needing a guided tour.

## What makes a good reproducer?

The short version: **simple beats complete**.

If the bug originally came from a large application, that is fine. But the goal
should be to reduce it until only the essential ingredients remain.

We group reproducers into two buckets.

### 1. Minimal reproducers (ideal)

This is the gold standard.

A strong minimal reproducer:

- fits in **one source file**
- fits on **one screen** if reasonably possible
- builds with **one command**
- runs with **one command**
- prints output that makes the problem obvious

For correctness issues, the output should clearly say whether the test passed or
failed.

For performance issues, the output should print a clear figure of merit, such
as:

- execution time
- throughput in GB/s or GFLOP/s
- slowdown or speedup relative to a baseline

Example shape:

```bash
# Build
clang++ <flags> reproducer.cpp -o reproducer

# Run
./reproducer

# Output should make the result obvious:
# PASS
# FAIL
# 412.7 GFLOP/s
# 37% regression vs baseline
```

If this kind of reduction is possible, this is what we want.

### 2. Larger reproducers with trivial setup

Sometimes the problem only shows up inside a real framework, a generated kernel,
or a code path that is annoying to peel apart.

That is acceptable, but the reproduction workflow still needs to be dead simple.

A good larger reproducer:

- has setup instructions that still fit on one screen
- uses copy-paste-friendly commands
- avoids mystery dependencies and hand-wavy steps
- makes the failure or regression obvious
- leaves no doubt about what the expected result was

Example shape:

```bash
# Clone and build
git clone <repo> && cd <repo> && ./build.sh

# Run
./run_test.sh

# Output should clearly show the problem:
# PASS / FAIL
# Expected ~500 GFLOP/s, got ~250 GFLOP/s
```

This is not as nice as a one-file reproducer, but it is still good if the setup
is trivial and the result is undeniable.

## Rules of thumb

### Make the result obvious

Whoever picks this up should not have to guess what went wrong.

Bad:

```text
Looks suspicious
```

Good:

```text
FAIL: lane 37 produced 0x00000000, expected 0x3f800000
```

Or for performance:

```text
Baseline: 820 GB/s
Current:  515 GB/s
Regression: 37.2%
```

### Keep the commands boring

The best reproducer is one that another engineer can copy, paste, and run in a
few minutes.

Try to avoid:

- long multi-step setup procedures
- undocumented environment assumptions
- local patches with no explanation
- hidden dependencies on private trees or machine-specific scripts

### Minimize dependencies

If the reproducer needs an entire framework, fine. But only keep what is
actually required to trigger the problem.

Strip out:

- unrelated code paths
- unnecessary kernels
- extra inputs
- giant build systems when a smaller one will do

### Reduce noise

The point is to highlight the compiler issue, not bury it.

Prefer:

- one kernel over many
- one input size over a matrix of cases
- one strong signal over lots of vague evidence

## What to include with each reproducer

Every reproducer submission should include the basics needed to reproduce and
triage the issue quickly.

### Required

- source file or repository link
- exact build command(s)
- exact run command(s)
- expected result
- actual result
- Clang version(s) affected

### Strongly recommended

- target GPU architecture
- relevant compiler flags
- whether the issue is a correctness bug or a performance regression
- a short note on how the case was reduced from the original application

### For performance issues

Also include:

- the metric being reported
- the baseline used for comparison
- the measured regression or speedup
- run methodology if it matters

You do not need a full benchmarking paper here. Just enough context that
another engineer can trust the number and reproduce it.

## What “good enough” looks like

A reproducer is good enough when another engineer can:

1. build it without reverse-engineering your environment
2. run it without asking follow-up questions
3. see the problem immediately
4. start debugging the compiler instead of debugging the reproducer

That is the bar.

## Non-goals

This repo is **not** trying to preserve the full original application or create
perfectly realistic production benchmarks.

If realism fights simplicity, simplicity usually wins.

The point is not to tell the whole story.
The point is to isolate the bug.

## Contributing

When submitting a reproducer, optimize for clarity and reduction.

If you are unsure whether something is small enough, it probably is not.
Try one more round of trimming.

If you cannot reduce it further without losing the issue, that is fine too.
Just make the reproduction steps trivial and the result unambiguous.