Compute Shaders
General-purpose GPU computation with compute shaders in BWSL.
Pressure-test the syntax
Take the concept from this page into the playground and deliberately break a pass, binding, or type signature to see how the compiler responds.
Try a Live EditCompute shaders enable general-purpose GPU work outside the rasterization pipeline. Use them for particle updates, reductions, image processing, culling, and simulation.
Basic Syntax
A compute pass contains a compute block instead of vertex and fragment blocks:
pipeline ComputeBasic {
pass "Compute" {
compute "Main" [64, 1, 1] {
uint3 globalId = input.global_id;
uint idx = globalId.x;
if (idx >= 1024u) {
return;
}
float value = float(idx) / 1024.0;
float squared = value * value;
// resources.outputBuffer[idx] = squared;
}
}
}
The stage syntax is:
compute "Name" [X, Y, Z] { body }
| Part | Description |
|---|---|
"Name" | Block name for diagnostics and graph references |
[X, Y, Z] | Workgroup size |
{ body } | Compute shader code |
A pass contains either compute or graphics stages. Compute passes do not use use attributes.
Workgroup Size
The workgroup size [X, Y, Z] defines how many invocations run together:
compute "Process1D" [256, 1, 1] { ... }
compute "Blur2D" [16, 16, 1] { ... }
compute "Voxel3D" [8, 8, 8] { ... }
Common totals are 64 to 256 invocations per workgroup.
Compute Built-in Inputs
Access compute-specific values through the input object:
| Built-in | Type | Description |
|---|---|---|
input.global_id | uint3 | Unique ID across all workgroups |
input.local_id | uint3 | ID within the current workgroup |
input.workgroup_id | uint3 | Which workgroup this invocation belongs to |
input.num_workgroups | uint3 | Total number of workgroups dispatched |
input.local_index | uint | Flattened 1D index within the workgroup |
Dispatch Configuration
Dispatch count is host-side configuration. BWSL source declares local workgroup size with [X, Y, Z], while your engine decides how many workgroups to launch.
That means bounds checks are still the normal pattern:
compute "FillBuffer" [64, 1, 1] {
uint idx = input.global_id.x;
if (idx >= resources.dataSize) {
return;
}
}
Shared Memory
Shared memory is visible to every invocation in the current workgroup. Declare it with shared:
compute "BasicShared" [256, 1, 1] {
shared float values[256];
uint localIdx = input.local_index;
values[localIdx] = float(localIdx) * 0.1;
barrier();
uint neighborIdx = (localIdx + 1u) % 256u;
float neighbor = values[neighborIdx];
values[localIdx] = neighbor * 2.0;
}
Barrier Intrinsics
Synchronize invocations within a workgroup using:
barrier()memoryBarrier()storageBarrier()
compute "BarrierDemo" [64, 1, 1] {
shared float data[64];
uint localIdx = input.local_index;
data[localIdx] = float(localIdx);
barrier();
if (localIdx == 0u) {
data[0] = data[1] + data[2];
}
memoryBarrier();
storageBarrier();
barrier();
}
Barrier Rules
All invocations in a workgroup must execute the same barrier sequence. Do not place barriers behind divergent control flow.
Atomics and Waves
The current compute surface also includes:
- shared-memory atomics such as
atomic_add,atomic_min,atomic_max,atomic_and,atomic_or,atomic_xor,atomic_exchange, andatomic_cmp_exchange - wave/subgroup intrinsics such as
wave_sum,wave_product,wave_min,wave_max,wave_all,wave_any,wave_broadcast, andwave_read_first
Atomics operate on lvalues such as array elements:
compute "AtomicOps" [64, 1, 1] {
shared int sharedCounter[1];
if (input.local_index == 0u) {
sharedCounter[0] = 0;
}
barrier();
int previous = atomic_add(sharedCounter[0], 1);
}
Compute Graph
The parser and compiler include a pipeline-level compute_graph { ... } feature for compute-pass dependency tracking. The implementation exists, but the public docs do not yet have a stable example syntax to show here, so this page intentionally stops short of inventing one.
Summary
| Feature | Syntax / Notes |
|---|---|
| Compute stage | compute "Name" [X, Y, Z] { ... } |
| Built-ins | input.global_id, input.local_id, input.workgroup_id, input.num_workgroups, input.local_index |
| Shared memory | shared float cache[256]; |
| Sync | barrier(), memoryBarrier(), storageBarrier() |
| Atomics | atomic_add(sharedCounter[0], 1) |
| Dispatch count | Chosen by the host engine |
See Also
- Shader I/O - Built-in
inputandoutputbehavior - Resources - Accessing external buffers, textures, and images
- Intrinsics - Full intrinsic reference