Language3 min read

Compute Shaders

General-purpose GPU computation with compute shaders in BWSL.

Reading Time
3 min
Word Count
401
Sections
10
Try It Live

Pressure-test the syntax

Take the concept from this page into the playground and deliberately break a pass, binding, or type signature to see how the compiler responds.

Try a Live Edit

Compute shaders enable general-purpose GPU work outside the rasterization pipeline. Use them for particle updates, reductions, image processing, culling, and simulation.

Basic Syntax

A compute pass contains a compute block instead of vertex and fragment blocks:

bwsl
pipeline ComputeBasic {
pass "Compute" {
compute "Main" [64, 1, 1] {
uint3 globalId = input.global_id;
uint idx = globalId.x;
if (idx >= 1024u) {
return;
}
float value = float(idx) / 1024.0;
float squared = value * value;
// resources.outputBuffer[idx] = squared;
}
}
}

The stage syntax is:

text
compute "Name" [X, Y, Z] { body }
PartDescription
"Name"Block name for diagnostics and graph references
[X, Y, Z]Workgroup size
{ body }Compute shader code

A pass contains either compute or graphics stages. Compute passes do not use use attributes.

Workgroup Size

The workgroup size [X, Y, Z] defines how many invocations run together:

bwsl
compute "Process1D" [256, 1, 1] { ... }
compute "Blur2D" [16, 16, 1] { ... }
compute "Voxel3D" [8, 8, 8] { ... }

Common totals are 64 to 256 invocations per workgroup.

Compute Built-in Inputs

Access compute-specific values through the input object:

Built-inTypeDescription
input.global_iduint3Unique ID across all workgroups
input.local_iduint3ID within the current workgroup
input.workgroup_iduint3Which workgroup this invocation belongs to
input.num_workgroupsuint3Total number of workgroups dispatched
input.local_indexuintFlattened 1D index within the workgroup

Dispatch Configuration

Dispatch count is host-side configuration. BWSL source declares local workgroup size with [X, Y, Z], while your engine decides how many workgroups to launch.

That means bounds checks are still the normal pattern:

bwsl
compute "FillBuffer" [64, 1, 1] {
uint idx = input.global_id.x;
if (idx >= resources.dataSize) {
return;
}
}

Shared Memory

Shared memory is visible to every invocation in the current workgroup. Declare it with shared:

bwsl
compute "BasicShared" [256, 1, 1] {
shared float values[256];
uint localIdx = input.local_index;
values[localIdx] = float(localIdx) * 0.1;
barrier();
uint neighborIdx = (localIdx + 1u) % 256u;
float neighbor = values[neighborIdx];
values[localIdx] = neighbor * 2.0;
}

Barrier Intrinsics

Synchronize invocations within a workgroup using:

  • barrier()
  • memoryBarrier()
  • storageBarrier()
bwsl
compute "BarrierDemo" [64, 1, 1] {
shared float data[64];
uint localIdx = input.local_index;
data[localIdx] = float(localIdx);
barrier();
if (localIdx == 0u) {
data[0] = data[1] + data[2];
}
memoryBarrier();
storageBarrier();
barrier();
}

Barrier Rules

All invocations in a workgroup must execute the same barrier sequence. Do not place barriers behind divergent control flow.

Atomics and Waves

The current compute surface also includes:

  • shared-memory atomics such as atomic_add, atomic_min, atomic_max, atomic_and, atomic_or, atomic_xor, atomic_exchange, and atomic_cmp_exchange
  • wave/subgroup intrinsics such as wave_sum, wave_product, wave_min, wave_max, wave_all, wave_any, wave_broadcast, and wave_read_first

Atomics operate on lvalues such as array elements:

bwsl
compute "AtomicOps" [64, 1, 1] {
shared int sharedCounter[1];
if (input.local_index == 0u) {
sharedCounter[0] = 0;
}
barrier();
int previous = atomic_add(sharedCounter[0], 1);
}

Compute Graph

The parser and compiler include a pipeline-level compute_graph { ... } feature for compute-pass dependency tracking. The implementation exists, but the public docs do not yet have a stable example syntax to show here, so this page intentionally stops short of inventing one.

Summary

FeatureSyntax / Notes
Compute stagecompute "Name" [X, Y, Z] { ... }
Built-insinput.global_id, input.local_id, input.workgroup_id, input.num_workgroups, input.local_index
Shared memoryshared float cache[256];
Syncbarrier(), memoryBarrier(), storageBarrier()
Atomicsatomic_add(sharedCounter[0], 1)
Dispatch countChosen by the host engine

See Also

  • Shader I/O - Built-in input and output behavior
  • Resources - Accessing external buffers, textures, and images
  • Intrinsics - Full intrinsic reference