Compute Shaders

Compute shaders enable general-purpose GPU work outside the rasterization pipeline. Use them for particle updates, reductions, image processing, culling, and simulation.

Basic Syntax

A compute pass contains a compute block instead of vertex and fragment blocks:

bwsl

pipeline ComputeBasic {
    pass "Compute" {
        compute "Main" [64, 1, 1] {
            uint3 globalId = input.global_id;
            uint idx = globalId.x;

            if (idx >= 1024u) {
                return;
            }

            float value = float(idx) / 1024.0;
            float squared = value * value;

            // resources.outputBuffer[idx] = squared;
        }
    }
}

The stage syntax is:

text

compute "Name" [X, Y, Z] { body }

Part	Description
`"Name"`	Block name for diagnostics and graph references
`[X, Y, Z]`	Workgroup size
`{ body }`	Compute shader code

A pass contains either compute or graphics stages. Compute passes do not use use attributes.

Workgroup Size

The workgroup size [X, Y, Z] defines how many invocations run together:

bwsl

compute "Process1D" [256, 1, 1] { ... }
compute "Blur2D" [16, 16, 1] { ... }
compute "Voxel3D" [8, 8, 8] { ... }

Common totals are 64 to 256 invocations per workgroup.

Compute Built-in Inputs

Access compute-specific values through the input object:

Built-in	Type	Description
`input.global_id`	`uint3`	Unique ID across all workgroups
`input.local_id`	`uint3`	ID within the current workgroup
`input.workgroup_id`	`uint3`	Which workgroup this invocation belongs to
`input.num_workgroups`	`uint3`	Total number of workgroups dispatched
`input.local_index`	`uint`	Flattened 1D index within the workgroup

Dispatch Configuration

Dispatch count is host-side configuration. BWSL source declares local workgroup size with [X, Y, Z], while your engine decides how many workgroups to launch.

That means bounds checks are still the normal pattern:

bwsl

compute "FillBuffer" [64, 1, 1] {
    uint idx = input.global_id.x;
    if (idx >= resources.dataSize) {
        return;
    }
}

Shared Memory

Shared memory is visible to every invocation in the current workgroup. Declare it with shared:

bwsl

compute "BasicShared" [256, 1, 1] {
    shared float values[256];

    uint localIdx = input.local_index;
    values[localIdx] = float(localIdx) * 0.1;

    barrier();

    uint neighborIdx = (localIdx + 1u) % 256u;
    float neighbor = values[neighborIdx];
    values[localIdx] = neighbor * 2.0;
}

Barrier Intrinsics

Synchronize invocations within a workgroup using:

barrier()
memoryBarrier()
storageBarrier()

bwsl

compute "BarrierDemo" [64, 1, 1] {
    shared float data[64];
    uint localIdx = input.local_index;

    data[localIdx] = float(localIdx);
    barrier();

    if (localIdx == 0u) {
        data[0] = data[1] + data[2];
    }

    memoryBarrier();
    storageBarrier();
    barrier();
}

Barrier Rules

All invocations in a workgroup must execute the same barrier sequence. Do not place barriers behind divergent control flow.

Atomics and Waves

The current compute surface also includes:

shared-memory atomics such as atomic_add, atomic_min, atomic_max, atomic_and, atomic_or, atomic_xor, atomic_exchange, and atomic_cmp_exchange
wave/subgroup intrinsics such as wave_sum, wave_product, wave_min, wave_max, wave_all, wave_any, wave_broadcast, and wave_read_first

Atomics operate on lvalues such as array elements:

bwsl

compute "AtomicOps" [64, 1, 1] {
    shared int sharedCounter[1];

    if (input.local_index == 0u) {
        sharedCounter[0] = 0;
    }

    barrier();

    int previous = atomic_add(sharedCounter[0], 1);
}

Compute Graph

The parser and compiler include a pipeline-level compute_graph { ... } feature for compute-pass dependency tracking. The implementation exists, but the public docs do not yet have a stable example syntax to show here, so this page intentionally stops short of inventing one.

Summary

Feature	Syntax / Notes
Compute stage	`compute "Name" [X, Y, Z] { ... }`
Built-ins	`input.global_id`, `input.local_id`, `input.workgroup_id`, `input.num_workgroups`, `input.local_index`
Shared memory	`shared float cache[256];`
Sync	`barrier()`, `memoryBarrier()`, `storageBarrier()`
Atomics	`atomic_add(sharedCounter[0], 1)`
Dispatch count	Chosen by the host engine

Pressure-test the syntax

Basic Syntax

Workgroup Size

Compute Built-in Inputs

Dispatch Configuration

Shared Memory

Barrier Intrinsics

Atomics and Waves

Compute Graph

Summary

See Also