The key to creating fast solutions to the BOX-256 challenges is to have a number of threads working in parallel. New threads can be created with the THR instruction. Every cycle each thread executes one instruction.

Before each cycle the memory is stored in a buffer. Memory reads come from the buffer but instructions executed are loaded from the current memory. The program counter for the first thread is stored at the end of memory (0xFFh). The PC for the next thread is stored at 0xFEh, etc.

There are two techniques for launching multiple processes. We borrow the terminology from Core War.

Vector Launched Threads

To create a number of parallel threads subtract one from the number required, convert the result to binary and code a THR 004 for every 1-bit and a MOV @00 @XX 004 for every 0-bit where XX is the address of the current instruction.

For example to create 23 processes, subtract one = 22, convert to binary = 10110:

00 - THR 004         ;
04 - MOV @00 @04 004 ;
08 - THR 004         ; create 23 parallel threads
0C - THR 004         ;
10 - MOV @00 @10 004 ;

At this point the parallel threads are all executing the same address. To dispatch the threads to different addresses use a MOV to copy an array of locations over the program counters:

14 - MOV @40 @E9 017 ; copy an array of 23 (0x17h) addresses over the PCs
...
40 - 0?? 0?? 0?? 0??
44 - 0?? 0?? 0?? 0??
48 - 0?? 0?? 0?? 0??
4C - 0?? 0?? 0?? 0??
50 - 0?? 0?? 0?? 0??
54 - 0?? 0?? 0?? 000

The program counters are stored at the end of memory. Remember the thread at PC 0xFFh will execute first, the thread at PC 0xFEh next, etc. Threads can be pinned into place so they execute the same instruction over and over by sending one thread back to execute the MOV again:

14 - MOV @40 @E9 017 ; copy an array of 23 (0x17h) addresses over the PCs
...
40 - 0?? 0?? 0?? 0??
44 - 0?? 0?? 0?? 0??
48 - 0?? 0?? 0?? 0??
4C - 0?? 0?? 0?? 0??
50 - 0?? 0?? 0?? 0??
54 - 0?? 0?? 014 000 ; 1st thread jumps back to execute the MOV at 0x14h

Remember memory is buffered at the start of each cycle. Memory reads come from the buffer, but instructions executed are loaded from the current memory so it's possible for one thread to modify the instruction another thread will execute.

A practical example - 5 pinned threads display a Siérpinski triangle:

00 - THR 004         ;
04 - MOV @00 @04 004 ; create 5 parallel threads
08 - MOV @04 @08 004 ; 

0C - MOV @10 @FB 005 ; copy an array of 5 addresses over the PCs

10 - 024 020 01C 018
14 - 00C 000 000 000 ; 1st thread jumps back to execute the MOV at 0x0Ch

18 - PIX 000 @29 000 ; 2nd thread
1C - MOV @29 @28 010 ; 3rd thread
20 - ADD @29 @28 @38 ; 4th thread
24 - ADD 001 @19 @19 ; 5th thread

28 - 000 008 000 000 ; seed data for Siérpinski triangle

Binary Launched Threads

Another technique for launching multiple threads is a binary tree. THR is a branch node and each leaf node contains 2–3 instructions. Binary launched threads can display any image of up to x ≤ ~45 pixels in ⌈1+log2x⌉ cycles.

A practical example - 8 binary launched threads display a square:

00 - THR @30
04 -   THR @1C
08 -     THR @14
0C -       PIX 000 008 ; leaf node
10 -       PIX 001 00A ;

14 -       PIX 002 00B ; leaf node
18 -       PIX 003 00C ;

1C -     THR @28
20 -       PIX 004 008 ; leaf node
24 -       PIX 010 00C ;

28 -       PIX 014 00A ; leaf node
2C -       PIX 020 00B ;

30 -   THR @48
34 -     THR @40
38 -       PIX 024 00B ; leaf node
3C -       PIX 030 00A ;

40 -       PIX 034 00C ; leaf node
44 -       PIX 040 008 ;

48 -     THR @54
4C -       PIX 041 00C ; leaf node
50 -       PIX 042 00B ;

54 -       PIX 043 00A ; leaf node
58 -       PIX 044 008 ;