On Interaction Nets and Hardware
Understanding the potential of our novel general-purpose architecture requires understanding the underlying model of computation: interaction nets. So we will introduce interaction nets and their unique properties (specifically locality, parallelism, and linearity) and how they help to overcome some of the hardest challenges in hardware architecture design.
Interaction Nets
Interaction nets are a graph-based model of computation. An interaction net consists of nodes that each have one principal port and, depending on the node type, zero or more auxiliary ports. The following is a node of type A with one principal port (on the top) and two auxiliary ports (at the bottom).
An interaction net consists of multiple nodes, whose ports connected by wires.
A wire connecting two principal ports forms an active pair. Those two nodes then interact with each other to perform some computation. For that, an interaction net system defines rules on how two nodes of a particular type interact with each other.
Applying a rule to an active pair in a larger net replaces the two nodes of the active pair with the right-hand side of the rule.
This also brings us to the first important property of interaction nets: locality. One step in the computation only changes a very local part of the graph and does not have a global effect.
For now, let's focus on very simple calculations to get a feel for how interaction nets compute things. First, we'll introduce nodes for unary numbers. We'll have a 0 node with a principal port and no auxiliary ports, and a +1 node with a principal port and one auxiliary port.
We'll represent the numbers 0, 1, and 2 as follows:
Starting with addition, we introduce a "+" node to add two numbers.
Before we do any calculations, we need to define how the "+" node interacts with our number nodes. The rules represent the arithmetic rules
and
.
Using this system, we can now compute .
The result is , as expected.
Next, we'll define multiplication with the help of the equivalences and . Notice though that in the first case does not appear on the right-hand side at all, and in the second case it appears twice. To model that, we'll need two additional node types to erase and duplicate numbers.
Now we can specify the rules for the multiplication node.
With this, we can compute .
After the first interaction, something interesting happens. We now have two active pairs in the interaction net. Moreover, the two active pairs are non-overlapping. This is actually always the case; since nodes only have a single principal port, they can only ever be part of a single active pair. This means that all active pairs are completely independent, and they can be executed in any order or, more interestingly, in parallel.
After executing these two active pairs, we get three more which we again can execute in parallel.
We get the expected result of .
Efficient Interaction Nets
While doing these calculations on paper, we could simply look for the active pairs in the graph. However, when implementing interaction nets in software or hardware, storing graphs and searching for active pairs is incredibly slow. This is why we'll decompose the interaction net into trees. To accomplish this, we cut any link that connects two auxiliary ports and replace the two ends with a label. As an example, we'll do this for one of the intermediate nets of the multiplication calculation.
What remains are pairs of trees. All active pairs of the interaction net are exactly where the roots of two trees meet. This means that we can efficiently store the interaction nets as trees in memory and also immediately have access to all active pairs in the net.
Programming Interaction Nets
Having an efficient implementation is one thing, but programming in raw interaction nets is not really feasible. For this, we'll talk about the third property we haven't mentioned yet: linearity. This property refers to the fact that things get used exactly once. As a matter of fact, we've seen this property in action already when we defined specific nodes to explicitly erase and duplicate numbers. This property is interesting because, as it turns out, a normal programming language with a linear type system can be compiled down to a linear model of computation, such as interaction nets, perfectly. Specifically, Rust has popularized a kind of linear type system with its concept of ownership.
While there are a couple of big caveats as to why Rust specifically can't easily be compiled to interaction nets, we're able to create a language (Vine) that is similar to Rust but compiles directly to interaction nets.
Equipped with this better tooling, let us solve a bigger problem, the longest common subsequence problem.
A subsequence of a list is a (not necessarily contiguous) selection of items from the list. In the longest common subsequence problem, two lists are given as input and the task is to determine the length of the longest subsequence that both lists have in common. For example, the longest common subsequence between "Tendrils" and "edits" is "edis", which has length 4.
This problem is well known to admit a dynamic programming solution. If we define as the length of the longest common subsequence between the first elements of the first list and the first elements of the second list, we have the following property:
This lets us implement the algorithm in Vine. While we are interested in parallel computation, let us start with the straightforward implementation using two nested loops.
pub fn lcs[T*; Eq[T]](a: List[T], b: List[T]) -> N32 { let row = []; for a_elem in a { let prev_row_entry = 0; let prev_new_entry = 0; let new_row = []; for b_elem in b { let row_entry = row.pop_front().or_use(0); let new_entry = if a_elem == b_elem { prev_row_entry + 1 } else { row_entry.max(prev_new_entry) }; prev_row_entry = row_entry; prev_new_entry = new_entry; new_row.push_back(new_entry); } row = new_row; } row.get(b.len() - 1).or_use(0) }
Let's run this function on two random lists of length 200.
> vine run lcs.vi --breadth-first Interactions Total 2_993_893 Depth 10_920 Breadth 274
As we can see, the program executes 2,993,893 total interactions (i.e. the number of instructions).
More interestingly, we ran the program with the --breadth-first flag. This mode pretends you have an infinitely parallel computer, and in each step executes all active pairs in the net, like we did earlier. It's a very optimistic upper bound on the speedup you can get from a parallel machine. The depth of the computation is then the number of parallel steps the program needs, also known as the critical path.
As we can see, this is significantly lower than the total number of interactions. In addition, dividing the total number of interactions by the depth of the computation gives us the breadth, the average number of interactions that can be executed at each step. This is an estimate of how much parallel work the program contains.
Note that the breadth is 274, meaning that in each step, an average of 274 interactions can be performed in parallel. So although we set out to first implement the straightforward version, this is already the parallel implementation!
To see where this parallelism comes from, let us draw the dependencies of the DP table.
Notably, all entries along a diagonal slice are independent of each other and can be computed simultaneously.
To verify this further, we can observe what happens when we increase the lengths of the input lists to 2000.
Interactions Total 296_223_905 Depth 108_142 Breadth 2_739
Since this algorithm has an overall runtime of , the total number of interactions increases by ~100×. However, the depth of the computation is only ~10× the previous result. The reason is that the critical path, which is just the main diagonal in the DP table, scales linearly with input size.
Importantly, the breadth of the computation has increased by ~10×. This means there is significantly more parallel work as the input sizes grow.
The way the Vine compiler does this is not by magically extracting parallelism out of the program; it simply preserves dependency information all the way down to the hardware (or runtime, in this case). We can find very similar patterns of parallelism in e.g. option pricing algorithms.
Normal compilers already track these dependencies internally and use them to make optimizations. However, this information is then thrown away and the entire program is linearized into a sequence of instructions. Ironically, modern CPUs then try to extract this parallelism to a limited degree in silicon with out-of-order buffers and superscalar execution. Instead of that, with interaction nets, we can keep and exploit this information at the lowest level.
Interaction Net Hardware
Earlier, we claimed that the properties of interaction nets can help us overcome some of the hardest challenges in hardware architecture design. To see why, let's look at some of the modern challenges in architecture design.
One of the key design choices of CPUs is having one unified memory space. This has become a problem in recent years, because since the early 2000s, single-core CPU performance has stagnated. We've thus seen a stark rise in multi-core CPUs, but when all processors work on a single memory space, every load and store instruction is technically a global operation. Of course, modern caches hide most of this, but cache coherency protocols are the price for it. They typically scale poorly and limit the number of cores. With interaction nets, operations are local and cores can store the relevant data in local SRAM, which does not require any cache coherence. Moreover, for the same reason that Rust's type system allows automatic memory management without a garbage collector, the linearity of interaction nets also makes memory management on the chip feasible. With these properties, we can scale to thousands of cores.
Another issue with multi-core CPUs is the difficulty of programming them. Debuggability is a major issue, and entirely new classes of issues arise, such as race conditions. However, in interaction nets, all active pairs are independent, and so can actually be executed in any order. This property is known as confluence, and means that no matter the order in which things are executed, the result will be the same. This makes debugging straightforward and eliminates issues like race conditions. This allows Vine to be parallel out of the box; the programmer does not have to do anything special with threads, processes, CUDA kernels, etc. to get parallel execution.
Of course, CPUs don't only get parallelism from having multiple cores; even within a single core, there is instruction-level parallelism. However, CPUs fundamentally execute a linear sequence of instructions. To be able to exploit instruction-level parallelism, CPUs have to use significant power and die area to analyze and extract that parallelism at runtime. This quickly hits diminishing returns, especially since deeper out-of-order windows and wider execution pipelines have significant cost. That's why one typically only gets a maximum of a handful of instructions per cycle. If the hardware could truly extract all parallelism, no program would ever need multiple threads; a single thread would have the same amount of parallelism. Since interaction nets directly encode all dependencies with their graph structure, the hardware does not have to do any work to extract that parallelism; it is readily available. This eliminates huge amounts of complexity and power consumption within a core.
Moreover, interaction nets capture parallelism at all levels, while CPUs only capture very fine-grained parallelism within a single core, and very coarse-grained parallelism through threads/cores, where the gains must outweigh the large cost of managing them and the high overhead of cross-core communication and synchronization.
If you want to learn more details about what the micro-architecture, memory system, chip interconnect, and compilation process look like, reach out to us and consider applying for one of our open job positions.