Implementing a TCAM in Calyx

Published:

Introduction

This post is aimed at individuals in an introductory Network Programming Languages course. It does not assume immediate expertise in any field, but rather a solid grounding in computer science fundamentals. For this reason, I will try to cover the tools of each trade, and manifest the intersection between programming languages, hardware design, and networks.

The goal is to examine the process of designing a ternary content-addressable memory (TCAM) in Calyx, an intermediate language (IL) and infrastructure for building compilers that generate custom hardware accelerators. I will also cover the role of TCAMs in networks, using the domain-specific language (DSL) P4 as a motivating example. If any term here is lost to you, worry not; I will explain each of these in depth.

Refreshing your memory: SRAM, CAM, TCAM

Computer memory comes in many shapes and sizes. There is internal memory, external memory, read-only memory, random-access memory, static memory, dynamic memory, synchronous memory, asynchronous memory, … you get the idea. Today, we’ll discuss only three memories that are typically found in the networking realm: static random-access memory, content-addressable memory, and ternary content-addressable memory.

SRAM

Static random-access memory (SRAM) uses latching circuitry (think flip-flops) to store each bit. It is volatile, which means that it requires power to maintain any stored information. In the realm of computer architecture, SRAM is usually used for the cache. Its sister memory, dynamic random-access memory (DRAM), is cheaper and slower. It is used in the main memory of a computer.

Aside: What is a Hardware Design Language?

A Hardware Design Language (HDL) is used to describe the structure and behavior of an electronic circuit. While software programming languages tend to be procedural with support for concurrency, HDLs are naturally concurrent, so that processes can execute independently and in parallel.

Examples of HDLs are SystemVerilog and Cornell’s own PyMTL. It can be useful to see what a module for a synchronous SRAM might look like in SystemVerilog:

// An SRAM memory with width 8 and depth 16.
module sram_memory(
  input  logic       clock,
  input  logic       write_enable,
  input  logic [3:0] address,
  input  logic [7:0] data_in,
  output logic [7:0] data_out);
  
  logic [7:0] mem [15:0];
  
  always_ff @(posedge clock)
  begin
    if (write_enable)
      mem[address] <= data_in;

    data_out <= mem[address];
  end
endmodule
  • clock captures timing, and allows for synchronous events or signal drives.
  • write_enable is a 1-bit signal that, when high, will write the value data_in to address. You’ll notice this is in a always_ff process, which represents a flip-flop, i.e. the process is triggered on every positive edge of the clock.
  • address is the address into the array of SRAM.
  • data_in is where one would pass in the data to be written. Note that this will only be written to the memory if write_en is high.
  • data_out is also in the always_ff block, and thus will always hold the value at mem[address].

Thought experiment: Let’s say we wanted to take a procedural program, such as:

if cond:
  y = a1 + a2
else:
  y = b1 + b2

and accelerate its performance by running it on hardware. Given the HDL representation, what limitations may exist in terms of both ease of use and optimizations? If you’ve never used a HDL, this may be difficult to answer, but perhaps reasoning about the differences in concurrent versus procedural interfaces may help.

…stay tuned! Later, I will discuss Calyx, which addresses some of the limitations mentioned here.


CAM

While this post will mainly focus on ternary content-addressable memories (TCAMs), it is obligatory to first cover the binary content-addressable memories (CAMs or BCAMs). Recall that for random-access memory (RAM), we provide a memory address, and receive as output the data at aforementioned address. CAMs are the inverse. We provide data, and the CAM returns the address(es) associated with the data.

SRAM:

CAM:

Sometimes CAMs are compared to the software-level associative arrays:

# Mapping from data -> indices where the data is stored.
bcam = {
  "tomato": [0, 1, 2],
  "banana": [4, 5, 6],
  "orange": [7]
}

print(bcam["banana"]) # Prints [4, 5, 6].

From here on, we’ll focus on CAMs that always return a single address. While it is implementation-dependent, the CAM usually returns the lowest-valued address if multiple matches are found.

Relative to random-access memory, CAMs are faster in the searching process, but more expensive. For example, a hash-based binary match in SRAM is 6x cheaper than a TCAM in terms of silicon area 1. The input data is matched with each stored data word in parallel, which requires that each individual storage cell have a comparison circuit to determine whether the input bit matches the stored bit. This means CAMs tend to be avoided unless the searching speed is absolutely necessary.

A simple binary CAM diagram



A limitation of binary CAMs is that it requires matching on entire addresses. This means that if we want an address for every possible input data with bit width , we would need rows in our CAM! Most of the time, we instead require matching on a subset of the bits, i.e. “I just want these bits to match, I don’t really care about the rest.” And thus, the ternary content-addressable memory (TCAM) was born.

TCAM

Ternary content-addressable memories (TCAMs) allow for match bits to take on 3 different states: 1, 0, or x. In this context, x means wildcard, or “don’t care”. A comparison with the “don’t care” value x will always signal high regardless of the value of the input bit.

A simple ternary CAM diagram


Types of matching

At a high level, matching implementations generally fall under two categories: ternary or longest-prefix match (LPM). Ternary means that the “don’t care” bits may be placed anywhere within the match lines. LPM means the “don’t care” bits must not be followed by any other state. For example,

00100001 #   valid LPM, valid ternary
1110xxxx #   valid LPM, valid ternary

101100x0 # invalid LPM, valid ternary
xxxx0001 # invalid LPM, valid ternary

As you may have noticed, valid LPM match lines are a subset of valid ternary match lines, meaning any valid LPM address can also be implemented using ternary. However, the underlying hardware implementation may be more efficient if it can assume that the “don’t care” are always at the end.

Longest prefix match TCAMs

For the rest of the post, we’ll only focus on LPM TCAMs. A LPM TCAM will return the address that has the longest prefix. A prefix is defined as the number of bits that are not wildcard bits. For example,

# Here, the longest prefix is `012x`, 
# since it has a prefix of length 3.

01xx   # prefix length: 2
012x   # prefix length: 3

As mentioned before, TCAMs are expensive, and therefore only used when absolutely necessary. A prime example of this is routing on the Internet. Core routers running the Internet Protocol may need to support tables with several hundreds of thousands of entries. An IP router is required to support over 60 million lookups per second when the bit rate is 40 Gbps over Ethernet. If the bit rate is 100 Gbps, this is promoted to 150 million lookups per second 2. On top of this, we also need the ability to update table entries, since the action associated with a given IP or set of IPs may change over time.

Trade-offs

Like most tools in engineering, there isn’t a single flawless answer to designing a TCAM. Similar to how a software engineer must reason about the choice of data structure in terms of benefits and costs, so must the hardware engineer. For example, the classic TCAM implementation stores entries in decreasing order of prefix length, and upon a search, can simply choose the first entry among all the entries that match. However, this also means that the TCAM table must remain in sorted order when new entries are inserted.

A naive approach to this may result in moving entries in order to maintain the sorted table. Research has been conducted to improve this; Shah et. al’s “Fast Updating Algorithms For TCAMs” gives an improved worst case bound of , where is the bit width (e.g. 32 for IPv4, 128 for IPv6) 3.

Another approach is trading off search speed (and area) for update speed. Rather than maintaining the post-condition of sorted order upon insertion, the TCAM implementation “carries” with it the size of each prefix length to use during the searching phase.

P4: A network programming language

P4 is an open source language for controlling packet forwarding planes in network devices 4. It is a domain-specific language (DSL), meaning it supports a carefully chosen subset of what a general-purpose language such as C may support. This also ensures it is able to optimize specifically for networking purposes.

Motivation

Bottom-up…

Before P4, most network functions were determined by the underlying, fixed hardware. That means there were a set of fixed functions supported by some 3rd party vendor for a given chip, which was virtually impossible to reconfigure at runtime. If you wanted to add a new feature, this could take an extensive period of time since it would require changing the actual design of the chip.

…to top-down!

P4 aims to provide functionality in the opposite direction. Instead, the client determines what functions are necessary for their networking device. This is done with a programmable chip, configured using the P4 language. This is an extremely powerful transition: from “This is how we must process packets” to “Let me tell you how to process packets.”

A look inside the P4 architecture

Below is an image of the Protocol-Independent Switch Architecture (PISA) for the P4 language.

Taken from the P4 language tutorial.


Given that this post is on writing TCAMs, we’ll focus on the Match-Action Pipeline phase:

In the first stage, data is selected to be searched in the Match Engine. Here, data could be an IP address, or in the case of LPM, a range of IP addresses. P4 supports different Match Engines since it supports different match kinds (ternary, LPM, exact, …). For the sake of this post, we’ll assume the Match Engine is a TCAM. After a look-up in the TCAM, we’ll get an address. This address is fed into the action memory, which is an SRAM that maps each address to a specific action. The respective action is then interpreted by the action ALU. Recall that there is no guarantee about the match rules or forms within the lifetime of the P4 program. Thus, we need to ensure the state of the Match Engine is flexible, i.e. insertions may occur during the P4 runtime.

Calyx

Before delineating Calyx, we need to define hardware accelerator generator. A hardware accelerator generator takes in some high-level specification and compiles it down to hardware design. Calyx acts as an intermediate language for hardware accelerator generators.

Motivation

(Credits to Rachit Nigam for the motivating example below.)

Take a microsecond to digest this uninvolved program:

if cond:
  y = a1 + a2
else:
  y = b1 + b2

If we want to turn it into a circuit-level hardware design, what is the best possible intermediate representation (IR) to allow for optimization? Ideally, we’d like optimizations at both the hardware and software level, as well as those that intersect both domains.

SSA form

Perhaps we could use the Static Single Assignment (SSA) IR:

br %cond true_br, false_br

true_br:
  %1 = add a1, a2; 
  jmp join;

false_br:
  %2 = add b1, b1;
  jmp join;

join:
  x = phi [%2, false_br] [%1, true_br]; 

SSA form, which is used in the compiler titan LLVM provides control flow information. Given the SSA IR above, we can produce the following control flow graph (CFG):

The CFG imparts analyses and properties of a program. In our trivial example, we ultimately know that the two sums, %1 and %2, are disjoint, i.e. they will never execute at the same time. This means the underlying hardware design requires at most a single adder module. However, SSA form doesn’t tell us much about the actual specifications of the add. For example, from SSA form, we don’t really know how many cycles it could take. While adders are usually combinational, a multiply operator may be a multi-cycle pipeline, and thus we have no way to reason about this.

RTL abstraction

We can also look at the other end of the software-hardware IR spectrum. RTL is the register-transfer level abstraction which can be used to model synchronous circuits. It is commonly used in hardware description languages. I should note that this is different than software compiler design, since its typically the level that circuit designers operator on. For software compilers, it is usually the lowest level IR.

A pseudo-representation of the RTL language might be portrayed as:

register x (in, write_en);
adder add0 (a1, a2);
adder add1 (b1, b2);

assign x.in =
  cond ? add1 : add0;

assign x.write_en = 1'b1;

Here, the RTL abstraction provides us with the necessary structural constructs. We can now reason about the timing of our operators. However, it isn’t really possible to recover the control flow graph from the RTL abstraction.

Marriage of two IRs

Hopefully you see the pattern here; RTL and SSA are each missing a key piece for optimizations. This is where Calyx comes in; it is the union of these two representations. Calyx sits between high-level DSLs (think P4) and circuit-level IRs (such as SystemVerilog). Its goal is to provide both high-level control information and low-level structural facts.

Doing so will allow a new class of hardware optimizations such as resource sharing 5. Resource sharing is exactly what our example motivates. With both a control flow graph and structural constructs, we can now safely ensure that only a single adder module is required for our trivial program.

The most up-to-date documentation for Calyx can be found here. A language tutorial within the documentation is found here.

Aside: Constant literals in Calyx

Numbers in Calyx are represented in the following manner (similar to Verilog):

<width>'<representation><value>

To represent the value 3 with bit width 4:

4'd3    // Decimal
4'h3    // Hexadecimal
4'b0011 // Binary

Walking through our example in Calyx

Since this post uses Calyx to implement a TCAM, I will quickly go over a Calyx program for completeness. Note that the Calyx language is rapidly improving; You should refer to the documentation for the most up-to-date information. This was written to work with the Calyx language at git hash 79825bb.

Calyx encapsulates programs with components. A component consists of 3 sections:

  • cells: The cells are the sub-components used within the component.
  • wires: The wires are the connections between the sub-components.
  • control: The control is the execution schedule or control flow of the component.

This can be seen more contretely if we take our example program (changed slightly to account for an actual conditional):

 if x < 42:
   y = a1 + a2
 else:
   y = b1 + b2

and translate it to Calyx:

Calyx component
component main() -> () {
  cells {
    add0 = std_add(32);     // 32 bit adder.
    add1 = std_add(32);
    
    y = std_reg(32);        // 32-bit registers;  
    x = std_reg(32);
    a1 = std_reg(32); 
    a2 = std_reg(32);
    b1 = std_reg(32);
    b2 = std_reg(32);

    lt = std_lt(32);        // 32-bit less than comparator.
    c = std_const(32, 42);  // 32-bit constant with value `42`.
  }
  wires {
    group true_branch {
      add0.left = a1.out;
      add0.right = a2.out;
      y.write_en = 1'b1;    // Writing to `y` requires driving
      y.in = add0.out;      // the `write_en` signal.
      true_branch[done] = y.done;
    }
    group false_branch {
      add1.left = b1.out;
      add1.right = b2.out;
      y.write_en = 1'b1;
      y.in = add1.out;
      false_branch[done] = y.done;
    }
    group cond {
      lt.left = x.out;
      lt.right = c.out;
      cond[done] = 1'b1; 
    }
  }
  control {
    if lt.out with cond {    //   if x < 42:
      true_branch;           //     y = a1 + a2
    } else {                 //   else:
      false_branch;          //     y = b1 + b2
    }
  }
}


In the component above, we’ve created three groups. Let’s pick the true_branch group:

group true_branch {
  add0.left = a1.out;
  add0.right = a2.out;
  y.write_en = 1'b1;
  y.in = add0.out;
  true_branch[done] = y.done;
}

It represents what we see in the true branch of the code. We take the values in registers a1 and a2 as the operands to our adder add0, and write the output to register y.

We execute this group in the control:

if lt.out with cond {
  true_branch;
} else {
  false_branch;
}

How does the execution flow know when the group true_branch has “returned,” or fulfilled its task? Using guard ports. From this group, we have: true_branch[done] = y.done, which essentially says that the true_branch is done when the write to y is complete.

Calyx provides other control operators as well. For example, if we instead wanted to execute two groups g1 and g2 in parallel:

control {
  par { g1; g2; }
}

or sequentially:

control {
  seq { g1; g2; }
}

While this may be enough to grease the groove, the language tutorial and documentation mentioned above will give a much more profound look into the language and its capabilities.

Implementing a TCAM in Calyx

We’ve now provided a foundation to really dig into what this post is all about, building a TCAM in Calyx.

Overview: Comparator Network LPM

The TCAM implemented will be strictly based upon “TCAM-based High Speed Longest Prefix Matching with Fast Incremental Updates” 6. This paper acknowledges that insertion with standard TCAM implementations either requires software sorting algorithms or extensive modifications to the table itself. It avoids this by taking into account the prefix lengths of each entry in the table during the searching process. The paper compares their work with two other TCAMs. First, a standard implementation that uses a priority encoder to determine the longest prefix length (i.e. the prefixes must be in sorted order). Second, it is compared to the G-LPM proposed by Gamache et al., which looks at prefix lengths locally in a step-by-step manner to eventually discover a global winner 7. Below, I’ve briefly summarized the CN-LPM implementation.

The good

  • Removes the insertion post-condition of sorting by length. As a result, updates to the table should be considerably faster.
  • Enables full utilization of the TCAM address space; this isn’t always possible for some TCAM implementations.
  • Provides a scheme that simplifies expansion to larger routing tables.

The bad

  • Requires significantly more area on a FPGA. For 1,024 match lines, just to store the prefix lengths, it is a ~770% increase in the number of Adaptive Look-up Tables (ALUTs) compared to a standard priority encoder implementation.

  • Each Comparator Element requires additional hardware (N-bit 2-to-1 multiplexer, N-bit comparator, …).
  • When the prefix lengths are already inserted in sorted order, the CN-LPM is approximately 56% slower than a standard priority encoder implementation.

The ugly

  • No performance metrics for TCAM insertions are provided for the three implementations. It would be interesting to see how their performances compare, given that the CN-LPM was principally designed to reduce the insertion time.

The Calyx component interface

A good starting point when designing a new component is the interface. The TCAM will provide two operations, write and search. For simplicity, we’ll assume that the provided prefixes will have bit width 8. This is contrary to standard implementations for IPv4 or IPv6, which use 32-bit and 128-bit prefixes respectively. Here is the component interface:

component CNLPM_TCAM(write_en: 1, search_en: 1, in: 8, prefix_len: 4, write_index: 3, rdone: 1) 
                     -> (index: 8, rwrite_en: 1) {
  cells {}
  wires {}
  control {}
}

We provide two input ports, write_en and search_en which, when set to 1'b1, will trigger the control for write or search respectively; these should never both be set high in the same invocation. The in port will take in either the prefix we want for matching or the prefix we are searching, depending on which action is being undertaken. The prefix_len and write_index are only used when writing to the TCAM. Lastly, we provide ports rdone, index, and rwrite_en to write the index from the search to a register. Not every port is necessary for each function. We can use a table to more succinctly describe which ports are used for each operation:

WriteSearch
write_ensearch_en
inin
prefix_lenrdone
write_indexrwrite_en
 index

For a concrete example, let’s say we want to write the address 11000000 with prefix length 3 to TCAM address 6. We could re-write this with wildcard bits as well: 110xxxxx

tcam = CNLPM_TCAM();
...
invoke tcam(write_en=1'b1, write_index=3'd6, in=8'b11000000, prefix_len=4'd4)();

Similarly, we can search for the same address in the following manner:

r = std_reg(3);
...
invoke tcam(search_en=1'b1, in=8'b11000000, rdone=r.done)(rwrite_en=r.write_en, index=r.in);
// Register `r` has the value `6` written to it.

You can think of the invoke control as a way to “call” a component. In other words, it will trigger the execution of a given component, and “return” when the invoked component’s done signal is high.

At a high level, writing to the TCAM is relatively simple, since we’re provided all the pieces (data, index, and prefix length). We’ll instead cover the stages of the TCAM search operation. Under the CN-LPM paper, this is split up into two separate entities: matching and comparison. In the matching phase, we need to determine which prefixes match the input data. This is accompanied by either a high or low signal indicating the validity of the match; this is referred to as the match line. The comparison phase is split up in levels, as seen in the following diagram:

It takes in the match lines, addresses, and prefix lengths of A and B, and returns the corresponding values of the address with the longest prefix. For modularity, we’ll implement each phase as sub-components, which we can then use in our main component.

Match Element

The Match Element determines whether the prefix and prefix length of a given TCAM index matches the in data. The paper does not provide a circuit diagram on how they conduct this, but I imagine they simply compare the first N - length bits of in and the given address, and then assign the corresponding match line as the AND of all these 1-bit comparisons. This can be done combinationally. For the Calyx implementation, we’ll take a slightly different approach that achieves the same goal. Instead, we’ll right shift the prefix and in to remove any “don’t care” bits at the end, and then execute an 8-bit equality comparison. The match_line is then written with the output signal of the comparator.

Calyx Match Element
// Sets the `match_line` signal to high if the value `in` matches the
// `prefix` with the given `length`. For example, given
//     in = 5'b11000, prefix = 5'b11001, length = 2:
//     our prefix-match may be represented as `11xxx`, where `x`
//     is a "don't care" bit. Since `in` does indeed match the
//     prefix `11xxx`, the `match_line` signal is set to high.
//
// Note: the length is an N-bit value,
//       which means length `0` really means length `1`,
//       and more generally, `N-1` length represents `N`.
//       The zero case is caught earlier so that
//       we only need N bits to represent length.
component match_element(in: 8, prefix: 8, length: 3, ml_done: 1) -> (match_line: 1, write_en: 1) {
  cells {
    sub = std_sub(3);
    pad = std_pad(3, 8);
    rsh0 = std_rsh(8);
    rsh1 = std_rsh(8);
    eq  = std_eq(8);
  }
  wires {
    // Note: A simpler way to do this would be to instead split off and compare the 
    // first `8 - length` bits to avoid the right shifts. However, Calyx currently 
    // doesn't have support for bit splitting in this form.
    group compare<"static"=1> {
      sub.left = 3'd7;
      sub.right = length;
      pad.in = sub.out;

      rsh0.left = in;
      rsh0.right = pad.out;
      rsh1.left = prefix;
      rsh1.right = pad.out;
      eq.left = rsh0.out;
      eq.right = rsh1.out;

      write_en = 1'b1;
      match_line = eq.out;
      compare[done] = ml_done;
    }
  }
  control {
    compare;
  }
}


When looking at the Match Element component, you may have noticed that the prefix length was only 3 bits. However, given that the range of the length is , it requires 4 bits to represent all possible values. The zero case is short-circuited before the matching phase, so that we are guaranteed: if length has value N, its actual value is N + 1. This allows us to store each length with 3 bits rather than 4 bits.

Comparator Element

Next, we need to design the Comparator Element. No matter what stage we are in, a Comparator Element’s interface remains the same. The CN-LPM paper provides the following diagram to concretely illustrate a Comparator Element:

Given two addresses and , it takes in:

  1. (length of A)
  2. (length of B)
  3. (match line of A)
  4. (match line of B)
  5. (address of A)
  6. (address of B)

We’ll provide wires to intermediary registers to save the values of the “winning” address, i.e. the address that is (1) valid and (2) has the longest length. These will be stored in:

  1. (length of winning address)
  2. (address of winning address)
  3. (OR of the A and B match lines)
Calyx Comparator Element
// Given two addresses A and B with their lengths and match line, sets the `X` ports
// with the address and length of the maximum length (valid) prefix match. The output
// match line is set as the logical OR of the match lines from A and B. For example,
//
//   lenA: 4, mlA: 1, addrA: 0
//   lenB: 3, mlB: 1, addrB: 1
//   Since lenA > lenB and the match line of A is high,
//   `lenX` = 4, `mlX` = 1, and `addrX` = 0.
component comparator_element(lenA: 3, lenB: 3, addrA: 3, addrB: 3, mlA: 1, mlB: 1, len_done: 1, addr_done: 1, ml_done: 1) ->
                            (lenX: 3, addrX: 3, mlX: 1, len_write_en: 1, addr_write_en: 1, ml_write_en: 1) {
  cells {
    gt0 = std_gt(3);
    or0 = std_or(1);
    or1 = std_or(1);
    not0 = std_not(1);
    and0 = std_and(1);
  }
  wires {
    group select<"static"=0> {
      gt0.left = lenA;
      gt0.right = lenB;
      not0.in = mlB;
      or0.left = not0.out;
      or0.right = gt0.out;
      and0.left = mlA;
      and0.right = or0.out;
      select[done] = 1'b1;
    }
    group A<"static"=1> {
      len_write_en = 1'b1;
      addr_write_en = 1'b1;
      lenX = lenA;
      addrX = addrA;
      A[done] = len_done & addr_done ? 1'b1;
    }
    group B<"static"=1> {
      len_write_en = 1'b1;
      addr_write_en = 1'b1;
      lenX = lenB;
      addrX = addrB;
      B[done] = len_done & addr_done ? 1'b1;
    }
    group or_match_line<"static"=1> {
      or1.left = mlA;
      or1.right = mlB;
      ml_write_en = 1'b1;
      mlX = or1.out;
      or_match_line[done] = ml_done;
    }
  }
  control {
    par {
      if and0.out with select { A; } else { B; }
      or_match_line;
    }
  }
}


In the Comparator Element, we can compute the selection of the correct address and the OR of the match lines in parallel, i.e.

control {
  par {
    if and0.out with select { A; } else { B; }
    or_match_line;
  }
}

Since each of these compute in a single cycle (because we’re writing to a register), the entire sub-component takes exactly one cycle to compute. Again, the comparator network is conducted in stages, so that for an 8-bit address, we’ll need 4 Comparator Elements in the first stage, 2 in the second stage, and 1 in the final stage.

Next, we’ll begin putting together our main CN-LPM TCAM component. This is split into the 3 sections of a Calyx component: cells, wire, and control.

Cells

The primitives in the cells section are used throughout the groups in the wires section. For example, we’ll need registers between stages so that the control is “pipeline-able”. Note that Calyx does not have first-class support for pipelines yet, so the execution flow will assume that values are passed in a strictly sequential pattern, e.g. invocation 2 is stalled until invocation 1 is complete.

We also need to create instances of our Match Element and Comparator Element components. Given a component X, we can construct an instance in the following manner: x0 = X();. We’ll instantiate exactly 8 Match Elements (for each line) and 4 + 2 + 1 Comparator Elements.

Calyx CN-LPM Cells
  cells {
    // Prefixes
    p0 = std_reg(8);
    p1 = std_reg(8);
    p2 = std_reg(8);
    p3 = std_reg(8);
    p4 = std_reg(8);
    p5 = std_reg(8);
    p6 = std_reg(8);
    p7 = std_reg(8);
    l0 = std_reg(3);
    l1 = std_reg(3);
    l2 = std_reg(3);
    l3 = std_reg(3);
    l4 = std_reg(3);
    l5 = std_reg(3);
    l6 = std_reg(3);
    l7 = std_reg(3);

    // Used when determining correct index for writing.
    is_index0 = std_eq(3);
    is_index1 = std_eq(3);
    is_index2 = std_eq(3);
    is_index3 = std_eq(3);
    is_index4 = std_eq(3);
    is_index5 = std_eq(3);
    is_index6 = std_eq(3);
    is_index7 = std_eq(3);

    // Stores the address for writes that
    // have a prefix length of zero.
    zero_index = std_reg(3);

    // Used to determine the invocation operation.
    w_eq = std_eq(1);
    s_eq = std_eq(1);
    z_eq = std_eq(4);

    // Used when verifying the validity of the
    // final comparator stage's match line.
    is_invalid = std_eq(1);

    // Used for subtracting one from the length
    // so that lengths can be stored with N-1 bits.
    slice = std_slice(4, 3);
    sub = std_sub(4);

    me0 = match_element();
    me1 = match_element();
    me2 = match_element();
    me3 = match_element();
    me4 = match_element();
    me5 = match_element();
    me6 = match_element();
    me7 = match_element();

    // Used to store the match line 
    // comparisons.
    ml0 = std_reg(1);
    ml1 = std_reg(1);
    ml2 = std_reg(1);
    ml3 = std_reg(1);
    ml4 = std_reg(1);
    ml5 = std_reg(1);
    ml6 = std_reg(1);
    ml7 = std_reg(1);

    // Each comparator element will store
    // an intermediary length, address and 
    // match line.
    ce00 = comparator_element();
    len00 = std_reg(3);
    addr00 = std_reg(3);
    ml00 = std_reg(1);

    ce01 = comparator_element();
    len01 = std_reg(3);
    addr01 = std_reg(3);
    ml01 = std_reg(1);

    ce02 = comparator_element();
    len02 = std_reg(3);
    addr02 = std_reg(3);
    ml02 = std_reg(1);

    ce03 = comparator_element();
    len03 = std_reg(3);
    addr03 = std_reg(3);
    ml03 = std_reg(1);

    ce10 = comparator_element();
    len10 = std_reg(3);
    addr10 = std_reg(3);
    ml10 = std_reg(1);

    ce11 = comparator_element();
    len11 = std_reg(3);
    addr11 = std_reg(3);
    ml11 = std_reg(1);

    ce20 = comparator_element();
    len20 = std_reg(3);
    out_index = std_reg(3);
    ml20 = std_reg(1);
  }

Wires

Next, the wires have the groups necessary for the control to execute the write and search phases. Above each group, I’ve put a short description to indicates its role.

Calyx CN-LPM Wires
wires {
    // Checks whether this invocation is a `write`.
    group is_write_enabled<"static"=0> {
      w_eq.left = write_en;
      w_eq.right = 1'b1;
      is_write_enabled[done] = 1'b1;
    }

    // Determine if the `write` has a prefix length of zero.
    // This means that any input may match on the value, i.e.
    // it is the default case.
    group is_length_zero<"static"=0> {
      z_eq.left = 4'd0;
      z_eq.right = prefix_len;
      is_length_zero[done] = 1'b1;
    }

    // Checks whether this invocation is a `search`.
    group is_search_enabled<"static"=0> {
      s_eq.left = search_en;
      s_eq.right = 1'b1;
      is_search_enabled[done] = 1'b1;
    }
    
    // Write the index for the default case to
    // the `zero_index` register.
    group write_zero<"static"=1> {
      zero_index.write_en = 1'b1;
      zero_index.in = write_index;
      write_zero[done] = zero_index.done;
    }
   
    // Since we want to store all lengths
    // with 3 bits, first subtract 1 from the
    // length. Pre-condition: The length is
    // non-zero. Then, write the prefix
    // and length to the TCAM's memory.
    group write0<"static"=1> {
      p0.write_en = write_en;
      l0.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l0.in = slice.out;
      p0.in = in;
      write0[done] = p0.done & l0.done ? 1'd1;
    }
    group write1<"static"=1> {
      p1.write_en = write_en;
      l1.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l1.in = slice.out;
      p1.in = in;
      write1[done] = p1.done & l1.done ? 1'd1;
    }
    group write2<"static"=1> {
      p2.write_en = write_en;
      l2.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l2.in = slice.out;
      p2.in = in;
      write2[done] = p2.done & l2.done ? 1'd1;
    }
    group write3<"static"=1> {
      p3.write_en = write_en;
      l3.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l3.in = slice.out;
      p3.in = in;
      write3[done] = p3.done & l3.done ? 1'd1;
    }
    group write4<"static"=1> {
      p4.write_en = write_en;
      l4.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l4.in = slice.out;
      p4.in = in;
      write4[done] = p4.done & l4.done ? 1'd1;
    }
    group write5<"static"=1> {
      p5.write_en = write_en;
      l5.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l5.in = slice.out;
      p5.in = in;
      write5[done] = p5.done & l5.done ? 1'd1;
    }
    group write6<"static"=1> {
      p6.write_en = write_en;
      l6.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l6.in = slice.out;
      p6.in = in;
      write6[done] = p6.done & l6.done ? 1'd1;
    }
    group write7<"static"=1> {
      p7.write_en = write_en;
      l7.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l7.in = slice.out;
      p7.in = in;
      write7[done] = p7.done & l7.done ? 1'd1;
    }
    
    // Find the correct registers to write the
    // given prefix and prefix length.
    group find_write_index<"static"=0> {
      is_index0.left = 3'd0;
      is_index1.left = 3'd1;
      is_index2.left = 3'd2;
      is_index3.left = 3'd3;
      is_index4.left = 3'd4;
      is_index5.left = 3'd5;
      is_index6.left = 3'd6;
      is_index7.left = 3'd7;
      is_index0.right = write_index;
      is_index1.right = write_index;
      is_index0.right = write_index;
      is_index1.right = write_index;
      is_index2.right = write_index;
      is_index3.right = write_index;
      is_index4.right = write_index;
      is_index5.right = write_index;
      is_index6.right = write_index;
      is_index7.right = write_index;
      find_write_index[done] = 1'd1;
    }

    // A final check to verify whether any of the match 
    // lines had a valid match after the search phase
    // is complete.
    group validity<"static"=0> {
      is_invalid.left = ml20.out;
      is_invalid.right = 1'd0;
      validity[done] = 1'b1;
    }
 
    // Pull out the index provided in the `zero_index`.
    group default_to_zero_length_index<"static"=1> {
      rwrite_en = 1'b1;
      index = zero_index.out;
      default_to_zero_length_index[done] = rdone;
    }

    // Save the final index to the register hooked
    // up to the CN-LPM TCAM component.
    group save_index<"static"=1> {
      rwrite_en = 1'b1;
      index = out_index.out;
      save_index[done] = rdone;
    }
}

Control

Finally, we need to determine the control of our TCAM component. Recall that the control establishes the execution flow. For example, if we want to execute two groups A, B in sequence, we could use the control: par { A; B; }. Since we’ve established as a pre-condition that writes and searches will never be done simultaneously, we can safely run the search and write phases in parallel without worrying about concurrency issues:

par {
  // Execute the write.
  if w_eq.out with is_write_enabled { ... }
  // Execute the search.
  if s_eq.out with is_search_enabled { ... }
}

The next step is to fill in the bodies of each phase. For the write phase, we have two cases:

  1. The prefix_len provided is zero. We then store its value in a special register named zero_index.
  2. Otherwise, we subtract one from the prefix length and write the data and length to the corresponding register in the TCAM memory. To determine the correct pair of registers to write to, we can use a combinational equality comparison with the write_index provided.

In Calyx control language, this would look like:

if w_eq.out with is_write_enabled {
  if z_eq.out with is_length_zero { 
    write_zero;
  } else {
    // Only one of these will be true.
    par {
      if is_index0.out  with find_write_index { write0; }
      if is_index1.out  with find_write_index { write1; }
      if is_index2.out  with find_write_index { write2; }
      if is_index3.out  with find_write_index { write3; }
      if is_index4.out  with find_write_index { write4; }
      if is_index5.out  with find_write_index { write5; }
      if is_index6.out  with find_write_index { write6; }
      if is_index7.out  with find_write_index { write7; }
    }
  }
}

For the search phase, we need to first execute the Match Elements (in parallel), and then execute the Comparator Elements. Finally, we need to determine whether the final match value is valid. If yes, return the index provided in our final comparator stage. Otherwise, return the value stored at the zero_index. The code follows as:

seq {
  // Match Elements.
  par { me0; me1; me2; me3; me4; me5; me6; me7; }

  // Comparator Elements.
  par { ce00; ce01; ce02; ce03; } // Stage 1
  par { ce10; ce11; }             // Stage 2
  par { ce20; }                   // Stage 3

  if is_invalid.out with validity { 
    // The final match line is invalid. Write the address provided in `zero_index`.
    default_to_zero_length_index; 
  } else { 
    // Otherwise, write the address provided in the final stage of comparator network.
    save_index;
  }
}

This is almost correct. Recall that the Match Elements and Comparator Elements are Calyx components, instead of groups. Calyx components must be invoked, so that the control knows when it is done, and can move on to the next execute. For example, let’s say we have the following component interface:

component X(in: 32, rdone: 1) -> (out: 32, rwrite_en: 1) { ... }

Then, we could invoke the component using the following control:

x = X();
r = std_reg(32);

...

control {
  // Save the output `out` to register `r`.
  invoke x(in=32'd1, rdone=r.done)(out=r.in, rwrite_en=r.write_en);
}

In the same way, we need to (1) invoke each Match Element and Comparator Element instance, and (2) hook up the correct wires. The execution flow for our 8-bit CN-LPM is provided below with both the search and match phases.

Calyx CN-LPM Control
  control {
    par {
      if w_eq.out with is_write_enabled {
        if z_eq.out with is_length_zero {
          write_zero;
        } else {
          par {
            if is_index0.out  with find_write_index { write0; }
            if is_index1.out  with find_write_index { write1; }
            if is_index2.out  with find_write_index { write2; }
            if is_index3.out  with find_write_index { write3; }
            if is_index4.out  with find_write_index { write4; }
            if is_index5.out  with find_write_index { write5; }
            if is_index6.out  with find_write_index { write6; }
            if is_index7.out  with find_write_index { write7; }
          }
        }
      }
      if s_eq.out with is_search_enabled {
        seq {
          par {
            invoke me0(in=in, prefix=p0.out, length=l0.out, ml_done=ml0.done)(match_line=ml0.in, write_en=ml0.write_en);
            invoke me1(in=in, prefix=p1.out, length=l1.out, ml_done=ml1.done)(match_line=ml1.in, write_en=ml1.write_en);
            invoke me2(in=in, prefix=p2.out, length=l2.out, ml_done=ml2.done)(match_line=ml2.in, write_en=ml2.write_en);
            invoke me3(in=in, prefix=p3.out, length=l3.out, ml_done=ml3.done)(match_line=ml3.in, write_en=ml3.write_en);
            invoke me4(in=in, prefix=p4.out, length=l4.out, ml_done=ml4.done)(match_line=ml4.in, write_en=ml4.write_en);
            invoke me5(in=in, prefix=p5.out, length=l5.out, ml_done=ml5.done)(match_line=ml5.in, write_en=ml5.write_en);
            invoke me6(in=in, prefix=p6.out, length=l6.out, ml_done=ml6.done)(match_line=ml6.in, write_en=ml6.write_en);
            invoke me7(in=in, prefix=p7.out, length=l7.out, ml_done=ml7.done)(match_line=ml7.in, write_en=ml7.write_en);
          }
          par {
            invoke ce00(lenA=l0.out, lenB=l1.out, addrA=3'd0, addrB=3'd1, mlA=ml0.out, mlB=ml1.out, len_done=len00.done, addr_done=addr00.done, ml_done=ml00.done)(lenX=len00.in, addrX=addr00.in, mlX=ml00.in, len_write_en=len00.write_en, addr_write_en=addr00.write_en, ml_write_en=ml00.write_en);
            invoke ce01(lenA=l2.out, lenB=l3.out, addrA=3'd2, addrB=3'd3, mlA=ml2.out, mlB=ml3.out, len_done=len01.done, addr_done=addr01.done, ml_done=ml01.done)(lenX=len01.in, addrX=addr01.in, mlX=ml01.in, len_write_en=len01.write_en, addr_write_en=addr01.write_en, ml_write_en=ml01.write_en);
            invoke ce02(lenA=l4.out, lenB=l5.out, addrA=3'd4, addrB=3'd5, mlA=ml4.out, mlB=ml5.out, len_done=len02.done, addr_done=addr02.done, ml_done=ml02.done)(lenX=len02.in, addrX=addr02.in, mlX=ml02.in, len_write_en=len02.write_en, addr_write_en=addr02.write_en, ml_write_en=ml02.write_en);
            invoke ce03(lenA=l6.out, lenB=l7.out, addrA=3'd6, addrB=3'd7, mlA=ml6.out, mlB=ml7.out, len_done=len03.done, addr_done=addr03.done, ml_done=ml03.done)(lenX=len03.in, addrX=addr03.in, mlX=ml03.in, len_write_en=len03.write_en, addr_write_en=addr03.write_en, ml_write_en=ml03.write_en);
          }
          par {
            invoke ce10(lenA=len00.out, lenB=len01.out, addrA=addr00.out, addrB=addr01.out, mlA=ml00.out, mlB=ml01.out, len_done=len10.done, addr_done=addr10.done, ml_done=ml10.done)(lenX=len10.in, addrX=addr10.in, mlX=ml10.in, len_write_en=len10.write_en, addr_write_en=addr10.write_en, ml_write_en=ml10.write_en);
            invoke ce11(lenA=len02.out, lenB=len03.out, addrA=addr02.out, addrB=addr03.out, mlA=ml02.out, mlB=ml03.out, len_done=len11.done, addr_done=addr11.done, ml_done=ml11.done)(lenX=len11.in, addrX=addr11.in, mlX=ml11.in, len_write_en=len11.write_en, addr_write_en=addr11.write_en, ml_write_en=ml11.write_en);
          }
          invoke ce20(lenA=len10.out, lenB=len11.out, addrA=addr10.out, addrB=addr11.out, mlA=ml10.out, mlB=ml11.out, len_done=len20.done, addr_done=out_index.done, ml_done=ml20.done)(lenX=len20.in, addrX=out_index.in, mlX=ml20.in, len_write_en=len20.write_en, addr_write_en=out_index.write_en, ml_write_en=ml20.write_en);
          if is_invalid.out with validity { default_to_zero_length_index; } else { save_index; }
        }
      }
    }
  }

Putting it together: a demonstration

Now that we’ve finished the 3 sections of the TCAM component, we can simply place everything together to get our final component. Below is the entire component (with the sub-component implementations as well).

Calyx 8-bit CN-LPM
// An implementation of Comparator Network Longest Prefix Match (CN-LPM), as described in:
// "TCAM-based high speed longest prefix matching with fast incremental table updates"
// Rasmussen et al. (2013) [https://ieeexplore.ieee.org/document/6602288]

// Sets the `match_line` signal to high if the value `in` matches the
// `prefix` with the given `length`. For example, given
//     in = 5'b11000, prefix = 5'b11001, length = 2:
//     our prefix-match may be represented as `11xxx`, where `x`
//     is a "don't care" bit. Since `in` does indeed match the
//     prefix `11xxx`, the `match_line` signal is set to high.
//
// Note: the length is an N-bit value,
//       which means length `0` really means length `1`,
//       and more generally, `N-1` length represents `N`.
//       The zero case is caught earlier so that
//       we only need N bits to represent length.
component match_element(in: 8, prefix: 8, length: 3, ml_done: 1) -> (match_line: 1, write_en: 1) {
  cells {
    sub = std_sub(3);
    pad = std_pad(3, 8);
    rsh = std_rsh(8);
    eq  = std_eq(8);
  }
  wires {
    group compare<"static"=1> {
      sub.left = 3'd7;
      sub.right = length;
      pad.in = sub.out;

      rsh.left = in;
      rsh.right = pad.out;

      eq.left = rsh.out;
      eq.right = prefix;

      write_en = 1'b1;
      match_line = eq.out;
      compare[done] = ml_done;
    }
  }
  control {
    compare;
  }
}

// Given two addresses A and B with their lengths and match line, sets the `X` ports
// with the address and length of the maximum length (valid) prefix match. The output
// match line is set as the logical OR of the match lines from A and B. For example,
//
//   lenA: 4, mlA: 1, addrA: 0
//   lenB: 3, mlB: 1, addrB: 1
//   Since lenA > lenB and the match line of A is high,
//   `lenX` = 4, `mlX` = 1, and `addrX` = 0.
component comparator_element(lenA: 3, lenB: 3, addrA: 3, addrB: 3, mlA: 1, mlB: 1, len_done: 1, addr_done: 1, ml_done: 1) ->
                            (lenX: 3, addrX: 3, mlX: 1, len_write_en: 1, addr_write_en: 1, ml_write_en: 1) {
  cells {
    gt0 = std_gt(3);
    or0 = std_or(1);
    or1 = std_or(1);
    not0 = std_not(1);
    and0 = std_and(1);
  }
  wires {
    group select<"static"=0> {
      gt0.left = lenA;
      gt0.right = lenB;
      not0.in = mlB;
      or0.left = not0.out;
      or0.right = gt0.out;
      and0.left = mlA;
      and0.right = or0.out;
      select[done] = 1'b1;
    }
    group A<"static"=1> {
      len_write_en = 1'b1;
      addr_write_en = 1'b1;
      lenX = lenA;
      addrX = addrA;
      A[done] = len_done & addr_done ? 1'b1;
    }
    group B<"static"=1> {
      len_write_en = 1'b1;
      addr_write_en = 1'b1;
      lenX = lenB;
      addrX = addrB;
      B[done] = len_done & addr_done ? 1'b1;
    }
    group or_match_line<"static"=1> {
      or1.left = mlA;
      or1.right = mlB;
      ml_write_en = 1'b1;
      mlX = or1.out;
      or_match_line[done] = ml_done;
    }
  }
  control {
    par {
      if and0.out with select { A; } else { B; }
      or_match_line;
    }
  }
}

// A ternary content-addressable memory (TCAM) for IPv4, implemented
// using CN-LPM. One can either write to the TCAM or search the TCAM.
// Each stage in the TCAM is separated by registers to allow
// pipelining in the future.
//
// To write to the TCAM:
// 1. Set the `write_en` signal high.
// 2. Provide an `in` value and a corresponding `prefix_len`.
//    For example, if you want to represent the prefix-match
//    `101xx`, you'd pass in `0b'00101` with prefix_len `3`.
//
// To search the TCAM:
// 1. Set the `search_en` signal high.
// 2. Provide an `in` value indicating the data you're searching.
// 3. The signal is set to `index`, and may be saved to a register
//    using the `rdone` and `rwrite_en` signals.
//
// The `write_en` and `search_en` signals should NOT be set to high
// in the same invocation. If a zero-length prefix is written to the TCAM,
// then invalid searches will be set to the corresponding index. Otherwise,
// invalid searches will be defaulted to index zero.
component CNLPM_TCAM(write_en: 1, search_en: 1, in: 8, prefix_len: 4, write_index: 3, rdone: 1) -> (index: 3, rwrite_en: 1) {
  cells {
    p0 = std_reg(8);
    p1 = std_reg(8);
    p2 = std_reg(8);
    p3 = std_reg(8);
    p4 = std_reg(8);
    p5 = std_reg(8);
    p6 = std_reg(8);
    p7 = std_reg(8);
    l0 = std_reg(3);
    l1 = std_reg(3);
    l2 = std_reg(3);
    l3 = std_reg(3);
    l4 = std_reg(3);
    l5 = std_reg(3);
    l6 = std_reg(3);
    l7 = std_reg(3);

    is_index0 = std_eq(3);
    is_index1 = std_eq(3);
    is_index2 = std_eq(3);
    is_index3 = std_eq(3);
    is_index4 = std_eq(3);
    is_index5 = std_eq(3);
    is_index6 = std_eq(3);
    is_index7 = std_eq(3);

    zero_index = std_reg(3);
    w_eq = std_eq(1);
    s_eq = std_eq(1);
    z_eq = std_eq(4);
    is_invalid = std_eq(1);

    slice = std_slice(4, 3);
    sub = std_sub(4);

    me0 = match_element();
    me1 = match_element();
    me2 = match_element();
    me3 = match_element();
    me4 = match_element();
    me5 = match_element();
    me6 = match_element();
    me7 = match_element();
    ml0 = std_reg(1);
    ml1 = std_reg(1);
    ml2 = std_reg(1);
    ml3 = std_reg(1);
    ml4 = std_reg(1);
    ml5 = std_reg(1);
    ml6 = std_reg(1);
    ml7 = std_reg(1);

    ce00 = comparator_element();
    len00 = std_reg(3);
    addr00 = std_reg(3);
    ml00 = std_reg(1);

    ce01 = comparator_element();
    len01 = std_reg(3);
    addr01 = std_reg(3);
    ml01 = std_reg(1);

    ce02 = comparator_element();
    len02 = std_reg(3);
    addr02 = std_reg(3);
    ml02 = std_reg(1);

    ce03 = comparator_element();
    len03 = std_reg(3);
    addr03 = std_reg(3);
    ml03 = std_reg(1);

    ce10 = comparator_element();
    len10 = std_reg(3);
    addr10 = std_reg(3);
    ml10 = std_reg(1);

    ce11 = comparator_element();
    len11 = std_reg(3);
    addr11 = std_reg(3);
    ml11 = std_reg(1);

    ce20 = comparator_element();
    len20 = std_reg(3);
    out_index = std_reg(3);
    ml20 = std_reg(1);
  }
  wires {
    // Checks whether this invocation is a `write`.
    group is_write_enabled<"static"=0> {
      w_eq.left = write_en;
      w_eq.right = 1'b1;
      is_write_enabled[done] = 1'b1;
    }

    // Determine if the `write` has a prefix length of zero.
    // This means that any input may match on the value, i.e.
    // it is the default case.
    group is_length_zero<"static"=0> {
      z_eq.left = 4'd0;
      z_eq.right = prefix_len;
      is_length_zero[done] = 1'b1;
    }

    // Checks whether this invocation is a `search`.
    group is_search_enabled<"static"=0> {
      s_eq.left = search_en;
      s_eq.right = 1'b1;
      is_search_enabled[done] = 1'b1;
    }
    
    // Write the index for the default case to
    // the `zero_index` register.
    group write_zero<"static"=1> {
      zero_index.write_en = 1'b1;
      zero_index.in = write_index;
      write_zero[done] = zero_index.done;
    }
   
    // Since we want to store all lengths
    // with 3 bits, first subtract 1 from the
    // length. Pre-condition: The length is
    // non-zero. Then, write the prefix
    // and length to the TCAM's memory.
    group write0<"static"=1> {
      p0.write_en = write_en;
      l0.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l0.in = slice.out;
      p0.in = in;
      write0[done] = p0.done & l0.done ? 1'd1;
    }
    group write1<"static"=1> {
      p1.write_en = write_en;
      l1.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l1.in = slice.out;
      p1.in = in;
      write1[done] = p1.done & l1.done ? 1'd1;
    }
    group write2<"static"=1> {
      p2.write_en = write_en;
      l2.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l2.in = slice.out;
      p2.in = in;
      write2[done] = p2.done & l2.done ? 1'd1;
    }
    group write3<"static"=1> {
      p3.write_en = write_en;
      l3.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l3.in = slice.out;
      p3.in = in;
      write3[done] = p3.done & l3.done ? 1'd1;
    }
    group write4<"static"=1> {
      p4.write_en = write_en;
      l4.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l4.in = slice.out;
      p4.in = in;
      write4[done] = p4.done & l4.done ? 1'd1;
    }
    group write5<"static"=1> {
      p5.write_en = write_en;
      l5.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l5.in = slice.out;
      p5.in = in;
      write5[done] = p5.done & l5.done ? 1'd1;
    }
    group write6<"static"=1> {
      p6.write_en = write_en;
      l6.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l6.in = slice.out;
      p6.in = in;
      write6[done] = p6.done & l6.done ? 1'd1;
    }
    group write7<"static"=1> {
      p7.write_en = write_en;
      l7.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 4'd1;
      slice.in = sub.out;
      l7.in = slice.out;
      p7.in = in;
      write7[done] = p7.done & l7.done ? 1'd1;
    }
    
    // Find the correct registers to write the
    // given prefix and prefix length.
    group find_write_index<"static"=0> {
      is_index0.left = 3'd0;
      is_index1.left = 3'd1;
      is_index2.left = 3'd2;
      is_index3.left = 3'd3;
      is_index4.left = 3'd4;
      is_index5.left = 3'd5;
      is_index6.left = 3'd6;
      is_index7.left = 3'd7;
      is_index0.right = write_index;
      is_index1.right = write_index;
      is_index0.right = write_index;
      is_index1.right = write_index;
      is_index2.right = write_index;
      is_index3.right = write_index;
      is_index4.right = write_index;
      is_index5.right = write_index;
      is_index6.right = write_index;
      is_index7.right = write_index;
      find_write_index[done] = 1'd1;
    }

    // A final check to verify whether any of the match 
    // lines had a valid match after the search phase
    // is complete
    group validity<"static"=0> {
      is_invalid.left = ml20.out;
      is_invalid.right = 1'd0;
      validity[done] = 1'b1;
    }
 
    // Pull out the index provided in the `zero_index`.
    group default_to_zero_length_index<"static"=1> {
      rwrite_en = 1'b1;
      index = zero_index.out;
      default_to_zero_length_index[done] = rdone;
    }

    // Save the final index to the register hooked
    // up to the CN-LPM TCAM component.
    group save_index<"static"=1> {
      rwrite_en = 1'b1;
      index = out_index.out;
      save_index[done] = rdone;
    }
  }
  control {
    par {
      if w_eq.out with is_write_enabled {
        if z_eq.out with is_length_zero {
          write_zero;
        } else {
          par {
            if is_index0.out  with find_write_index { write0; }
            if is_index1.out  with find_write_index { write1; }
            if is_index2.out  with find_write_index { write2; }
            if is_index3.out  with find_write_index { write3; }
            if is_index4.out  with find_write_index { write4; }
            if is_index5.out  with find_write_index { write5; }
            if is_index6.out  with find_write_index { write6; }
            if is_index7.out  with find_write_index { write7; }
          }
        }
      }
      if s_eq.out with is_search_enabled {
        seq {
          par {
            invoke me0(in=in, prefix=p0.out, length=l0.out, ml_done=ml0.done)(match_line=ml0.in, write_en=ml0.write_en);
            invoke me1(in=in, prefix=p1.out, length=l1.out, ml_done=ml1.done)(match_line=ml1.in, write_en=ml1.write_en);
            invoke me2(in=in, prefix=p2.out, length=l2.out, ml_done=ml2.done)(match_line=ml2.in, write_en=ml2.write_en);
            invoke me3(in=in, prefix=p3.out, length=l3.out, ml_done=ml3.done)(match_line=ml3.in, write_en=ml3.write_en);
            invoke me4(in=in, prefix=p4.out, length=l4.out, ml_done=ml4.done)(match_line=ml4.in, write_en=ml4.write_en);
            invoke me5(in=in, prefix=p5.out, length=l5.out, ml_done=ml5.done)(match_line=ml5.in, write_en=ml5.write_en);
            invoke me6(in=in, prefix=p6.out, length=l6.out, ml_done=ml6.done)(match_line=ml6.in, write_en=ml6.write_en);
            invoke me7(in=in, prefix=p7.out, length=l7.out, ml_done=ml7.done)(match_line=ml7.in, write_en=ml7.write_en);
          }
          par {
            invoke ce00(lenA=l0.out, lenB=l1.out, addrA=3'd0, addrB=3'd1, mlA=ml0.out, mlB=ml1.out, len_done=len00.done, addr_done=addr00.done, ml_done=ml00.done)(lenX=len00.in, addrX=addr00.in, mlX=ml00.in, len_write_en=len00.write_en, addr_write_en=addr00.write_en, ml_write_en=ml00.write_en);
            invoke ce01(lenA=l2.out, lenB=l3.out, addrA=3'd2, addrB=3'd3, mlA=ml2.out, mlB=ml3.out, len_done=len01.done, addr_done=addr01.done, ml_done=ml01.done)(lenX=len01.in, addrX=addr01.in, mlX=ml01.in, len_write_en=len01.write_en, addr_write_en=addr01.write_en, ml_write_en=ml01.write_en);
            invoke ce02(lenA=l4.out, lenB=l5.out, addrA=3'd4, addrB=3'd5, mlA=ml4.out, mlB=ml5.out, len_done=len02.done, addr_done=addr02.done, ml_done=ml02.done)(lenX=len02.in, addrX=addr02.in, mlX=ml02.in, len_write_en=len02.write_en, addr_write_en=addr02.write_en, ml_write_en=ml02.write_en);
            invoke ce03(lenA=l6.out, lenB=l7.out, addrA=3'd6, addrB=3'd7, mlA=ml6.out, mlB=ml7.out, len_done=len03.done, addr_done=addr03.done, ml_done=ml03.done)(lenX=len03.in, addrX=addr03.in, mlX=ml03.in, len_write_en=len03.write_en, addr_write_en=addr03.write_en, ml_write_en=ml03.write_en);
          }
          par {
            invoke ce10(lenA=len00.out, lenB=len01.out, addrA=addr00.out, addrB=addr01.out, mlA=ml00.out, mlB=ml01.out, len_done=len10.done, addr_done=addr10.done, ml_done=ml10.done)(lenX=len10.in, addrX=addr10.in, mlX=ml10.in, len_write_en=len10.write_en, addr_write_en=addr10.write_en, ml_write_en=ml10.write_en);
            invoke ce11(lenA=len02.out, lenB=len03.out, addrA=addr02.out, addrB=addr03.out, mlA=ml02.out, mlB=ml03.out, len_done=len11.done, addr_done=addr11.done, ml_done=ml11.done)(lenX=len11.in, addrX=addr11.in, mlX=ml11.in, len_write_en=len11.write_en, addr_write_en=addr11.write_en, ml_write_en=ml11.write_en);
          }
          invoke ce20(lenA=len10.out, lenB=len11.out, addrA=addr10.out, addrB=addr11.out, mlA=ml10.out, mlB=ml11.out, len_done=len20.done, addr_done=out_index.done, ml_done=ml20.done)(lenX=len20.in, addrX=out_index.in, mlX=ml20.in, len_write_en=len20.write_en, addr_write_en=out_index.write_en, ml_write_en=ml20.write_en);
          if is_invalid.out with validity { default_to_zero_length_index; } else { save_index; }
        }
      }
    }
  }
}


So… does it work? We can test this by executing a few writes followed by a search, and see if the correct address is returned. Luckily, Calyx provides a driver, fud, to support a vast number of toolchains; this gives us a way to simulate Calyx programs! I’ve done exactly this below in a file named tcam_sim.futil.

Calyx 8-bit TCAM demonstration: tcam_sim.futil
import "primitives/core.futil";

// We can either import the `tcam` primitives library, 
// or simply copy and paste the above components.
import "primitives/tcam.futil";

component main() -> () {
  cells {
    tcam = CNLPM_TCAM();
    @external(1) index = std_mem_d1(3, 1, 1);
    r = std_reg(3);
  }

  wires {
    group save_index<"static"=1> {
      index.write_en = 1'b1;
      index.addr0 = 1'd0;
      index.write_data = r.out;
      save_index[done] = index.done;
    }
  }

  control {
    seq {
      // Write.
      invoke tcam(write_en=1'b1, write_index=3'd1, in=8'b11000000, prefix_len=4'd4)();
      invoke tcam(write_en=1'b1, write_index=3'd2, in=8'b00000000, prefix_len=4'd0)();
      invoke tcam(write_en=1'b1, write_index=3'd3, in=8'b01010100, prefix_len=4'd1)();
      invoke tcam(write_en=1'b1, write_index=3'd4, in=8'b11000000, prefix_len=4'd5)();

      // Search.
      invoke tcam(search_en=1'b1, in=8'b11000000, rdone=r.done)(rwrite_en=r.write_en, index=r.in);
      
      // Save the index to a memory for simulation purposes.
      // This should save index `4`.
      save_index;
    }
  }
}


Next, we need to provide the necessary memories for the simulation. We’ll have a single-address memory named index for the output of the search, and two more memories to store the prefixes and prefix lengths in the TCAM (more on this later). This is shown below in a file named tcam_sim.data:

Calyx 8-bit TCAM demonstration: tcam_sim.data
{
  "index": {
    "data": [0],
    "format": {
      "numeric_type": "bitnum",
      "is_signed": false,
      "width": 5
    }
  }
}


Finally, we can simulate this with the following fud command:

fud e tcam_sim.futil --to dat -s verilog.data tcam_sim.data

The value saved in index is 4, as expected!

IPv4 CN-LPM TCAM

Scaling to a larger bit width for the CN-LPM TCAM requires adjusting the number of Match Elements and Comparator Elements (as well as the corresponding hardware for each). This means the IPv4 (32-bit) implementation is just an extended version of our 8-bit TCAM. The implementation can be found below. Corresponding tests can also be found in the Calyx repository’s pull request #487.

Calyx IPv4 CN-LPM
// An implementation of Comparator Network Longest Prefix Match (CN-LPM), as described in:
// "TCAM-based high speed longest prefix matching with fast incremental table updates"
// Rasmussen et al. (2013) [https://ieeexplore.ieee.org/document/6602288]


// Sets the `match_line` signal to high if the value `in` matches the
// `prefix` with the given `length`. For example, given
//     in = 5'b11000, prefix = 5'b11001, length = 2:
//     our prefix-match may be represented as `11xxx`, where `x`
//     is a "don't care" bit. Since `in` does indeed match the
//     prefix `11xxx`, the `match_line` signal is set to high.
//
// Note: the length is an N-bit value,
//       which means length `0` really means length `1`,
//       and more generally, `N-1` length represents `N`.
//       The zero case is caught earlier so that
//       we only need N bits to represent length.
component match_element(in: 32, prefix: 32, length: 5, ml_done: 1) -> (match_line: 1, write_en: 1) {
  cells {
    sub = std_sub(5);
    pad = std_pad(5, 32);
    rsh0 = std_rsh(32);
    rsh1 = std_rsh(32);
    eq  = std_eq(32);
  }
  wires {
    group compare<"static"=1> {
      sub.left = 5'd31;
      sub.right = length;
      pad.in = sub.out;

      rsh0.left = in;
      rsh0.right = pad.out;
      rsh1.left = prefix;
      rsh1.right = pad.out;
      eq.left = rsh0.out;
      eq.right = rsh1.out;

      write_en = 1'd1;
      match_line = eq.out;
      compare[done] = ml_done;
    }
  }
  control {
    compare;
  }
}

// Given two addresses A and B with their lengths and match line, sets the `X` ports
// with the address and length of the maximum length (valid) prefix match. The output
// match line is set as the logical OR of the match lines from A and B. For example,
//
//   lenA: 4, mlA: 1, addrA: 0
//   lenB: 3, mlB: 1, addrB: 1
//   Since lenA > lenB and the match line of A is high,
//   `lenX` = 4, `mlX` = 1, and `addrX` = 0.
component comparator_element(lenA: 5, lenB: 5, addrA: 5, addrB: 5, mlA: 1, mlB: 1, len_done: 1, addr_done: 1, ml_done: 1) ->
                            (lenX: 5, addrX: 5, mlX: 1, len_write_en: 1, addr_write_en: 1, ml_write_en: 1) {
  cells {
    gt0 = std_gt(5);
    or0 = std_or(1);
    or1 = std_or(1);
    not0 = std_not(1);
    and0 = std_and(1);
  }
  wires {
    group select<"static"=0> {
      gt0.left = lenA;
      gt0.right = lenB;
      not0.in = mlB;
      or0.left = not0.out;
      or0.right = gt0.out;
      and0.left = mlA;
      and0.right = or0.out;
      select[done] = 1'd1;
    }
    group A<"static"=1> {
      len_write_en = 1'd1;
      addr_write_en = 1'd1;
      lenX = lenA;
      addrX = addrA;
      A[done] = len_done & addr_done ? 1'd1;
    }
    group B<"static"=1> {
      len_write_en = 1'd1;
      addr_write_en = 1'd1;
      lenX = lenB;
      addrX = addrB;
      B[done] = len_done & addr_done ? 1'd1;
    }
    group or_match_line<"static"=1> {
      or1.left = mlA;
      or1.right = mlB;
      ml_write_en = 1'd1;
      mlX = or1.out;
      or_match_line[done] = ml_done;
    }
  }
  control {
    par {
      if and0.out with select { A; } else { B; }
      or_match_line;
    }
  }
}

// A ternary content-addressable memory (TCAM) for IPv4, implemented
// using CN-LPM. One can either write to the TCAM or search the TCAM.
// Each stage in the TCAM is separated by registers to allow
// pipelining in the future.
//
// To write to the TCAM:
// 1. Set the `write_en` signal high.
// 2. Provide an `in` value and a corresponding `prefix_len`.
//    For example, if you want to represent the prefix-match
//    `101xx`, you'd pass in `0b'00101` with prefix_len `3`.
//
// To search the TCAM:
// 1. Set the `search_en` signal high.
// 2. Provide an `in` value indicating the data you're searching.
// 3. The signal is set to `index`, and may be saved to a register
//    using the `rdone` and `rwrite_en` signals.
//
// The `write_en` and `search_en` signals should NOT be set to high
// in the same invocation. If a zero-length prefix is written to the TCAM,
// then invalid searches will be set to the corresponding index. Otherwise,
// invalid searches will be defaulted to index zero.
component TCAM_IPv4(write_en: 1, search_en: 1, in: 32, prefix_len: 6, write_index: 5, rdone: 1) -> (index: 5, rwrite_en: 1) {
  cells {
    p0 = std_reg(32);
    p1 = std_reg(32);
    p2 = std_reg(32);
    p3 = std_reg(32);
    p4 = std_reg(32);
    p5 = std_reg(32);
    p6 = std_reg(32);
    p7 = std_reg(32);
    p8 = std_reg(32);
    p9 = std_reg(32);
    p10 = std_reg(32);
    p11 = std_reg(32);
    p12 = std_reg(32);
    p13 = std_reg(32);
    p14 = std_reg(32);
    p15 = std_reg(32);
    p16 = std_reg(32);
    p17 = std_reg(32);
    p18 = std_reg(32);
    p19 = std_reg(32);
    p20 = std_reg(32);
    p21 = std_reg(32);
    p22 = std_reg(32);
    p23 = std_reg(32);
    p24 = std_reg(32);
    p25 = std_reg(32);
    p26 = std_reg(32);
    p27 = std_reg(32);
    p28 = std_reg(32);
    p29 = std_reg(32);
    p30 = std_reg(32);
    p31 = std_reg(32);
    l0 = std_reg(5);
    l1 = std_reg(5);
    l2 = std_reg(5);
    l3 = std_reg(5);
    l4 = std_reg(5);
    l5 = std_reg(5);
    l6 = std_reg(5);
    l7 = std_reg(5);
    l8 = std_reg(5);
    l9 = std_reg(5);
    l10 = std_reg(5);
    l11 = std_reg(5);
    l12 = std_reg(5);
    l13 = std_reg(5);
    l14 = std_reg(5);
    l15 = std_reg(5);
    l16 = std_reg(5);
    l17 = std_reg(5);
    l18 = std_reg(5);
    l19 = std_reg(5);
    l20 = std_reg(5);
    l21 = std_reg(5);
    l22 = std_reg(5);
    l23 = std_reg(5);
    l24 = std_reg(5);
    l25 = std_reg(5);
    l26 = std_reg(5);
    l27 = std_reg(5);
    l28 = std_reg(5);
    l29 = std_reg(5);
    l30 = std_reg(5);
    l31 = std_reg(5);

    is_index0 = std_eq(5);
    is_index1 = std_eq(5);
    is_index2 = std_eq(5);
    is_index3 = std_eq(5);
    is_index4 = std_eq(5);
    is_index5 = std_eq(5);
    is_index6 = std_eq(5);
    is_index7 = std_eq(5);
    is_index8 = std_eq(5);
    is_index9 = std_eq(5);
    is_index10 = std_eq(5);
    is_index11 = std_eq(5);
    is_index12 = std_eq(5);
    is_index13 = std_eq(5);
    is_index14 = std_eq(5);
    is_index15 = std_eq(5);
    is_index16 = std_eq(5);
    is_index17 = std_eq(5);
    is_index18 = std_eq(5);
    is_index19 = std_eq(5);
    is_index20 = std_eq(5);
    is_index21 = std_eq(5);
    is_index22 = std_eq(5);
    is_index23 = std_eq(5);
    is_index24 = std_eq(5);
    is_index25 = std_eq(5);
    is_index26 = std_eq(5);
    is_index27 = std_eq(5);
    is_index28 = std_eq(5);
    is_index29 = std_eq(5);
    is_index30 = std_eq(5);
    is_index31 = std_eq(5);

    zero_index = std_reg(5);
    w_eq = std_eq(1);
    s_eq = std_eq(1);
    z_eq = std_eq(6);
    is_invalid = std_eq(1);

    slice = std_slice(6, 5);
    sub = std_sub(6);

    me0 = match_element();
    me1 = match_element();
    me2 = match_element();
    me3 = match_element();
    me4 = match_element();
    me5 = match_element();
    me6 = match_element();
    me7 = match_element();
    me8 = match_element();
    me9 = match_element();
    me10 = match_element();
    me11 = match_element();
    me12 = match_element();
    me13 = match_element();
    me14 = match_element();
    me15 = match_element();
    me16 = match_element();
    me17 = match_element();
    me18 = match_element();
    me19 = match_element();
    me20 = match_element();
    me21 = match_element();
    me22 = match_element();
    me23 = match_element();
    me24 = match_element();
    me25 = match_element();
    me26 = match_element();
    me27 = match_element();
    me28 = match_element();
    me29 = match_element();
    me30 = match_element();
    me31 = match_element();
    mle0 = std_reg(1);
    mle1 = std_reg(1);
    mle2 = std_reg(1);
    mle3 = std_reg(1);
    mle4 = std_reg(1);
    mle5 = std_reg(1);
    mle6 = std_reg(1);
    mle7 = std_reg(1);
    mle8 = std_reg(1);
    mle9 = std_reg(1);
    mle10 = std_reg(1);
    mle11 = std_reg(1);
    mle12 = std_reg(1);
    mle13 = std_reg(1);
    mle14 = std_reg(1);
    mle15 = std_reg(1);
    mle16 = std_reg(1);
    mle17 = std_reg(1);
    mle18 = std_reg(1);
    mle19 = std_reg(1);
    mle20 = std_reg(1);
    mle21 = std_reg(1);
    mle22 = std_reg(1);
    mle23 = std_reg(1);
    mle24 = std_reg(1);
    mle25 = std_reg(1);
    mle26 = std_reg(1);
    mle27 = std_reg(1);
    mle28 = std_reg(1);
    mle29 = std_reg(1);
    mle30 = std_reg(1);
    mle31 = std_reg(1);

    ce00 = comparator_element();
    ce01 = comparator_element();
    ce02 = comparator_element();
    ce03 = comparator_element();
    ce04 = comparator_element();
    ce05 = comparator_element();
    ce06 = comparator_element();
    ce07 = comparator_element();
    ce08 = comparator_element();
    ce09 = comparator_element();
    ce010 = comparator_element();
    ce011 = comparator_element();
    ce012 = comparator_element();
    ce013 = comparator_element();
    ce014 = comparator_element();
    ce015 = comparator_element();
    ce10 = comparator_element();
    ce11 = comparator_element();
    ce12 = comparator_element();
    ce13 = comparator_element();
    ce14 = comparator_element();
    ce15 = comparator_element();
    ce16 = comparator_element();
    ce17 = comparator_element();
    ce20 = comparator_element();
    ce21 = comparator_element();
    ce22 = comparator_element();
    ce23 = comparator_element();
    ce30 = comparator_element();
    ce31 = comparator_element();
    ce40 = comparator_element();

    len00 = std_reg(5);
    len01 = std_reg(5);
    len02 = std_reg(5);
    len03 = std_reg(5);
    len04 = std_reg(5);
    len05 = std_reg(5);
    len06 = std_reg(5);
    len07 = std_reg(5);
    len08 = std_reg(5);
    len09 = std_reg(5);
    len010 = std_reg(5);
    len011 = std_reg(5);
    len012 = std_reg(5);
    len013 = std_reg(5);
    len014 = std_reg(5);
    len015 = std_reg(5);
    len10 = std_reg(5);
    len11 = std_reg(5);
    len12 = std_reg(5);
    len13 = std_reg(5);
    len14 = std_reg(5);
    len15 = std_reg(5);
    len16 = std_reg(5);
    len17 = std_reg(5);
    len18 = std_reg(5);
    len20 = std_reg(5);
    len21 = std_reg(5);
    len22 = std_reg(5);
    len23 = std_reg(5);
    len24 = std_reg(5);
    len30 = std_reg(5);
    len31 = std_reg(5);

    addr00 = std_reg(5);
    addr01 = std_reg(5);
    addr02 = std_reg(5);
    addr03 = std_reg(5);
    addr04 = std_reg(5);
    addr05 = std_reg(5);
    addr06 = std_reg(5);
    addr07 = std_reg(5);
    addr08 = std_reg(5);
    addr09 = std_reg(5);
    addr010 = std_reg(5);
    addr011 = std_reg(5);
    addr012 = std_reg(5);
    addr013 = std_reg(5);
    addr014 = std_reg(5);
    addr015 = std_reg(5);
    addr10 = std_reg(5);
    addr11 = std_reg(5);
    addr12 = std_reg(5);
    addr13 = std_reg(5);
    addr14 = std_reg(5);
    addr15 = std_reg(5);
    addr16 = std_reg(5);
    addr17 = std_reg(5);
    addr20 = std_reg(5);
    addr21 = std_reg(5);
    addr22 = std_reg(5);
    addr23 = std_reg(5);
    addr30 = std_reg(5);
    addr31 = std_reg(5);

    ml00 = std_reg(1);
    ml01 = std_reg(1);
    ml02 = std_reg(1);
    ml03 = std_reg(1);
    ml04 = std_reg(1);
    ml05 = std_reg(1);
    ml06 = std_reg(1);
    ml07 = std_reg(1);
    ml08 = std_reg(1);
    ml09 = std_reg(1);
    ml010 = std_reg(1);
    ml011 = std_reg(1);
    ml012 = std_reg(1);
    ml013 = std_reg(1);
    ml014 = std_reg(1);
    ml015 = std_reg(1);
    ml10 = std_reg(1);
    ml11 = std_reg(1);
    ml12 = std_reg(1);
    ml13 = std_reg(1);
    ml14 = std_reg(1);
    ml15 = std_reg(1);
    ml16 = std_reg(1);
    ml17 = std_reg(1);
    ml20 = std_reg(1);
    ml21 = std_reg(1);
    ml22 = std_reg(1);
    ml23 = std_reg(1);
    ml30 = std_reg(1);
    ml31 = std_reg(1);

    final_valid = std_reg(1);
    out_index = std_reg(5);
  }
  wires {
    group is_write_enabled<"static"=0> {
      w_eq.left = write_en;
      w_eq.right = 1'd1;
      is_write_enabled[done] = 1'd1;
    }
    group is_length_zero<"static"=0> {
      z_eq.left = 6'd0;
      z_eq.right = prefix_len;
      is_length_zero[done] = 1'd1;
    }
    group is_search_enabled<"static"=0> {
      s_eq.left = search_en;
      s_eq.right = 1'd1;
      is_search_enabled[done] = 1'd1;
    }
    group write_zero<"static"=1> {
      zero_index.write_en = 1'd1;
      zero_index.in = write_index;
      write_zero[done] = zero_index.done;
    }
    group write0<"static"=1> {
      p0.write_en = write_en;
      l0.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l0.in = slice.out;
      p0.in = in;
      write0[done] = p0.done & l0.done ? 1'd1;
    }
    group write1<"static"=1> {
      p1.write_en = write_en;
      l1.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l1.in = slice.out;
      p1.in = in;
      write1[done] = p1.done & l1.done ? 1'd1;
    }
    group write2<"static"=1> {
      p2.write_en = write_en;
      l2.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l2.in = slice.out;
      p2.in = in;
      write2[done] = p2.done & l2.done ? 1'd1;
    }
    group write3<"static"=1> {
      p3.write_en = write_en;
      l3.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l3.in = slice.out;
      p3.in = in;
      write3[done] = p3.done & l3.done ? 1'd1;
    }
    group write4<"static"=1> {
      p4.write_en = write_en;
      l4.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l4.in = slice.out;
      p4.in = in;
      write4[done] = p4.done & l4.done ? 1'd1;
    }
    group write5<"static"=1> {
      p5.write_en = write_en;
      l5.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l5.in = slice.out;
      p5.in = in;
      write5[done] = p5.done & l5.done ? 1'd1;
    }
    group write6<"static"=1> {
      p6.write_en = write_en;
      l6.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l6.in = slice.out;
      p6.in = in;
      write6[done] = p6.done & l6.done ? 1'd1;
    }
    group write7<"static"=1> {
      p7.write_en = write_en;
      l7.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l7.in = slice.out;
      p7.in = in;
      write7[done] = p7.done & l7.done ? 1'd1;
    }
    group write8<"static"=1> {
      p8.write_en = write_en;
      l8.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l8.in = slice.out;
      p8.in = in;
      write8[done] = p8.done & l8.done ? 1'd1;
    }
    group write9<"static"=1> {
      p9.write_en = write_en;
      l9.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l9.in = slice.out;
      p9.in = in;
      write9[done] = p9.done & l9.done ? 1'd1;
    }
    group write10<"static"=1> {
      p10.write_en = write_en;
      l10.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l10.in = slice.out;
      p10.in = in;
      write10[done] = p10.done & l10.done ? 1'd1;
    }
    group write11<"static"=1> {
      p11.write_en = write_en;
      l11.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l11.in = slice.out;
      p11.in = in;
      write11[done] = p11.done & l11.done ? 1'd1;
    }
    group write12<"static"=1> {
      p12.write_en = write_en;
      l12.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l12.in = slice.out;
      p12.in = in;
      write12[done] = p12.done & l12.done ? 1'd1;
    }
    group write13<"static"=1> {
      p13.write_en = write_en;
      l13.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l13.in = slice.out;
      p13.in = in;
      write13[done] = p13.done & l13.done ? 1'd1;
    }
    group write14<"static"=1> {
      p14.write_en = write_en;
      l14.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l14.in = slice.out;
      p14.in = in;
      write14[done] = p14.done & l14.done ? 1'd1;
    }
    group write15<"static"=1> {
      p15.write_en = write_en;
      l15.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l15.in = slice.out;
      p15.in = in;
      write15[done] = p15.done & l15.done ? 1'd1;
    }
    group write16<"static"=1> {
      p16.write_en = write_en;
      l16.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l16.in = slice.out;
      p16.in = in;
      write16[done] = p16.done & l16.done ? 1'd1;
    }
    group write17<"static"=1> {
      p17.write_en = write_en;
      l17.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l17.in = slice.out;
      p17.in = in;
      write17[done] = p17.done & l17.done ? 1'd1;
    }
    group write18<"static"=1> {
      p18.write_en = write_en;
      l18.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l18.in = slice.out;
      p18.in = in;
      write18[done] = p18.done & l18.done ? 1'd1;
    }
    group write19<"static"=1> {
      p19.write_en = write_en;
      l19.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l19.in = slice.out;
      p19.in = in;
      write19[done] = p19.done & l19.done ? 1'd1;
    }
    group write20<"static"=1> {
      p20.write_en = write_en;
      l20.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l20.in = slice.out;
      p20.in = in;
      write20[done] = p20.done & l20.done ? 1'd1;
    }
    group write21<"static"=1> {
      p21.write_en = write_en;
      l21.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l21.in = slice.out;
      p21.in = in;
      write21[done] = p21.done & l21.done ? 1'd1;
    }
    group write22<"static"=1> {
      p22.write_en = write_en;
      l22.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l22.in = slice.out;
      p22.in = in;
      write22[done] = p22.done & l22.done ? 1'd1;
    }
    group write23<"static"=1> {
      p23.write_en = write_en;
      l23.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l23.in = slice.out;
      p23.in = in;
      write23[done] = p23.done & l23.done ? 1'd1;
    }
    group write24<"static"=1> {
      p24.write_en = write_en;
      l24.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l24.in = slice.out;
      p24.in = in;
      write24[done] = p24.done & l24.done ? 1'd1;
    }
    group write25<"static"=1> {
      p25.write_en = write_en;
      l25.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l25.in = slice.out;
      p25.in = in;
      write25[done] = p25.done & l25.done ? 1'd1;
    }
    group write26<"static"=1> {
      p26.write_en = write_en;
      l26.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l26.in = slice.out;
      p26.in = in;
      write26[done] = p26.done & l26.done ? 1'd1;
    }
    group write27<"static"=1> {
      p27.write_en = write_en;
      l27.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l27.in = slice.out;
      p27.in = in;
      write27[done] = p27.done & l27.done ? 1'd1;
    }
    group write28<"static"=1> {
      p28.write_en = write_en;
      l28.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l28.in = slice.out;
      p28.in = in;
      write28[done] = p28.done & l28.done ? 1'd1;
    }
    group write29<"static"=1> {
      p29.write_en = write_en;
      l29.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l29.in = slice.out;
      p29.in = in;
      write29[done] = p29.done & l29.done ? 1'd1;
    }
    group write30<"static"=1> {
      p30.write_en = write_en;
      l30.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l30.in = slice.out;
      p30.in = in;
      write30[done] = p30.done & l30.done ? 1'd1;
    }
    group write31<"static"=1> {
      p31.write_en = write_en;
      l31.write_en = write_en;
      sub.left = prefix_len;
      sub.right = 6'd1;
      slice.in = sub.out;
      l31.in = slice.out;
      p31.in = in;
      write31[done] = p31.done & l31.done ? 1'd1;
    }
    group find_write_index<"static"=0> {
      is_index0.left = 5'd0;
      is_index1.left = 5'd1;
      is_index2.left = 5'd2;
      is_index3.left = 5'd3;
      is_index4.left = 5'd4;
      is_index5.left = 5'd5;
      is_index6.left = 5'd6;
      is_index7.left = 5'd7;
      is_index8.left = 5'd8;
      is_index9.left = 5'd9;
      is_index10.left = 5'd10;
      is_index11.left = 5'd11;
      is_index12.left = 5'd12;
      is_index13.left = 5'd13;
      is_index14.left = 5'd14;
      is_index15.left = 5'd15;
      is_index16.left = 5'd16;
      is_index17.left = 5'd17;
      is_index18.left = 5'd18;
      is_index19.left = 5'd19;
      is_index20.left = 5'd20;
      is_index21.left = 5'd21;
      is_index22.left = 5'd22;
      is_index23.left = 5'd23;
      is_index24.left = 5'd24;
      is_index25.left = 5'd25;
      is_index26.left = 5'd26;
      is_index27.left = 5'd27;
      is_index28.left = 5'd28;
      is_index29.left = 5'd29;
      is_index30.left = 5'd30;
      is_index31.left = 5'd31;
      is_index0.right = write_index;
      is_index1.right = write_index;
      is_index0.right = write_index;
      is_index1.right = write_index;
      is_index2.right = write_index;
      is_index3.right = write_index;
      is_index4.right = write_index;
      is_index5.right = write_index;
      is_index6.right = write_index;
      is_index7.right = write_index;
      is_index8.right = write_index;
      is_index9.right = write_index;
      is_index10.right = write_index;
      is_index11.right = write_index;
      is_index12.right = write_index;
      is_index13.right = write_index;
      is_index14.right = write_index;
      is_index15.right = write_index;
      is_index16.right = write_index;
      is_index17.right = write_index;
      is_index18.right = write_index;
      is_index19.right = write_index;
      is_index20.right = write_index;
      is_index21.right = write_index;
      is_index22.right = write_index;
      is_index23.right = write_index;
      is_index24.right = write_index;
      is_index25.right = write_index;
      is_index26.right = write_index;
      is_index27.right = write_index;
      is_index28.right = write_index;
      is_index29.right = write_index;
      is_index30.right = write_index;
      is_index31.right = write_index;
      find_write_index[done] = 1'd1;
    }
    group validity<"static"=0> {
      is_invalid.left = final_valid.out;
      is_invalid.right = 1'd0;
      validity[done] = 1'd1;
    }
    group default_to_zero_length_index<"static"=1> {
      rwrite_en = 1'd1;
      index = zero_index.out;
      default_to_zero_length_index[done] = rdone;
    }
    group save_index<"static"=1> {
      rwrite_en = 1'd1;
      index = out_index.out;
      save_index[done] = rdone;
    }
  }

  control {
    par {
      if w_eq.out with is_write_enabled {
        if z_eq.out with is_length_zero {
          write_zero;
        } else {
          par {
            if is_index0.out  with find_write_index { write0; }
            if is_index1.out  with find_write_index { write1; }
            if is_index2.out  with find_write_index { write2; }
            if is_index3.out  with find_write_index { write3; }
            if is_index4.out  with find_write_index { write4; }
            if is_index5.out  with find_write_index { write5; }
            if is_index6.out  with find_write_index { write6; }
            if is_index7.out  with find_write_index { write7; }
            if is_index8.out  with find_write_index { write8; }
            if is_index9.out  with find_write_index { write9; }
            if is_index10.out with find_write_index { write10; }
            if is_index11.out with find_write_index { write11; }
            if is_index12.out with find_write_index { write12; }
            if is_index13.out with find_write_index { write13; }
            if is_index14.out with find_write_index { write14; }
            if is_index15.out with find_write_index { write15; }
            if is_index16.out with find_write_index { write16; }
            if is_index17.out with find_write_index { write17; }
            if is_index18.out with find_write_index { write18; }
            if is_index19.out with find_write_index { write19; }
            if is_index20.out with find_write_index { write20; }
            if is_index21.out with find_write_index { write21; }
            if is_index22.out with find_write_index { write22; }
            if is_index23.out with find_write_index { write23; }
            if is_index24.out with find_write_index { write24; }
            if is_index25.out with find_write_index { write25; }
            if is_index26.out with find_write_index { write26; }
            if is_index27.out with find_write_index { write27; }
            if is_index28.out with find_write_index { write28; }
            if is_index29.out with find_write_index { write29; }
            if is_index30.out with find_write_index { write30; }
            if is_index31.out with find_write_index { write31; }
          }
        }
      }
      if s_eq.out with is_search_enabled {
        seq {
          par {
            invoke me0(in=in, prefix=p0.out, length=l0.out, ml_done=mle0.done)(match_line=mle0.in, write_en=mle0.write_en);
            invoke me1(in=in, prefix=p1.out, length=l1.out, ml_done=mle1.done)(match_line=mle1.in, write_en=mle1.write_en);
            invoke me2(in=in, prefix=p2.out, length=l2.out, ml_done=mle2.done)(match_line=mle2.in, write_en=mle2.write_en);
            invoke me3(in=in, prefix=p3.out, length=l3.out, ml_done=mle3.done)(match_line=mle3.in, write_en=mle3.write_en);
            invoke me4(in=in, prefix=p4.out, length=l4.out, ml_done=mle4.done)(match_line=mle4.in, write_en=mle4.write_en);
            invoke me5(in=in, prefix=p5.out, length=l5.out, ml_done=mle5.done)(match_line=mle5.in, write_en=mle5.write_en);
            invoke me6(in=in, prefix=p6.out, length=l6.out, ml_done=mle6.done)(match_line=mle6.in, write_en=mle6.write_en);
            invoke me7(in=in, prefix=p7.out, length=l7.out, ml_done=mle7.done)(match_line=mle7.in, write_en=mle7.write_en);
            invoke me8(in=in, prefix=p8.out, length=l8.out, ml_done=mle8.done)(match_line=mle8.in, write_en=mle8.write_en);
            invoke me9(in=in, prefix=p9.out, length=l9.out, ml_done=mle9.done)(match_line=mle9.in, write_en=mle9.write_en);
            invoke me10(in=in, prefix=p10.out, length=l10.out, ml_done=mle10.done)(match_line=mle10.in, write_en=mle10.write_en);
            invoke me11(in=in, prefix=p11.out, length=l11.out, ml_done=mle11.done)(match_line=mle11.in, write_en=mle11.write_en);
            invoke me12(in=in, prefix=p12.out, length=l12.out, ml_done=mle12.done)(match_line=mle12.in, write_en=mle12.write_en);
            invoke me13(in=in, prefix=p13.out, length=l13.out, ml_done=mle13.done)(match_line=mle13.in, write_en=mle13.write_en);
            invoke me14(in=in, prefix=p14.out, length=l14.out, ml_done=mle14.done)(match_line=mle14.in, write_en=mle14.write_en);
            invoke me15(in=in, prefix=p15.out, length=l15.out, ml_done=mle15.done)(match_line=mle15.in, write_en=mle15.write_en);
            invoke me16(in=in, prefix=p16.out, length=l16.out, ml_done=mle16.done)(match_line=mle16.in, write_en=mle16.write_en);
            invoke me17(in=in, prefix=p17.out, length=l17.out, ml_done=mle17.done)(match_line=mle17.in, write_en=mle17.write_en);
            invoke me18(in=in, prefix=p18.out, length=l18.out, ml_done=mle18.done)(match_line=mle18.in, write_en=mle18.write_en);
            invoke me19(in=in, prefix=p19.out, length=l19.out, ml_done=mle19.done)(match_line=mle19.in, write_en=mle19.write_en);
            invoke me20(in=in, prefix=p20.out, length=l20.out, ml_done=mle20.done)(match_line=mle20.in, write_en=mle20.write_en);
            invoke me21(in=in, prefix=p21.out, length=l21.out, ml_done=mle21.done)(match_line=mle21.in, write_en=mle21.write_en);
            invoke me22(in=in, prefix=p22.out, length=l22.out, ml_done=mle22.done)(match_line=mle22.in, write_en=mle22.write_en);
            invoke me23(in=in, prefix=p23.out, length=l23.out, ml_done=mle23.done)(match_line=mle23.in, write_en=mle23.write_en);
            invoke me24(in=in, prefix=p24.out, length=l24.out, ml_done=mle24.done)(match_line=mle24.in, write_en=mle24.write_en);
            invoke me25(in=in, prefix=p25.out, length=l25.out, ml_done=mle25.done)(match_line=mle25.in, write_en=mle25.write_en);
            invoke me26(in=in, prefix=p26.out, length=l26.out, ml_done=mle26.done)(match_line=mle26.in, write_en=mle26.write_en);
            invoke me27(in=in, prefix=p27.out, length=l27.out, ml_done=mle27.done)(match_line=mle27.in, write_en=mle27.write_en);
            invoke me28(in=in, prefix=p28.out, length=l28.out, ml_done=mle28.done)(match_line=mle28.in, write_en=mle28.write_en);
            invoke me29(in=in, prefix=p29.out, length=l29.out, ml_done=mle29.done)(match_line=mle29.in, write_en=mle29.write_en);
            invoke me30(in=in, prefix=p30.out, length=l30.out, ml_done=mle30.done)(match_line=mle30.in, write_en=mle30.write_en);
            invoke me31(in=in, prefix=p31.out, length=l31.out, ml_done=mle31.done)(match_line=mle31.in, write_en=mle31.write_en);
          }
          par {
            invoke ce00(lenA=l0.out, lenB=l1.out, addrA=5'd0, addrB=5'd1, mlA=mle0.out, mlB=mle1.out, len_done=len00.done, addr_done=addr00.done, ml_done=ml00.done)(lenX=len00.in, addrX=addr00.in, mlX=ml00.in, len_write_en=len00.write_en, addr_write_en=addr00.write_en, ml_write_en=ml00.write_en);
            invoke ce01(lenA=l2.out, lenB=l3.out, addrA=5'd2, addrB=5'd3, mlA=mle2.out, mlB=mle3.out, len_done=len01.done, addr_done=addr01.done, ml_done=ml01.done)(lenX=len01.in, addrX=addr01.in, mlX=ml01.in, len_write_en=len01.write_en, addr_write_en=addr01.write_en, ml_write_en=ml01.write_en);
            invoke ce02(lenA=l4.out, lenB=l5.out, addrA=5'd4, addrB=5'd5, mlA=mle4.out, mlB=mle5.out, len_done=len02.done, addr_done=addr02.done, ml_done=ml02.done)(lenX=len02.in, addrX=addr02.in, mlX=ml02.in, len_write_en=len02.write_en, addr_write_en=addr02.write_en, ml_write_en=ml02.write_en);
            invoke ce03(lenA=l6.out, lenB=l7.out, addrA=5'd6, addrB=5'd7, mlA=mle6.out, mlB=mle7.out, len_done=len03.done, addr_done=addr03.done, ml_done=ml03.done)(lenX=len03.in, addrX=addr03.in, mlX=ml03.in, len_write_en=len03.write_en, addr_write_en=addr03.write_en, ml_write_en=ml03.write_en);
            invoke ce04(lenA=l8.out, lenB=l9.out, addrA=5'd8, addrB=5'd9, mlA=mle8.out, mlB=mle9.out, len_done=len04.done, addr_done=addr04.done, ml_done=ml04.done)(lenX=len04.in, addrX=addr04.in, mlX=ml04.in, len_write_en=len04.write_en, addr_write_en=addr04.write_en, ml_write_en=ml04.write_en);
            invoke ce05(lenA=l10.out, lenB=l11.out, addrA=5'd10, addrB=5'd11, mlA=mle10.out, mlB=mle11.out, len_done=len05.done, addr_done=addr05.done, ml_done=ml05.done)(lenX=len05.in, addrX=addr05.in, mlX=ml05.in, len_write_en=len05.write_en, addr_write_en=addr05.write_en, ml_write_en=ml05.write_en);
            invoke ce06(lenA=l12.out, lenB=l13.out, addrA=5'd12, addrB=5'd13, mlA=mle12.out, mlB=mle13.out, len_done=len06.done, addr_done=addr06.done, ml_done=ml06.done)(lenX=len06.in, addrX=addr06.in, mlX=ml06.in, len_write_en=len06.write_en, addr_write_en=addr06.write_en, ml_write_en=ml06.write_en);
            invoke ce07(lenA=l14.out, lenB=l15.out, addrA=5'd14, addrB=5'd15, mlA=mle14.out, mlB=mle15.out, len_done=len07.done, addr_done=addr07.done, ml_done=ml07.done)(lenX=len07.in, addrX=addr07.in, mlX=ml07.in, len_write_en=len07.write_en, addr_write_en=addr07.write_en, ml_write_en=ml07.write_en);
            invoke ce08(lenA=l16.out, lenB=l17.out, addrA=5'd16, addrB=5'd17, mlA=mle16.out, mlB=mle17.out, len_done=len08.done, addr_done=addr08.done, ml_done=ml08.done)(lenX=len08.in, addrX=addr08.in, mlX=ml08.in, len_write_en=len08.write_en, addr_write_en=addr08.write_en, ml_write_en=ml08.write_en);
            invoke ce09(lenA=l18.out, lenB=l19.out, addrA=5'd18, addrB=5'd19, mlA=mle18.out, mlB=mle19.out, len_done=len09.done, addr_done=addr09.done, ml_done=ml09.done)(lenX=len09.in, addrX=addr09.in, mlX=ml09.in, len_write_en=len09.write_en, addr_write_en=addr09.write_en, ml_write_en=ml09.write_en);
            invoke ce010(lenA=l20.out, lenB=l21.out, addrA=5'd20, addrB=5'd21, mlA=mle20.out, mlB=mle21.out, len_done=len010.done, addr_done=addr010.done, ml_done=ml010.done)(lenX=len010.in, addrX=addr010.in, mlX=ml010.in, len_write_en=len010.write_en, addr_write_en=addr010.write_en, ml_write_en=ml010.write_en);
            invoke ce011(lenA=l22.out, lenB=l23.out, addrA=5'd22, addrB=5'd23, mlA=mle22.out, mlB=mle23.out, len_done=len011.done, addr_done=addr011.done, ml_done=ml011.done)(lenX=len011.in, addrX=addr011.in, mlX=ml011.in, len_write_en=len011.write_en, addr_write_en=addr011.write_en, ml_write_en=ml011.write_en);
            invoke ce012(lenA=l24.out, lenB=l25.out, addrA=5'd24, addrB=5'd25, mlA=mle24.out, mlB=mle25.out, len_done=len012.done, addr_done=addr012.done, ml_done=ml012.done)(lenX=len012.in, addrX=addr012.in, mlX=ml012.in, len_write_en=len012.write_en, addr_write_en=addr012.write_en, ml_write_en=ml012.write_en);
            invoke ce013(lenA=l26.out, lenB=l27.out, addrA=5'd26, addrB=5'd27, mlA=mle26.out, mlB=mle27.out, len_done=len013.done, addr_done=addr013.done, ml_done=ml013.done)(lenX=len013.in, addrX=addr013.in, mlX=ml013.in, len_write_en=len013.write_en, addr_write_en=addr013.write_en, ml_write_en=ml013.write_en);
            invoke ce014(lenA=l28.out, lenB=l29.out, addrA=5'd28, addrB=5'd29, mlA=mle28.out, mlB=mle29.out, len_done=len014.done, addr_done=addr014.done, ml_done=ml014.done)(lenX=len014.in, addrX=addr014.in, mlX=ml014.in, len_write_en=len014.write_en, addr_write_en=addr014.write_en, ml_write_en=ml014.write_en);
            invoke ce015(lenA=l30.out, lenB=l31.out, addrA=5'd30, addrB=5'd31, mlA=mle30.out, mlB=mle31.out, len_done=len015.done, addr_done=addr015.done, ml_done=ml015.done)(lenX=len015.in, addrX=addr015.in, mlX=ml015.in, len_write_en=len015.write_en, addr_write_en=addr015.write_en, ml_write_en=ml015.write_en);
          }
          par {
            invoke ce10(lenA=len00.out, lenB=len01.out, addrA=addr00.out, addrB=addr01.out, mlA=ml00.out, mlB=ml01.out, len_done=len10.done, addr_done=addr10.done, ml_done=ml10.done)(lenX=len10.in, addrX=addr10.in, mlX=ml10.in, len_write_en=len10.write_en, addr_write_en=addr10.write_en, ml_write_en=ml10.write_en);
            invoke ce11(lenA=len02.out, lenB=len03.out, addrA=addr02.out, addrB=addr03.out, mlA=ml02.out, mlB=ml03.out, len_done=len11.done, addr_done=addr11.done, ml_done=ml11.done)(lenX=len11.in, addrX=addr11.in, mlX=ml11.in, len_write_en=len11.write_en, addr_write_en=addr11.write_en, ml_write_en=ml11.write_en);
            invoke ce12(lenA=len04.out, lenB=len05.out, addrA=addr04.out, addrB=addr05.out, mlA=ml04.out, mlB=ml05.out, len_done=len12.done, addr_done=addr12.done, ml_done=ml12.done)(lenX=len12.in, addrX=addr12.in, mlX=ml12.in, len_write_en=len12.write_en, addr_write_en=addr12.write_en, ml_write_en=ml12.write_en);
            invoke ce13(lenA=len06.out, lenB=len07.out, addrA=addr06.out, addrB=addr07.out, mlA=ml06.out, mlB=ml07.out, len_done=len13.done, addr_done=addr13.done, ml_done=ml13.done)(lenX=len13.in, addrX=addr13.in, mlX=ml13.in, len_write_en=len13.write_en, addr_write_en=addr13.write_en, ml_write_en=ml13.write_en);
            invoke ce14(lenA=len08.out, lenB=len09.out, addrA=addr08.out, addrB=addr09.out, mlA=ml08.out, mlB=ml09.out, len_done=len14.done, addr_done=addr14.done, ml_done=ml14.done)(lenX=len14.in, addrX=addr14.in, mlX=ml14.in, len_write_en=len14.write_en, addr_write_en=addr14.write_en, ml_write_en=ml14.write_en);
            invoke ce15(lenA=len010.out, lenB=len011.out, addrA=addr010.out, addrB=addr011.out, mlA=ml010.out, mlB=ml011.out, len_done=len15.done, addr_done=addr15.done, ml_done=ml15.done)(lenX=len15.in, addrX=addr15.in, mlX=ml15.in, len_write_en=len15.write_en, addr_write_en=addr15.write_en, ml_write_en=ml15.write_en);
            invoke ce16(lenA=len012.out, lenB=len013.out, addrA=addr012.out, addrB=addr013.out, mlA=ml012.out, mlB=ml013.out, len_done=len16.done, addr_done=addr16.done, ml_done=ml16.done)(lenX=len16.in, addrX=addr16.in, mlX=ml16.in, len_write_en=len16.write_en, addr_write_en=addr16.write_en, ml_write_en=ml16.write_en);
            invoke ce17(lenA=len014.out, lenB=len015.out, addrA=addr014.out, addrB=addr015.out, mlA=ml014.out, mlB=ml015.out, len_done=len17.done, addr_done=addr17.done, ml_done=ml17.done)(lenX=len17.in, addrX=addr17.in, mlX=ml17.in, len_write_en=len17.write_en, addr_write_en=addr17.write_en, ml_write_en=ml17.write_en);
          }
          par {
            invoke ce20(lenA=len10.out, lenB=len11.out, addrA=addr10.out, addrB=addr11.out, mlA=ml10.out, mlB=ml11.out, len_done=len20.done, addr_done=addr20.done, ml_done=ml20.done)(lenX=len20.in, addrX=addr20.in, mlX=ml20.in, len_write_en=len20.write_en, addr_write_en=addr20.write_en, ml_write_en=ml20.write_en);
            invoke ce21(lenA=len12.out, lenB=len13.out, addrA=addr12.out, addrB=addr13.out, mlA=ml12.out, mlB=ml13.out, len_done=len21.done, addr_done=addr21.done, ml_done=ml21.done)(lenX=len21.in, addrX=addr21.in, mlX=ml21.in, len_write_en=len21.write_en, addr_write_en=addr21.write_en, ml_write_en=ml21.write_en);
            invoke ce22(lenA=len14.out, lenB=len15.out, addrA=addr14.out, addrB=addr15.out, mlA=ml14.out, mlB=ml15.out, len_done=len22.done, addr_done=addr22.done, ml_done=ml22.done)(lenX=len22.in, addrX=addr22.in, mlX=ml22.in, len_write_en=len22.write_en, addr_write_en=addr22.write_en, ml_write_en=ml22.write_en);
            invoke ce23(lenA=len16.out, lenB=len17.out, addrA=addr16.out, addrB=addr17.out, mlA=ml16.out, mlB=ml17.out, len_done=len23.done, addr_done=addr23.done, ml_done=ml23.done)(lenX=len23.in, addrX=addr23.in, mlX=ml23.in, len_write_en=len23.write_en, addr_write_en=addr23.write_en, ml_write_en=ml23.write_en);
          }
          par {
            invoke ce30(lenA=len20.out, lenB=len21.out, addrA=addr20.out, addrB=addr21.out, mlA=ml20.out, mlB=ml21.out, len_done=len30.done, addr_done=addr30.done, ml_done=ml30.done)(lenX=len30.in, addrX=addr30.in, mlX=ml30.in, len_write_en=len30.write_en, addr_write_en=addr30.write_en, ml_write_en=ml30.write_en);
            invoke ce31(lenA=len22.out, lenB=len23.out, addrA=addr22.out, addrB=addr23.out, mlA=ml22.out, mlB=ml23.out, len_done=len31.done, addr_done=addr31.done, ml_done=ml31.done)(lenX=len31.in, addrX=addr31.in, mlX=ml31.in, len_write_en=len31.write_en, addr_write_en=addr31.write_en, ml_write_en=ml31.write_en);
          }
          invoke ce40(lenA=len30.out, lenB=len31.out, addrA=addr30.out, addrB=addr31.out, mlA=ml30.out, mlB=ml31.out, addr_done=out_index.done, ml_done=final_valid.done)(addrX=out_index.in, mlX=final_valid.in, addr_write_en=out_index.write_en, ml_write_en=final_valid.write_en);
          if is_invalid.out with validity { default_to_zero_length_index; } else { save_index; }
        }
      }
    }
  }
}

P4 switch configuration to Calyx program

P4 provides a way to establish user-defined tables in the switch configuration. It has the following structure:

table_add <table_name> <action> <match>/<length> => <action data(s)>

For example, a LPM identity routing table would look like:

table_add ipv4_lpm identity 10.0.0.10/0 => 10.0.0.10
table_add ipv4_lpm identity 10.0.1.10/0 => 10.0.1.10
table_add ipv4_lpm identity 10.0.2.10/0 => 10.0.2.10

The P4 compiler then takes this table and lowers it to a protocol buffer-based format named P4Info. The nice part is this output file is target-independent, i.e. the P4Info is the same for a software switch, ASIC switch, etc.

Supporting the entirety of “P4Runtime to Calyx” would require some extensive engineering. Instead, I’ve created a simple program that takes in a configuration table, as seen above, and lowers it to a Calyx program that:

  1. Writes the prefixes and prefix lengths to the Match Engine.
  2. Writes the corresponding action identifiers (currently just generated pseudorandomly) to the Action Memory.
P4 Switch Configuration to Calyx
from dataclasses import dataclass
from uuid import uuid4
from typing import List
from io import TextIOWrapper
from calyx.py_ast import *


@dataclass
class Action:
    table_name: str
    action_name: str
    ip: str
    subnet_mask: str
    action_data: List[str]

    action_address: int
    action_id: int

    def __init__(
        self,
        table_name: str,
        action_name: str,
        ip: str,
        subnet_mask: str,
        action_data: str,
        action_address: int,
    ):
        self.table_name = table_name
        self.action_name = action_name
        self.ip = ip
        self.subnet_mask = subnet_mask
        self.action_data = action_data
        self.action_address = action_address

        # Generate a random, unique address for simplicity.
        # Realistically, this would be an ID that could be
        # interpreted by the Action ALU.
        addr = uuid4().hex[:8]
        self.action_id = int(addr, 16)


def interpret_p4_config(file: TextIOWrapper, unique_address: int) -> List[Action]:
    """Given a P4 Switch Config, returns a list of Actions.
    Each action is given a unique address in the Action Memory.
    Currently, an action ID is stripped from UUID4. Eventually,
    this should be an op code for the Action ALU."""

    def parse_action(line: str, action_address: int) -> Action:
        """Interprets a P4 configuration table line of the form:
        table_add <table_name> <action> <match>/<length> => <action data(s)>"""
        tokens = line.strip().split(" ")
        assert (
            tokens[0] == "table_add" and tokens[4] == "=>"
        ), f"""The configuration line: {line} 
does not match the expected form:
table_add <table_name> <action> <match>/<length> => <action data(s)>"""

        ip, subnet_mask = tokens[3].split("/", maxsplit=1)
        return Action(
            table_name=tokens[1],
            action_name=tokens[2],
            ip=ip,
            subnet_mask=subnet_mask,
            action_data=[] if len(tokens) <= 5 else tokens[5:],
            action_address=action_address,
        )

    actions = []
    for line in file.readlines():
        actions.append(parse_action(line, unique_address))
        unique_address = unique_address + 1

    return actions


def write_to_action_memory(action_memory_id: CompVar, actions: List[Action]):
    """Returns the groups and control to write
    each action ID to an address in the Action Memory."""
    groups = []
    for a in actions:
        action_addr = a.action_address
        action_id = a.action_id

        # Write `action_id` to `action_addr`.
        group_name = CompVar(f"wr_action_memory_{action_id}")
        connections = [
            Connect(
                ConstantPort(32, action_id), CompPort(action_memory_id, "write_data")
            ),
            Connect(ConstantPort(5, action_addr), CompPort(action_memory_id, "addr0")),
            Connect(ConstantPort(1, 1), CompPort(action_memory_id, "write_en")),
            Connect(CompPort(action_memory_id, "done"), HolePort(group_name, "done")),
        ]
        groups.append(Group(group_name, connections, static_delay=1))

    return groups, SeqComp([Enable(g.id.name) for g in groups])


def ip_to_decimal(ip: str) -> int:
    """Given an IP string, returns the decimal value."""
    octet1 = 16777216  # 256 ** 3
    octet2 = 65536  # 256 ** 2
    octet3 = 256
    ips = ip.split(".", maxsplit=3)
    ips = [int(ip) for ip in ips]

    return ips[0] * octet1 + ips[1] * octet2 + ips[2] * octet3 + ips[3]


def write_to_match_engine(match_engine_id: CompVar, actions: List[Action]):
    """Returns the control sequence that writes each
    IP (with prefix length) to the Match Engine."""
    controls = []
    for a in actions:
        # Write mapping (ip, mask) -> address.
        address = a.action_address
        ip = a.ip
        prefix_length = int(a.subnet_mask)

        controls.append(
            Invoke(
                id=match_engine_id,
                in_connects=[
                    ("write_en", ConstantPort(1, 1)),
                    ("write_index", ConstantPort(5, address)),
                    ("in", ConstantPort(32, ip_to_decimal(ip))),
                    ("prefix_len", ConstantPort(6, prefix_length)),
                ],
                out_connects=[],
            )
        )

    return SeqComp(controls)


if __name__ == "__main__":
    import argparse, json

    parser = argparse.ArgumentParser(description="Interpret P4 Table Configuration")
    parser.add_argument("file", nargs="?", type=str)
    args = parser.parse_args()

    # A counter to ensure each action receives a unique address in the Action Memory.
    unique_address = 0

    if args.file is not None:
        with open(args.file, "r") as file:
            actions = interpret_p4_config(file, unique_address)
    else:
        parser.error("Need to pass in `-f FILE`.")

    action_memory_id = CompVar("action_memory")
    match_engine_id = CompVar("tcam")
    groups, control0 = write_to_action_memory(action_memory_id, actions)
    control1 = write_to_match_engine(match_engine_id, actions)

    main = Component(
        name="main",
        inputs=[],
        outputs=[],
        structs=[
            Cell(action_memory_id, Stdlib().mem_d1(32, 32, 5)),
            Cell(match_engine_id, CompInst("TCAM_IPv4", [])),
        ]
        + groups,
        controls=ParComp([control0, control1]),
    )

    Program(
        imports=[Import("primitives/core.futil"), Import("primitives/tcam.futil")],
        components=[main],
    ).emit()


Again, this shouldn’t be interpreted as anything more than a fun demonstration. Running the script requires installing the calyx-py library. For a concrete example, try placing the following P4 configuration in a file named switch.config:

table_add ipv4_lpm X 10.0.0.10/0 => 10.0.0.10 1
table_add ipv4_lpm X 10.0.1.10/32 => 10.0.1.10 2
table_add ipv4_lpm X 10.0.6.10/16 => 10.0.6.10 7
table_add ipv4_lpm X 244.244.244.244/16 => 10.0.7.10 8

and running the command:

python3 interpret_p4_config.py switch.config

Conclusion

This post covered a lot. We began with some very basic memory concepts, provided some insight into the background and motivation for P4 and Calyx, and finally implemented a TCAM in Calyx. The goal wasn’t to make you an oracle of hardware or programming languages or networks. Rather, I wanted to exhibit a consequence of experts in different domains of computer science coming together to solve difficult problems. Developing an intermediate language to optimize hardware accelerator generators written in domain specific languages is truly the intersection of programming languages and computer architecture.

Please feel free to reach out to me at cpg49 at cornell dot edu with any questions, comments, or concerns!

References

[1]: Pat Bosshart, et al. 2013. Forwarding metamorphosis: fast programmable match-action processing in hardware for SDN. SIGCOMM Comput. Commun. Rev. 43, 4 (October 2013), 99–110. DOI:https://doi.org/10.1145/2534169.2486011

[2]: LAN/MAN Standards Committee of the IEEE Computer Society, “IEEE P802.3ba D2.1-Amendment: MAC parameters, physical layers and management parameters for 40 gb/s and 100 gb/s operation,” p. 29, 2009.

[3]: D. Shah and P. Gupta, “Fast updating algorithms for TCAM,” in IEEE Micro, vol. 21, no. 1, pp. 36-47, Jan.-Feb. 2001, doi: 10.1109/40.903060.

[4]: Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and David Walker. 2014. P4: programming protocol-independent packet processors. SIGCOMM Comput. Commun. Rev. 44, 3 (July 2014), 87–95. DOI:https://doi.org/10.1145/2656877.2656890

[5]: Rachit Nigam, Samuel Thomas, Zhijing Li, and Adrian Sampson. 2021. A compiler infrastructure for accelerator generators. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA, 804–817. DOI:https://doi.org/10.1145/3445814.3446712

[6]: A. Rasmussen, A. Kragelund, M. Berger, H. Wessing and S. Ruepp, “TCAM-based high speed Longest prefix matching with fast incremental table updates,” 2013 IEEE 14th International Conference on High Performance Switching and Routing (HPSR), 2013, pp. 43-48, doi: 10.1109/HPSR.2013.6602288.

[7]: B. Gamache, Z. Pfeffer and S. P. Khatri, “A fast ternary CAM design for IP networking applications,” Proceedings. 12th International Conference on Computer Communications and Networks (IEEE Cat. No.03EX712), 2003, pp. 434-439, doi: 10.1109/ICCCN.2003.1284205.