Synthesis of Parallel and System-on-Chip Designs With Behavioural High-Level Hardware-Synthesis Using Communicating Sequential Processes and the ConPro-Framework

Dr. Stefan Bosse

**Technical Paper** 

4.3.2010



#### Abstract

Traditionally, there are two different ways to model and implement System-On-Chip-Designs (SoC): using a structural and/or a behavioural level. The structural level decomposes a SoC into independent submodules interacting with each other using centralized or distributed networks and communication protocols. The behavioural level usually describes the behaviour of the full design interacting with the environment. Complex reactive systems with dominant and complex control paths play an increasing role in SoC-design. The major contribution to concurrency appears on control path level. This article gives an introduction to SoC-design methodology using the behavioural hardware compiler ConPro providing a programming model based on concurrent communication sequential processes (CSP) with an extensive set of interprocess-communication primitives. An extended case study of a communication protocol used in high density sensor-actuator-networks should demonstrate the design of a SoC for a robot actuator. The communication protocol is suited for high-density intra- and interchip networks.

**Key Words and Phrases:** Circuit Design, Digital Logic, SoC, NoC, Register-Transfer-Logic, Communicating Sequential Processes, Higher-Level-Synthesis, Multiprocessing, Parallel Programming, FPGA, ASIC



# **Table of Content**

| 1 | Introduction and State of the Art                    | 2         |
|---|------------------------------------------------------|-----------|
| 2 | Modelling and Implementing Concurrency               | 4         |
|   | Process Model and RTL-Architecture                   |           |
| 3 | Behavioural Programming Language ConPro              | 9         |
| 4 | Abstract Objects and the External Module Interface 1 | 6         |
| 5 | RTL Architecture 2                                   | 2         |
| 6 | Synthesis2                                           | 5         |
| 7 | Case Study: A Robot Actuator Network2                | :9        |
|   | Control Design                                       | <u>29</u> |
|   | Communication Design: SLIP                           |           |
|   | Synthesis and Results 3                              |           |
| 8 | Conclusion and Outlook 3                             | 5         |
|   | Bibliography3                                        | 6         |



## Introduction and State of the Art

Today there is an increasing requirement for the development of System-On-Chip-Designs (SoC) using Application-Specific Digital Circuits, with increasing complexity, too, serving low-power and miniaturization demands. The structural decomposition of such a SoC into independent submodules requires smart networks and communication (Network-on-Chip, NoC) serving chip area and power limitations. Traditionally, SoCs are composed of microprocessor cores, memory and peripherial components.

But in generally massiv parallel systems require modelling of concurrency both on control- and data-path level. Digital logic systems are preferred for exploration and implementation of concurrency.

Traditionally, digital circuits are modelled on hardware behaviour or gate level, but usually the entry point for a reactive or functional system is the algorithmic level. The Register-Transfer-Logic (RTL) on architecture and hardware level must be derived from the algorithmic level, requiring a raise of abstraction of RTL **ZHU01**.

With increasing complexity, higher abstraction levels are required, moving from hardware to algorithmic level. Naturally imperative programming languages are used to implement algorithms on program-controlled machines which process a sequential stream of data- and control operations. Using this data-processing architecture, a higher-level imperative language can be simply mapped to a lower-level imperative machine language, which is a rulebased mapping, automatically performed by a software compiler.

But in circuit design, there is neither an existing architecture nor an existing low level language that can be synthesized directly from a higher level one.

An imperative programming approach provides both abstraction from hardware and direct implementation of algorithms, but usually reflects the memory-mapped von-Neumann computer architecture model.

Another important requirement of a programming language in circuit design (in contrast to software design) is the ability to have fine-grained control over the synthesis process, usually transparent.

Using generic memory-mapped languages like C makes RTL hardware synthesis difficult because of transparency of object references (using pointers) preventing RTL mapping. Additionally, concurrency models are missing in most software languages. There are many attempts to use C-like languages, but either with restrictions, prohibiting anonymous memory access with pointers, or using a program-controlled (multi-) processor architecture with classical hardware-software-co-design, actually dominant in SoC-Design. But SoC-designs using generic or application-specific processor architectures complicate low-power designs and concurrency is coarse grained.

One example is PICO **KAT02**, addressing the complete hardware design flow targeting SoC and customizable or configurable processors, enhanced with custom-designed hardware blocks (accelerators). The RTL level is modelled with C. The program-controlled approach with processor blocks enables software compilation and unrestricted C (functions, pointers) but lacks support of true bit-scaled data objects.



Another example is SPARK **SPA04**, a C-to-VHDL high-level framework, currently with the restrictions of no pointers, no function recursion, and no irregular control-flow jumps. It is embedded in a traditional hardware-software-codesign flow. It is based on speculative code motions and loop transformations used for exploration of concurrency. SPARK generates pure RTL. Only a single-threaded control-flow is provided.

Though SystemC provides many features suitable for higher-level synthesis, it is primarily used for simulation and verification, and only a subset can be synthesized to circuits. True bit-scaled data types are supported. Concurrency can be modelled using threaded processes, for example used in Forthes commercial synthesis tool Cynthesizer **HLS08**. Interprocess communication is modelled on transaction level (TLM). SystemC provides a high-level-approach to model hardware behaviour and structure, rather than algorithms.

Non of these approaches fully satisfy the requirements for pure RTL circuit design while using C-based languages, especially providing a consistent hardware, software, and concurrency model.

Efficient hardware design requires more knowledge about objects than classical languages like C can provide, for example true bit-scaled registers, access, and implementation models on architecture level (for example singleport versa dualport RAM blocks, static versa dynamic access synchronization). The generic software approach only covers the implementation of algorithms, but in hardware design the synthesized circuit must be connected to and react with the outside world (other circuits, communication links and many more), thus there must be a programming model to interface to hardware blocks, consistent with the imperative programming model. Furthermore, there must be a way to easily implement synchronization always required in presence of concurrency (at least on control path level). A multi-process model, established in the software programmer community, provides a common approach for modelling parallelism, which is the preferred approach to implement and partition reactive systems on algorithmic level.

This article focuses 1. on the design-methodology of SoCs using a concurrent multi-process model and the behavioural programming language Con-Pro **CON08**, 2. the synthesis methods and architecture models for compiling mainly reactive systems using this imperative programming language towards RTL level (modelled on hardware behaviour level VHDL) **CON09**.

Concurrency is modelled explicitly but can be exploited implicitly, too.

The following section 2 describes the used concurrency process model and interprocess communication, and section 3 explains basics of the ConPro programming language. The synthesized RTL architecture with relation to the programming model is described in section 5, and finally section 6 gives an overview of the synthesis process and the synthesis rules.

An extended design study of a protocol implementation suited for sensor- and actuator networks is presented in section 7 and demonstrates the power and suitability of the synthesis approach and tool for complex circuit designs.



Many algorithms used in applications like communication protocolls, sensorsand actuators are in general massiv parallel control systems **LEE06**, with many different tasks to be performed. Therefore, modelling concurrency is an essential part of a complex SoC-design. Concurrency can be either modelled explicitly (not transparent) or implicitly (transparent) by the synthesis tool:

#### **Explicit Parallelism**

The programming model explicitly describes parallelism. Usually this is the preferred method for exploration of coarse-grained parallelism, which requires partitioning on algorithmic level, well done by the programmer, rather by the synthesis tool. No further computational effort must be made by the synthesis tool.

#### **Implicit Parallelism**

The compiler tries to explore and derive parallelism from an initially sequential program specification, described with an imperative language, or using functional languages with (hidden) inherent concurrency **SHA98**. Mostly, concurrency is derived from loops using unroll techniques with allocation of resources in parallel, but concurrency can be explored in basicblocks of data-independent expressions, too. For example, both expressions  $x \leftarrow x + 1$  and  $y \leftarrow y + 1$  can be scheduled (using RTL only) in one time step requiring two adders. Usually this is the preferred method for exploration of fine-grained parallelism on data path level. High computational effort must be made for balancing area and time constraints, usually done with an iterative approach **KU92**.

There are several advantages of the explicit concurrency model versa the implicit model derived from an initially pure sequential code, found in most extended C-like approaches **HLS08**, especially in the context of reactive systems. Knowledge-based modelling of concurrency can lead to a higher degree of concurrency.

A multi-process model with communicating sequential processes provides a concise way, 1. to directly map imperative programming languages to RTL, and 2. to provide parallelism on control path level, required for massive parallel control systems. The multi-process model requires explicit synchronization, shown in figure **1**. Interaction between processes, mainly access of shared resources, is request-acknowledge based.





FIGURE 1: The multi-process model with request-based synchronization (IPC).

Concurrency reduces latency of an algorithm, and minimized latency is a precondition for the design and implementation of real-time systems.

### **Process Model and RTL-Architecture**

A process  $\phi$  provides an execution environment consisting of a control path  $\Gamma$  implemented with a Finite-State-Machine (FSM) and a data path  $\Delta$  performing calculations, shown in figure **2**.



FIGURE 2: The process implementation on hardware architecture level.

A process  $\phi$  bounds a sequence of instructions  $\kappa = \{\kappa_1, \kappa_2, ...\}$  to this execution environment. Process instructions on programming level must be executed in the order they appear (imperative nature). Therefore, the set of program instructions  $\kappa$  can be directly mapped to a set of states  $\Sigma$  of the FSM, implemented entirely in RTL.

An algorithm can be partitioned on control path level using a set of N processes  $\Phi = \{\phi_1, \phi_2, ..., \phi_N\}$ , initially executing independently and concurrently, doing communication, based on the model of communicating sequential processes (CSP) proposed in **HOA85**. A set of interprocess-communication (IPC)  $\Im$  is required for synchronization. IPC creates control relations between processes:  $\Im_1:\phi_n \leftrightarrow \phi_m$ . Using ConPro, it is possible to map multi-processing



and interprocess communication to RTL directly with low resource requirements, shown and proved in this article. The ConPro language CON09 explained in this article provides concurrency both on control path level using processes and on data path level using bounded basicblocks, either specified on programming level or derived automatically by a basicblock scheduler. Synthesis of RTL from an imperative programming language providing the multi-process model can be superior compared with traditional hardware-software-co-design using multi-processor architectures because abstract objects, especially all kind of interprocess-communication, can be implemented more efficiently in hardware than in software, both concerning resources and latency. The set $\Phi$  of processes belongs to a module. On module level, a set of global shared objects $\alpha = \Re \cup \Im$  can be defined, and on process level, local objects can be defined. Processes can access both their local and the global objects. These objects  $\alpha$  are either used for data storage ( $\Re$ ={registers, variables in RAM blocks}), or for IPC ( $\mathfrak{I}=\{mutex, semaphore, queue, timer,...\}$ ). The ConPro synthesis tool maps programming level processes and the instruction sequence<sub>k</sub> to hardware components (entities in VHDL terminology), each consisting of:

- 1. a FSM (state register and state transition network) whose states representing the initial program flow of this particular process,
- 2. combinational data path of RTL (data path multiplexer, demultiplexer, functional units), and
- 3. transitional data path of RTL (data path multiplexer, demultiplexer, functional units, and local registers), shown in figure **2**.

Processes can be controlled by other processes. A process is treated like an abstract data type object (ADTO). Process control is established with the appropriate methods. Starting and stopping of processes are non-blocking operations, thereby calling a process which suspends the caller process until the called (started) process reaches its end state.

## Interprocess-Communication

Concurrency on control path level requires **synchronization AND00**. In the context of a SoC-design consisting of a multi-processor architecture, (coarse-grained) synchronization between different submodules is performed using a distributed or centralized network infrastructure and message passing. In the context of a behavioural multi-process-model (CSP), synchronization is handled by interprocess-communication using abstract objects with an appropriate set of methods applied to, for example, queues (read,write) or semaphores (up,down).

These IPC objects are **content-based**, in contrast to network communication for example in program-controlled multi-processor architectures. Therefore,



these IPC objects are optimally matched to the communication purpose producing lowest communication overhead, and can be implemented directly with hardware blocks. Message passing is reduced to hardware signals.

Realtime-systems require **temporal synchronization**, provided by the IPC objects with time skew and latency lower than one clock cycle, not achievable by generic network architectures and software implementations.

At least the access of shared resources must be protected using mutual exclusion locks (mutex). Access of all global objects is implicitly protected and serialized by a mutex scheduler. IPC and external communication objects are abstract object types, they can only be modified and accessed with a defined set of methods  $v = \{v_1, v_2, ...\}$ , shown in table **1**. Queues and channels can be used in expressions and assignments like any other data storage object.

| IPC Object S | Description                       | Methods v                          |
|--------------|-----------------------------------|------------------------------------|
| mutex        | Mutual Exclusion<br>Lock          | lock, unlock                       |
| semaphore    | Counting<br>Semaphore             | init,up,down                       |
| barrier      | Counting Barrier                  | init, await                        |
| event        | Signal Event                      | init, await,<br>wakeup             |
| timer        | Periodic Timer<br>Event           | init,<br>set,start,<br>stop, await |
| queue (*)    | FIFO queue                        | read, write                        |
| channel (*)  | Handshaken<br>Channel             | read, write                        |
| link         | External<br>Handshaken<br>Channel | read,write                         |

TABLE 1: Available IPC objects. Queues and channels belong both to the core and abstract object class, and can be used within expressions and assignments (\*).

The link IPC object is used for communication between different clock domains, either within a SoC (Interprocess), or between different components (Intraprocess). The link object provides a fast parallel communication channel for Global-Asynchronous-Local-Synchronous (GALS) system design. The link uses dual-rail encoding of data lines and a requestacknowledge handshake, derived from asynchronous circuit design using Muller-C-Gates **BAI01**. Only full-custom ASIC design supports synthesis of asynchronous circuit parts, like C-gates. To provide synthesis of such link components for all technologies (FPGA, standard-cell and custom-designed ASICs) using only synchronous circuits, the Muller-C-Gates are implemented with synchronous FSMs.

Concurrent access of shared resources (including IPC itself) is serialized by a scheduler, which prevents realtime capability on clock resolution, because



the upper limit of access time is unknown and depends on the program and algorithm. But realtime is possible using mutual exclusion which (spatial and temporal) protects parts of a program.



# **Behavioural Programming Language ConPro**

The ConPro programming language consists of two classes of statements: 1. process instructions mapped to FSM/RTL, and 2. type, and object definitions. It is an imperative programming language with strong type checking. Beneath algorithmic statements, the programming language provides some kind of relation to the hardware circuit synthesized from the programming level. Additionally, there is a requirement to get full programmability of the design activities themselves, i. e. of the synthesis process, too **RU87**, implemented here with constrained rules on block level, providing fine-grained control of the synthesis process. The synthesis process can be parameterized by the programmer globally or locally on instruction block level, for example, scheduling and allocation.

The set of objects is splitt into two classes: 1. data storage type set  $\Re$ , and 2. abstract data type object set (ADTO) $\Theta$ , with a subset of the IPC objects  $\Im$ , providing object-orientated programming features with method access. Though it is a traditional imperative programming language, it features true parallel programming both in control and data path, explicitly modelled by the programmer.

Processes provide parallelism on control path level, whereby arbitrary nested bounded blocks inside processes provide parallelism on data path level. There are two extended interfaces connecting the behavioural programming model to external hardware objects: 1. using hardware signals and component structures, or 2. using the External Module Interface (EMI). EMI provides two interface levels: A. the high-level ADTO level with method-based object access, and B. on low-level a behavioural hardware description using a script language extended subset of VHDL. For example. all IPC objects are modelled this way. User objects can be added, too. There are different hierarchy levels provided by the programming model (shown in graph 1):

- 1. The structural module (Module-S) level provides composition of behavioural modules (circuits) to a system-on-chip (SoC) design.
- 2. The behavioural module (Module-B) level contains processes and global objects. A toplevel port defines the interface of the circuit to the outside world.
- 3. The process level contains a finite-state-machine, computational units, and local objects.
- 4. The abstract object Module-O, belonging both to process and behavioural module level. It defines and implements abstract data types objects and the provided method set v.

A module represents a circuit component with an associated toplevel interface hardware port. At least the system clock and reset signals are connected. Some storage objects can be exported with interconnect signals



appearing in the toplevel port. Some abstract objects, for example communication links, have internal input-output signals routed to the toplevel port. A system-on-chip (SoC) can be composed of behavioural module component instantiations using a structural module environment. Either all toplevel port signals of each subcomponent or only a subset is routed to the SoC-toplevelport. In the latter case, there is an interconnection component internally connecting module signals (which can be automatically generated using map instructions).



A **process environment** consists of a unique process name, local object definitions and process instructions. Single processes or an array of processes can be defined using the process environment. Each process executes the associated instruction sequence independently.

**Objects** can be defined within processes with local visibility, or globally on module level. The set of object types  $\alpha$  contains storage  $\Re$ , signals  $\wp$ , and abstract objects  $\Theta = \{\Im, D, E\}$ :  $\alpha = \{\Re, \Theta\}$ . The set D contains data computational objects, for example, random generators and DSP units, and the set E contains external communication objects.

**Data storage** can be implemented with single **registers** or with **variables** sharing one or more memory blocks. Choosing one of these object types is a contraint for synthesis, not a suggestion (in contrast to software programming). Registers provide concurrent-read-exclusive-write (CREW) access behaviour, whereby variables provide only exclusive-read-exclusive-write access behaviour (EREW). Both data storage types can be defined locally on process level or globally on module level. Both registers and variables are **true bit-scaled**, that means, any width ranging from 1 to 64 bit can be used. In the case of variable storage, the data width of the associated memory block is scaled to the widest object stored in this block. Fragmented variable objects are supported.

A strong typed expression model is provided. There is a set of core data



types:  $\beta$ ={logic, int, bool, char}. Product types, both structures and arrays, can be defined to provide user-defined types.

A structure type binds different named elements with defined data types  $\beta$  to a new data type T. The structure type must be defined before an object of this type can be defined: type T: { E1:  $\beta$ 1; E2:  $\beta$ 2; ...}.

The object type  $\alpha$  (register, variable, or signal) is associated during object definition. For each structure element a separate storage element is created. Array definitions consist of object and cell data type specifications in the form: array A:  $\alpha$ [N] of  $\beta$ . Arrays can be accessed dynamically selected. In the case of register or object arrays, index-selected multiplexer and demultiplexer are created. Multi-dimensional storage arrays and arrays of abstract objects including processes are supported.

Example 1 shows some object definitions.

```
◆ Storage Objects ◆
1:
    reg x,y: int[8]; Defines registers
2:
    block ram1; Defines a block RAM
3:
    var a,b,c: int[10] in ram1; Defines variables in RAM
4:
    array mat1: reg[10] of int[23]; Defines an array
5:
    array mat2: var[10] of int[8] in ram1;
6:
7:
    type complex: {
    real: int[16];
8:
     imag: int[16];
9:
    }; Defines a new data type
10:
   reg zcmp: complex; Defines registers of this type
11:
12:
   process xyz:
   begin
13:
    reg t: int[8]; Local data storage
14:
     array ta: var[8] of int[8];
15:
  end; Defines a new process
16:
    ◆ Abstract Objects ◆
17:
    open Mutex; Opens Mutex module required below
18:
    object mu1: mutex; Defines ADTO
19:
```

EXAMPLE 1: Examples of different object definitions distinguished by their object and data type.

```
process p1:
1:
2.
       begin
          a \leftarrow 1, b \leftarrow 3, z \leftarrow x-1; Bounded instruction block
3:
4:
       \Leftrightarrow
         begin
5:
            a \leftarrow 1;
6:
7:
            b←3;
8:
            z←x-1;
9:
          end with bind; Bounded instruction block, too
         x \leftarrow (a+b)*4;
10:
11:
      end;
```



```
12: process p2:

13: begin

14: a \leftarrow 1;

15: b \leftarrow 3;

16: z \leftarrow x-1;

17: x \leftarrow (a+b) * 4;

18: end with schedule="basicblock";
```

EXAMPLE 2: Example of assignments. Lines 3 and 5..9 (parameterized block) reflect equivalent syntax for concurrent statements with identical behaviour. Automatic basicblock scheduling is applied to the second process (parameterized process body block).

- **Expressions** contain data storage objects, constants, and operators. Supported are all common arithmetic, Boolean (logical), and relational operators. Most of them are directly mapped to hardware behaviour level (VHDL operators). Initially, assignments to data storage objects are scheduled in one time step, and the order of a sequence of assignments is preserved. A sequence of data-independent assignments can be bound to one time unit either explicitly by the programmer (bounded block), or implicitly evaluated by the basicblock scheduler (preserving data dependencies, but violating sequence order). A semicolon (without further scheduling constraints) schedules an assignment, whereby a colon-separated list binds assignments to one time unit, shown in example 3, e.g. RTL scheduling originally proposed by Barbacci BAR73. There are different expression models which can be set on block level using the parameter: expr={"flat", "binary", "shared"}. The flat model maps operators of a (nested) expression 1:1 to hardware blocks (no shared resources), the binary mode splits nested expressions into single two-operand subexpressions using temporary registers, improving combinational path delay, and the shared model provides resource sharing of functional operators using ALUs.
- **Control Statements** There are conditional branches, both Boolean and multivalue branching, conditional, unconditional and counting loops, conditional blocking wait-for statements, function calls, and exceptions. Exceptions are abstract (symbolic) signals which can be raised anywhere inside a process, and caught either inside the process where the signal is raised, or outside from any other process calling this respective process. Exceptions are propagated accross process boundaries. Exceptions are the only structural way to leave a control environment, there is no break, continue, or goto statement.
- **Functions** User-defined functions can be implemented in two different ways: 1. as inlined not-shared function macros and 2. as shared function blocks. In the first case, the function call is replaced by all function instructions, and function parameters are replaced by their respective



arguments. In the second case, a function is modelled using the described process with an additional function control block containing a function call lock bound to an access scheduler and registers required for passing function arguments to parameters and returning results. Only call-by-value arguments of atomic objects can be used. The remaining functionality is provided by the underlying process model using the call method. Figure **4** shows the system architecture of a shared function block.

Functions are restricted to non-recursive calls due to a missing stack environment.

Example **3** shows a complete ConPro design. It is the implementation of the dining philosophers' problem using counting semaphores demonstrating resource sharing and scheduling. The story is: five philosophers sit around a circular table. Each philosopher spends his life alternately thinking and eating. In the center of the table is a large platter of spaghetti. Each philosopher needs two forks two eat. But there are only five forks for all. One fork is placed between each pair of philosophers, and they agree that each will use only the forks to the immeadiate left and right **AND00**, here implemented with a semaphore array fork, defined on line 17. The depth parameter specifies the datawidth of the semaphore counter. A register is defined in line 5, and an array of registers in line 7. Though each semaphore is an independent object (in hardware, too), the array can be accessed with dynamic selectors (register, variable), shown for example in lines 21-22 inside a for-loop.

The read ports of the shared registers eating and thinking are exported to the module toplevel port (SoC hardware port level). The design consists of seven processes. The philosophers are implemented with the process array philosopher.

A user-defined structure type is defined in lines 8-14, and finally a register of this type is created in line 15.

A mutex num\_eat\_lock, defined in line 6, is used to protect the incremental and decremental operation of the shared counter num\_eat.

The event ev assures the same starting point for all philosopher processes (synchronization boundary).

```
open Core, Process, Semaphore, System, Event, Mutex;
1:
     object sys: system;
2:
3:
       sys.simu_cycles (500);
     object ev: event;
4:
     reg num_eat: int[8];
5:
    object num_eat_lock: mutex;
6:
     array eating,thinking: reg[5] of logic;
7.
8:
    type stat : {
9:
       phil1: int[8];
10:
       phil2: int[8];
11:
       phil3: int[8];
       phil4: int[8];
12:
       phil5: int[8];
13:
```



```
};
14:
     reg stats: stat;
15:
16:
     array fork: object semaphore[5] with depth=8 and scheduler="fifo";
17:
18:
19:
     process init:
20:
     begin
       for i = 0 to 4 do
21:
          fork.[i].init (1);
22:
        ev.init ();
23:
     end;
24:
25:
     function eat(n: natural):
26:
27:
     begin
       num_eat_lock.lock ();
28:
          num_eat \leftarrow num_eat + 1;
29.
       num_eat_lock.unlock ();
30:
       match n with
31:
        begin
32:
          when 1: stats.phil1 ← stats.phil1 + 1;
33:
          when 2: stats.phil2 ← stats.phil2 + 1;
34:
          when 3: stats.phil3 \leftarrow stats.phil3 + 1;
35:
          when 4: stats.phil4 ← stats.phil4 + 1;
36:
          when 5: stats.phil5 ← stats.phil5 + 1;
37:
38:
        end;
        eating.[n] \leftarrow 1,
39:
          thinking.[n] \leftarrow 0;
40:
        wait for 5;
41:
        eating.[n] \leftarrow 0,
42:
          thinking.[n] \leftarrow 1;
43.
44:
       num_eat_lock.lock ();
45:
          num_eat \leftarrow num_eat - 1;
46:
       num_eat_lock.unlock ();
47:
     end with inline;
48:
     array philosopher: process[5] of
49:
     begin
50:
       if \# < 4 then
51:
52:
       begin
         ev.await ();
53:
         always do
54:
         begin
55:
           - get left fork then right
56:
           fork.[#].down ();
57:
           fork.[#+1].down ();
58:
           eat (#);
59:
           fork.[#].up ();
60:
           fork.[#+1].up ();
61:
         end;
62:
63:
        end
64:
        else
65:
        begin
66:
         always do
67:
         begin
           - get right fork then left
68:
           fork.[4].down ();
69:
```



```
fork.[0].down ();
70:
           eat (#);
71:
           fork.[4].up ();
72:
73:
           fork.[0].up ();
74:
         end;
75:
        end;
76:
     end;
77:
     process main:
78:
     begin
79:
       init.call ();
80:
       for i = 0 to 4 do
81:
       begin
82:
          philosopher.[i].start ();
83:
        end:
84:
       ev.wakeup ();
85:
86:
     end;
87:
     export eating,thinking,num_eat,stats;
88:
```

EXAMPLE 3: A complete ConPro example: the dininig philosopher problem. This implementation demonstrates resource sharing and synchronized access of shared resources using mutex and semaphore objects.

**Synthesis** Gate-level synthesis with a standard cell technology using Leonardo Spectrum and SXLIB standard-cell library results in a circuit with 3919 gates, 235 D-flip-flops, and an estimated longest combinational path of 17 ns (55 MHz maximal clock frequency). The implementation of the semaphore array requires 1733 gates and 70 register, the event requires only 122 gates and 10 register. Each loop iteration of process philosopher up to the and from the eat function requires each 6 clock cycles to complete if all shared resources are immediately granted.



## Abstract Objects and the External Module Interface

Abstract data type objects provide a method based interface to functional and operational hardware blocks, part of the SoC-design. They can be accessed concurrently, requiring an access scheduler serializing and guarding the access to the particular object. A process trying to access the object using a method is blocked (suspended) untill the resource is available and the request was serviced.

Beneath a set of standard objects required for interprocess-communication, the programmer/user can define its own objects and methods using the External Module Interface (EMI).

The EMI defines the object, the available methods and their programming interface, the access of the object and the implementation on hardware behaviour level, at least the access scheduler.

Example **4** shows the definition and implementation of the mutex object. The EMI is divided into different sections.

The EMI language is a combination of a script language (for example lists and list operations) and a subset of VHDL. There are parameter lists, for example the list of all processes \$P accessing the object, or only processes using a particular method M (\$P.M is a subset of \$P). Other core parameters like the object name \$0 or the clock name \$CLK get a value during synthesis.

- **Paramter** This section defines local parameters evaluated during synthesis.
- **Methods** The methods section defines the interface of the provided methods applied to the object.

Assert During synthesis sanity checks can be performed (optional).

- **Interface** The interface section defines interconnect hardware signals required for the RTL implementation of each ConPro process accessing this object.
- **Mapping** Signals required for access of the object (defined in interface section) must be mapped from local process to global module level (in terms of VHDL entity ports and port mappings)
- **Access** This section defines for the object signal access during method call (RTL level).
- **Signals** The signal section defines hardware signals required for the implementation of the object.
- **Process** One or more hardware processes required for object implementation (functional/operational block, at least the scheduler) can be defined and modeled using the process section using a VHDL style language.

```
1: #parameter
```

- 2: begin
- 3: \$scheduler["static","fifo"] <= "static";</pre>



4

[CON10] Page 18

```
end;
4:
5:
     - Supported object methods
6:
7:
8:
     #methods
9:
     begin
       init();
10:
       lock();
11:
       unlock();
12:
13:
     end;
     #assert
14:
     begin
15:
       size($P.init) >= 1;
16:
       size($P.lock) >= 1;
17:
       size($P.unlock) >= 1;
18:
     end;
19:
20:
     - Interface for processes accessing methods
21:
     - of object (process port, local process context)
22:
23:
     #interface
24:
     begin
25:
       foreach $p in $P.init do
26:
27:
       begin
28:
          signal MUTEX_$0_INIT: out std_logic;
       end;
29:
       foreach $p in $P.lock do
30:
       begin
31:
          signal MUTEX_$0_LOCK: out std_logic;
32:
       end;
33
       foreach $p in $P.unlock do
34:
35:
       begin
          signal MUTEX_$0_UNLOCK: out std_logic;
36:
37:
       end;
       foreach $p in $P do
38:
       begin
39:
          signal MUTEX_$0_GD: in std_logic;
40:
       end;
41:
42:
     end;
43:
     - Process mapping of object signals (global module context)
44:
45:
     #mapping
46:
     begin
47:
       foreach $p in $P.init do
48:
49:
       begin
          MUTEX_$0_INIT => MUTEX_$0_$p_INIT;
50:
       end;
51:
       foreach $p in $P.lock do
52:
       begin
53:
          MUTEX_$0_LOCK => MUTEX_$0_$p_LOCK;
54:
55:
       end;
56:
       foreach $p in $P.unlock do
57:
       begin
          MUTEX_$0_UNLOCK => MUTEX_$0_$p_UNLOCK;
58:
       end;
59:
```



```
foreach $p in $P do
60:
        begin
61:
          MUTEX_$0_GD => MUTEX_$0_$p_GD;
62:
63:
        end;
64:
     end;
65:
66:
     - Object method access (local process context)
     - for each method ...
67:
68:
69:
     init: #access
70:
     begin
        #data
71:
        begin
72:
          MUTEX_$0_INIT <= MUTEX_$0_GD when $ACC else '0';</pre>
73:
        end;
74:
        #control
75:
       begin
76.
          wait for MUTEX_$0_GD = '0';
77:
        end;
78:
     end;
79:
     lock: #access
80:
     begin
81:
        #data
82:
83:
        begin
          MUTEX_$0_LOCK <= MUTEX_$0_GD when $ACC else '0';</pre>
84:
85:
        end;
        #control
86:
87:
        begin
          wait for MUTEX_$0_GD = '0';
88:
        end;
89
90:
     end;
91:
     unlock: #access
92:
     begin
        #data
93:
        begin
94:
          MUTEX_$0_UNLOCK <= MUTEX_$0_GD when $ACC else '0';</pre>
95:
        end;
96:
97:
        #control
98:
       begin
99:
          wait for MUTEX_$0_GD = '0';
        end;
100:
     end:
101:
102:
     - Implementation (global module context)
103:
     - VHDL signals required, both for mapping processes
104:
     - and auxilliary signals.
105:
106:
     #signals
107:
     begin
108:
109:
        - Implementation signals
110:
111:
112:
        foreach $p in $P.lock do
113:
        begin
          signal MUTEX_$0_$p_LOCK: std_logic;
114:
115:
        end;
```



```
foreach $p in $P.unlock do
116
       begin
117
          signal MUTEX_$0_$p_UNLOCK: std_logic;
118
119:
        end;
120
        foreach $p in $P.init do
121:
       begin
          signal MUTEX_$0_$p_INIT: std_logic;
122
       end;
123
       foreach $p in $P do
124
       begin
125
          signal MUTEX_$0_$p_GD: std_logic;
126
127
          signal MUTEX_$0_$p_LOCKed: std_logic;
128
       end;
129
     end;
     #signals ($scheduler="static")
130
     begin
131
       signal MUTEX_$0_LOCKed: std_logic;
132
133:
     end;
134:
     - Implementation (global module context)
135:
     - Scheduler process: access serialization
136:
137:
     MUTEX_$0_SCHED: #process ($scheduler="static" and $fsm="moore")
138:
139
     begin
       if $CLK then
140
       begin
141
          if $RES then
142
143:
          begin
            MUTEX_$0_LOCKed <= '0';</pre>
144
            foreach $p in $P do
145
146
            begin
              MUTEX_$0_$p_GD <= '1';</pre>
147
148
            end;
            foreach $p in $P.lock do
149
            begin
150
              MUTEX_$0_$p_LOCKed <= '0';</pre>
151:
            end;
152
          end
153
154
          else
155
          begin
            foreach $p in $P do
156:
            begin
157
              MUTEX_$0_$p_GD <= '1';</pre>
158
            end;
159
            sequence
160
            begin
161:
              foreach $p in $P.init do
162
              begin
163
                 if MUTEX_$0_$p_INIT = '1' then
164
                 begin
165
                   MUTEX_$0_LOCKed <= '0';</pre>
166
167
                   MUTEX_$0_$p_GD <= '0';
168:
                   foreach $1 in $P.lock do
169:
                   begin
                     if MUTEX_$0_$1_LOCKed = '1' then
170
                     begin
171:
```



[CON10] Page 21

```
MUTEX_$0_$1_LOCKed <= '0';</pre>
172
                         MUTEX_0_1GD <= 0';
173
174
                       end:
175
                    end;
176
                  end;
177
               end;
178
               foreach $p in $P.lock do
               begin
179
                  if MUTEX_$0_$p_LOCK = '1' and MUTEX_$0_LOCKed = '0' then
180
                 begin
181:
                    MUTEX_$0_LOCKed <= '1';</pre>
182
183
                    MUTEX_$0_$p_LOCKed <= '1';</pre>
                    MUTEX_$0_$p_GD <= '0';</pre>
184
185:
                  end;
               end;
186
               foreach $p in $P.unlock do
187
               begin
188
                  if MUTEX_$0_$p_UNLOCK = '1' then
189
                  begin
190
                    MUTEX_$0_LOCKed <= '0';</pre>
191:
                    MUTEX_$0_$p_LOCKed <= '0';</pre>
192:
                    MUTEX_$0_$p_GD <= '0';</pre>
193:
                  end;
194:
               end;
195
196
             end;
197
          end;
        end;
198:
     end;
199:
200
```

EXAMPLE 4: Extract from the mutex EMI implementation.

Finally, objects can be created and used within ConPro modules and processes, shown in example **5**. A mutex is used to guard a global counter x, though the access of this register is implicitly guarded, the expression  $x \leftarrow x + 1$  is not mutual, and must be guarded explicitly to guarantee data consistency.

```
open Core; open Process; open Mutex;
1:
     object mu_x: mutex with schedule="fifo";
2:
     reg x: int[6];
3:
     process p1:
4:
     begin
5:
       for i = 1 to 10 do
6:
       begin
7:
          mu_x.lock (); x \leftarrow x + 1; mu_x.unlock();
8:
g.
        end;
10:
     end;
11:
     process p2:
12:
     begin
       for i = 1 to 10 do
13:
       begin
14:
```



```
mu_x.lock (); x \leftarrow x - 1; mu_x.unlock();
15:
        end;
16:
     end;
17:
     process main:
18:
19:
     begin
       mu_x.init (); x \leftarrow 0;
20:
       p1.start (); p2.start ();
21:
     end;
22:
```

**EXAMPLE 5: Abstract Data Type Object creation and method access inside ConPro processes.** 



## **RTL Architecture**

Each high-level process is mapped to a FSM and RTL, already shown in figure **2**. Process instructions are mapped to states of the FSM. Figure **3** shows the process system interconnect using signals. Access of objects is request-based and requires a request signal fed into a mutex-guarded access scheduler, responsible for serialization of concurrent access by different processes. A guard signal is read by the process FSM, providing a simple and efficient two-signal handshake (REQ $\leftrightarrow$ ACT). Each shared object implements an access scheduler block, consisting of an FSM, providing the interface between processes accessing a resource and the resource itself.

The process block interface and system interconnect shown in figure **3** require different signals for the control and data path. Shared objects can be connected to different processes, requiring control signals for atomic access (called guards). All processes and objects are sourced by one system clock and reset signal, thus all functional blocks operate synchronously.



Two different scheduling policies are supported: a simple static priority scheduler and a dynamic FIFO-based scheduler. The first one assigns each process a static priority during compile time. The resource is scheduled in priority order. This can lead to race conditions, whereby one or some processes always get access to the resource, and others never. The dynamic scheduler stores process identifiers in a queue and guarantees resource access in the order the requests arrived.

Local objects are directly implemented in RTL of a process, whereby global shared objects are implemented in separated hardware blocks, connected to processes using signals and to external circuit signals (at least clock and reset). The hardware architecture of a global object consists of the access scheduler block explained above, and the implementation blocks of this object, for example a RAM or communication transmitter. The access scheduler is the interface between the processes (accessing this object) and the implementation blocks (processing the object request).

The structure of the data path depends of chosen expression and temporary register model (shared versa exclusive), shown in graphs 2 and 3. The



[CON10] Page 24 structure of the control path depends on instruction binding and scheduling using bounded blocks and different scheduling strategies, on loop parameters (loop-unrolling), and finally on optimization.

GRAPH 2: Comparison of allocation in different expression models: flat (left) versa shared (right). Instruction sequence:  $x \leftarrow a+b$ ;  $y \leftarrow x+c$ ;  $z \leftarrow y+d$ .



GRAPH 3: Comparison of allocation in different expression models: flat (left) versa shared (right) and shared temporary register model. Instruction:  $z \leftarrow a+b+c+d$ . Additionally the binary expression model is shown in the middle subgraph.



Figure 4 shows the system architecture of a shared function block.





FIGURE 4: Shared function blocks are implemented with a process block and a function call scheduler. Only some of the interconnect signals are displayed.



[CON10] Page 26 The synthesis process is a traditional software compiler flow. The synthesis of RTL circuits from high-level imperative programs passes different phases:

1. First, the source code is **parsed and analyzed**. For each process, an abstract **syntax graph** preserving complex statements is built. Global and local objects are stored in symbol tables (one globally for module level, and one for each process level).

First optimizations are performed on the process instruction graph, for example, constant folding and dead object checking, and elimination of those objects and superfluous statements.

Several program transformations (based on rules and pattern matching) are performed, for example inference of temporary expressions and registers.

A symbolic source code analysis method, called **reference stack scheduler KU92**, examines (local) data storage objects and their history in expressions. The **reference stack scheduler** analyzes the evaluation of data storage expressions with an expression stack, one for each object.

The reference stack transforms a sequence of storage assignments with expression E  $\kappa$ ={ $\Theta \leftarrow E_1, \Theta \leftarrow E_2, ...$ } of a particular storage object  $\Theta$  to a sequence of immutable and unique symbolic variables  $\Theta_i$ : { $\Theta_1 \leftarrow E_1, \Theta_2 \leftarrow E_2, ...$ }. The aim is to reduce statements (using backward substitution and constant folding) and superfluous storage. The reference stack scheduler has an ALAP scheduling behaviour.

2. After analysis and optimization on instruction graph level, these complex instructions (ranging from expressions to loops) are transformed into a linear list of  $\mu$ **Code** instructions, shown in table **2**. The  $\mu$ Code level is an **intermediate representation** of the program code, used in software compilers, too, though no architecture-specific assumption is made on this level, except constraints to the control flow. The  $\mu$ Code can be exported and imported, too. This feature enables a different entry level for other programming language frontends, for example, functional languages **SHA98**.

This intermediate representation allows more fine-grained optimization, allocation, and scheduling. The transformation from syntax graph to  $\mu$ Code infers auxiliary instructions and register (suppose for-loops which require initialization, conditional branching, and loop-counter statements).

Parallelism on data path level is provided by the bind instruction which binds N instructions to one FSM state (one time unit).

The transformation is based on a **set of rules**  $\chi_{\kappa \to \mu}$ , consisting of default rules and user-selectable rules (constrainted rules). This is the

first phase of architecture synthesis by replacing the paradigms of the source language with paradigms of the target machine, in this case a FSM with statements mapped to states and expressions mapped to the data path (RTL). Additionally, the first phase of allocation is performed here.

Data path concurrency is explored either by user-specified bounded blocks or by the **basicblock scheduler**. This scheduler partitions the  $\mu$ Code instructions into basicblocks. These blocks have only one control path entry at the top and one exit at the tail. The instructions of one basicblock (called major block) are further partitioned into minor blocks (containing at least one instruction or a bounded block). From these minor blocks data dependency graphs (DDG) are built. Finally the scheduler selects data-independent instructions from these DDGs with ASAP behaviour.

3. After the first synthesis level, the intermediate  $\mu$ **Code is mapped to an abstract state graph RTL** using a **set of rules**  $\chi_{\mu \to \Gamma\Delta}$ , too, again consisting of default and user-selectable rules. A final conversion step emits VHDL code. This design choice provides the possibility to add other/new hardware languages, like Verilog, without changing the main synthesis path.

The rule set determines resource allocation of temporary registers and functional blocks providing different allocation strategies: shared versa non-shared objects and flat versa shared functional operators and inference of temporary registers. Shared registers and functional blocks introduce signal selectors inside the data path.

RTL is partitioned into a state machine FSM (two hardware blocks, one transitional implementing the state register and one combinational implementing the state switch network), providing the control path, and the data path (consisting of transitional and combinational hardware blocks, implementing functional operators, access of global resources and local registers).

Using the default set of rules, each  $\mu$ Code instruction (except those bounded) is mapped to one state of the FSM requiring one time unit ( $\geq$  1 clock cycle, depending on object guards). Scheduling is mainly determined by the rule set  $\chi_{\kappa \to \mu}$ , rather by  $\chi_{\mu \to \Gamma\Delta}$ .



| Mnemonics                | Descriptions                                                         | Effect                                            |
|--------------------------|----------------------------------------------------------------------|---------------------------------------------------|
| <pre>move(dst,src)</pre> | Data transfer                                                        | ∆:dst←src                                         |
| expr(dst,op1,op,o        | p2Data transfer with<br>binary expression<br>evaluation              | ∆:dst←op(op1,op2)                                 |
| jump(label)              | Unconditional branch                                                 | $\Gamma: \sigma \leftarrow \sigma$ (label)        |
| falsejump(cond,la        | be ${f D}$ onditional branch                                         | $\Gamma:\sigma(\text{label}) _{\neg \text{cond}}$ |
| bind(n)                  | Bind n following<br>instructions to a<br>parallel execution<br>block | $\{\mu_1,\mu_2,\ldots\} \rightarrow \sigma$       |
| fun                      | Abstract Object                                                      | $\Gamma:\sigma _{\mathrm{obj}}$                   |
| obj.meth(args)           | Method Call                                                          | ∆:params <sub>obj</sub> ⇔args                     |
| nop                      | No operation place<br>holder, mostly a<br>result of optimization     | -                                                 |

TABLE 2:  $\mu$ -Code instructions and their effect on data- and control path ( $\Delta$ ,  $\Gamma$ ).

Though no traditional iterative scheduling and allocation strategies are used in this software compiler flow, the non-iterative constraint selective rule based synthesis approach provides inherent scheduling and allocation with strong impact from different optimizers. Summarized there are different levels of scheduling and allocation:

#### **Reference Stack Scheduler**

Operates on syntax graph level and tries to reduce statements, functional operators, and storage, and has impact on scheduling and allocation.

#### **Basicblock Scheduler**

Operates on intermediate  $\mu$ Code level and tries to reduce operational time steps of statements and has only impact on scheduling.

#### **Expression Scheduler**

To meet timing constraints, mainly clock-driven, complex, and nested flat expressions must be partitioned into subexpressions using temporary registers and expanded scheduling. This scheduler has impact both on scheduling and allocation.

#### Optimizer

Classical constant folding, dead code and object elimination, and loop/ branch-invariant code transformations further reduce time steps and resources (operators and storage).



#### **Synthesis Rules**

But finally the largest impact on scheduling and allocation comes from the set of synthesis rules  $\chi = \chi_{\kappa \to \mu} \cup \chi_{\mu \to \Gamma \Delta}$ .

The ConPro synthesis tool was entirely implemented using the functional language ML (OCaML, about 70000 source code lines).

Figure 5 summarizes the synthesis flow. During the synthesis process SYN-TAX  $\rightarrow \mu$ CODE  $\rightarrow$ FSM-STATES  $\rightarrow$ RTL-VHDL, initial block structures from high level control environments (for example branches) are kept, together with source code informations.



FIGURE 5: This figure gives an overview about the ConPro synthesis process, from source code to VHDL RTL.



This case study is part of an ongoing project called ModACT founded by ISIS. Goal of the project is the design and development of a modular robot actuator with a decentralized overall control architecture. Several actuators and sensors are connected in a network. Apart from system architecture and design issues, modern microsystem technologies and integration are related work in this project. Thus, miniaturization and low power design must be satisfied. The high density sensor and actuator network consist of independent nodes, each equipped with: 1. data processing, 2. control and 3. communication blocks **SSI10**. The functionality of each node is implemented entirely in register-transfer-digital-logic using FPGAs (Xilinx, SpartanIII-1000/500).

- An actuator node consists of: 1. a brushless DC-motor, 2. a harmonicdrive-gear, 3. highly miniaturized electronics performing data processing and control, 4. different sensors for angular position, current and temperature, and finally 5. communication.
- A sensor node consists of: 1. a sensor (for example a force/pressure sensor), 2. electronics, and 3. communication.

The ModuACT actuator controller was designed with the multi-process programm model explained in this article and partitioned into 46 behavioural processes and 9 shared functions. Figure **6**, **8** and **9** show most of the processes and their interconnection using IPC, mainly queues. The SoChardware-design can be synthesized for FPGA and ASIC technologies.

## **Control Design**

Figure **6** shows the main parts of the actuator controller consisting of data acquisition (component ADC), data processing performed by process service, a reactive loop activated periodically by the timer service\_timer (time period can be set between 10  $\mu$ s and 10ms). The controller uses a register file consisting of 256 registers (implemented with a scheduled RAM block), and each cell has a datawidth of 12 bit. Each register is assigned to a special purpose, for example PID controller constants, actual measuring results (current, position, temperature and many more).

The position controller is a traditional PID calculator, activated periodically from timer service\_timer, too.

The brushless DC motor requires a three-phase pattern generator to supply the phase-synchronous driving voltages (approximated sinus waves). Each motor coil voltage is generated by PWM generators, performed with process  $pwm_gen.[0]$  to  $pwm_gen.[2]$ . The PWM period time is about 30  $\mu$ s, activated by the timer  $pwm_timer$ . Each PWM generator process is synchronized by the process  $pat_update$  with actual data, getting itself data from the main pattern generator process pat. This process also linearizes the PWM amplitudes (electronic H-bridge compensation). Data is passed by global shared registers. The sinus wave period is different from the PWM period, therefore the process  $pat_syncer$  activates the pattern generator.





FIGURE 6: ModuACT architecture, part 1: control and data processing.

The pattern geneator requires feedback from rotor position of the motor, detected by three magnetic hall sensors, whose data is processed by processes hall and speed. They are activated by timers hall\_timer (1  $\mu$ s) and speed\_timer (10  $\mu$ s), respectively.

## **Communication Design: SLIP**

Communication is an essential part of a complex robot systems, consisting of several actuators and sensor nodes. The Scalable Local Intranet Protocol (SLIP) and a communication controller design was developed. The goals were: 1. low power design and efficiency, 2. low resource design, SoC capable, and 3. adaptable to local communication requirements. SLIP is scalable with respect to network size (address size class ASC), maximal data payload (data size class DSC) and the network topology dimension size (address dimension class ADC).

Network nodes are connected using (serial) point-to-point links, and they are arranged along different metric axes of different geometrical dimensions: a one dimensional network (ADC=1) implements chains and rings, a two-dimensional network (ADC=2) can implement mesh grids, a three-dimensional (ADC=3) can implement cubes, and so on. Both regular (complete) and irregular networks (with missing nodes) are supported for each dimension.

The main problem in message-based communciation is routing and thus addressing of nodes. Absolute and unique addressing of nodes in a high-density sensor-actuator-network is not suitable. An alternative routing strategy is delta-distance routing, used by SLIP. A delta vector  $\Delta$  specifies the way



from the source to a destination node.

An example network (ADC=2) with five nodes connected by bidirectional point-to-point links is shown in figure **7**. Suppose a message should be sent from Node 1 to Node 5. The relative distance is  $\Delta$ =(2,-1).



FIGURE 7: An example network (ADC=2) with five nodes connected by bidirectional point-to-point links.

A message packet contains a header descriptor specifying the type of the packet and the scalable parameters ASC, DSC and ADC.

A packet descriptor follows the header descriptor, containing: the actual delta-vector  $\Delta$ , the original delta-vector  $\Delta^1$ , a backward-propagation vector  $\Gamma$ , a preferred routing direction  $\omega$ , an application layer  $\pi$ , and the length of the following data part.

Each time a packet is forwarded (routed) in some direction, the delta-vector is decreased in the respective dimension entry. For example, routing in x-direction results in:  $\Delta_1 \leftarrow \Delta_1$ -1.

A message has reached the destination if  $\Delta$ =0 and can be delivered to the application layer  $\pi$ .

There are different smart routing rules, applied in order showed below untill the packet can be routed (or discarded):

**route\_normal** This is the main routing rule with the goal to decrease the distance from the actual node to the destination node, finding the shortest path:  $min|\Delta|$ .

Each i-dimension of the  $\Delta$ -vector is checked: if is  $\Delta_i > 0$ , and there is a link in i-direction, route packet with  $\Delta_i \leftarrow \Delta_i$ -1, else if  $\Delta_i < 0$ , and there is a link in -i-direction, route packet with  $\Delta_i \leftarrow \Delta_i$ +1, else try next rule. The starting dimension is specified in the packet descriptor.

- **route\_opposite** Try to send the packet on any other link (but not the direction from which the packet has arrived), resulting in a partial increase of distance. The opposite travel is marked in the  $\Gamma$ -entry.
- **route\_backward** The packet is trapped, no further way to route the packet. Send back the packet on the direction from wwhich it has arrived. The backward travel is marked in the  $\Gamma$ -entry.



[CON10] Page 33 For example, the path N1 $\rightarrow$ N5 can be directly routed using normal routing, first x-direction, finally y-direction. Now assume the link L35 is down, and the packet arrives at node N3, but must be send back to node N2, to change the routing direction, and finally arrives N5 over link L45. The above set of smart routing rules solve such backend-trap problems, supporting irregular network topologies and robustness against link failures. The way back from N5 $\rightarrow$ N1 requires opposite routing at node N4!

The SLIP protocol stack was implemented for ADC=2, ASC=8, and DSC=8, using a partition of 21 processes and 4 functions. Queues are used to supply buffering and synchronized data exchange between processes. Packet structures and packet data are managed in different pools (uisng scheduled RAM arrays).

Figures **8** and **9** show the architecture of the protocol stack. The application layer is implemented with process proto, providing an RPC-based interface to the controller register file.





Incoming data (from a serial link link.[i] with i=1..4) is stored in the first data queue link\_rx.[i]. The data is parsed by the process link\_rx\_proc.[i], reading from the data queue. The process link\_tmo.[i] provides timeout management. If a packet was successfully received, the packet header is stored in the queue pkt\_proc\_queue. Packet are processed by the process pkt\_process, which tries to route the packet using the above routing rules (implemented in three function blocks) or delivers the packet to the application layer. Outgoing packets are processed also by pkt\_process. The routing functions pass the packet to the appropi-



ate queue pkt\_send\_queue.[i], wich is processed by the packet encoder link\_tx\_proc.[i]. This process sends the data stream to the link transmitter.



FIGURE 9: ModuACT architecture, part 3: communication link interface.

## **Synthesis and Results**

The full design consisting of the actuator controller, data processing and communication was synthesized for two target technologies using two different design flows: 1. XilinX-FPGA, Spartan III-1000, Xilinx-ISE synthesis with place & route backends, and 2. standard-cell-library ASIC using the lsi\_10k library and the design compiler from Synopsys. The second one was only used to get a good area estimation of the SoC-design.

Table **3** shows results of the FPGA synthesis. Though a high complex design was implemented, the design fits in a medium sized FPGA. The FPGA resources are fully mapped, but the overall performance (longest comb. path) is still high (up to 60 MHz clock frequency). Table **4** shows results of the ASIC synthesis. This design requires about 300k gates area. In contrast one microprocessor core (for example Leon-2, Sparc V8, using Virtex XCV800 FPGA) requires about 600k gates and achieves only 20 MHz clock frequency **LEO06**.

Table **5** shows synthesis results of some selected processes. Though FSMs have a high number of states, the required area and register count is still low.



| Parameter                               | Value                |
|-----------------------------------------|----------------------|
| number of source code lines<br>(ConPro) | 3381                 |
| number of synthesized VHDL              | 36531                |
| lines                                   |                      |
| total eq. gate count                    | 1433516              |
| longest path time                       | 15 ns (66 MHz clock) |
| number of 4-input LUTs                  | 13448 (15360)        |
| number of block RAMs                    | 20 (24)              |
| number of FSMs                          | 48                   |
| number of flip-flops                    | 3666                 |

TABLE 3: Summarization of VHDL synthesis results of robot joint controller using XIIinx ISE 9.2 and a XIIinx Spartan-III FPGA

| Parameter          | Value            |
|--------------------|------------------|
| number of cells    | 71870            |
| number of nets     | 81624            |
| combinatorial area | 118829 eq. gates |
| non-comb. area     | 188940 eq. gates |
| total area         | 307769 eq. gates |

TABLE 4: Summarization of VHDL synthesis results of robot joint controller using Synopsys Design Compiler and the lsi\_10k standard-cell library.

| Process            | States | Area | Register |
|--------------------|--------|------|----------|
| pkt_process        | 32     | 587  | 42       |
| FUN_route_normal   | 72     | 1269 | 57       |
| FUN_route_opposite | 55     | 879  | 31       |
| FUN_route_backward | 86     | 1244 | 34       |
| FUN_sin            | 33     | 2318 | 71       |
| pid                | 19     | 3723 | 86       |
| proto              | 135    | 3404 | 136      |
| service            | 90     | 2438 | 104      |

TABLE 5: Summarization of synthesis results of some selected processes. The FSM state number is calculated by ConPro, area in equivalent gates and registers by the design compiler (lsi\_10k library).



In this paper, a multi-process programming model was presented for SoC design to explore concurrency required in complex systems like Cyber Physical Systems for control, data processing and communciation.

A new high-level language and a synthesis compiler ConPro used for complex circuit and SoC design was presented, closing the gap between software and hardware level. The programming language provides an algorithmic entry level with additional features for synthesis control concerning scheduling and allocation. True bit-scaled data types are supported. The programming model is based on a concurrent multi-process architecture with interprocess-communication primitives, providing coarse-grained parallelism explicitly modelled. Fine-grained parallelism is supported on data path level and can be explored by the synthesis tool, too.

The synthesis process maps process instructions to states of an FSM and RTL on hardware behaviour level with good performance and resource coverage, using a user selectable set of rules. VLSI design with about 1M gates and beyond can be designed. Synthesis results show good performance of the compiler and good matching results to target technologies like FPGAs compared with traditional microprocessor designs. Though a traditional software compiler design flow is used, the optimizers can reach well-optimized circuit designs. Main application fields are reactive systems, rather than functional and pipelined systems.

The programming-model and synthesis approach was proved with a complex robot actuator design.

In the future, pipelining of the data path must be supported to provide high performance synthesis of functional units.



# Bibliography

| LEE06         | Edward Lee                                                                          |
|---------------|-------------------------------------------------------------------------------------|
|               | Cyber-Physical Systems - Are Computing Foundations                                  |
|               | Adequate?<br>Position Paper for NSF Workshop on Cyber-Physical Sys-                 |
|               | tems: Research Motivation, Techniques and Roadmap,                                  |
|               | Austin, Texas, October 16-17, 2006                                                  |
| SHA98         | Richard Sharp                                                                       |
|               | Higher-Level Hardware Synthesis                                                     |
|               | Springer, 1998                                                                      |
| HOA85         | C. A. R. Hoare                                                                      |
|               | Communicating Sequential Processes                                                  |
|               | Prentice Hall, 1985                                                                 |
| VAN93         | Jan Vanhoof, Karl Van. Rompaey, Ivo Bolsens, Gert                                   |
|               | Goossens, Hugo De Man                                                               |
|               | High-Level Synthesis for Real-Time Digital Signal Processing<br>Kluwer, 1993        |
| KU92          | David C. Ku, Giovanni De Micheli                                                    |
|               | High Level Synthesis of ASICs Under Timing and Synchro-                             |
|               | nization Constraints                                                                |
| • • • • • • • | Kluwer, 1993                                                                        |
| CON08         | Stefan Bosse                                                                        |
|               | ConPro: High-Level Hardware Synthesis With An Imperative<br>Multi-Process Approach  |
|               | Technical Paper, BSSLAB, Bremen, 2008                                               |
| CON09         | Stefan Bosse                                                                        |
| CONCO         | ConPro: Rule-Based Mapping of an Imperative Programming                             |
|               | Language to RTL for Higher-Level-Synthesis Using Commu-                             |
|               | nicating Sequential Processes                                                       |
|               | Technical Paper, BSSLAB, Bremen, 2009                                               |
| SSI10         | Stefan Bosse, Dirk Lehmhus                                                          |
|               | Smart Communication in a Wired Sensor- and Actuator-                                |
|               | Network of a Modular Robot Actuator System using a Hop-                             |
|               | Protocol with Delta-Routing<br>Smart Systems Integration, 23-24.3.2010, Italy, Como |
| ZHU01         | Jianwen Zhu                                                                         |
| 2001          | MetaRTL: Raising the abstraction level of RTL Design                                |
|               | DATE '01: Proceedings of the conference on Design, automa-                          |
|               | tion and test in Europe (2001), pp. 71-76                                           |
| RU87          | Steven M. Rubini                                                                    |
|               | Computer Aids For VLSI Design                                                       |
|               | Addision Wesley 1987                                                                |



[CON10] Page 38

| CLA09 | J. Hilljegerdes, P. Kampmann, S. Bosse, and F. Kirchner<br>Development of an Intelligent Joint Actuator Prototype for<br>Climbing and Walking Robots<br>12th International Conference on Climbing and Walking<br>Robots and the Support Technologies for Mobile Machines,<br>09-11 September 2009, Istanbul, Turkey                                        |
|-------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| KAT02 | V. Kathail, S. Aditya, R. Schreiber, B. R. Rau, D. C. Cronquist<br><i>PICO: Automatically Designing Custom Computers</i><br>IEEE Computer, 35 (9), pp 39-47, 2002                                                                                                                                                                                          |
| SPA04 | Sumit Gupta, R.K. Gupta, N.D. Dutt, A. Nicolau<br>SPARK: A Parallelizing Approach to the High-Level Synthesis<br>of Digital Circuits<br>Kluwer Academic Publishers 2004                                                                                                                                                                                    |
| BAR73 | Mario R. Barbacci<br>Automated Exploration of the Design Space For Register<br>Transfer (RT) Systems<br>Thesis, 1973, Carnegie-Mellon-University                                                                                                                                                                                                           |
| AND00 | Greg Andrews<br><i>Multithreaded, Parallel, and Distributed Programming</i><br>Addison Wesley, 2000                                                                                                                                                                                                                                                        |
| HLS08 | Philippe Coussy, Adam Morawiec (Ed.)<br><i>High-Level Synthesis - from Algorithm to Digital Circuit</i><br>Springer 2008                                                                                                                                                                                                                                   |
| BAI01 | John Baindridge<br>Asynchronous System-on-Chip Interconnect<br>Springer, 2001                                                                                                                                                                                                                                                                              |
| LEO06 | <ul> <li>H. Miranda, J. Tombs, M. A. Echánove</li> <li><i>Implementation of a low cost embedded system using the</i></li> <li><i>Leon2 processor</i></li> <li>XXI Conference on Design of Circuits and Integrated Systems. Conference on Design of Circuits and Integrated Systems (21). Num. 21. Barcelona. Departamento de Electrónica. 2006.</li> </ul> |

