

# **Application Note**



# Compact Zynq System with SW-defined Floating-Point 8xSIMD EdkDSP Accelerator

Trenz Electronic TE0720-2IF Module and TE0706-02 Carrier Board

Jiří Kadlec, Zdeněk Pohl, Lukáš Kohout kadlec@utia.cas.cz, xpohl@utia.cas.cz, kohoutl@utia.cas.cz phone: +420 2 6605 2216 UTIA AV CR, v.v.i.

#### Revision history:

| Rev. | Date       | Author      | Description                                      |
|------|------------|-------------|--------------------------------------------------|
| 1    | 12.01.2018 | Jiří Kadlec | Initial internal draft for the Productive 4.0    |
|      |            |             | consortium meeting 17-18.1.2018 (Lisabon, PT).   |
| 2    | 30.01.2018 | Jiří Kadlec | Demonstrator description prior to the Productive |
|      |            |             | 4.0 project conference 6-7.3.2018 (Athens, GR)   |
| 3    | 14.03.2018 | Jiří Kadlec | Fixed function names in table 6.                 |
|      |            |             | Released for public download.                    |
|      |            |             |                                                  |
|      |            |             |                                                  |
|      |            |             |                                                  |

### Acknowledgements:

This work has been partially supported by ECSEL JU project Productive 4.0 No. 737459.

# **Table of Contents**

| Compact Zynq System with SW-defined Floating-Point 8xSIMD EdkDSP Accelerator and TE0706-02 Carrier Board |    |
|----------------------------------------------------------------------------------------------------------|----|
| 1. EdkDSP IP Core - Introduction                                                                         | 4  |
| 2. Implementation Details                                                                                | 5  |
| 3. EdkDSP IP Core – PicoBlaze6 C Application Interface Functions                                         | 11 |
| 4. EdkDSP IP Core – MicroBlaze C Application Interface Functions                                         | 12 |
| 5. EdkDSP IP Core – Integration with dual core ARM A9 Linux                                              | 17 |
| 6. Setup of Hardware                                                                                     | 18 |
| 7. Reference Application for the 8xSIMD EdkDSP IP Core                                                   | 20 |
| 8. Installation and Use of Base Evaluation Package                                                       | 22 |
| 9. Installation and Use of Extended Evaluation Package                                                   | 35 |
| 10. References                                                                                           | 51 |
| 11. Base Evaluation Package                                                                              | 52 |
| 12. Extended Evaluation Package for PRODUCTIVE 4.0 partners                                              | 53 |
| Disclaimer                                                                                               | 55 |
|                                                                                                          |    |



# **Table of Figures**

| Figure 1: TE0706-02 carrier board with TE0720-2IF Zynq module                                 | 4        |
|-----------------------------------------------------------------------------------------------|----------|
| Figure 2: SDSoC compatible Zynq system with 8xSIMD EdkDSP floating point accelerator          |          |
| Figure 3: 8xSIMD EdkDSP floating point accelerator IP core                                    | 8        |
| Figure 4: Internal details of (8xSIMD) EdkDSP floating point accelerator IP core              | 9        |
| Figure 5: MicroUSB cable: Pmod USBUART. MiniUSB cable: XMOD FTDI JTAG adapter                 | 18       |
| Figure 6: Release demo t01_s. ARM and 8xSIMD EdkDSP terminal output                           | 25       |
| Figure 7: Release demo t01_s. Vivado Lab Tool is open                                         |          |
| Figure 8: Release demo t01_s. Probes file is specified. Trigger conditions are set            | 27       |
| Figure 9: Release demo t01_s. Details of the 8xSIMD EdkDSP LMS filter computation             | 28       |
| Figure 10: Release demo t01_s. Details of the 8xSIMD EdkDSP FIR filter computation            |          |
| Figure 11: Release demo t01_s. Stanalone demo supports measurements of the chip temperature.  | 30       |
| Figure 12: Release demo t01_l. Linux start                                                    |          |
| Figure 13: Release demo t01_I. Login, Compilation of firmware in the EdkDSP C Compiler        |          |
| Figure 14: Release demo t01_l. Program and start 8xSIMD EdkDSP demo                           |          |
| Figure 15: Create new SDK 2017.1 workspace.                                                   |          |
| Figure 16: Import the extended evaluation package projects into the SDK Workspace             | 37       |
| Figure 17: SDK compiles MicroBlaze SW projects for the standalone debug target                |          |
| Figure 18: Debug demo t01_I. Execution of the ./t01_s.elf example from the SD card            |          |
| Figure 19: Debug demo t01_s. Open project edkdsp_fp12_1x8_s for debug                         |          |
| Figure 20: Debug demo t01_s. Start the free-run from the debugger                             |          |
| Figure 21: Debug demo t01_s. Arm started EdkDSP and runs SDSoC akcelerátor demo               |          |
| Figure 22: Debug demo t01_s. MicroBlaze project output (Compiled for Debug)                   |          |
| Figure 23: Compiled EdkDSP firmware. Started debug demo - linux target t01_l                  |          |
| Figure 24: Select MicroBlaze project edkdsp_fp12_1x8_I for debug                              |          |
| Figure 25: Select free run of MicroBlaze project edkdsp_fp12_1x8_l                            |          |
| Figure 26: Output from ARM MicroBlaze fort t01_I. Compiled EdkDSP firmware                    |          |
| Figure 27: Create BOOT.bin for the t01_s demo                                                 |          |
| Figure 28: Create BOOT.bin for the t01_I demo                                                 | 49       |
|                                                                                               |          |
| Table of Tables                                                                               |          |
| Table of Tables                                                                               |          |
|                                                                                               |          |
| Table 1: (8xSIMD) EdkDSP bce_fp12_1x8_40 accelerator vector operations                        | 6        |
| Table 2: PicoBlaze6 ports forming VLIW instruction for the 8xSIMD EdkDSP data flow unit       |          |
| Table 3: PicoBlaze6 precompiled support functions                                             |          |
| Table 4: MicroBlaze access names to 8xSIMD EdkDSP memory banks                                | 11       |
| Table 5: MicroBlaze WAL error codes                                                           | 12       |
| Table 6: MicroBlaze API functions for communication with 8xSIMD EdkDSP IP core                | 12<br>19 |
| Table 7: Organisation of DDR3 memory                                                          |          |
| Table 8: Requirements and results.                                                            |          |
| Table 9: Description of ARM SDSoC acceleration examples compatible with 8xSIMD EdkDSP IP      |          |
| Table 3. Description of Artivi 3D30C acceleration examples compatible with oxiginal EukD3P IP | ∠∠       |



# 1. EdkDSP IP Core - Introduction

This report describes design of compact HW system based on Zynq all programmable 28nm chip with two Arm A9 processors and programmable logic area. System is optimised for Ethernet connected computing nodes serving for industrial automation, local data processing and data communication. The documented HW architecture is one of candidates for wider use within the ECSEL Productive 4.0 project for the edge computing node in the Industry 4.0 solutions.

The demonstrated Zynq system includes the run-time reprogrammable 8xSIMD EdkDSP IP core. It combines the MicroBlaze and the floating point single Instruction multiple data (SIMD) data flow unit (DFU). The SIMD DFU is controlled by a run-time reprogrammable finite state machine implemented by Xilinx PicoBlaze6 8 bit controller with dedicated embedded (on Zyng executed) C compiler.

The application note describes the installation of the HW system, the SW API, algorithmic implementation and mapping to the 8xSIMD EdkDSP IP. Presented HW system is also compatible with the Xilinx SDSoC 2017.1 design environment. The SDSoC is supporting automated compilation of user-defined C/C++ ARM functions into HW accelerators with data movers (zero-copy, DMA, SG-DMA) and the automated integration of generated accelerators with the ARM Linux or stand-alone operating systems.



Figure 1: TE0706-02 carrier board with TE0720-2IF Zyng module



# 2. Implementation Details



Figure 2: SDSoC compatible Zyng system with 8xSIMD EdkDSP floating point accelerator.

#### **Evaluation system**

The 8xSIMD EdkDSP IP Core is programmed, evaluated and debugged in HW Xilinx Zynq module TE0720-IF [1]. The 28nm Xilinx Zynq device xc7z020-2I has two 32 bit ARM Cortex A9 processors operating at 766 MHz and single MicroBlaze 32 bit soft core processor operating at 100 MHz. The Zynq programmable logic area is used for one 8xSIMD EdkDSP IP (operating at 120 MHz) and also for demonstration of compatibility of the EdkDSP IP with examples of Xilinx SDSoC 2017.1 HW accelerators.

The EdkDSP IP Core is 8xSIMD floating point accelerator. It is reprogrammable in runtime by change of firmware of a PicoBlaze6 8bit controller. The PicoBlaze6 controller schedules vector operations performed in the 8xSIMD floating point data paths. The PicoBlaze6 controller serves as re-programmable finite state machine (FSM). It is programmed by firmware compiled by an EdkDSP C Compiler and Assembler. The EdkDSP C Compiler and Assembler are implemented as application programs running on the embedded PetaLinux 2017.1 operating system. The 8xSIMD EdkDSP IP is controlled by the 32bit MicroBlaze processor. MicroBlaze runs programs from the DDR3 memory. The DDR3 is interfaced by an Instruction and Data cache (32k x 32bit) with HPO AXI interface. The 8xSIMD EdkDSP IP is connected to the MicroBlaze by local dual-ported memories. MicroBlaze implements data communication from DDR3 to 8xSIMD EdkDSP dual-ported memories in software. This communication is performed in parallel with the 8xSIMD parallel floating point computation in the 8xSIMD EdkDSP IP.

### Parameters of the 8xSIMD EdkDSP IP core

8x SIMD EdkDSP floating point accelerator IP core supports 8xSIMD vector floating point operations performed from/to dual-ported BRAMs A, B, Z. Each dual-ported BRAM has 8 parallel layers of 1024 32bit words. The set of supported floating point operations is different for different grades [10|20|30|40] of the 8xSIMD EdkDSP accelerator IPs. The supported floating point operations are summarised in  $Table\ 1$ .

5/55



The accelerator bce\_fp12\_1x8\_0\_axiw\_v1\_10 is area optimized and supports only data transfers and vector floating point operations FPADD, FPSUB in 8 SIMD data paths.

The accelerator bce\_fp12\_1x8\_0\_axiw\_v1\_20 performs identical operations as bce\_fp12\_1x8\_0\_axiw\_v1\_10 plus the vector floating point MAC operations in 8 SIMD data paths. MAC is supported for length of vectors 1 up to 10. This accelerator is optimized for applications like floating point matrix multiplication with one row and column dimensions <= 10.

The accelerator bce\_fp12\_1x8\_0\_axiw\_v1\_30 supports identical operations as bce\_fp12\_1x8\_0\_axiw\_v1\_20 plus HW-accelerated computation of the floating point vector-by-vector dot-product operators performed in 8 SIMD data paths. It is optimized for parallel computation of up to 8 FIR or LMS filters, each with size up to 250 coefficients. It is also efficient in case of floating point matrix by matrix multiplications, where one of the dimensions is large (in the range from 11 to 250).

The accelerator bce fp12 1x8 0 axiw v1 40 supports identical operations as bce fp12 1x8 0 axiw v1 30 plus an additional HW support of dot product. It is computed in 8 data paths with HW-supported wind-up into single scalar result propagated into all SIMD planes.

All bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators support single data path for pipelined, floating-point division operations with vector operands taken from the first SIMD plain and the result is propagated into all 8 SIMD plains.

All bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators are suitable for applications like adaptive normalised LMS and NLMS filters and square root free versions of adaptive RLS QR filters and adaptive RLS LATTICE filters.

*Table 1:* (8xSIMD) EdkDSP bce\_fp12\_1x8\_40 accelerator vector operations.

| Name in MicroBlaze C value (dec) |      | 8xSIMD Floating point Operation                                     |
|----------------------------------|------|---------------------------------------------------------------------|
| WAL_BCE_JK_VVER                  | = 0  | Return capabilities of the (8xSIMD) EdkDSP accelerator              |
| WAL_BCE_JK_VZ2A                  | = 1  | 8xSIMD copy $a_m[i] \le z_m[j]; m=18$ IP core: 10,20,30,40          |
| WAL_BCE_JK_VB2A                  | = 2  | 8xSIMD copy $a_m[i] \le b_m[j]; m=18$ IP core: 10,20,30,40          |
| WAL_BCE_JK_VZ2B                  | = 3  | 8xSIMD copy $b_m[i] \le z_m[j]; m=18$ IP core: 10,20,30,40          |
| WAL_BCE_JK_VA2B                  | = 4  | 8xSIMD copy $b_m[i] \le a_m[j]; m=18$ IP core: 10,20,30,40          |
| WAL DOE IK WADD                  |      | 0.5040                                                              |
| WAL_BCE_JK_VADD                  | = 5  | 8xSIMD add $z_m[i] \le a_m[j] + b_m[k]$ ; m=18 IP core: 10,20,30,40 |
| WAL_BCE_JK_VADD_BZ2A             | = 6  | 8xSIMD add $a_m[i] \le b_m[j] + z_m[k]$ ; m=18 IP core: 10,20,30,40 |
| WAL_BCE_JK_VADD_AZ2B             | = 7  | 8xSIMD add $b_m[i] \le a_m[j] + z_m[k]$ ; m=18 IP core: 10,20,30,40 |
|                                  |      |                                                                     |
| WAL_BCE_JK_VSUB                  | = 8  | 8xSIMD sub $z_m[i] \le a_m[j] - b_m[k]$ ; m=18 IP core: 10,20,30,40 |
| WAL_BCE_JK_VSUB_BZ2A             | = 9  | 8xSIMD sub $a_m[i] \le b_m[j] - z_m[k]$ ; m=18 IP core: 10,20,30,40 |
| WAL_BCE_JK_VSUB_AZ2B             | = 10 | 8xSIMD sub $b_m[i] \le a_m[j] - z_m[k]$ ; m=18 IP core: 10,20,30,40 |
|                                  |      |                                                                     |
| WAL_BCE_JK_VMULT                 | = 11 | 8xSIMD mult $z_m[i] \le a_m[j] * b_m[k]; m=18$ IP core: 10,20,30,40 |
| WAL_BCE_JK_VMULT_BZ2A            | = 12 | 8xSIMD mult $a_m[i] \le b_m[j] * z_m[k]; m=18$ IP core: 10,20,30,40 |
| WAL_BCE_JK_VMULT_AZ2B            | = 13 | 8xSIMD mult $b_m[i] \le a_m[j] * z_m[k]; m=18$ IP core: 10,20,30,40 |
|                                  |      |                                                                     |
| WAL_BCE_JK_VPROD                 | = 14 | 8xSIMD vector products: IP core: 30,40                              |
|                                  |      | $z_m[i] \le a_m'[jj+nn]*b_m[kk+nn]; m=18; nn range 1255$            |
|                                  |      |                                                                     |
| WAL_BCE_JK_VMAC                  | = 15 | 8xSIMD vector MACs: IP core: 20,30,40                               |
| ar bijoressind                   |      | http://zs.utia.cas.cz                                               |

|                     |      | $z_m[ii+nn] \le z_m[ii+nn] + a_m[jj+nn]$<br>nn range 113                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | * b <sub>m</sub> [kjk+nn];                                                                             |
|---------------------|------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
| WAL_BCE_JK_VMSUBAC  | = 16 | 8xSIMD vector MSUBACs  z <sub>m</sub> [ii+nn] <= z <sub>m</sub> [ii+nn] - a <sub>m</sub> [jj+nn] *  nn range 113                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | IP core: 20,30,40<br>' b <sub>m</sub> [kjk+nn];                                                        |
| WAL_BCE_JK_VPROD_S8 | = 17 | $S$ 8xSIMD vector product (extended) $z_m[i] <= ( (a_1'[jj+nn]*b_1[kk+nn]+a_1+(a_3'[jj+nn]*b_3[kk+nn]+a_1+(a_5'[jj+nn]*b_5[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]+a_5'[kk+nn]$ | -a <sub>4</sub> '[jj+nn]*b <sub>4</sub> [kk+nn]) )<br>-a <sub>6</sub> '[jj+nn]*b <sub>6</sub> [kk+nn]) |
| WAL_BCE_JK_VDIV     | = 20 | vector division (extended)<br>$z_m[i] \le a_1[j] / b_1[k]; m=18$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   | IP core: 10,20,30,40                                                                                   |

## Ports of the 8xSIMD EdkDSP accelerator

| • | bce_atoa[0:9]    | Memory A address (addressing 1024 32 bit floating point values)                    |
|---|------------------|------------------------------------------------------------------------------------|
| • | bce_atob[0:9]    | Memory B address (addressing 1024 32 bit floating point values)                    |
| • | bce_atoz[0:9]    | Memory Z address (addressing 1024 32 bit floating point values)                    |
| • | bce_done[0:7]    | Vector operation in progress or finished                                           |
| • | bce_led4b[0:3]   | 4 bit output, intended for led signalling. (Unconnected in the evaluation design). |
| • | bce_mode[0:3]    | Mode of the communication protocol PicoBlaze6 - MicroBlaze                         |
| • | bce_op[0:7]      | Vector operation to be performed.                                                  |
| • | bce_port[0:7]    | 8 bit output port. (Unconnected in the evaluation design).                         |
| • | bce_port_id[0:7] | 8 bit output External port address.                                                |
|   |                  | Address space [0x0 0x1F] is reserved for optimized construction of the VLIW        |
|   |                  | instruction to the 8xSIMD vector processing unit of the EdkDSP.                    |
|   |                  | Address space [0x20 0xFF] can be used by the user.                                 |
| • | bce_port_wr      | 1 bit output. Write strobe for write of 8 bit data to the external port address.   |
| • | bce_r_pb         | 1 bit output. Reset of the PicoBlaze6.                                             |
| • | bce_we           | 1 bit output. Write strobe signals start of execution of a VLIW instruction by the |
|   |                  | 8xSIMD vector processing unit of the EdkDSP.                                       |
| • | bce_dip4b[0:3]   | 4bit input (Connected to a constant in the evaluation design).                     |
| • | Bce_gpi8b[0:7]   | 8bit input (Connected to a constant in the evaluation design).                     |





Figure 3: 8xSIMD EdkDSP floating point accelerator IP core.

## Interface of the 8xSIMD EdkDSP IP to the MicroBlaze processor

The EdkDSP IP core is connected to the 100 MHz MicroBlaze processor via the 100 MHz 32bit AXI lite bus represented by port **s\_axi**, 100 MHz clock input **axi\_aclk** and an asynchronous reset signal **axi\_aresetn**. See *Figure 3*.



The debug ports are used for the real-time visualisation, debug and analysis of the computation implemented inside of the 8xSIMD data flow unit (DFU) of the (8xSIMD) EdkDSP accelerator IP. This makes easier to debug the compiled PicoBlaze6 firmware code. The implemented in circuit logic analyser (ILA) debug probes can capture 8096 data samples and provide visibility for the auto-generated addresses and for the detailed schedule of vector operation in the 8xSIMD EdkDSP IP core. See *Figure 3*.

Figure 4 presents connection of the two parts of the 8xSIMD EdkDSP IP core.



Figure 4: Internal details of (8xSIMD) EdkDSP floating point accelerator IP core.

All bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators versions have identical Edk IP part.

The DSP part has identical ports and connectivity (see

Figure 4) for all bce\_fp12\_1x8\_0\_axiw\_v1\_[10|20|30|40] accelerators versions.

The Edk part of the EdkDSP floating point accelerator IP core **bce\_fp12\_1x8\_0\_axiw\_v1\_0\_c** includes inside the PicoBlaze6 controller, its program memories P0 and P1 and the 8xSIMD dual-ported block-ram memories 8xA, 8xB and 8xZ designed for parallel access. The **bce\_fp12\_1x8\_0\_axiw\_v1\_0\_c** IP is designed in the Xilinx System Generator 14.5 and ported to the Vivado 2017.1 compatible IP core. The PicoBlaze6 firmware executes C code and supports C constructs like loops, while, if, else, function calls etc.

The first of the two ports of all block-rams are accessed by the MicroBlaze as memory via the Axi-lite bus.

- The second of the two ports of both program memories P0 and P1 are connected to the PicoBlaze6 controller.
- The second of the two ports of all data memories 8xA, 8xB and 8xZ are connected to the floating point data paths of the data flow unit (DFU) unit and support parallel access.

9/55



The DFU bce fp12 1x8 0 dsp is designed in the Xilinx System Generator for DSP 2017.1. It contains 8 pipelined floating point ADD units, 8 pipelined floating point MULT units and one pipelined floating point DIV unit. The DFU supports all vector operations defined in *Table 1*.

- The 100bit VLIW instruction is transferred in two 50bit ports mem\_bce\_i\_lo and mem\_bce\_i\_hi. The VLIW instruction is set by dedicated PicoBlaze6 output ports. See *Table 2*.
- The 8xSIMD data flow unit executes 8xSIMD floating point operations defined in Table 1.
- The concrete 8xSIMD operation is defined by the PicoBlaze6 DFU OP 8bit output register driving the mem bce op port of the bce fp12 1x8 0 axiw v1 0 c IP. The transfer of the complete VLIW instruction (100+8 bits) is triggered by the write strobe signal mem\_bce we . It is activated by PicoBlaze6 program write of the 8xSIMD operation DFU OP. See Table 2.

The 8xSIMD data flow unit (DFU) indicates end of the operation in the 8bit output port mem\_bce\_done. PicoBlaze6 program can execute few instructions in parallel to the 8xSIMD operation defined in DFU\_OP. End of the 8xSIMD operation is detected by the PicoBlaze6 program by reading of the input 8bit port mem\_bce\_done. PicoBlaze6 firmware defines the sequence of VLIW instructions for the 8xSIMD DFU unit by its dedicated output registers. PicoBlaze6 addresses of these dedicated output registers are listed in Table 2.

Table 2: PicoBlaze6 ports forming VLIW instruction for the 8xSIMD EdkDSP data flow unit.

| Disc Disco C was sistens would found affinition of | Ганнаа      | \/LI\A/       | Description of sections defined in the |
|----------------------------------------------------|-------------|---------------|----------------------------------------|
| PicoBlaze6 registers used for definition of        | Format      | VLIW          | Description of sections defined in the |
| the 100 bit wide VLIW instruction for the          | [msblsb]    | [2x 50bit]    | VLIW instruction for the EdkDSP Data   |
| EdkDSP Data Flow Unit                              |             | mem_bce_i_hi  | Flow Unit                              |
|                                                    |             | mem_bce_i_lo  |                                        |
| [00b, DFU_CNT]                                     | [2bit,8bit] | 10 bit [4940] | Number of 8xSIMD steps (0 255)         |
| [00b, DFU_Z_INC]                                   | [2bit,8bit] | 10 bit [3930] | Auto increment of Z address (0 255)    |
| [DFU_Z_MEM_BANK, DFU_Z_MEM_SADDR]                  | [2bit,8bit] | 10 bit [2920] | Set Z address after auto incr overflow |
| [DFU_Z_MEM_BANK, DFU_Z_MEM_ADDR]                   | [2bit,8bit] | 10 bit [1910] | Initial Z address                      |
| [00b, DFU_B_INC]                                   | [2bit,8bit] | 10 bit [0900] | Auto increment of B address (0 255)    |
| [DFU_B_MEM_BANK, DFU_B_MEM_SADDR]                  | [2bit,8bit] | 10 bit [4940] | Set B address after auto incr overflow |
| [DFU_B_MEM_BANK, DFU_B_MEM_ADDR]                   | [2bit,8bit] | 10 bit [3920] | Initial B address                      |
| [00b, DFU_A_INC]                                   | [2bit,8bit] | 10 bit [2920] | Auto increment of A address (0 255)    |
| [DFU_A_MEM_BANK, DFU_A_MEM_SADDR]                  | [2bit,8bit] | 10 bit [1910] | Set A address after auto incr overflow |
| [DFU_A_MEM_BANK, DFU_A_MEM_ADDR]                   | [2bit,8bit] | 10 bit [0900] | Initial A address                      |
|                                                    |             |               |                                        |
| [0000b, PBP_REG01]                                 | [4bit,4bit] | 8 bit         | Set actual VLIW instr. memory (0 15)   |
| [DFU_OP]                                           | [8bit]      | 8 bit         | Execute SIMD operation with            |
|                                                    |             |               | parameters in the actual VLIW instr.   |
|                                                    |             |               | memory (set by the PBP_REG01 port).    |

10/55



# 3. EdkDSP IP Core – PicoBlaze6 C Application Interface Functions

The EdkDSP compiler embedded compilation of simple C and ASM programs or the PicoBlaze6 controller. PicoBlaze6 programs can use predefined and precompiled library functions listed in *Table 3*. Functions are optimized in the PicoBlaze6 assembler code, and occupy fixed area of the firmware and serve as common simple API for C and ASM PicoBlaze6 programs.

PicoBlaze6 firmware image with precompiled support functions is present in MicroBlaze header file **fill\_def\_program\_store.h** PicoBlaze6 application program firmware is merged with this precompiled image by the MicroBlaze SW program.

Table 3: PicoBlaze6 precompiled support functions

| PicoBlaze6 predefined functions                      | Description                                                         |
|------------------------------------------------------|---------------------------------------------------------------------|
|                                                      |                                                                     |
| unsigned char mb2pb_read_data();                     | Single unsigned char from MicroBlaze to PicoBlaze6                  |
| <pre>void pb2mb_write(unsigned char data);</pre>     | Single unsigned char from PicoBlaze6 to MicroBlaze                  |
| <pre>void pb2mb_eoc(unsigned char data);</pre>       | EOC unsigned char from PicoBlaze6 to MicroBlaze                     |
| <pre>void pb2mb_req_reset(unsigned char data);</pre> | Request from PicoBlaze6 to MicroBlaze to initiate PB reset          |
| <pre>void pb2mb_reset();</pre>                       | Information from PicoBlaze6 to MicroBlaze - PB reset                |
| void pb2dfu_set(unsigned char mem,                   | Set one section of the VLIW instruction for the data flow unit      |
| unsigned char data);                                 | (DFU) to an unsigned char data. VLIW instruction sections are       |
|                                                      | addressed as PicoBlaze6 8bit output ports defined in <i>Table 2</i> |
| void pb2dfu_wait4hw();                               | PicoBlaze6 function is waiting for the termination of data flow     |
|                                                      | unit operation.                                                     |
| unsigned char led2pb();                              | Write from PicoBlaze6 to 4 bit led output port                      |
| unsigned char btn2pb();                              | Read from 4 bit input port to PicoBlaze6                            |
| unsigned char hex_h(unsigned char ch);               | Translate upper 4 bit nibble of an unsigned char to ascii           |
| unsigned char hex_l(unsigned char ch);               | Translate lower 4 bit nibble of an unsigned char to ascii           |
| void pb2lcd_ascii_char(unsigned char ch,             | Write from PicoBlaze6 to LCD asci alphanumerical display            |
| unsigned char pos);                                  |                                                                     |

11/55





# 4. EdkDSP IP Core – MicroBlaze C Application Interface Functions

MicroBlaze program is responsible for data communication, programming and initialization of the PicoBlaze6 and global scheduling of the implemented algorithm. The API providing MicroBlaze - Picoblaze6 interface is called Worker Abstraction Layer (WAL).

- 8xSIMD EdkDSP memory pointers and program memory pointers (from MicroBlaze view) are defined in *Table 4*.
- WAL error codes are defined in Table 5.
- 8xSIMD EdkDSP is supported by API functions collected in the WAL API are listed and described in *Table* 6.

Table 4: MicroBlaze access names to 8xSIMD EdkDSP memory banks

| MicroBlaze access names | Description of the 8xSIMD EdkDSP memory banks                      |
|-------------------------|--------------------------------------------------------------------|
| WAL_BCE_JK_DMEM_A       | index of the A data memory banks (8x [01023] 32bit words)          |
| WAL_BCE_JK_DMEM_B       | index of the B data memory banks (8x [01023] 32bit words)          |
| WAL_BCE_JK_DMEM_Z       | index of the Z data memory banks (8x [01023] 32bit words)          |
|                         |                                                                    |
| WAL_CMEM_MB2PB          | index to MB2PB control memory (the control register of the worker) |
| WAL_CMEM_PB2MB          | index to PB2MB control memory (the status register of the worker)  |
| WAL_PBID_P0             | index to P0 control memory (PicoBlaze program memory 1)            |
| WAL_PBID_P1             | index to P1 control memory (PicoBlaze program memory 2)            |

Table 5: MicroBlaze WAL error codes

| MicroBlaze WAL codes | Value | Description                 |
|----------------------|-------|-----------------------------|
| WAL_RES_OK           | 0     | all is OK                   |
| WAL_RES_WNULL        | 1     | argument is a NULL          |
| WAL_RES_ERR          | -1    | generic error               |
| WAL_RES_ENOINIT      | -2    | not initiated               |
| WAL_RES_ENULL        | -3    | null pointer                |
| WAL_RES_ERUNNING     | -4    | worker is running           |
| WAL_RES_ERANGE       | -5    | index/value is out of range |

Table 6: MicroBlaze API functions for communication with 8xSIMD EdkDSP IP core

# MicroBlaze API functions for communication with 8xSIMD EdkDSP IP core

wal\_init\_worker() - generalised function for worker initialising

This function is designed for calling from user application. The function checks if the \*wrk structure is prepared to initiate worker (the family description structure must be set). Then the assigned family function (init\_wrk()) is called. In the called function all arrays of pointers to shared memories should be initiated.

Return Value: The function returns return code WAL\_RES\_OK if successful and WAL\_RES\_E... if any error occurs.

int wal\_init\_worker(struct wal\_worker \*wrk);





<sup>\*</sup>wrk is a pointer to the worker structure.

## wal done worker - generalised function for worker clean-up

\*wrk is a pointer to the worker structure

This function is designed for calling from user application. The function calls done function (done\_wrk()) assigned to family description structure. In the called function all dynamically allocated worker structures, memories and resources should be clean-up and released if they have been created in the worker init function.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_done\_worker(struct wal\_worker \*wrk);

# wal\_reset\_worker() - generalised function for worker hard reset

\*wrk is a pointer to the worker structure

This function is designed for calling from user application. The function calls reset function (reset\_wrk()) assigned to the family description structure. In the called function the worker control registers should be reset (by HARD RESET bit in the worker control register). The reset is not acknowledged by accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_reset\_worker(struct wal\_worker \*wrk);

wal start operation() - generalised function for starting operation on the accelerator.

\*wrk is a pointer to the worker structure. \*pbid is an index of used PB firmware ( WAL\_PBID\_...)

This function is designed for calling from user application. The function checks if the accelerator is in the idle state and then it calls function for starting operation (start\_op()) assigned to the family description structure. The called function should start a new accelerator operation by setting accelerator control register and checking status register. This function is blocking, i.e. it waits for acknowledgement from accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_start\_operation(struct wal\_worker \*wrk, unsigned int pbid);

wal end operation() - generalised function for finishing operation on the accelerator.

\*wrk is a pointer to the worker structure.

This function is designed for calling from user application. The function checks if the accelerator is in processing state and then it calls function for ending operation (end\_op()) assigned to the family description structure. The called function should stop processing operation on the accelerator. And it waits for synchronization with the accelerator, therefore the function is blocking.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_end\_operation(struct wal\_worker \*wrk);

**wal\_mb2pb()** - generalised function for setting worker control register.

\*wrk is a pointer to the worker structure. data is user data to be send to worker control register.

This function is designed for calling from user application. The function calls function for setting worker control

13/55



register (mb2pb()) assigned to the family description structure. The called function should send user data through control register with controlling READ bit. It should also waits for synchronization with accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal mb2pb(struct wal worker \*wrk, const uint32 t data);

wal\_pb2mb() - generalised function for reading worker status register.

\*wrk is a pointer to the worker structure. \*data is a pointer to an output buffer where read user data is written.

This function is designed for calling from user application. The function calls function for reading worker status register (pb2mb()) assigned to the family description structure. The called function should read user data through worker status register with waiting for synchronization with accelerator.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_pb2mb(struct wal\_worker \*wrk, uint32\_t \*data);

wal\_mb2cmem() - generalised function for writing a block of data to any worker control or support memory

\*wrk is a pointer to the worker structure. memid is an index of control/support memory where data are written to ( WAL\_CMEM\_... or WAL\_...\_SMEM\_...). memoffs is offset in selected memory (in words not in bytes). outbuf is a pointer to memory where data are read from. len is a number of words to copy from outbuf to accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for writing data to any control/support memory (mb2cmem()) assigned to the family description structure. The called function should get a pointer to the right memory according to the required index **memid**. For accessing support memories they have to define indices greater then indices to control memories. Then the called function should copy a block of data from CPU memory **outbuf** to an accelerator control/support memory selected by **memid** and offset in selected memory **memoffs**.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_mb2cmem(struct wal\_worker \*wrk, unsigned int memid, unsigned int memoffs, const uint32\_t \*outbuf, unsigned int len);

wal\_cmem2mb() - generalised function for reading a block of data from any worker control or support memory

\*wrk is a pointer to the worker structure. memid is an index of control/support memory where data are read from

( WAL\_CMEM\_... or WAL\_...\_SMEM\_...). **memoffs** is offset in selected memory (in words not in bytes). \*inbuf is a pointer to memory where data are written to. len is a number of words to copy from accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for reading data from any control/support memory (cmem2mb()) assigned to the family description structure. The called function should get a pointer to the right memory according to the required index **memid**. For accessing support memories they have to define indices greater then indices to

signal processing



control memories. Then the called function should copy a block of data from the accelerator control/support memory selected by **memid** and offset in selected memory **memoffs**.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_cmem2mb(struct wal\_worker \*wrk, unsigned int memid, unsigned int memoffs, uint32 t \*inbuf, unsigned int len);

wal\_mb2dmem() - generalised function for writing a block of data to any worker data memory

\*wrk is a pointer to the worker structure. simdid is an index of SIMD which data memories are indexed. memid is an index of control/support memory where data are written to (WAL\_CMEM\_... or WAL\_...\_SMEM\_...). memoffs is offset in selected memory (in words not in bytes). \*outbuf is a pointer to memory where data are read from. len is a number of words to copy from \*outbuf to accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for writing data to any data memory (mb2dmem()) assigned to the family description structure. The called function should get a pointer to the right memory according to the required SIMD **simdid** and memory index **memid**. Then the called function should copy a block of data from CPU memory \*outbuf to the accelerator data memory with offset inside the selected memory **memoffs**.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_mb2dmem(struct wal\_worker \*wrk, unsigned int simdid, unsigned int memid, unsigned int memoffs, const void \*outbuf, unsigned int len);

# wal\_dmem2mb() - generalised function for writing a block of data to any worker data memory

\*wrk is a pointer to the worker structure. simdid is an index of SIMD which data memories are indexed. memid is an index of control/support memory where data are read from ( WAL\_CMEM\_... or WAL\_...\_SMEM\_...). memoffs is offset in selected memory (in words not in bytes). \*inbuf is a pointer to memory where data are written to. len is a number of words to copy from accelerator control memory.

This function is designed for calling from user application. The function checks index of the required memory and then it calls function for reading data from any data memory (dmem2mb()) assigned to the family description structure. The called function should get pointer to the right memory according to the required SIMD **simdid** and memory index **memid**. Then the called function should copy a block of data from the accelerator data memory with offset inside the selected memory **memoffs**.

Return Value: The function returns WAL RES ... codes.

int wal\_dmem2mb(struct wal\_worker \*wrk, unsigned int simdid, unsigned int memid, unsigned int memoffs, void \*inbuf, unsigned int len);

## wal\_set\_firmware() - generalised function for writing PicoBlaze firmware

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware ( WAL\_PBID\_...). \*fwbuf is a pointer to a firmware in CPU memory. fwsize is a size of the firmware in words, it can be a negative value to set full firmware (4096 words).

This function is designed for calling from user application. The function checks if all arguments are correct and then it calls function for writing PB firmware (set\_fw()). The called function should copy firmware from CPU memory \*fwbuf to PicoBlaze6 program memory in the accelerator. The PB program memory is selected by the

signal processing



argument **pbid**. The firmware needn't be full 4096 word long. The firmware length (in words) can be set by the argument **fwsize**. If the **fwsize** is a negative value (you can use defined value WAL\_FW\_WHOLE) the function assumes the FW length is 4096 words.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_set\_firmware(struct wal\_worker \*wrk, int pbid, const unsigned int \*fwbuf, int fwsize);

wal bce jk get id() - implementation of the worker get id() function for the BCE JK families

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware ( WAL\_PBID\_...). **outval** is a pointer to an output buffer for read worker ID.

The function emulates reading worker ID from hardware because the BCE\_JK families don't support this operation in the hardware.

Return Value: The function always returns WAL RES OK.

int wal jk get id(struct wal worker \*wrk, int pbid, unsigned int \*outval);

wal\_bce\_jk\_get\_cap() - implementation of the worker get\_cap() function for the BCE\_JK families

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware ( WAL\_PBID\_...). \*outval is a pointer to an output buffer for read capabilities.

The function sends operation WAL\_BCE\_JK\_VVER to accelerator, reads the worker capabilities and returns the read value in the \*outval buffer.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_bce\_jk\_get\_cap(struct wal\_worker \*wrk, int pbid, unsigned int \*outval);

wal\_bce\_jk\_get\_lic() - implementation of the get\_lic() function for the BCE\_JK families

\*wrk is a pointer to the worker structure. **pbid** is an index of used PB firmware (WAL\_PBID\_...). \*outval is a pointer to an output buffer for read license.

The function reads the license from the worker. For BCE\_JK families the license is a 2bit license down-counter contained in the value returned by accelerator operation WAL\_BCE\_JK\_VVER. The 2bit license counter is returned in the \*outval buffer.

Return Value: The function returns WAL\_RES\_... codes.

int wal\_bce\_jk\_get\_lic(struct wal\_worker \*wrk, int pbid, unsigned int \*outval);

All worker abstraction layer API functions listed in  $Table\ 6$  are precompiled into the MicroBlaze library **wal.a** and declared in MicroBlaze header files wal.h and **wal\_bce\_jk.h** .

The worker abstraction layer API functions listed in *Table 6* support instantiation of several (more than 1) instances of the 8xSIMD EdkDSP IP core.





# 5. EdkDSP IP Core – Integration with dual core ARM A9 Linux

The 8xSIMD EdkDSP IP core is integrated in a tester system with architecture presented in *Figure 2* and photo of the HW presented by *Figure 1* and *Figure 5*.

The dual core ARM Cortex A9 system runs configured PetaLinux 2017.1 operating system and supports:

- Ethernet 1 Gbit
- SSH, telnet, FTP, ...
- The system image is located on SD card. After the initial boot, the file system is decompressed to the RAM FS in DDR3. The SD card file system is mounted and visible in the running Petalinux.
- Symmetrical multiprocessing on two ARM A9 processors
- SDSoC 2017.1 generated HW accelerators with data movers based on:
  - Simple DMA with HW supported data movers (DMA data width 32bit or 64bit) with no ARM interrupts. Simple DMA requires allocation of continuous memory space.
  - SG DMA with data movers (DMA data width 32bit or 64bit) with ARM interrupts. SG DMA can
    work with continuous allocation of memory or with standard Linux allocation of memory,
    where the continuous allocation is not guaranteed.
  - HW data movers connected to the advanced cache coherent port resolving in HW the cache coherency of dual core ARM access and data mover access to DDR3.

The MicroBlaze processor and the 8xSIMD EdkDSP IP core require initialisation and synchronisation with Linux and the dual core ARM subsystem. This is arranged by the following configuration of reserved DDR3 memory (1 GB)

Memory Area (in Bytes) Size Description 0x0000 0000 ... 0x27FF FFFF 640 M Byte Memory managed by standard Linux memory allocation mechanism. Used by dual core Arm A9 symmetrical multiprocessing 32 bit Linux 0x2800 0000 ... 0x280F FFFF Reserved for MicroBlaze – ARM communication 1 M Byte It is continuous memory reserved in Linux configuration 0x2800 0000 ... 0x2810 0FFF 4 kByte Reserved for PicoBlaze6 f0 firmware (MicoBlaze and ARM) 0x2800 1000 ... 0x2810 1FFF 4 kByte Reserved for PicoBlaze6 f1 firmware (MicoBlaze and ARM) 0x2800 2000 ... 0x2810 2FFF 4 kByte Reserved for PicoBlaze6 f2 firmware (MicoBlaze and ARM) 0x2800 3000 ... 0x2810 3FFF 4 kByte Reserved for PicoBlaze6 f3 firmware (MicoBlaze and ARM) 0x2800 4000 ... 0x281F FFFF Reserved for 8xSIMD EdkDSP data (MicoBlaze and ARM) Reserved 0x2810 0000 ... 0x29FF FFFF MicroBlaze program & data. Microblaze processor IP is 15 M Byte configured for execution of its code from 0x28100000. It is a part of the continuous memory reserved in Linux. 0x2A00 0000 ... 0x2FFF FFFF Continuous memory reserved for video frame buffers. 112 M Byte 0x3000 0000 ... 0x3FFF FFFF 256 M Byte Memory reserved for SDSoC data mover and DMA drivers.

Table 7: Organisation of DDR3 memory

Linux user application uses the four reserved 4k Byte areas for copy of four PicoBlaze6 firmware programs. These programs can be compiled on the dual core ARM A9 from the C and ASM source codes stored as asci files on the mounted SD card file system. Compiled firmware programs are read by the user application running on ARM from the SD card files and copied as data to the reserved 4kB continuous memory areas. MicroBlaze program (after HW mutex based synchronisation) reads this data and uses them for programming of PicoBlaze6 FSM of the 8xSIMD EdkDSP IP.

signal processing



# 6. Setup of Hardware

HW setup is based on commercially accessible components [1], [2], [3], [4], [5]:

**TE0720-2IF**; Part: XC7Z020-2CLG484I; 1 GByte DDR; Industrial Grade (-40°C to +85°C) [1].

Heatsink for TE0720, spring-loaded embedded [2].

**TE0706-02 Carrier Board** from Trenz Electronic [3]

Pmod USBUART Serial converter & interface [4].

TE0790-02 XMOD FTDI JTAG Adapter - Xilinx compatible [5].

See the technical reference manuals (TRM) for the description of the TE0720-02-2IF or TE0720-03-2IF module [1] and the TE0706-02 carrier board [3].



Figure 5: MicroUSB cable: Pmod USBUART. MiniUSB cable: XMOD FTDI JTAG adapter.



## Set the TE0706-02 carrier board switches and jumpers for the TE0720-03-2IF module as follows:

The TE0720-2IF Zyng device works with all IO-bank supply-voltages 3.3V.

- Set jumpers to generate VCCIOA= VCCIOB= VCCIOC=3.3V J10: connect 2-3; J11: connect 2-3; J12: connect 2-3
- Set switch S1:
   1=ON 2=ON 3=ON 4=OFF

The TE0706-02 board ARM serial terminal/JTAG is connected to the PC by a Mini USB (type B) cable via the TE0790-02 XMOD FTDI JTAG adapter [5]. See *Figure 1* and *Figure 5*.

- Set switch in the XMOD module to:
- 1=ON 2=OFF 3=ON 4=OFF

The serial terminal for MicroBlaze is connected to the PC by a Micro USB cable via the USBUART pmod. The J6 connector on the TE0706 carrier board has three lines of 32 pins named:

[A1 ... A32] [B1 ... B32] [C1 ... C32].

The USBUART pmod is connected to pins [B1 ... B6] of connector J6B (central line B). See the implemented solution on *Figure 5*.

• The jumper on the USBUART pmod is set to the default: connect lcl-vcc. With his setup, the USBUART pmod convertor chip is powered from the 5V USB from the PC and generates the local 3.3V for the pmod. See *Figure 1* and *Figure 5*.





# 7. Reference Application for the 8xSIMD EdkDSP IP Core

The reference application problem is the active acoustics noise cancellation for the hands free telephony.

The near end signal e(i) (voice of a speaker) is disturbed by a disturbance signal received by the near end microphone. This unknown disturbance y(i) is generated by a known (measured) far end signal (example: noise from the motor engine) u(i). The objective of the active acoustics noise cancellation is to use the measured disturbed near end microphone signal d(i) and the signal measured by the far end microphone u(i) for reconstruction of the near end speaker signal e(i) with cancelled disturbance.

The transfer function from the far end (known) source of the disturbance is modelled by a recursive FIR filter with 2000 coefficients with sampling rate 90 kHz.

#### **Recursive FIR filter algorithm:**

Objective of FIR filter is to generate sequence of modelled system outputs d(i) based on the sequence of system inputs u(i) and constant vector of N FIR filter coefficients. The generated output sequence includes also the random additive output noise defined by white noise signal e(i).

```
x(i) = u(i)

y(i) = [w(1), w(2), ..., w(N)] * [x(i), x(i-1), ... x(i-N+1)]^T

d(i) = y(i) + e(i)
```

## Recursive adaptive LMS filter algorithm:

Objective of adaptive LMS filter is to identify recursively an unknown vector of N=2000 FIR filter coefficients from a sequence of system inputs u(i) and system outputs d(i) with sampling rate 90 kHz. The algorithm works under an assumption that the measured output sequence d(i) has been generated by a FIR filter with unknown coefficients with dimension N=2000 and includes also the unknown random white noise signal. Signal e(i) is estimated by the adaptive LMS filter.

```
x(i) = u(i)

y(i) = [w(1), w(2), ..., w(N)] * [x(i), x(i-1), ... x(i-N+1)]^T

e(i) = d[i]-y[i]

[w(1), w(2), ..., w(N)] = [w(1), w(2), ..., w(N)] + mu * e(i) * [x(i), x(i-1), ... x(i-N+1)]
```

Where N is order of the FIR and LMS filter. N = 2000 in the implemented designs.

- u(i) is scalar, floating point input to the system
- d(i) is scalar, floating point output of a system
- y(i) is scalar, floating point output of FIR filter
- e(i) is scalar, floating point prediction error
- [w(1), w(2), ..., w(N)] is vector of N scalar, floating point FIR filter coefficients, N=2000.

mu is scalar, floating point constant used for control of the speed of convergence of the adaptive LMS filter.

## The 8xSIMD EdkDSP IP Core

The 8xSIMD EdkDSP IP Core is configured for accelerated floating point computation of the recursive FIR filter with constant parameters N=2000 and for acceleration of the adaptive recursive LMS filter with N=2000 unknown coefficients with required sustained sampling frequency 90 kHz. The FIR filter models the environment and generates the sequence of u(i), d(i) data measurements. The LMS filter serves for reconstruction of the unknown e(i) sequence – the speaker voice with partially cancelled disturbance from the far distance source.

signal processing http://zs.utia.cas.cz



Requirements and main implementation results (for the floating point FIR & LMS filter implementation on the 8xSIMD EdkDSP IP) are listed in *Table 8*.

Table 8: Requirements and results.

| Parameter                              | Requirement | SW MicroBlaze 100 MHz | 8xSIMD EdkDSP 120 MHz |
|----------------------------------------|-------------|-----------------------|-----------------------|
| FIR filter sampling rate Order N=2000  | 90 kHz      | 2.5 kHz (NO)          | 288 kHz (YES)         |
| FIR sustained performance (MFLOPs)     | 360 MFLOPs  | 10 MFLOPs (NO)        | 1152 MFLOPs (YES)     |
| LMS filter sampling rate Order N=2000  | 90 kHz      | 1.25 kHz (NO)         | 92 KHz (YES)          |
| LMS sustained performance (MFLOPs)     | 720 MFLOPs  | 10 MFLOPs (NO)        | 738 MFLOPs (YES)      |
| Bit exact identical results for 8xSIMD | Required    | YES                   | YES                   |
| EdkDSP IP and MB (FIR and LMS)         |             |                       |                       |
| Parallel EdkDSP computation and data   | Required    | YES                   | YES                   |
| transfers to/from DDR3 by MB           |             |                       |                       |
| Runtime change of 8xSIMD EdkDSP IP     | Required    | NA                    | YES                   |
| Embedded 8xSIMD EdkDSP C compiler      | Required    | NA                    | YES                   |
| Compatibility with SDSoC 2017.1        | Required    | YES                   | YES                   |
| Compatibility with PetaLinux 2017.1    | Required    | YES                   | YES                   |
| Compatibility with free SDK 2017.1     | Required    | YES                   | YES                   |
| and free edition of Vivado HLS 2017.1  |             |                       |                       |

### Summary of main results:

- The required LMS filter sampling rate 90 KHz (with N=2000) was reached.
- The maximum is 92 kHz for the adaptive LMS filter and 288 kHz for the FIR filter.
- The sustained floating-point performance of the 8xSIMD EdkDSP is 738 MFLOPs in case of the adaptive LMS filter and 1152 MFLOPs in case of the FIR filter.
- The 8xSIMD EdkDSP operates in parallel to the Cortex A9 processor without additional computing load.
- The 8xSIMD EdkDSP operates in parallel to each of the 21 Linux examples and 19 standalone examples
  of HW accelerators generated from selected Cortex A9 C/C++ functions in the Xilinx SDSoC 2017.1
  design environment.
- The embedded C/ASM compiler utilities for the 8xSIMD EdkDSP accelerator run as Linux applications on the dual core Arm Cortex A9 processor. These utilities can re-compile new EdkDSP firmware from the modified C/ASM source code in the runtime.



# 8. Installation and Use of Base Evaluation Package

This chapter describes the installation and use of a base evaluation package. Package is demonstrating:

- In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug of the 8xSIMD EdkDSP IP. ILA works with internal buffer for 8k samples and operates at 120 MHz. See Figure 7, Figure 8, Figure 9, Figure 10.
- The standalone examples support ILA and additionally can display the on-chip temperature via JTAG. See Figure 11
- Embedded Compilation from a C/ASM source code to firmware for the reprogrammable PicoBlaze6 finite state machine (FSM) scheduling inside of the 8xSIMD EdkDSP IP core the floating point computation sequences performed in the 8xSIMD data flow unit (DFU). This embedded compilation is supported for the Linux examples. See Figure 12 Figure 13, Figure 14.
- There is no need to install Xilinx SDK 2017.1 or Xilinx Vivado 2017.1 tools.
- The In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug can be performed from the free Xilinx Lab Vivado 2017.1 tool installed on Win7 (64bit) or Win 10 (64bit) PC
- The In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug can be also performed from a 32bit PC with the Xilinx Lab Vivado 2016.4 tool installed on Win7 (32bit) or Win 10 (32bit).
- The linux target examples support 1GBit Ethernet, SSH telnet and file system management tools like the Total Commander for an Ethernet based access from PC to the SD card files and editing of these files from user PC.

The base evaluation package provides 21 demos for the Linux target and the 19 precompiled demos for the standalone target. Table 9 describes demos, PL resources and the HW/SW SDSoC 2017.1. acceleration data.

Table 9: Description of ARM SDSoC acceleration examples compatible with 8xSIMD EdkDSP IP

| Linux          | Stand-<br>alone | Description of ARM SDSoC acceleration examples. All examples are extended versions                                                              |
|----------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------------------------|
|                | alone           | of the Xilinx GitHub SDSoC 2017.1 examples. SW extensions support the initialisation of the MicroBlaze processor and the 8xSIMD EdkDSP IP core. |
| HW/SW          | HW/SW           | of the Microbiaze processor and the oxshirid Eukosp in tore.                                                                                    |
| Accel.         | Accel.          |                                                                                                                                                 |
| t01_l          | t01_s           | array_partition - This example shows how to use array partitioning to improve                                                                   |
| 7.06           | 40.00           | performance of a hardware function.                                                                                                             |
| 7.86x          | 10.08x          | Slices: 78.28% Luts: 46.27% Registers: 29.12% Block RAM: 91.79% DSPs: 79.55%                                                                    |
| t02_l          | t02_s           | burst_rw - This is simple example of using AXI4-master interface for burst read and                                                             |
|                |                 | write.                                                                                                                                          |
|                |                 | Slices: 68.96% Luts: 52.59% Registers: 24.57% Block RAM: 51.43% DSPs: 9.55%                                                                     |
| t03_l          | t03_s           | <b>custom_data_type</b> - This is a simple example of RGB to HSV conversion to demonstrate                                                      |
|                |                 | Custom Data Type usage in hardware accelerator. Xilinx HLS compiler supports custom                                                             |
|                |                 | data type to operate within the hardware function and also it acts as a memory                                                                  |
|                |                 | interface between PL to DDR3.                                                                                                                   |
| 19.06x         | 20.36x          | Slices: 74.23% Luts: 49.58% Registers: 26.35% Block RAM: 51.43% DSPs: 10.91%                                                                    |
| t04_l          | t04_s           | data_access_random - This is a simple example of matrix multiplication (Row x Col) to                                                           |
|                |                 | demonstrate random data access pattern.                                                                                                         |
| 0.584x         | 0.59x           | Slices: 78.04% Luts: 51.03% Registers: 28.83% Block RAM: 62.86% DSPs: 13.64%                                                                    |
| t05_l          | t05_s           | dependence_inter - This is a simple example to demonstrate inter dependence                                                                     |
|                |                 | attribute. Using inter dependence attribute user can provide additional dependency                                                              |
|                |                 | details to compiler which allow compiler to perform unrolling/pipelining to get better                                                          |
| department of  |                 | performance.                                                                                                                                    |
| lau processing |                 |                                                                                                                                                 |



| 5.67x   | 6.25x   | Slices: 73.64% Luts: 47.76% Registers: 26.05% Block RAM: 55.00% DSPs: 22.27%            |
|---------|---------|-----------------------------------------------------------------------------------------|
| t06_l   | t06_s   | direct_connect - This is a simple example of matrix multiplication with matrix addition |
|         |         | (Out = (A x B) + C) to demonstrate direct connection which helps to achieve increasing  |
|         |         | in system parallelism and concurrency.                                                  |
| 12.6x   | 7.73x   | Slices: 87.95% Luts: 57.94% Registers: 33.76% Block RAM: 95.36% DSPs: 82.27%            |
| t07_l   | t07_s   | dma_sg - This example demonstrates how to use Scatter-Gather DMAs for data              |
|         |         | transfer to/from hardware accelerator.                                                  |
|         |         | Slices: 84.95% Luts: 56.38% Registers: 32.39% Block RAM: 59.29% DSPs: 9.55%             |
| t08_l   | t08_s   | dma_simple - This example demonstrates how to insert Simple DMAs for data transfer      |
|         |         | between user program and hardware accelerator.                                          |
|         |         | Slices: 78.20% Luts: 50.57% Registers: 28.17% Block RAM: 56.43% DSPs: 9.55%             |
| t09_l   | Not     | file_io_manr_sobel - Linux video processing application that reads input video from a   |
|         | Imple-  | file and writes out the output video to a file. Video processing includes Motion        |
|         | mented  | Adaptive Noise Reduction (MANR) followed by a Sobel filter for edge detection. You      |
|         | as      | can run it by supplying a 1080p YUV422 file as input with limiting number of frames to  |
|         | Stand-  | a maximum of 20 frames.                                                                 |
|         | Alone   | Slices: 89.23% Luts: 58.91% Registers: 33.60% Block RAM: 62.50% DSPs: 10.91%            |
| t10_l   | Not     | file_io_optical - Linux video processing application that reads input video from a file |
| _       | Imple-  | and writes out the output video to a file. Video processing performs LK Dense Optical   |
|         | mented  | Flow over two Full HD frames video file. You can run it by supplying a 1080p YUV422     |
|         | as      | file route85_1920x1080.yuv as input.                                                    |
|         | Stand-  | Slices: 99.71% Luts: 81.96% Registers: 49.53% Block RAM: 85.00% DSPs: 35.45%            |
|         | Alone   |                                                                                         |
| t11_l   | t11_s   | full_array_2d - This is a simple example of accessing full data from 2D array.          |
| _       | _       | Slices: 72.08% Luts: 49.26% Registers: 26.47% Block RAM: 87.50% DSPs: 12.27%            |
| t12_l   | t12_s   | hello_vadd - This is a basic hello world kind of example which demonstrates how to      |
| _       | _       | achieve vector addition using hardware function.                                        |
|         |         | Slices: 73.42% Luts: 48.75% Registers: 26.04% Block RAM: 53.21% DSPs: 9.55%             |
| t13_l   | t13_s   | Imem_2rw - This is a simple example of vector addition to demonstrate how to utilize    |
|         |         | both ports of Local Memory.                                                             |
|         |         | Slices: 74.38% Luts: 49.39% Registers: 26.42% Block RAM: 55.36% DSPs: 9.55%             |
| t14_l   | t14_s   | loop_fusion - This example will demonstrate how to fuse two loops into one to           |
|         |         | improve the performance of a C/C++ hardware function.                                   |
|         |         | Slices: 74.77% Luts: 49.95% Registers: 27.02% Block RAM: 53.21% DSPs: 15.00%            |
| t15_l   | t15_s   | loop_perfect - This nearest neighbor example is to demonstrate how to achieve better    |
|         |         | performance using perfect loop.                                                         |
|         |         | Slices: 86.75% Luts: 60.82% Registers: 32.78% Block RAM: 53.21% DSPs: 15.45%            |
| t16_l   | t16_s   | loop_pipeline - This example demonstrates how loop pipelining can be used to            |
|         |         | improve the performance of a hardware function.                                         |
|         |         | Slices: 73.42% Luts: 48.75% Registers: 26.04% Block RAM: 53.21% DSPs: 9.55%             |
| t17_l   | t17_s   | loop_reorder - This is a simple example of matrix multiplication (Row x Col) to         |
|         |         | demonstrate how to achieve better pipeline II factor by loop reordering.                |
| 7.31x   | 9.93x   | Slices: 81.13% Luts: 57.76% Registers: 30.33% Block RAM: 91.79% DSPs: 81.82%            |
| t18_l   | t18_s   | shift_register - This example demonstrates how to shift values in each clock cycle.     |
| 1.91x   | 4.09x   | Slices: 76.20% Luts: 49.99% Registers: 27.48% Block RAM: 53.21% DSPs: 24.55%            |
| t19_l   | t19_s   | sys_port - This is a simple example which demonstrates sys_port usage.                  |
|         |         | Slices: 88.41% Luts: 58.06% Registers: 34.23% Block RAM: 61.07% DSPs: 9.55%             |
| t20_l   | t20_s   | systolic_array - Matrix multiplication implemented as systolic array.                   |
| 0.07x   | 0.17x   | Slices: 80.22% Luts: 54.76% Registers: 30.10% Block RAM: 53.21% DSPs: 61.36%            |
| t21_l   | t21_s   | wide_memory_rw - Wide memory read write 64 bit wide.                                    |
| departm | nent of | Slices: 73.82% Luts: 47.24% Registers: 26.98% Block RAM: 55.36% DSPs: 9.55%             |
| 771 000 | essina  |                                                                                         |

signal processing



## Installation and use of the Base Evaluation Package - standalone examples

In case of standalone target:

- (1) In Win 7 or Win 10 (32bit or 64bit PC), unzip the basic evaluation package TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1.zip to directory of your choice. We will use: C:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1\
- (2) Select one of the examples (t01\_s ... t21\_s) and copy the content of sd\_card directory to the SD card. Example. Copy BOOT.bin from C:TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1\SD\_release\t01\_s\sd\_card\BOOT.bin to the root of the SD card as single file.
- (3) Connect USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect another USB cable to the USBUART Pmod module present in the J5 connector to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board and open putty (or similar) terminal client for both USB serial lines. Set the serial communication to: [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.
- (6) Insert SD card to the TE0706-02 carrier board.
- (7) Reset the carrier board (S2 button).
  - The standalone system will start. See *Figure 6*.
  - The ARM terminal will present output from the t01\_s example.
  - The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP. See *Figure 6*.
- (8) In PC, open the Vivado Lab tool. See Figure 7.

Open Hardware Manager

Press Auto Connect icon in Hardware window

- Open description of debug nets present in file, thus specifying the probes file

c:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1\SD\_release\t01\_s\debug\_nets.ltx

- Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 8, Figure 9, Figure 10*.
- Open new perspective and observe the chip temperature. See *Figure 11*.
- (9) Close Vivado Lab tool project.
- (10)Remove SD card and reprogram it in PC to test another example.
- (11)Go to step (6).





```
_ D X
Device IDCODE ->
                        23727093
 Revision ->
                  (7z020)
 Device ->
SoM: TE0720-02-2I F SC REV:04
MAC: 00 04 A3 DB 32 FC
ARMCPU0: MB0 reset removed, ARM waiting ...
ARMCPU0: MB0 indicates - running ...
Number of CPU cycles running application in software: 588718
Number of CPU cycles running application in hardware: 58440
Speed up: 10.0739
Note: Speed up is meaningful for real hardware execution only, not f
or emulation.
TEST PASSED
```

```
_ D X
COM57 - PuTTY
 MB0 : (EdkDSP 8xSIMD) Write firmware ...
 MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
 MBO : (HW FP unit ) Far-end signal ...
 MB0 : (EdkDSP 8xSIMD) FIR room response ... 1152 MFLOPs
 MBO : (HW FP unit ) Add near-end signal ...
 MB0 : (EdkDSP 8xSIMD) LMS Identification ... 738 MFLOPs MB0 : (HW FP unit ) LMS Identification ... 10 MFLOPs
 MB0 : (EdkDSP 8xSIMD) OK
 MB0 : (EdkDSP 8xSIMD) Write firmware ...
 MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
MB0 : (EdkDSP 8xSIMD) VZ2A 'worker1' ..... OK
MB0 : (EdkDSP 8xSIMD) VB2A 'worker1' ..... OK
 MB0 : (EdkDSP 8xSIMD) VZ2B 'worker1' ..... OK
 MB0 : (EdkDSP 8xSIMD) VA2B 'worker1' ..... OK
 MB0 : (EdkDSP 8xSIMD) VADD 'worker1' ..... OK
MB0 : (EdkDSP 8xSIMD) VADD_BZ2A 'worker1' .. OK
MB0 : (EdkDSP 8xSIMD) VADD_AZ2B 'worker1' .. OK
 MB0 : (EdkDSP 8xSIMD) VSUB 'worker1' ..... OK
MB0 : (EdkDSP 8xSIMD) VSUB BZ2A 'worker1' . . OK
MB0 : (EdkDSP 8xSIMD) VSUB BZ2A 'worker1' . . OK
MB0 : (EdkDSP 8xSIMD) VSUB AZ2B 'worker1' . . OK
MB0 : (EdkDSP 8xSIMD) VMULT 'worker1' . . OK
MB0 : (EdkDSP 8xSIMD) VMULT BZ2A 'worker1' . OK
 MB0 : (EdkDSP 8xSIMD) VMULT AZ2B 'worker1' . OK
 MB0 : (EdkDSP 8xSIMD) VPROD 'worker1' ..... OK
 MB0 : (EdkDSP 8xSIMD) VMAC 'worker1' ..... OK
 MB0 : (EdkDSP 8xSIMD) VMSUBAC 'worker1' .... OK
 MB0 : (EdkDSP 8xSIMD) VPROD S8 'worker1' ... OK
 MBO : (EdkDSP 8xSIMD) VDIV worker1' ..... OK
 MB0 : (EdkDSP 8xSIMD) Write firmware ...
 MB0 : (EdkDSP 8xSIMD) Capabilities1 = 13ffff
 MBO : (HW FP unit ) Far-end signal ...
 MB0 : (EdkDSP 8xSIMD) FIR room response ... 1152 MFLOPs
 MBO : (HW FP unit ) Add near-end signal ...
 MBO : (EdkDSP 8xSIMD) LMS Identification ...
                                                         738 MFLOPs
```

Figure 6: Release demo t01\_s. ARM and 8xSIMD EdkDSP terminal output.





Figure 7: Release demo t01\_s. Vivado Lab Tool is open.

The Vivado Lab tool is connected to the chip. You have to specify the probes file (See Figure 8).

X:\SD release\t01 s\debug nets.ltx

Names and parameters of probes are added to the ILA Waveform window. See Figure 8.

Use + to select probes used for triggering, and select the condition for the trigger for each probe and their combinations (use AND as default).

Some of debug probes can be used to trigger the capturing of data. The ILA can be triggered from the EdkDSP firmware running on the PicoBlaze6 running inside of the (8xSIMD) EdkDSP unit.





Figure 8: Release demo t01\_s. Probes file is specified. Trigger conditions are set.

In SDK, open the X:\zsys\edkdsp\a\f2.c file. See section of the LMS C code firmware. This C code includes the additional call to the pb2dfu\_set() function used for selective triggering of the ILA scope in specified point of computation of the EdkDSP accelerator.

27/55





Figure 9: Release demo t01 s. Details of the 8xSIMD EdkDSP LMS filter computation.

In Vivado Lab Edition, in the ILA configuration page, change the trigger condition to: (bce\_port\_wr ==1) AND (bce\_port\_id[0:7]==0x20) AND (bce\_port[0:7]==0x00)

In Vivado Lab Edition 2017.1, arm the hw\_ila\_1 core by pressing **Run Trigger** button in **Hardware** window. Armed hw\_ila\_1 core will wait until the recompiled EdkDSP firmware comes to the point, where PicoBlaze6 calls function **pb2dfu\_set(0x20, 0)**.

ILA core starts to capture 8K samples of all debug signals with the sampling rate 120 MHz. Data are captured and sent via jtag USB connection in Vivado Lab Edition 2017.1 for visualisation and analysis in the waveform window.

28/55

This snapshot stores the detailed trace of the initial 8192 clock cycles of the FIR filter computation. See *Figure 10*.





Figure 10: Release demo t01 s. Details of the 8xSIMD EdkDSP FIR filter computation.

In Vivado Lab. Edition, in the ILA configuration page, change the trigger condition to bce\_port[0:7]==**0x01**: to capture start of the FIR filter. See *Figure 10*. The PicoBlaze C code of the FIR example is listed in *Figure 17*. The Vivado Lab. screens presented in *Figure 9* and *Figure 10* display also the 1024 samples before the trigger event. This mode is set in the trigger mode settings window. Screens display how the PicoBlaze6 controller reset signal **bce\_r\_pb** is deactivated. Picoblaze6 reads the 8 bit parameters **op** and **n** from the MicroBlaze before the trigger evet. See complete program listing in *Figure 17* with these initial lines of the PicoBlaze6 SW:





Figure 11: Release demo t01\_s. Stanalone demo supports measurements of the chip temperature.

The standalone demos support measurement of the chip temperature in a new dashboard connected to the XADC system monitor.

## Installation and use of Base Evaluation Package - linux examples

In case of Linux target:

- (1) In Win 7 or Win 10 (32bit or 64bit PC), unzip the basic evaluation package TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1.zip to directory of your choice. We will use: C:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1\
- (2) Select one of the examples (t01\_I ... t21\_I) and copy the content of sd\_card directory to the SD card. Example. Copy content of the directory from C:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1\SD\_release\t01\_I\sd\_card\ to the root of the SD card
- (3) Connect Mini USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect Micro USB cable from to the USBUART Pmod module present in the J5 connector) to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board. And open putty (or similar) terminal client for both USB serial lines. Set the serial communication to [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.
- (6) Insert SD card to the TE0706-02 carrier board.
- (7) Reset the carrier board.
  - The linux system will start. See *Figure 12*. type user name:

root

type password:

root

- Mount SD card to the directory (See Figure 13) /mnt by typing:

mount /dev/mmcblk0p1 /mnt

- Change directory (See Figure 13) to /mnt

cd/mnt

- -Compile firmware for the PicoBlaze6 by the EdkDSP C compiler (See *Figure 13*): ./edkdsp/tools/cc\_fx.sh ./edkdsp/a (or ./edkdsp/tools/cc\_fx.sh ./edkdsp/b or ./edkdsp/tools/cc\_fx.sh ./edkdsp/c ...)
- The PicoBlaze6 C source code f0.c f1.c f2.c and f3.c from the directory ./edkdsp/a are compiled by the EdkDSP C compiler to the firmware files (See *Figure 13*):





./f0.dec ./f1.dec ./f2.dec ./f3.dec

- The ARM terminal will present output from the EdkDSP C compiler
- The MicroBlaze terminal is not active. EdkDSP is not programmed yet.
- Start the linux application (See Figure 14) by typing:

./t01\_l.elf

- The ARM terminal will present output from the t01\_l.elf example. See Figure 14.
- The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP working with new firmware programs as re-compiled by the EdkDSP C compiler from the C source code files: f0.c f1.c f2.c and f3.c from the directory ./edkdsp/a The output from the 8xSIMD EdkDSP is identical to the standalone output. See Figure 6.
- (8) In PC, open the Vivado Lab tool. See Figure 7.
  - Open Hardware Manager
  - Press Auto Connect icon in Hardware window
  - Open description of debug nets present in file, thus specifying the probes file. See Figure 8.

c:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_2if\_ila\_8k\_usb\_sw1\SD\_release\t01\_s\debug\_nets.ltx

- Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 9*, *Figure 10*.
- (9) Close Vivado Lab tool project.
- (10)Remove SD card and reprogram it in PC to test another example.

(11)Go to step (6).



```
ᄰ COM39 - PuTTY
Starting internet superserver: inetd.
INIT: Entering runlevel: 5
Configuring network interfaces... udhcpc (v1.24.1) started
Sending discover...
Sending discover...
Sending discover...
No lease, forking to background
done.
Starting Dropbear SSH server: Generating key, this may take a while.
Public key portion is:
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDQ1ADWXAXBQ1MTnJ/Au7a1kstH3RJk
SDJ/UYsdR/qnHIQ9I1K1CQggx4vvi2Cil0GIU8QmOWUOgN2u/KFDmrzRVxrF2glSw679
/198M0y7Mu5ebPF8HWLoocTKkusljmy/MDn6feQSK/1ns3Agsd3kMj/dB0oOZUjQ7CgC
WM6ZVSdkUEm00ZTJPb4kMj7ZUBJSJfXm+bNqA8ahUOej1R0gKEyoXh1PHxVMg5Np89XV
UlaDTdO1IrMVI1iI/Lj98KR6foSABtJPQUaEp3KrgcY+aFpRSVSrCtzsG+TvediSB3m8
2E1FUKT8T2t3EW/nwYaIfUFvdBqrFlXXmH39g381 root@plnx arm
Fingerprint: md5 7d:71:76:ec:d9:53:2d:f2:b6:b5:7f:7d:a1:03:8f:ed
dropbear.
Starting syslogd/klogd: done
Starting tcf-agent: OK
PetaLinux 2017.1 plnx arm /dev/ttyPS0
plnx arm login:
```

Figure 12: Release demo t01\_l. Linux start.

```
COM39 - PuTTY
2E1FUKT8T2t3EW/nwYaIfUFvdBqrFlXXmH39g381 root@plnx arm
Fingerprint: md5 7d:71:76:ec:d9:53:2d:f2:b6:b5:7f:7d:a1:03:8f:ed
dropbear.
Starting syslogd/klogd: done
Starting tcf-agent: OK
PetaLinux 2017.1 plnx arm /dev/ttyPS0
plnx arm login: root
Password:
root@plnx_arm:~# mount /dev/mmcblk0p1 /mnt
root@plnx_arm:~# cd /mnt
root@plnx arm:/mnt# ls
           _sds
BOOT.bin
                        f0.dec
                                    f2.dec
                                                image.ub
README.txt edkdsp
                                                t01 l.elf
                        f1.dec
                                    f3.dec
root@plnx arm:/mnt# ./edkdsp/tools/cc fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPPSM: f0.psm ...
EDKDSPCC : f1.c ...
EDKDSPPSM: fl.psm ...
EDKDSPCC : f2.c ...
EDKDSPPSM: f2.psm ...
EDKDSPCC : f3.c ...
EDKDSPPSM: f3.psm .
root@plnx arm:/mnt#
```

Figure 13: Release demo t01 I. Login, Compilation of firmware in the EdkDSP C Compiler.



```
_ D X
COM39 - PuTTY
root@plnx arm:/mnt#
root@plnx arm:/mnt# ./edkdsp/tools/cc fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPPSM: f0.psm ...
EDKDSPCC : f1.c ...
EDKDSPPSM: fl.psm ...
EDKDSPCC : f2.c ...
EDKDSPPSM: f2.psm ...
EDKDSPCC : f3.c ...
EDKDSPPSM: f3.psm ...
root@plnx arm:/mnt# ./t01 l.elf
/dev/mem opened.
Memory mapped at address 0xb6fd7000.
Memory mapped at address 0xae450000.
ARMCPU0: Write firmware ...
ARMCPU0: Open input file f0.dec ... OK
ARMCPU0: Open input file f1.dec ... OK
ARMCPU0: Open input file f2.dec ... OK
ARMCPU0: Open input file f3.dec ... OK
ARMCPU0: Close input file f0.dec ... OK
ARMCPU0: Close input file f1.dec ... OK
ARMCPU0: Close input file f2.dec ... OK
ARMCPU0: Close input file f3.dec ... OK
ARMCPU0: Write firmware Done.
Reset for 1 sec. ... Done.
ARMCPU0: MB0 reset removed, ARM waiting ... Done.
ARMCPU0: MB0 indicates - running ...
Number of CPU cycles running application in software: 591792
Number of CPU cycles running application in hardware: 76670
Speed up: 7.71869
TEST PASSED
root@plnx arm:/mnt#
```

Figure 14: Release demo t01\_l. Program and start 8xSIMD EdkDSP demo.



# 9. Installation and Use of Extended Evaluation Package

The extended evaluation package is offered to the ECSEL PRODUCTIVE 4.0 project partners [8] on their written request to UTIA for free. See the license conditions listed in next sections of this report.

The extended evaluation package supports:

- Compilation from C source code and debug for the MicroBlaze processor for Linux and standalone targets
- Creation and Release of SD cards with new compiled MicroBlaze SW and new compiled Picoblaze6 firmware for Linux and standalone targets.
- In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug of the 8xSIMD EdkDSP IP. ILA works with internal buffer for 8k samples and operates at 120 MHz.
- Embedded Compilation from a C/ASM source code to firmware for the reprogrammable PicoBlaze6 finite state machine (FSM) scheduling inside of the 8xSIMD EdkDSP IP core the floating point computation sequences performed in the 8xSIMD data flow unit (DFU). This embedded compilation is supported for the Linux examples.
- The standalone examples also support ILA and additionally can display the on-chip temperature via JTAG.
- The extended evaluation package requires the Xilinx SDK 2017.1 tools (download is free). SDK serves for compilation of MicroBlaze code, download of compiled MicroBlaze code via JTAG and for the debug of this code in parallel with the ILA inspection/observation/debug of the EdkDSP IP core.
- The In-circuit Logic Analyser (ILA) JTAG based inspection/observation/debug can be performed from the free Xilinx Lab Vivado 2017.1 tool installed on Win7 (64bit) or Win 10 (64bit) PC.
- The linux target examples support 1G Bit Ethernet, SSH telnet and file system management tools like the Total Commander for an Ethernet based access from PC to the SD card files and editing of these files from user PC.

The extended evaluation package provides 21 precompiled designs for the linux target and 19 precompiled designs for the standalone target as described in *Table 9*.





## Installation and use of extended evaluation package - standalone examples

In case of standalone target:

(1) In Win 7 or Win 10 (64bit PC), unzip the basic evaluation package TE0720\_MB\_EdkDSP\_1x8\_zsys\_SDK\_2if\_ila\_8k\_usb\_sw1\_INSTALL.zip to directory of your choice. We will use:

C:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_SDK\_2if\_ila\_8k\_usb\_sw1\_INSTALL

In Xilinx SDK 2017.1 create a new workspace in the directory

X:\zsys



Figure 15: Create new SDK 2017.1 workspace.

Import (with copy) all SDK projects from

C:\TE0720\_MB\_EdkDSP\_1x8\_zsys\_SDK\_2if\_ila\_8k\_usb\_sw1\\_INSTALL\zsys

to the SDK workspace X:\zsys





Figure 16: Import the extended evaluation package projects into the SDK Workspace.

(2) Select one of the examples (t01\_s ... t21\_s) and copy the content of the sd\_card directory to the SD card. Example. Copy BOOT.bin from

X:\SD\_debug\t01\_s\sd\_card\BOOT.bin

to the root of the SD card as a single file.

- (3) Connect Mini USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect Micro USB cable to the USBUART Pmod module present in the J5 connector to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board. And open putty (or similar) terminal client for both USB serial lines. Set the serial communication to [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.





Figure 17: SDK compiles MicroBlaze SW projects for the standalone debug target.

- (6) Insert SD card to the TE0706-02 carrier board.
- (7) Reset the carrier board.
  - The standalone system will start.
  - The ARM terminal will present output from the t01\_s example. The Arm application is waiting for the MicroBlaze.

38/55





Figure 18: Debug demo t01 I. Execution of the ./t01 s.elf example from the SD card.

- The Xilinx SDK project edkdsp\_fp12\_1x8\_s already includes PicoBlaze6 firmware header files fill\_f0\_program\_store.h, fill\_f1\_program\_store.h, fill\_f2\_program\_store.h, fill\_f3\_program\_store.h These files can be recreated from C source code by the EdkDSP C compiler in the linux target session as described in the next section.
- In the Xilinx SDK workspace, compile the edkdsp\_fp12\_1x8\_s project with the existing (or new, recompiled) PicoBlaze6 firmware headers fill\_f0\_program\_store.h, fill\_f1\_program\_store.h, fill\_f2\_program\_store.h, fill\_f3\_program\_store.h.
- In the Xilinx SDK workspace, select Debug of MicroBaze project **edkdsp\_fp12\_1x8\_s**. In the Debug Configurations, select "No reset", unselect "Run ps7 init", unselect "Run ps7 post config" click "Apply".



Figure 19: Debug demo t01\_s. Open project edkdsp\_fp12\_1x8\_s for debug.





Figure 20: Debug demo t01\_s. Start the free-run from the debugger.

- In the SDK debugger, step through the MicroBlaze source code, inspect content of variables, set the breakpoints, step through the code and finally select the free run of the MicroBlaze code.
- At this stage, the ARM terminal will present the output from the ARM t01\_s.elf example. See Figure 21.

```
ARMCPU0: MB0 reset removed, ARM waiting ...

ARMCPU0: MB0 indicates - running ...

Number of CPU cycles running application in software: 586256

Number of CPU cycles running application in hardware: 59428

Speed up: 9.86498

Note: Speed up is meaningful for real hardware execution only, not for emulation.

TEST PASSED
```

Figure 21: Debug demo t01\_s. Arm started EdkDSP and runs SDSoC akcelerátor demo.



The MicroBlaze terminal will present output from the debugged MicroBlaze and the 8xSIMD EdkDSP IP core. See *Figure 22*.

```
COM57 - PuTTY
 MB0 : Start of MB ... Done.
 MB0 : Read firmware ... Done.
 Initialize TmrCtr for axi timer 0...
 MB0 : (EdkDSP 8xSIMD) Write firmware ...
      (EdkDSP 8xSIMD) Capabilities1 = 13ffff
       (HW FP unit
                   ) Far-end signal ...
       (EdkDSP 8xSIMD) FIR room response ... 1138 MFLOPs
                   ) Add near-end signal ...
       (HW FP unit
                                              733 MFLOPs
 MB0 :
       (EdkDSP 8xSIMD) LMS Identification ...
                   ) LMS Identification ...
 MB0 : (HW FP unit
                                               3 MFLOPs
 MB0 : (EdkDSP 8xSIMD) OK
 MB0 : (EdkDSP 8xSIMD) Write firmware ...
       (EdkDSP 8xSIMD) Capabilities1 = 13ffff
       (EdkDSP 8xSIMD) VZ2A 'worker1'
 MB0 :
       (EdkDSP 8xSIMD) VB2A 'worker1'
 MB0 : (EdkDSP 8xSIMD) VZ2B 'worker1'
 MB0 : (EdkDSP 8xSIMD) VA2B 'worker1'
      (EdkDSP 8xSIMD) VADD 'worker1'
       (EdkDSP 8xSIMD) VADD BZ2A 'worker1'
       (EdkDSP 8xSIMD) VADD AZ2B 'worker1'
       (EdkDSP 8xSIMD) VSUB
                            'worker1'
 MB0
       (EdkDSP 8xSIMD) VSUB BZ2A 'worker1' .. OK
 MB0
      (EdkDSP 8xSIMD) VSUB AZ2B 'worker1' .. OK
 MB0 :
 MB0 : (EdkDSP 8xSIMD) VMULT 'worker1'
 MB0 : (EdkDSP 8xSIMD) VMULT BZ2A 'worker1'
 MB0 : (EdkDSP 8xSIMD) VMULT AZ2B 'worker1'
       (EdkDSP 8xSIMD) VPROD 'worker1'
 MB0:
       (EdkDSP 8xSIMD) VMAC 'worker1'
                      VMSUBAC 'worker1'
       (EdkDSP 8xSIMD)
 MBO : (EdkDSP 8xSIMD) VPROD S8 'worker1' ... OK
 MB0 : (EdkDSP 8xSIMD) VDIV 'worker1'
```

Figure 22: Debug demo t01\_s. MicroBlaze project output (Compiled for Debug).

(8) In PC, open the Vivado Lab tool. See Figure 7.

- Open Hardware Manager.
- Press Auto Connect icon in Hardware window to connect to the board via JTAG line.
- Open description of debug nets present in file, thus specifying the probes file.

X:\SD\_debug\t01\_s\debug\_nets.ltx

- Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 8, Figure 9, Figure 10*.



- Open new perspective and observe the chip temperature. See Figure 11. Close Vivado Lab tool project.
- (9) In SDK debugger, stop MicroBlaze processor and close the debug session.
- (10) Remove SD card and reprogram it in the PC to test another example.
- (11) Go to step (6).

### Installation and use of Extended Evaluation Package – linux examples

In case of Linux target:

(1) In Win 7 or Win 10 (64bit PC), unzip the basic evaluation package TE0720\_MB\_EdkDSP\_1x8\_zsys\_SDK\_2if\_ila\_8k\_usb\_sw1\_INSTALL.zip to directory of your choice. We will use:

C:\TE0720 MB EdkDSP 1x8 zsys SDK 2if ila 8k usb sw1 INSTALL\

Open new Xilinx SDK 2017.1 workspace in the directory

X:\zsys

Import (with copy) all SDK projects from

C:\TE71\TE0720 MB EdkDSP 1x8 zsys SDK 2if ila 8k usb sw1 INSTALL\zsys\

to the SDK.

- (2) Select one of the examples (t01 | ... t21 |) and copy the content of sd card directory to the SD card. Example. Copy content of the directory from X:\SD debug\t01 I\sd card\ to the root of the SD card
- (3) Connect USB cable from J7 connector to the PC. It will serve as ARM terminal and JTAG line.
- (4) Connect USB cable to the USBUART Pmod module present in the J5 connector to the PC. It will serve as MicroBlaze terminal.
- (5) Power ON the carrier board. And open putty (or similar) terminal client for both USB serial lines. Set the serial communication to [speed 115200, data bits 8, stop bits 1, parity none and flow control None] in both cases.

42/55

- (6) Insert SD card to the TE0706-02 carrier board.
- (7) Reset the carrier board.
  - The linux system will start. See Figure 23.
  - Type the linux user name and password:

root

root



```
COM39 - PuTTY
Password:
root@plnx arm:~# mount /dev/mmcblk0p1 /mnt
root@plnx arm:~# cd /mnt
root@plnx arm:/mnt# ls
BOOT.BIN
           README.txt sds
                                   edkdsp
                                                image.ub
                                                            t01 l.el
root@plnx arm:/mnt# ./edkdsp/tools/cc fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPPSM: f0.psm ...
EDKDSPCC : f1.c ...
EDKDSPPSM: fl.psm ...
EDKDSPCC : f2.c ...
EDKDSPPSM: f2.psm ...
EDKDSPCC : f3.c ...
EDKDSPPSM: f3.psm ...
root@plnx arm:/mnt# ./edkdsp/tools/cs fx.sh ./edkdsp/a
EDKDSPCC : f0.c ...
EDKDSPASM: f0.psm ...
Generated M function file in the M file ././fill f0 program store.m
Generated C header file in the H file ./fill f0 program store.h
EDKDSPCC : f1.c ...
EDKDSPASM: fl.psm ...
Generated M function file in the M file ././fill f1 program store.m
Generated C header file in the H file ./fill f1 program store.h
EDKDSPCC : f2.c ...
EDKDSPASM: f2.psm ...
Generated M function file in the M file ././fill f2 program store.m
Generated C header file in the H file ./fill f2 program store.h
EDKDSPCC : f3.c ...
EDKDSPASM: f3.psm ...
Generated M function file in the M file ././fill f3 program store.m
Generated C header file in the H file ./fill f3 program store.h
root@plnx_arm:/mnt# ./t01_l.elf
/dev/mem opened.
Memory mapped at address 0xb6f2f000.
Memory mapped at address 0xae3a8000.
ARMCPU0: Write firmware ...
ARMCPU0: Open input file f0.dec ... OK
ARMCPU0: Open input file f1.dec ... OK
ARMCPU0: Open input file f2.dec ... OK
ARMCPU0: Open input file f3.dec ... OK
ARMCPU0: Close input file f0.dec ... OK
ARMCPU0: Close input file f1.dec ... OK
ARMCPU0: Close input file f2.dec ... OK
ARMCPU0: Close input file f3.dec ... OK
ARMCPU0: Write firmware Done.
ARMCPU0: place 0x28100000 at start of MB0 vectors
Reset for 1 sec. ... Done.
ARMCPU0: MB0 reset removed, ARM waiting ...
```

Figure 23: Compiled EdkDSP firmware. Started debug demo - linux target t01\_l.



UTIA

http://zs.utia.cas.cz

Ústav teorie informace a automatizace AV ČR, v.v.i.

- Mount SD card to the directory (See Figure 23) /mnt by typing:

```
mount /dev/mmcblk0p1 /mnt
```

- Change directory to /mnt

cd /mnt

- -Compile firmware for the PicoBlaze6 by the EdkDSP C compiler (see *Figure 23*): ./edkdsp/tools/cc\_fx.sh ./edkdsp/a (or ./edkdsp/tools/cc\_fx.sh ./edkdsp/b or ./edkdsp/tools/cc\_fx.sh ./edkdsp/c etc.)
- The PicoBlaze6 C source code files from the directory ./edkdsp/a ./edkdsp/a/f0.c ./edkdsp/a/f1.c ./edkdsp/a/f2.c ./edkdsp/a/f3.c (or from the directory ./edkdsp/b or from the directory ./edkdsp/c etc.)

are compiled by the EdkDSP C compiler to the firmware files:

```
./f0.dec ./f1.dec ./f2.dec ./f3.dec
```

 Optionally, you can also compile the PicoBlaze6 firmware into header files for the standalone target. Compile firmware for the PicoBlaze6 by the EdkDSP C compiler (See Figure 23):

```
./edkdsp/tools/cs_fx.sh ./edkdsp/a (or ./edkdsp/tools/cs_fx.sh ./edkdsp/b or ./edkdsp/tools/cs_fx.sh ./edkdsp/c etc.)
```

Generated header files with PicoBlaze6 firmware for the standalone target EdkDSP IP target are created and stored in the SD card root directory:

```
./fill_f0_program_store.h ./fill_f1_program_store.h ./fill_f2_program_store.h ./fill_f3_program_store.h
```

These headers serve for the standalone MicroBlaze projects. Headers are compiled directly into the debugged MicroBlaze standalone application as described above.

- Execute the ARM linux application See Figure 23.
- The ARM terminal will present output from the EdkDSP C compiler
- The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP
- Start the linux application by typing

./t01\_l.elf

- The ARM terminal will present output from the t01\_l.elf example. The Arm application is waiting for the MicroBlaze in this stage.

44/55

In the Xilinx SDK environment on the PC, select debug project (See Figure 24):
 X:\zsys\edkdsp fp12 1x8 I





Figure 24: Select MicroBlaze project edkdsp\_fp12\_1x8\_I for debug.

- In the SDK debugger, step through the MicroBlaze source code, inspect the content of variables, set breakpoints etc. See *Figure 25*.
- In the SDK debugger, select free run of the MicroBlaze code. See *Figure 25*.
- The MicroBlaze terminal will present output from the 8xSIMD EdkDSP IP working with new firmware programs as re-compiled by the EdkDSP C compiler from the C source code files: f0.c, f1.c, f2.c and f3.c from the directory ./edkdsp/a Output is identical to *Figure 22*.
- The ARM terminal will continue to present output from the t01\_l.elf example. See Figure 26.
- In ARM terminal, type:Is Ir

to see listing of files compiled by the EdkDSP C compiler. See *Figure 26*.

The compiled header files fill\_f0\_program\_store.h, fill\_f1\_program\_store.h, fill\_f2\_program\_store.h, and fill\_f3\_program\_store.h. can be used as new source code for the standalone MicroBlaze project  $X:\zys\ed{dsp_fp12\_1x8_s}$ .





Figure 25: Select free run of MicroBlaze project edkdsp\_fp12\_1x8\_l.

- (8) In PC, open the Vivado Lab tool. See Figure 7.
- Open Hardware Manager.
- Press Auto Connect icon in Hardware window to connect to the board via JTAG line
- Open description of debug nets present in file, thus specifying the probes file
- Open description of debug nets present in file
- X:\SD\_debug\t01\_l\debug\_nets.ltx
- Set the ILA trigger conditions and observe process of computation in the 8xSIMD EdkDSP IP. See *Figure 8*, *Figure 9*, *Figure 10*.
- (9) Close Vivado Lab tool project.
- (10) In SDK debugger, stop MicroBlaze processor and close the debug session.



- (11) Exit from linux by typing on the ARM terminal: exit
- (12) Remove SD card and reprogram it in the PC to test another example.
- (13) Go to step (6).



Figure 26: Output from ARM MicroBlaze fort t01 l. Compiled EdkDSP firmware.



## Updating of the release SD card images for new standalone-release-target

Modified Picoblaze6 C source code can be compiled to firmware headers in the embedded EdkDSP C compiler (linux target). Resulting headers can be included in the SDK MicroBlaze standalone release target project. See *Figure 17*. The standalone-release-target SD card image can be updated by re-compilation of the (possibly modified) C source code for the MicroBlaze in the SDK project with included updated PicoBlaze firmware header files. See *Figure 27*.



Figure 27: Create BOOT.bin for the t01\_s demo.

48/55

Copy the content of directory:

```
\label{lem:c:teo720_MB_edkDSP_1x8_zsys_SDK_2if_ila_8k_usb_sw1_INSTALL\SD\_release\t01\_s\to
```

X:\SD\_release\t01\_s\uboot\

The new BOOT.bin image can be created from these files:

```
X:\SD_release\t01_s\uboot\t01_s.bif
X:\SD_release\t01_s\uboot\zynq_fsbl.elf
X:\SD_release\t01_s\uboot\zynq_wrapper.bit.elf
X:\SD_release\t01_s\uboot\t01_s.elf
X:\SD_release\t01_s\uboot\edkdsp_fp12_1x8_s.elf
```



Replace the old

X:\SD release\t01 |\uboot\edkdsp fp12 1x8 s.elf

with the new file recompiled in the SDK (with new PicoBlaze6 firmware headers) from the SDK project directory: X:\zsys\edkdsp\_fp12\_1x8\_s\Release\edkdsp\_fp12\_1x8\_s.elf

Use the BOOT.bin generation utility (In the SDK workspace: Xilinx Tools -> Create Boot Image) and create the new BOOT.bin file (See *Figure 27*):

X:\SD\_release\t01\_s\uboot\BOOT.bin

Copy this new BOOT.bin file it to:

X:\SD release\t01 s\sd card\BOOT.bin

The content of the standalone-release-target SD card is updated with new MicroBlaze and PicoBlaze6 firmware.

## Updating of the release SD card images for new linux-release-target

The linux-release-target SD card image can be updated by re-compilation of the (possibly modified) C source code for the MicroBlaze in the SDK project. See *Figure 28*.



Figure 28: Create BOOT.bin for the t01\_I demo.



```
Copy the content of directory:

C:\TE0720_MB_EdkDSP_1x8_zsys_SDK_2if_ila_8k_usb_sw1_INSTALL\SD_release\t01_l\uboot\

to

X:\SD_release\t01_l\uboot\
```

The new BOOT.bin image can be created from these files:

```
X:\SD_release\t01_l\uboot\t01_l.bif
X:\SD_release\t01_l\uboot\zynq_fsbl.elf
X:\SD_release\t01_l\uboot\zynq_wrapper.bit.elf
X:\SD_release\t01_l\uboot\u-boot.elf
X:\SD_release\t01_l\uboot\edkdsp_fp12_1x8_l.elf
```

Replace

```
X:\SD release\t01 |\uboot\edkdsp fp12 1x8 ||.elf
```

with a new file recompiled in the SDK project directory:

```
X:\zsys\edkdsp_fp12_1x8_l\Release\edkdsp_fp12_1x8_l.elf
```

Use the BOOT.bin generation utility of the SDK and create the new BOOT.bin file:

```
X:\SD_release\t01_l\uboot\BOOT.bin
```

```
Copy this new BOOT.bin file to: X:\SD_release\t01_I\sd_card\BOOT.bin
```

Update the linux-release-target SD card with the actual PicoBlaze6 C programs and actual firmware files compiled from C source code by the embedded EdkDSP C compiler:

```
X:\SD_release\t01_l\sd_card\edkdsp\a\f0.c
X:\SD_release\t01_l\sd_card\edkdsp\a\f1.c
X:\SD_release\t01_l\sd_card\edkdsp\a\f2.c
X:\SD_release\t01_l\sd_card\edkdsp\a\f3.c
X:\SD_release\t01_l\sd_card\f0.dec
X:\SD_release\t01_l\sd_card\f1.dec
X:\SD_release\t01_l\sd_card\f2.dec
X:\SD_release\t01_l\sd_card\f3.dec
```

The content of the linux-release-target SD card is updated with new MicroBlaze and PicoBlaze6 firmware.



# 10. References

[1] TE0720-2IF; Part: XC7Z020-2CLG484I; 1 GByte DDR; Grade: Industrial;

http://shop.trenz-electronic.de/en/TE0720-03-2IF-Xilinx-Zynq-module-XC7Z020-2CLG484I-ind.-temp.-range-1-Gbyte

https://www.trenz-electronic.de/fileadmin/docs/Trenz\_Electronic/TE0720/REV02/Documents/TE0720%20User%20Manual-v45-20150323 1407.pdf

https://www.trenz-electronic.de/fileadmin/docs/Trenz Electronic/TE0720/REV03/Documents/TRM-TE0720-03.pdf

[2] Heatsink for TE0720, spring-loaded embedded;

https://shop.trenz-electronic.de/en/26922-Heatsink-for-TE0720-spring-loaded-embedded?c=38

[3] TE0706 - Carrierboard for Trenz Electronic Modules with 4 x 5 cm Form factor <a href="https://shop.trenz-electronic.de/en/TE0706-02-TE0706-Carrierboard-for-Trenz-Electronic-Modules-with-4-x-5-cm-Form-factor?c=261">https://shop.trenz-electronic.de/en/TE0706-02-TE0706-Carrierboard-for-Trenz-Electronic-Modules-with-4-x-5-cm-Form-factor?c=261</a>

https://www.trenz-electronic.de/fileadmin/docs/Trenz Electronic/carrier boards/TE0706/REV02/documents/SCH-TE0706-02.PDF

[4] Pmod USBUART: Serial converter & interface.

https://shop.trenz-electronic.de/en/24242-Pmod-USBUART-USB-to-UART-Interface?c=80

[5] XMOD FTDI JTAG Adapter - Xilinx compatible

https://shop.trenz-electronic.de/en/TE0790-02-XMOD-FTDI-JTAG-Adapter-Xilinx-compatible

[6] Vivado HLx Web Install Client - 2017.1.

https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/vivado-design-tools/2015-4.html

[7] SDSoC - 2017.1 Full Product Installations.

https://www.xilinx.com/support/download/index.html/content/xilinx/en/downloadNav/sdx-development-environments/sdsoc/2015-4.html

[8] PRODUCTIVE 4.0 Project www page in UTIA with pointers to evaluation packages for download http://sp.utia.cz/index.php?ids=projects/productive40



# 11. Base Evaluation Package

The base evaluation package can be downloaded from UTIA www pages [8] free of charge.

#### **Deliverables:**

The base evaluation package [8] includes evaluation bitstreams with single (8xSIMD) EdkDSP IP working in parallel with selected HW-accelerated SDSoC algorithms on the Trenz Electronic TE0720-2IF module [1] located on the Trenz Electronic TE0706-02 carrier [3] with PMOD USBUART adapter [4] and XMOD FTDI JTAG Adapter [5].

The evaluation package [8] includes bitstreams compiled with the evaluation version of the (8xSIMD) EdkDSP IP core. Bitstreams contain these IPs:

bce\_fp12\_1x8\_0\_axiw\_v1\_10\_c Evaluation version of the AXI-lite interface bce\_fp12\_1x8\_40 Evaluation version of the floating point data path

The base evaluation version of the (8xSIMS) EdkDSP IP is compiled into bitstreams with a HW limit on number of vector operations. The termination of the nonexclusive, non-transferable evaluation license of this evaluation IP core is reported in advance by the demonstrator on the PMOD USBUART terminal. The evaluation designs run again after reset (TE0706-02 button S2).

The base evaluation package [8] includes these binary applications:

edkdsppp.elf EdkDSP C pre-processor binary for ARM PetaLinux running on the evaluation board. EdkDSP C compiler binary for ARM PetaLinux running on the evaluation board. edkdsppsm.elf EdkDSP ASM compiler binary for ARM PetaLinux running on the evaluation board.

These binary applications have no time restriction. The user of the evaluation package has nonexclusive, non-transferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6 processor inside of the 8xSIMD EdkDSP IP in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.

The base evaluation package [8] includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz Electronic TE0720-2IF module [1] on Trenz Electronic TE0706-02 carrier board [3].

HW boards are not part of deliverables. HW can be ordered separately from [1] – [5].

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.



# 12. Extended Evaluation Package for PRODUCTIVE 4.0 partners

This extended evaluation package includes MicroBlaze and PicoBlaze6 C code and precompiled bitstreams of HW projects for the Trenz Electronic TE0720-2IF module [1] located on the Trenz Electronic TE0706-02 carrier [3] with PMOD USBUART adapter [4] and XMOD FTDI JTAG Adapter [5] with the evaluation version of the (8xSIMD) EdkDSP IP. Partners of the ECSEL PRODUCTIVE 4.0 project [8] can order this extended package from UTIA AV CR, v.v.i., by email request for quotation to kadlec@utia.cas.cz.

UTIA AV CR, v.v.i., will provide to the PRODUCTIVE 4.0 project partner quotation by email. After confirmation of the quotation by the customer, UTIA AV CR, v.v.i., will send to the customer this invoice:

The extended evaluation package with MicroBlaze and PicoBlaze6 C code and precompiled bitstream of HW projects for the Trenz Electronic TE0720-2IF module [1] located on the Trenz Electronic TE0706-02 carrier [3] with PMOD USBUART adapter [4] and XMOD FTDI JTAG Adapter [5] with the evaluation version of the 8xSIMD EdkDSP IP for the partners in the ECSEL PRODUCTIVE 4.0 project (Without VAT)

After receiving confirmation from the PRODUCTIVE 4.0 project partner about the zero-invoice received, UTIA AV CR, v.v.i. will send within 5 working days by standard mail printed version of this application note together with DVD with the Deliverables described in this section.

#### **Deliverables:**

The extended evaluation package for PRODUCTIVE 4.0 partners [8] includes MicroBlaze and PicoBlaze6 C code and precompiled bitstreams of HW projects. MicroBlaze and PicoBlaze6 SW projects can be modified and recompiled by the PRODUCTIVE 4.0 project partner.

The extended evaluation version of the UTIA 8xSIMD EdkDSP accelerator IP is provided in precompiled bitstreams of HW projects with these IPs:

bce\_fp12\_1x8\_0\_axiw\_v1\_10\_c Evaluation version of the AXI-lite interface bce\_fp12\_1x8\_40 Evaluation version of the floating point data path

The extended evaluation version of the 8xSIMS EdkDSP IP is compiled into bitstream with an HW limit on number of vector operations. The termination of the nonexclusive, non-transferable evaluation license of this evaluation IP core is reported in advance by the demonstrator on the PMOD USBUART terminal. The evaluation designs run again after the reset.

The extended evaluation package [8] includes these binary applications:

edkdsppp.elf
 EdkDSP C pre-processor binary for ARM PetaLinux running on the evaluation board.
 edkdspcc.elf
 EdkDSP C compiler binary for ARM PetaLinux running on the evaluation board.
 edkdsppsm.elf
 EdkDSP ASM compiler binary for ARM PetaLinux running on the evaluation board.
 edkdspasm.elf
 EdkDSP ASM compiler binary for ARM PetaLinux running on the evaluation board.

These binary applications have no time restriction. The user of the evaluation package has nonexclusive, non-transferable license from UTIA to use these utilities for compilation of the firmware for the Xilinx PicoBlaze6 processor inside of the UTIA EdkDSP accelerators in precompiled designs. The source code of these compilers is owned by UTIA and it is not provided in the evaluation package.





The extended evaluation package for PRODUCTIVE 4.0 partners includes demonstration firmware in C source code for the Xilinx PicoBlaze6 processor for the family of UTIA EdkDSP accelerators for the Trenz Electronic TE0720-2IF module [1] on Trenz Electronic TE0706-02 carrier board [3].

The extended evaluation package for PRODUCTIVE 4.0 partners includes SDK SW projects with C source code for MicroBlaze. The extended evaluation package [8] includes static library for MicroBlaze processor:

libwal.a SDK 2017.1 UTIA static library with EdkDSP API for MicroBlaze

This library has no time restriction. Source code of this library is not provided in this evaluation package.

HW boards are not part of deliverables. HW can be ordered separately from references [1] – [5].

Partners of the ECSEL PRODUCTIVE 4.0 project [8] can order the hardware [1] - [5] directly from the company Trenz Electronic or order the complete evaluation system from UTIA AV CR, v.v.i.

In case of an order from UTIA AV CR, v.v.i., an email request for a quotation to <a href="kadlec@utia.cas.cz">kadlec@utia.cas.cz</a> is required. UTIA AV CR, v.v.i., will provide to the PRODUCTIVE 4.0 project partner quotation by email. After confirmation of the quotation by the PRODUCTIVE 4.0 project partner, UTIA AV CR, v.v.i., will buy from company Trenz Electronic boards [1]-[5] with cables and power supply. UTIA will assemble and test the complete evaluation system and send them to the PRODUCTIVE 4.0 project partner for price identical to the price offered by the company Trenz Electronic plus the transport cost and the VAT.

Any and all legal disputes that may arise from or in connection with the use, intended use of or license for the software provided hereunder shall be exclusively resolved under the regional jurisdiction relevant for UTIA AV CR, v. v. i. and shall be governed by the law of the Czech Republic. See also the Disclaimer section.





# **Disclaimer**

This disclaimer is not a license and does not grant any rights to the materials distributed herewith. Except as otherwise provided in a valid license issued to you by UTIA AV CR v.v.i., and to the maximum extent permitted by applicable law:

- (1) THIS APPLICATION NOTE AND RELATED MATERIALS LISTED IN THIS PACKAGE CONTENT ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND UTIA AV CR V.V.I. HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY PARTICULAR PURPOSE; and
- (2) UTIA AV CR v.v.i. shall not be liable (whether in contract or tort, including negligence, or under any other theory of liability) for any loss or damage of any kind or nature related to, arising under or in connection with these materials, including for any direct, or any indirect, special, incidental, or consequential loss or damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a result of any action brought by a third party) even if such damage or loss was reasonably foreseeable or UTIA AV CR v.v.i. had been advised of the possibility of the same.

## **Critical Applications:**

UTIA AV CR v.v.i. products are not designed or intended to be fail-safe, or for use in any application requiring fail-safe performance, such as life-support or safety devices or systems, Class III medical devices, nuclear facilities, applications related to the deployment of airbags, or any other applications that could lead to death, personal injury, or severe property or environmental damage (individually and collectively, "Critical Applications"). Customer assumes the sole risk and liability of any use of UTIA AV CR v.v.i. products in Critical Applications, subject only to applicable laws and regulations governing limitations on product liability.

