Programmable Logic


FPGA co-processors promise a brighter future for DSP

17 October 2012 Programmable Logic

Digital signal processing (DSP) applications have always placed great demands on available technology. Ever-increasing system requirements drive an extraordinary degree of design optimisation to process data in real-time, and resulting systems usually employ a number of different processing technologies, each best suited to a particular function in the processing chain.

One of the most common forms of DSP application involves the gathering of information from sensors at very high data rates, processing this information into a form that is meaningful to humans and presenting this to the human at a much lower data rate. The traditional architecture for such a system is shown in Figure 1.

Figure 1. Traditional DSP architecture.
Figure 1. Traditional DSP architecture.

A field-programmable gate array (FPGA) is used for front-end ‘bit-rate’ or ‘chip-rate’ processing where data rates are highest and often presented as a continuous stream. FPGAs are programmed using a hardware description language (HDL) such as VHDL or Verilog, which allows state machines to be built up from combinations of logic blocks.

Programming in HDL is the hardware equivalent of programming in assembly language – the results are highly efficient (especially for low-level computations like FFTs and filters) but development requires great effort and HDL programmers are relatively scarce.

A DSP is used for front-end ‘symbol rate’ processing. DSPs are typically programmed using C or assembly language, but exploiting the efficiencies of these devices also requires careful programming considerations, often writing code specific to the device architecture.

A microprocessor is used for back-end ‘data rate’ processing and control. These allow the use of operating systems and high-level languages to provide a human-machine interface and access to data storage.

So, with reference to Figure 1, FPGAs show best efficiency (therefore smallest size and lowest weight), lowest power consumption and best determinism (lowest latencies).

However, because development time and cost are usually significant drivers in system design choices, much functionality is pushed in the direction of DSPs and CPUs, resulting in higher power consumption, more physical space, more weight, longer latencies and more execution jitter (indeterminism).

In deciding how to map functions to hardware, it should be noted that FGPAs work best when:

1. The computation is independent of the data, ie, there is little conditional processing and the algorithm can be expressed as a state machine.

2. The algorithm is mature. Developing and modifying algorithms using logic gates requires a considerable amount of effort compared to developing and modifying the same using a programming language.

This implies, conversely, that if an algorithm is data-dependent or subject to change then it is likely to be better implemented elsewhere than on an FPGA.

However, implementing a solution using three vastly diverse technologies requires development teams comprising three different specialised skill sets.

The resulting system also suffers from additional latencies introduced by interfaces between the three subsystems.

As the DSP component offers a compromise between the strengths of the FPGA and the microprocessor, it makes sense to investigate the possibility of removing it and implementing its functions elsewhere.

Where else can DSP functions be implemented?

A technology that has emerged from development in high-performance graphics processing is the graphics processing unit (GPU) and its less specialised cousin, the general purpose GPU (GPGPU). These devices can excel in processing large data sets where there is little interdependency in the data, but this restricts their use to a few specific applications.

They have very high power consumption and poor determinism because data is routed in and out of a microprocessor (see Figure 2). Efficiency is generally very poor, in practise only achieving less than 25% of theoretical performance for most applications.

Figure 2. The GPU approach.
Figure 2. The GPU approach.

A more common solution is to put microprocessor cores inside the FPGA as indicated in Figure 3. So-called ‘soft’ cores (programmed onto standard FPGA logic cells) are not an optimal use of available gates and generally enforce non-standard C-language implementations.

Figure 3. The embedded core approach.
Figure 3. The embedded core approach.

‘Hard’ cores (such as ARM devices implemented within the FPGA chip) might be useful for implementing protocol stacks but are not optimised for DSP functions as they have smaller feature sets and lower clock speeds than dedicated DSP chips.

High-level synthesis (HLS) tools, C-to-RTL tools and C-to-gates tools allow the implementation of algorithms in C language, but the C language is non-standard and the implementation usually involves a runtime module that is effectively a ‘soft’ CPU core being instantiated in the FPGA.

Neither of these solutions is particularly ideal.

The FPGA co-processor

A new approach is introduced here, as shown in Figure 4. In essence, an external multicore co-processor is coupled to the FPGA.

Figure 4. The co-processor approach.
Figure 4. The co-processor approach.

The co-processor allows the use of the C language to implement data-dependent algorithms while presenting itself to the HDL tools as if it were an embedded core. So, in some ways it resembles the embedded core approach discussed earlier, but an important difference is that it is optimised to provide highly efficient parallel processing capability.

Incorporating floating-point capability (DSP-friendly multiple/accumulate instructions) in the co-processor simplifies algorithm development. If the internal architecture of the co-processor allows the use of a standard C implementation while still achieving optimal performance, then the same developers can be used to code for both the microprocessor and FPGA co-processor.

It is important for the co-processor to allow fine-grained interaction with the FPGA by integrating directly to FPGA fabric, as many applications require data to flow back and forth between DSP and FPGA functions.

A multicore architecture is well suited to the inherently parallel nature of DSP algorithms and also provides for scalability. To make a multicore topology effective, a high-throughput mesh network is required so that data is available to all cores and synchronisation between cores is catered for.

Shared memory mapping should be transparent to the cores to simplify programming and memory should be cache-less to ensure determinism.

Another benefit of this approach is that immature algorithms ultimately destined for the FPGA can be implemented and refined quickly in code running on the co-processor. When the algorithm is stable it can then be ‘ported’ to an HDL implementation with little development risk.

An example in practice

The FPGA co-processor concept as described above has already been implemented and with very encouraging results. An example is BittWare’s Anemone AN104 chip based on the Epiphany architecture from Adapteva, which provides a 4 x 4 array of cores in a single chip.

Each core offers similar real-world performance to a TigerSHARC DSP. The AN104’s internal architecture is shown in Figure 5.

Figure 5. The Anemone AN104 internal architecture.
Figure 5. The Anemone AN104 internal architecture.

Board-level implementations are available from BittWare, which currently provides four Anemone chips on a single FMC module for a total of 64 cores to augment its existing Altera-based FPGA products.

Programming the Anemone does not require the use of any C subset, language extensions, SIMD or other ‘funny stuff.’ Standard GNU development tools are supported, including an optimised C compiler, simulator, GDB debugger with support for multicore, and an Eclipse multicore IDE.

Benchmarks for 8x8 matrix multiplication, 16-tap FIR filter, bi-quad IRR filter and 256-point dot product algorithms show between two and 10 times the code efficiency over SIMD architectures. It should also be noted that the implementation of the benchmarks required only one or two days’ effort each.

The Anemone AN104 shows enormous promise for this architecture, delivering a performance to power ratio of 24 to 32 GFLOPS/Watt compared to the best GPGPU solutions currently available around 20 GFLOPS/Watt (both figures here for single-precision floating point).

The next generation of the chip will implement 64 cores for an impressive 96 GFLOPS of double precision floating-point processing while achieving efficiencies exceeding 50 GFLOPS/Watt.

From what has already been demonstrated, the co-processor reduces system development costs and directly bridges the FPGA’s gaps of requiring algorithmic maturity and data independence by enabling out-of-the-box execution of applications written in regular ANSI C.

In other words: high efficiency, low power consumption and low latency with less development effort.

For more information contact Alastair Knight, Redlinx, +27 (0)21 526 4160, [email protected], www.redlinx.co.za





Share this article:
Share via emailShare via LinkedInPrint this page

Further reading:

Solving industry 4.0 challenges with software-configurable I/O
Altron Arrow Programmable Logic
Software-configurable I/O modules solve problems caused by connecting legacy equipment to modern Ethernet controllers by allowing for the efficient use of all channels in an I/O system.

Read more...
Rapid prototyping with new Xilinx FPGA evaluation kit
EBV Electrolink Programmable Logic
Equipped with the Xilinx Spartan-7 FPGA, the kit is built for rapid prototyping of industrial networking, embedded vision and automotive designs.

Read more...
Embedded vision development tool for RISC-V FPGA
EBV Electrolink Programmable Logic
Microchip Technology announced the second development tool offering in its Smart Embedded Vision initiative for designers using its PolarFire RISC-V system-on-chip (SoC) FPGA. The industry’s lowest-power ...

Read more...
Latest updates to Libero SoC v2021.3 Design Suite
Altron Arrow Programmable Logic
Microchip Technology recently introduced new MPFS025T/095T/160T/250T PolarFire SoC devices with Tgrade2 operating conditions in standard and -1 speed grades and MPFS095TS/250TS/460TS devices with MIL ...

Read more...
HLS tool suite for PolarFire FPGAs
EBV Electrolink Programmable Logic
The need to combine performance with low power consumption in edge compute applications has driven demand for field programmable gate arrays (FPGAs) to be used as power-efficient accelerators while also ...

Read more...
Mid-range FPGAs for edge compute systems
Avnet Silica Programmable Logic
Edge compute systems need compact programmable devices with low power consumption and a small enough thermal footprint to eliminate fans and other heat mitigation while providing robust compute horsepower. ...

Read more...
Microchip achieves JEDEC qualification for rad-tolerant FPGA
Future Electronics Programmable Logic
Developers of small-satellite constellations and other systems used in ‘New Space’ missions must deliver both high reliability and radiation protection while meeting stringent cost and scheduling requirements. ...

Read more...
Intel Agilex FPGAs deliver flexibility and agility for the data-centric world
Altron Arrow Editor's Choice Programmable Logic
The Agilex FPGA fabric and innovative chiplet architecture delivers an extensible FPGA platform that scales across a wide range of device densities and brings key features and benefits to the table.

Read more...
Security tool for mission-critical FPGAs
Future Electronics Programmable Logic
Mission-critical and other high-assurance systems deployed worldwide are under rapidly evolving threats from cybercriminals who attempt to extract Critical Program Information (CPI) via the FPGAs that ...

Read more...
Lattice releases new version of sensAI
Future Electronics Programmable Logic Design Automation
New enhancements have been made to Lattice Semiconductor’s award-winning sensAI solution stack for accelerating AI/ML (artificial intelligence/machine learning) application development on low-power Lattice ...

Read more...