NEON Basics


From a software perspective, NEON technology is based on single instruction, multiple data (SIMD) operations in ARMv7 processors, which implement the advanced SIMD architecture extensions.

From a hardware perspective, NEON is a separate hardware unit on Cortex-A series processors, together with a vector floating point (VFP) unit.

If an algorithm can be designed to exploit dedicated hardware, performance can be maximized.


SIMD Introduction
SIMD is a computational technique for processing many data values in parallel using a single instruction, with the data for the operands packed into special, wide registers.

Therefore, one instruction can do the work of many separate instructions on single instruction, single data (SISD) architectures.


Many software programs operate on large data sets. Each element in a data set can be less than 32 bits. 8-bit data is common in video, graphics, and image processing, and 16-bit data in audio codecs.

In these contexts, the operations to be performed are simple, repeated many times, and have little need for control code. SIMD can offer considerable performance improvements for this type of data processing. It is particularly useful for digital signal processing or multimedia algorithms.
• Block-based data processing, such as FFTs, matrix multiplication, etc.
• Audio, video, and image processing codecs, such as MPEG-4, H.264, On2 VP6/7/8, AVS, etc.
• 2D graphics based on rectangular blocks of pixels
• 3D graphics
• Color-space conversion
• Physics simulations
• Error correction, such as Reed Solomon codecs, CRCs, elliptic curve cryptography, etc.


SIMD enables a single instruction to treat a register value as multiple data elements and to perform multiple, identical operations on those elements.

32-bit Scaler Add
To achieve four separate additions using scalar operation requires you to use four add instructions, and additional instructions to prevent one result from overflowing into the adjacent byte.

SIMD Parallel Add

SIMD needs only one instruction to do this, and you do not need to manage the overflow. Moreover, with a dedicated ALU, SIMD instructions generally require fewer cycles than ARM instructions for the same functions


Register
NEON architecture allows for 64-bit or 128-bit parallelism. Its register bank can be viewed as either sixteen 128-bit registers (Q0-Q15) or as thirty-two 64-bit registers (D0-D31). Each of the Q0-Q15 registers maps to a pair of D registers.
128-bit * 16 (Q0-Q15)
64-bit * 32 (D0-D31)
Figure : NEON Register Bank


NEON vs. VFP

The key differences between NEON and VFP are that NEON only works on vectors, but VFP does not, even though it has “vector” in its name.In fact, calling it a floating-point unit (FPU) can be more appropriate for the Cortex-A9 processor.

For floating-point operation, VFP can support both single-precision (Float-32) and double-precision (Double-64), whereas NEON only supports single-precision (Float-32).

VFP can also support more complex functions, such as square roots, division, and others, but NEON cannot.

NEON and VFP share the thirty-two 64-bit registers in hardware. This means that VFP is present in VFPv3-D32 form, which has 32 double-precision floating-point registers. This makes support for context switching simpler. Code that saves and restores VFP contexts also saves and restores NEON contexts.


Data Types
• Unsigned integer U8 U16 U32 U64
• Signed integer S8 S16 S32 S64
• Integer of unspecified type I8 I16 I32 I64
• Floating-point number F16 F32
• Polynomial over {0,1} P8

Data type specifiers in NEON instructions consist of a letter that indicates the type of data and a number that indicates the width.


NEON Instruction
All mnemonics for NEON instructions (as with VFP) begin with the letter V.
You can use this indicator to find NEON instructions in disassembly code when checking the efficiency of a compiler.

V{<mod>}<op>{<shape>}{<cond>}{.<dt>}(<dest>}, src1, src2
<mod> One of the previously described modifiers (Q, H, D, R)
<op> Operation (for example, ADD, SUB, MUL)
<shape> Shape (L, W or N) [Ref 4]
<cond> Condition, used with IT instruction
<.dt> Data type
<dest> Destination
<src1> Source operand 1
<src2> Source operand 2

.NEON instructions should be used to the maximum extent when designing algorithms. Emulating functions with instruction sequences can lower performance significantly.
.Compilers might not be able to generate optimal code, so you might have to read the disassembly and determine whether or not the generated code is optimal.
.For time-critical applications, you might have to write NEON assembler code to realize the best performance.


ref : xapp-1206


留言

這個網誌中的熱門文章

[VB6]使用File Dialog選擇檔案

[VB6]MSFlexGrid使用記錄