Which Registers Are Floating Point Registers

Floating-Point Register

Floating point

Larry D. Pyeatt , William Ughetta , in ARM 64-Bit Assembly Language, 2020

9.5 Data movement instructions

With the addition of all of the FP registers, there many more possibilities for how data can be moved. There are many more registers, and FP registers may be 32 or 64 bit. This results in several combinations for moving data among all of the registers. The FP instruction set includes instructions for moving data between two FP registers, between FP and integer registers, and between the various system registers.

9.5.1 Moving between data registers

The most basic move instruction involving FP registers simply moves data between two floating point registers, or moves data between an FP register and an Integer register. The instruction is:

fmov: Move Between Data Registers.

9.5.1.1 Syntax

•: The two registers specified must be the same size.
•: refers to the top 64 bits of register Vn.

9.5.1.2 Operations

Name	Effect	Description
fmov	Fd ←Fn	Move Fn to Fd

9.5.1.3 Examples

9.5.2 Floating point move immediate

The FP/NEON instruction set provides an instruction for moving an immediate value into a register, but there are some restrictions on what the immediate value can be. The instruction is:

fmov: Floating Point Move Immediate.

9.5.2.1 Syntax

•: The floating point constant, fpimm, may be specified as a decimal number such as 1.0.
•: The floating point value must be expressable as $\pm n \div 16 \times 2^{r}$ , where n and r are integers such that $16 \leq n \leq 31$ and $- 3 \leq r \leq 4$ .
•: The floating point number will be stored as a normalized binary floating point encoding with 1 sign bit, 4 bits of fraction and a 3-bit exponent (see Chapter 8, Section 8.7).
•: Note that this encoding does not include the value 0.0, however this value may be loaded using the

instruction.

9.5.2.2 Operations

Name	Effect	Description
fmov	Fd ←fpimm	Move Immediate Data to Fd

9.5.2.3 Examples

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012819221400016X

Embedded Software in Real-Time Signal Processing Systems: Design Technologies

GERT GOOSSENS , ... MEMBER, IEEE, in Readings in Hardware/Software Co-Design, 2002

2 Data Routing

The above mentioned extension of graph coloring toward heterogeneous register structures has been applied to general-purpose processors, which typically have a few register classes (e.g., floating-point registers, fixed-point registers, and address registers). DSP and ASIP architectures often have a strongly heterogeneous register structure with many special-purpose registers.

In this context, more specialized register allocation techniques have been developed, often referred to as data routing techniques. To transfer data between functional units via intermediate registers, specific routes may have to be followed. The selection of the most appropriate route is nontrivial. In some cases indirect routes may have to be followed, requiring the insertion of extra register-transfer operations. Therefore an efficient mechanism for phase coupling between register allocation and scheduling becomes essential [73].

As an illustration, Fig. 12 shows a number of alternative solutions for the multiplication operand of the symmetrical FIR filter application, implemented on the ADSP-21xx processor (see Fig. 8).

Several techniques have been presented for data routing in compilers for embedded processors. A first approach is to determine the required data routes during the execution of the scheduling algorithm. This approach was first applied in the Bulldog compiler for VLIW machines [18], and subsequently adapted in compilers for embedded processors like the RL compiler [48] and CBC [74]. In order to prevent a combinational explosion of the problem, these methods only incorporate local, greedy search techniques to determine data routes. The approach typically lacks the power to identify good candidate values for spilling to memory.

A global data routing technique has been proposed in the Chess compiler [75]. This method supports many different schemes to route values between functional units. It starts from an unordered description, but may introduce a partial ordering of operations to reduce the number of overlapping live ranges. The algorithm is based on branch-and-bound searches to insert new data moves, to introduce partial orderings, and to select candidate values for spilling. Phase coupling with scheduling is supported, by the use of probabilistic scheduling estimators during the register allocation process.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781558607026500399

Architecture

Sarah L. Harris , David Harris , in Digital Design and Computer Architecture, 2022

6.6.4 Floating-Point Instructions

The RISC-V architecture defines optional floating-point extensions called RVF, RVD, and RVQ for operating on single-, double-, and quad-precision floating-point numbers, respectively. RVF/D/Q define 32 floating-point registers, f0 to f31, with a width of 32, 64, or 128 bits, respectively. When a processor implements multiple floating-point extensions, it uses the lower part of the floating-point register for lower-precision instructions. f0 to f31 are separate from the program (also called integer) registers, x0 to x31. As with program registers, floating-point registers are reserved for certain purposes by convention, as given in Table 6.7.

Table 6.7. RISC-V floating-point register set

Name	Register Number	Use
ft0–7	f0–7	Temporary variables
fs0–1	f8–9	Saved variables
fa0–1	f10–11	Function arguments/Return values
fa2–7	f12–17	Function arguments
fs2–11	f18–27	Saved variables
ft8–11	f28–31	Temporary variables

Table B.3 in Appendix B lists all of the floating-point instructions. Computation and comparison instructions use the same mnemonics for all precisions, with .s, .d, or .q appended at the end to indicate precision. For example, fadd.s, fadd.d, and fadd.q perform single-, double-, and quad-precision addition, respectively. Other floating-point instructions include fsub, fmul, fdiv, fsqrt, fmadd (multiply-add), and fmin. Memory accesses use separate instructions for each precision. Loads are flw, fld, and flq, and stores are fsw, fsd, and fsq.

Floating-point instructions use R-, I-, and S-type formats, as well as a new format, the R4-type instruction format (see Figure B.1 in Appendix B). This format is needed for multiply-add instructions, which use four register operands. Code Example 6.31 modifies Code Example 6.21 to operate on an array of single-precision floating-point scores. The changes are in bold.

Code Example 6.31

Using a for Loop to Access an Array of Floats

High-Level Code

int i;

float scores[200];

for (i = 0; i < 200; i = i + 1)

scores[i] = scores[i] + 10;

RISC-V Assembly Code

# s0 = scores base address, s1 = i

addi s1, zero, 0 # i = 0

addi t2, zero, 200 # t2 = 200

addi t3, zero, 10 # t3 = 10

fcvt.s.w ft0, t3 # ft0 = 10.0

for:

bge s1, t2, done # if i >= 200 then done

slli t3, s1, 2 # t3 = i * 4

add t3, t3, s0 # address of scores[i]

flw ft1, 0(t3) # ft1 = scores[i]

fadd.s ft1, ft1, ft0 # ft1 = scores[i] + 10

fsw ft1, 0(t3) # scores[i] = t1

addi s1, s1, 1 # i = i + 1

j for # repeat

done:

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128200643000064

Operating Systems Overview

Peter Barry , Patrick Crowley , in Modern Embedded Computing, 2012

Task Context

Each task or thread has a context store; the context store keeps all the task-specific data for the task. The kernel scheduler will save and restore the task state on a context switch. The task's context is stored in a Task Control Block in VxWorks; the equivalent in Linux is the struct task_struct.

The Task Control Block in VxWorks contains the following elements, which are saved and restored on each context switch:

•: The task program/instruction counter.
•: Virtual memory context for tasks within a process if enabled.
•: CPU registers for the task.
•: Non-core CPU registers, such as SSE registers/floating-point register, are saved/restored based on use of the registers by a thread. It is prudent for an RTOS to minimize the data it must save and restore for each context switch to minimize the context switch times.
•: Task program stack storage.
•: I/O assignments for standard input/output and error. As in Linux, a tasks/process output is directed to standard console for input and output, but the file handles can be redirected to a file.
•: A delay timer, to postpone the tasks availability to run.
•: A time slice timer (more on that later in the scheduling section).
•: Kernel structures.
•: Signal handles (for C library signals such as divide by zero).
•: Task environment variables.
•: Errno—the C library error number set by some C library functions such as strtod().
•: Debugging and performance monitoring values.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123914903000072

Architecture

David Money Harris , Sarah L. Harris , in Digital Design and Computer Architecture (Second Edition), 2013

6.7.4 Floating-Point Instructions

The MIPS architecture defines an optional floating-point coprocessor, known as coprocessor 1. In early MIPS implementations, the floating-point coprocessor was a separate chip that users could purchase if they needed fast floating-point math. In most recent MIPS implementations, the floating-point coprocessor is built in alongside the main processor.

MIPS defines thirty-two 32-bit floating-point registers, $f0–$f31. These are separate from the ordinary registers used so far. MIPS supports both single- and double-precision IEEE floating-point arithmetic. Double-precision (64-bit) numbers are stored in pairs of 32-bit registers, so only the 16 even-numbered registers ($f0, $f2, $f4, … , $f30) are used to specify double-precision operations. By convention, certain registers are reserved for certain purposes, as given in Table 6.8.

Table 6.8. MIPS floating-point register set

Name	Number	Use
$fv0–$fv1	0, 2	function return value
$ft0–$ft3	4, 6, 8, 10	temporary variables
$fa0–$fa1	12, 14	function arguments
$ft4–$ft5	16, 18	temporary variables
$fs0–$fs5	20, 22, 24, 26, 28, 30	saved variables

Floating-point instructions all have an opcode of 17 (10001₂). They require both a funct field and a cop (coprocessor) field to indicate the type of instruction. Hence, MIPS defines the F-type instruction format for floating-point instructions, shown in Figure 6.35. Floating-point instructions come in both single- and double-precision flavors. cop = 16 (10000₂) for single-precision instructions or 17 (10001₂) for double-precision instructions. Like R-type instructions, F-type instructions have two source operands, fs and ft, and one destination, fd.

Instruction precision is indicated by .s and .d in the mnemonic. Floating-point arithmetic instructions include addition (add.s, add.d), subtraction (sub.s, sub.d), multiplication (mul.s, mul.d), and division (div.s, div.d) as well as negation (neg.s, neg.d) and absolute value (abs.s, abs.d).

Floating-point branches have two parts. First, a compare instruction is used to set or clear the floating-point condition flag (fpcond). Then, a conditional branch checks the value of the flag. The compare instructions include equality (c.seq.s/c.seq.d), less than (c.lt.s/c.lt.d), and less than or equal to (c.le.s/c.le.d). The conditional branch instructions are bc1f and bc1t that branch if fpcond is FALSE or TRUE, respectively. Inequality, greater than or equal to, and greater than comparisons are performed with seq, lt, and le, followed by bc1f.

Floating-point registers are loaded and stored from memory using lwc1 and swc1. These instructions move 32 bits, so two are necessary to handle a double-precision number.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123944245000069

Device architectures

David Kaeli , ... Dong Ping Zhang , in Heterogeneous Computing with OpenCL 2.0, 2015

Server CPUs

Intel's Itanium architecture and its more successful successors (the latest being the Itanium 9500), represent an interesting attempt to make a mainstream server processor based on VLIW techniques [6]. The Itanium architecture includes a large number of registers (128 integer and 128 floating point registers). It uses a VLIW approach known as EPIC, in which instructions are stored in 128-bit, three-instruction bundles. The CPU fetches four instruction bundles per cycle from its L1 cache and can hence executes 12 instructions per clock cycle. The processor is designed to be efficiently combined into multicore and multisocket servers.

The goal of EPIC is to move the problem of exploiting parallelism from runtime to compile time. It does this by feeding back information from execution traces into the compiler. It is the task of the compiler to package instructions into the VLIW/EPIC packets, and as a result, performance on the architecture is highly dependent on compiler capability. To assist with this, numerous execution masks, dependence flags between bundles, prefetch instructions, speculative loads, and rotating register files are built into the architecture. To improve the throughput of the processor, the latest Itanium microarchitectures have included SMT, with the Itanium 9500 supporting independent front-end and back-end pipeline execution.

The SPARC T-series family (Figure 2.9), originally from Sun and under continuing development at Oracle, takes a throughput computing multithreaded approach to server workloads [7]. Workloads on many servers, particularly transactional and Web workloads, are often heavily multithreaded, with a large number of lightweight integer threads using the memory system. The UltraSPARC Tx and later SPARC Tx CPUs are designed to efficiently execute a large number of threads to maximize overall work throughput with minimal power consumption. Each of the cores is designed to be simple and efficient, with no out-of-order execution logic, until the SPARC T4. Within a core, the focus on thread-level parallelism is immediately apparent, as it can interleave operations from eight threads with only a dual issue pipeline. This design shows a clear preference for latency hiding and simplicity of logic compared with the mainstream x86 designs. The simpler design of the SPARC cores allows up to 16 cores per processor in the SPARC T5.

To support many active threads, the SPARC architecture requires multiple sets of registers, but as a trade-off requires less speculative register storage than a superscalar design. In addition, coprocessors allow acceleration of cryptographic operations, and an on-chip Ethernet controller improves network throughput.

As mentioned previously, the latest generations, the SPARC T4 and T5, back off slightly from the earlier multithreading design. Each CPU core supports out-of-order execution and can switch to a single-thread mode where a single thread can use all of the resources that previously had to be dedicated to multiple threads. In this sense, these SPARC architectures are becoming closer to other modern SMT designs such as those from Intel.

Server chips, in general, try to maximize parallelism at the cost of some single-threaded performance. As opposed to desktop chips, more area is devoted to supporting quick transitions between thread contexts. When wide-issue logic is present, as in the Itanium processors, it relies on help from the compiler to recognize instruction-level parallelism.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128014141000028

Multicore and data-level optimization

Jason D. Bakos , in Embedded Systems, 2016

2.10.1 ARM11 VFP short vector instructions

The ARMv6 VFP instruction set offers SIMD instructions through a feature called short vector instructions, in which the programmer can specify a vector width and stride field through the floating-point status and control register (FPSCR). Setting the FPSCR will cause all the thread's subsequently issued floating-point instructions to perform the number of operations and access the registers using a stride as defined in the FPSCR. Note that VFP short vector instructions are not supported by ARMv7 processors. Attempting to change the vector width or stride on a NEON-equipped processor will trigger an invalid instruction exception.

The 32 floating-point VFP registers are arranged in four banks of eight registers each (four registers each if using double precision). Each bank can be used as a short vector when performing short vector instructions. The first bank, registers s0-s7 (or d0-d3), will be used as scalars in a short vector instruction when specified as the second input operand. For example, when the vector width is 8, the fadds s16,s8,s0 instruction will add each element of the vector held in registers s8-15 with the scalar held in s0 and store the result vector in registers s16-s23.

The fmrx and fmxr instructions allow the programmer to read and write the FPSCR register. The latency of the fmrx instruction is two cycles and the latency of the fmxr instruction is four cycles. The vector width is stored in FPSCR bits 18:16 and is encoded such that values 0 through 7 specify lengths 1-8.

When writing to the FPSCR register you must be careful to change only the bits you intend to change and leave the others alone. To do this, you must first read the existing value using the fmrx instruction, change bits 18:16, and then write the back using the fmxr instruction.

Be sure to change the length back to its default value of 1 after the kernel since the compiler would not do this automatically, and any compiler-generated floating-point code can potentially be adversely affected by the change to the FPSCR.

You can use the following function to change the length field in the FPSCR:

void set_fpscr_reg (unsigned char len) {

unsigned int fpscr;

asm("fmrx %[val], fpscr\n\t" : [val]"=r"(fpscr));

len = len - 1;

fpscr = fpscr & ~(0x7<<16);

fpscr = fpscr | ((len&0x7)<<16);

asm("fmxr fpscr, %[val]\n\t" : : [val]"r"(fpscr));

}

To maximize the benefit of the short vector instructions, target the maximum vector size of 8 by unrolling the outer loop by 8. In the original assembly implementation, each fmacs instruction is followed by a dependent fmacs instruction two instructions later. To fully cover the eight-cycle latency of all the fmacs instructions, use each fmacs instruction to perform its operations for 8 loop iterations.

In other words, unroll the outer loop to calculate eight polynomial values on each iteration and use short vector instructions of length 8 for each instruction. Since the fmacs instruction adds the value in its Fd register, the code requires the ability to load copies of each coefficient into each of the four Fd registers. To make this easier, re-write your coefficient array so each coefficient is replicated eight times:

float coeff[64] = {1.2,1.2,1.2,1.2,1.2,1.2,1.2,1.2,

1.4,1.4,1.4,1.4,1.4,1.4,1.4,1.4,…

2.6,2.6,2.6,2.6,2.6,2.6,2.6,2.6};

Change the short vector length to 8 and unroll the outer loop by 8, so change the iteration step in the outer loop to 4:

set_fpscr_reg (8);

for (i=0;i<N/4;i+=8) {

Now load the first coefficient into a scalar register and eight values of the x array into vector register s15:s8:

asm("flds s0, %[mem]\n\t" : : [mem]"m" (coeff[0]) : "s0");

asm("fldmias%[mem],{s8,s9,s10,s11,s12,s13,s14,s15}\n\t"::

[mem]"r"(&x[i]) : "s8", "s9", "s10", "s11", "s12", "s13", "s14", "s15");

Next load eight copies of the second coefficient into vector register s23:s16 and perform our first fmacs by multiplying the x vector by the first coefficient and adding the result to the second coefficient, leaving the running sum in vector register s23:s16:

asm("fldmias %[mem],{s16,s17,s18,s19,s20,s21,s22,s23}\n\t": :

[mem]"r"(&coeff[8]) :

"s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

asm("fmacs s16, s8, s0\n\t" : : :

"s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

Now repeat this process but now swapping the vector registers s23:s16 with s31:s24:

asm("fldmias %[mem],{s24,s25,s26,s27,s28,s29,s30,s31}\n\t": :

[mem]"r"(&coeff[16]) :

"s24", "s25", "s26", "s27", "s28", "s29", "s30", "s31");

asm("fmacs s24, s8, s16\n\t" : : :

"s20", "s17", "s18", "s19", "s28", "s29", "s30", "s31");

Now repeat these last two steps two more times. End with the following code:

asm("fldmias %[mem],{s16,s17,s18,s19,s20,s21,s22,s23}\n\t": :

[mem]"r"(&coeff[56]) :

"s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

asm("fmacs s16, s8, s24\n\t" : : :

"s16", "s17", "s18", "s19", "s20", "s21", "s22", "s23");

asm("fstmias %[mem],{s16,s17,s18,s19,s20,s21,s22,s23}\n\t" : :

[mem]"r" (&d[i]));

Be sure to reset the short vector length to 1 after the outer loop:

set_fpscr_reg (1);

Table 2.4 shows the resulting performance improvement on the Raspberry Pi relative to the software pipelined implementation. The use of scheduled SIMD instructions provides a 37% performance improvement over software pipelining. This optimization increases CPI because each eight-way SIMD instruction requires eight cycles to issue, but comes with a larger relative decrease in instructions per flop (the product of CPI slowdown and instructions per flop speedup gives a total speedup of 1.36).

Table 2.4. Performance Improvement from Short Vector Instructions Versus Software Pipelining

Platform	Raspberry Pi
CPU	ARM11
Throughput/efficiency	1.37 speedup
Throughput/efficiency	55.2% efficiency
CPI	0.43 speedup (slowdown)
Cache miss rate	1.89 speedup
Instructions per flop	3.17 speedup

Another benefit of this optimization is the reduction in cache miss rate due to the SIMD load and store instructions.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012800342800002X

Management of Cache Contents

Bruce Jacob , ... David T. Wang , in Memory Systems, 2008

3.3.1 Combined Approaches to Partitioning

Several examples of partitioning revolve around the PlayDoh architecture from Hewlett-Packard Labs.

HPL-PD, PlayDoh v1.1 — General Architecture

One content-management mechanism in which the hardware and software cooperate in interesting ways is the HPL PlayDoh architecture, renamed the HPL-PD architecture, embodied in the EPIC line of processors [Kathail et al. 2000]. Two facets of the memory system are exposed to the programmer and compiler through instruction-set hooks: (1) the memory-system structure and (2) the memory disambiguation scheme.

The HPL-PD architecture exposes its view or definition of the memory system, shown in Figure 3.36, to the programmer and compiler. The instruction-set architecture is aware of four components in the memory system: the L1 and L2 caches, an L1 streaming or data-prefetch cache (sits next to the L1 cache), and main memory. The exact organization of each structure is not exposed to the architecture. As with other mechanisms that have placed separately managed buffers adjacent to the L1 cache, the explicit goal of the streaming/prefetch cache is to partition data into disjoint sets: (1) data that exhibits temporal locality and should reside in the L1 cache, and (2) everything else (e.g., data that exhibits only spatial locality), which should reside in the streaming cache.

To manage data movement in this hierarchy, the instruction set provides several modifiers for the standard set of load and store instructions.

Load instructions have two modifiers:

1.: A latency and source cache specifier hints to the hardware where the data is expected to be found (i.e., the L1 cache, the streaming cache, the L2 cache, main memory) and also specifies to the hardware the compiler's assumed latency for scheduling this particular load instruction. In machine implementations that require rigid timing (e.g., traditional VLIW), the hardware must stall if the data is not available with this latency; in machine implementations that have dynamic scheduling around cache misses (e.g., a superscalar implementation of the architecture), the hardware can ignore the value.
2.: A target cache specifier indicates to hardware where the load data should be placed within the memory system (i.e., place it in the L1 cache, place it in the streaming cache, bring it no higher than the L2 cache, or leave it in main memory). Note that all loads specify a target register, but the target register may be r0, a read-only bit-bucket in both general-purpose and floating-point register files, providing a de facto form of non-binding prefetch. Presumably the processor core communicates the binding/non-binding status to the memory system to avoid useless bus activity.

Store instructions have one modifier:

1.: The target cache specifier, like that for load instructions, indicates to the hardware the highest component in the memory system in which the store data should be retained. A store instruction's ultimate target is main memory, and the instruction can leave a copy in the cache system if the compiler recognizes that the value will be reused soon or can specify main memory as the highest level if the compiler expects no immediate reuse for the data.

Abraham's Profile-Directed Partitioning

Abraham describes a compiler mechanism to exploit the Play Doh facility [Abraham et al. 1993]. At first glance, the authors note that it seems to offer too few choices to be of much use: a compiler can only distinguish between short-latency loads (expected to be found in L1), long-latency loads (expected in L2), and very long-latency loads (in main memory). A simple cache-performance analysis of a blocked matrix multiply shows that all loads have relatively low miss rates, which would suggest using the expectation of short latencies to schedule all load instructions.

However, the authors show that by loop peeling one can do much better. Loop peeling is a relatively simple compiler transformation that extracts a specific iteration of a loop and moves it outside the loop body. This increases code size (the loop body is replicated), but it opens up new possibilities for scheduling. In particular, keeping in mind the facilities offered by the HPL-PD instruction set, many loops display the following behavior: the first iteration of the loop makes (perhaps numerous) data references that miss the cache; the main body of the loop enjoys reasonable cache hit rates; and the last iteration of the loop has high hit rates, but it represents the last time the data will be used.

The HPL-PD transformation of the loop peels off first and last iterations:

•: The first iteration of the loop uses load instructions that specify main memory as the likely source cache; the store instructions target the L1 cache.
•: The body of the loop uses load instructions that specify the L1 cache as the likely source; the store instructions also target the L1 cache.
•: The last iteration of the loop uses load instructions that specify the L1 cache as the likely source; the store instructions target main memory.

The authors note that such a transformation is easily automated for regular codes, but irregular codes present a difficult challenge. The focus of the Abraham et al. study is to quantify the predictability of memory access in irregular applications. The study finds that, in most programs, a very small number of load instructions cause the bulk of cache misses. This is encouraging because if those instructions can be identified at compile time, they can be optimized by hand or perhaps by a compiler.

Hardware/Software Memory Disambiguation

The HPL-PD's memory disambiguation scheme comes from the memory conflict buffer in William Chen's Ph.D. thesis [1993]. The hardware provides to the software a mechanism that can detect and patch up memory conflicts, provided that the software identifies loads that are risky and then follows each up with an explicit invocation of a hardware check. The compiler/programmer can exploit the scheme to speculatively issue loads ahead of when it is safe to issue them, or it can ignore the scheme. The scheme by definition requires the cooperation of software and hardware to reap any benefits. The point of the scheme is to enable the compiler to improve its scheduling of code for which compile-time analysis of pointer addresses is not possible. For example, the following code uses pointer addresses in registers a1, a2, a3, and a4 that cannot be guaranteed to be conflict free:

The code has the following conservative schedule (assuming 2-cycle load latencies—equivalent to a 1-cycle load-use penalty, as in separate EX and MEM pipeline stages in an in-order pipe—and 1-cycle latencies for all else):

A better schedule would be the following, which moves the second load instruction ahead of the first store:

If we assume two memory ports, the following schedule is slightly better:

However, the compiler cannot guarantee the safety of this code, because it cannot guarantee that a3 and a2 will contain different values at run time. Chen's solution, used in HPL-PD, is for the compiler to inform the hardware that a particular load is risky. This allows the hardware to make note of that load and to compare its run-time address to stores that follow it. The scheme also relies upon the compiler to perform a post-verification that can patch up errors if it turns out that there was indeed a conflict by aggressively scheduling the load ahead of the store.

The scheme centers around the LDS log, a record of speculatively issued load instructions that maintains in each of its entries the target register of the load and the memory address that the load uses. There are two types of instructions that the compiler uses to manage the log's state, and store instructions affect its state implicitly:

1.: LDS instructions are load-speculative instructions that explicitly allocate a new entry in the log (remember an entry contains the target register and memory address). On executing an LDS instruction, the hardware creates a new entry and invalidates any old entries that have the same target register.
2.: Store instructions modify the log implicitly. On executing a store, the hardware checks the log for a live entry that matches the same memory address and deletes any entries that match.
3.: LDV instructions are load-verification instructions that must be placed conservatively in the code (after a potentially conflicting store instruction). They check to see if there was a conflict between the speculative load and the store. On executing an LDV instruction, the hardware checks the log for a valid entry with the matching target register. If an entry exists, the instruction can be treated as an NOP; if no entry matches, the LDV is treated as a load instruction (it computes a memory address, fetches the datum from memory, and places it into the target register).

The example code becomes the following, where the second LD instruction is replaced by an LDS/LDV pair:

The compiler can schedule the LDS instruction aggressively, keeping the matching LDV instruction in the conservative spot behind the store instruction (note that in HPL-PD, memory operations are prioritized left to right, so the LDV operation is technically "behind" the ST).

If we assume two memory ports, there is not much to be gained, because the LDV must be scheduled to happen after the potentially aliasing ST (store) instruction, which would yield effectively the same schedule as above. To address this type of issue (as well as many similar scenarios) the architecture also provides a BRDV instruction, a post-verification instruction similar to LDV that, instead of loading data, branches to a specified location on detection of a memory conflict. This instruction is used in conjunction with compiler-generated patch-up code to handle more complex scenarios. For instance, the following could be used for implementations with a single memory port:

The following can be used with multiple memory ports:

where the patch-up code is given as follows:

Using the BRDV instruction, the compiler can achieve optimal scheduling.

There are a number of issues that the HPL-PD mechanism must handle. For instance, the hardware must ensure that no virtual-address aliases can cause problems (e.g., different virtual addresses that map to the same physical address, if the operating system supports this). The hardware must also handle partial overwrites, for instance, a write instruction that writes a single byte to a four-byte word that was previously read speculatively (the addresses would not necessarily match). The compiler must ensure that every LDS is followed by a matching LDV that uses the same target register and address register (for obvious reasons), and the compiler also must ensure that no intervening operations disturb the log or the target register. The LDV instruction must block until complete to achieve effectively single-cycle latencies.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123797513500059

EXCEPTION AND INTERRUPT HANDLING

ANDREW N. SLOSS , ... CHRIS WRIGHT , in ARM System Developer's Guide, 2004

9.3.2 NESTED INTERRUPT HANDLER

A nested interrupt handler allows for another interrupt to occur within the currently called handler. This is achieved by reenabling the interrupts before the handler has fully serviced the current interrupt.

For a real-time system this feature increases the complexity of the system but also improves its performance. The additional complexity introduces the possibility of subtle timing issues that can cause a system failure, and these subtle problems can be extremely difficult to resolve. A nested interrupt method is designed carefully so as to avoid these types of problems. This is achieved by protecting the context restoration from interruption, so that the next interrupt will not fill the stack (cause stack overflow) or corrupt any of the registers.

The first goal of any nested interrupt handler is to respond to interrupts quickly so the handler neither waits for asynchronous exceptions, nor forces them to wait for the handler. The second goal is that execution of regular synchronous code is not delayed while servicing the various interrupts.

The increase in complexity means that the designers have to balance efficiency with safety, by using a defensive coding style that assumes problems will occur. The handler has to check the stack and protect against register corruption where possible.

Figure 9.9 shows a nested interrupt handler. As can been seen from the diagram, the handler is quite a bit more complicated than the simple nonnested interrupt handler described in Section 9.3.1.

The nested interrupt handler entry code is identical to the simple nonnested interrupt handler, except that on exit, the handler tests a flag that is updated by the ISR. The flag indicates whether further processing is required. If further processing is not required, then the interrupt service routine is complete and the handler can exit. If further processing is required, the handler may take several actions: reenabling interrupts and/or performing a context switch.

Reenabling interrupts involves switching out of IRQ mode to either SVC or system mode. Interrupts cannot simply be reenabled when in IRQ mode because this would lead to possible link register r14_irq corruption, especially if an interrupt occurred after the execution of a BL instruction. This problem will be discussed in more detail in Section 9.3.3.

Performing a context switch involves flattening (emptying) the IRQ stack because the handler does not perform a context switch while there is data on the IRQ stack. All registers saved on the IRQ stack must be transferred to the task's stack, typically on the SVC stack. The remaining registers must then be saved on the task stack. They are transferred to a reserved block of memory on the stack called a stack frame.

EXAMPLE 9.9

This nested interrupt handler example is based on the flow diagram in Figure 9.9. The rest of this section will walk through the handler and describe in detail the various stages.

This example uses a stack frame structure. All registers are saved onto the frame except for the stack register r13. The order of the registers is unimportant except that FRAME_LR and FRAME_PC should be the last two registers in the frame because we will return with a single instruction:

There may be other registers that are required to be saved onto the stack frame, depending upon the operating system or application being used. For example:

▪: Registers r13_usr and r14_usr are saved when there is a requirement by the operating system to support both user and SVC modes.
▪: Floating-point registers are saved when the system uses hardware floating point.

There are a number of defines declared in this example. These defines map various cpsr/spsr changes to a particular label (for example, the I_Bit).

A set of defines is also declared that maps the various frame register references with frame pointer offsets. This is useful when the interrupts are reenabled and registers have to be stored into the stack frame. In this example we store the stack frame on the SVC stack.

The entry point for this example handler uses the same code as for the simple nonnested interrupt handler. The link register r14 is first modified so that it points to the correct return address, and then the context plus the link register r14 are saved onto the IRQ stack.

An interrupt service routine then services the interrupt. When servicing is complete or partially complete, control is passed back to the handler. The handler then calls a function called read_RescheduleFlag, which determines whether further processing is required. It returns a nonzero value in register r0 if no further processing is required; otherwise it returns a zero. Note we have not included the source for read_RescheduleFlag because it is implementation specific.

The return flag in register r0 is then tested. If the register is not equal to zero, the handler restores context and returns control back to the suspended task.

Register r0 is set to zero, indicating that further processing is required. The first operation is to save the spsr, so a copy of the spsr_irq is moved into register r2. The spsr can then be stored in the stack frame by the handler later on in the code.

The IRQ stack address pointed to by register r13_irq is copied into register r0 for later use. The next step is to flatten (empty) the IRQ stack. This is done by adding 6 * 4 bytes to the top of the stack because the stack grows downwards and an ADD instruction can be used to set the stack.

The handler does not need to worry about the data on the IRQ stack being corrupted by another nested interrupt because interrupts are still disabled and the handler will not reenable the interrupts until the data on the IRQ stack has been recovered.

The handler then switches to SVC mode; interrupts are still disabled. The cpsr is copied into register r1 and modified to set the processor mode to SVC. Register r1 is then written back into the cpsr, and the current mode changes to SVC mode. A copy of the new cpsr is left in register r1 for later use.

The next stage is to create a stack frame by extending the stack by the stack frame size. Registers r4 to r11 can be saved onto the stack frame, which will free up enough registers to allow us to recover the remaining registers from the IRQ stack still pointed to by register r0.

At this stage the stack frame will contain the information shown in Table 9.7. The only registers that are not in the frame are the registers that are stored upon entry to the IRQ handler.

Table 9.7. SVC stack frame.

Label	Offset	Register
FRAME_R0	+0	—
FRAME_R1	+4	—
FRAME_R2	+8	—
FRAME_R3	+12	—
FRAME_R4	+16	r4
FRAME_R5	+20	r5
FRAME_R6	+24	r6
FRAME_R7	+28	r7
FRAME_R8	+32	r8
FRAME_R9	+36	r9
FRAME_R10	+40	r10
FRAME_R11	+44	r11
FRAME_R12	+48	—
FRAME_PSR	+52	—
FRAME_LR	+56	—
FRAME_PC	+60	—

Table 9.8 shows the registers in SVC mode that correspond to the existing IRQ registers. The handler can now retrieve all the data from the IRQ stack, and it is safe to reenable interrupts.

Table 9.8. Data retrieved from the IRQ stack.

Registers (SVC)	Retrieved IRQ registers
r4	r0
r5	r1
r6	r2
r7	r3
r8	r12
r9	r14 (return address)

IRQ exceptions are reenabled, and the handler has saved all the important registers. The handler can now complete the stack frame. Table 9.9 shows a completed stack frame that can be used either for a context switch or to handle a nested interrupt.

Table 9.9. Complete frame stack.

Label	Offset	Register
FRAME_R0	+0	r0
FRAME_R1	+4	r1
FRAME_R2	+8	r2
FRAME_R3	+12	r3
FRAME_R4	+16	r4
FRAME_R5	+20	r5
FRAME_R6	+24	r6
FRAME_R7	+28	r7
FRAME_R8	+32	r8
FRAME_R9	+36	r9
FRAME_R10	+40	r10
FRAME_R11	+44	r11
FRAME_R12	+48	r12
FRAME_PSR	+52	spsr_irq
FRAME_LR	+56	r14
FRAME_PC	+60	r14_irq

At this stage the remainder of the interrupt servicing may be handled. A context switch may be performed by saving the current value of register r13 in the current task's control block and loading a new value for register r13 from the new task's control block.

It is now possible to return to the interrupted task/handler, or to another task if a context switch occurred.

SUMMARY

Nested Interrupt Handler

▪: Handles multiple interrupts without a priority assignment.
▪: Medium to high interrupt latency.
▪: Advantage—can enable interrupts before the servicing of an individual interrupt is complete reducing interrupt latency.
▪: Disadvantage—does not handle prioritization of interrupts, so lower priority interrupts can block higher priority interrupts.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781558608740500101

Hardware and Application Profiling Tools

Tomislav Janjusic , Krishna Kavi , in Advances in Computers, 2014

3.3 Multiple-Component Simulators

Medium-complexity simulators model multiple components and the interactions among the components, including a complete CPU with in-order or out-of-order execution pipelines, branch prediction and speculation, and memory subsystem. A prime example of such a system is the widely used SimpleScalar tool set [8]. It is aimed at architecture research although some academics deem SimpleScalar to be invaluable for teaching computer architecture courses. An extension known as ML-RSIM [10] is an execution-driven computer system simulating several subcomponents including an OS kernel. Other extension includes M-Sim [12], which extends SimpleScalar to model multithreaded architectures based on simultaneous multithreading (SMT).

3.3.1 SimpleScalar

SimpleScalar is a set of tools for computer architecture research and education. Developed in 1995 as part of the Wisconsin Multiscalar project, it has since sparked many extensions and variants of the original tool. It runs precompiled binaries for the SimpleScalar architecture. This also implies that SimpleScalar is not an FS simulator but rather user-space single application simulator. SimpleScalar is capable of emulating Alpha, portable instruction set architecture (PISA) (MIPS like instructions), ARM, and x85 instruction sets. The simulator interface consists of the SimpleScalar ISA and POSIX system call emulations.

The available tools that come with SimpleScalar include sim-fast, sim-safe, sim-profile, sim-cache, sim-bpred, and sim-outorder:

•: sim-fast is a fast functional simulator that ignores any microarchitectural pipelines.
•: sim-safe is an instruction interpreter that checks for memory alignments; this is a good way to check for application bugs.
•: sim-profile is an instruction interpreter and profiler. It can be used to measure application dynamic instruction counts and profiles of code and data segments.
•: sim-cache is a memory simulator. This tool can simulate multiple levels of cache hierarchies.
•: sim-bpred is a branch predictor simulator. It is intended to simulate different branch prediction schemes and measures miss prediction rates.
•: sim-outorder is a detailed architectural simulator. It models a superscalar pipelined architecture with out-of-order execution of instructions, branch prediction, and speculative execution of instructions.

3.3.2 M-Sim

M-Sim is a multithreaded extension to SimpleScalar that models detailed individual key pipeline stages. M-Sim runs precompiled Alpha binaries and works on most systems that also run SimpleScalar. It extends SimpleScalar by providing a cycle-accurate model for thread context pipeline stages (reorder buffer, separate issue queue, and separate arithmetic and floating-point registers). M-Sim models a single SMT capable core (and not multicore systems), which means that some processor structures are shared while others remain private to each thread; details can be found in Ref. [12].

The look and feel of M-Sim is similar to SimpleScalar. The user runs the simulator as a stand-alone simulation that takes precompiled binaries compatible with M-Sim, which currently supports only Alpha APX ISA.

3.3.3 ML-RSIM

This is an execution-driven computer system simulator that combines detailed models of modern computer hardware, including I/O subsystems, with a fully functional OS kernel. ML-RSIM's environment is based on RSIM, an execution-driven simulator for instruction-level parallelism (ILP) in shared memory multiprocessors and uniprocessor systems. It extends RSIM with additional features including I/O subsystem support and an OS. The goal behind ML-RSIM is to provide detailed hardware timing models so that users are able to explore OS and application interactions. ML-RSIM is capable of simulating OS code and memory-mapped access to I/O devices; thus, it is a suitable simulator for I/O-intensive interactions.

ML-RSIM implements the SPARC V8 instruction set. It includes cache and TLB models, and exception handling capabilities. The cache hierarchy is modeled as a two-level structure with support for cache coherency protocols. Load and store instructions to I/O subsystem are handled through an uncached buffer with support for store instruction combining. The memory controller supports MESI (modify, exclusive, shared, invalidate) snooping protocol with accurate modeling of queuing delays, bank contention, and dynamic random access memory (DRAM) timing. The I/O subsystem consists of a peripheral component interconnect (PCI) bridge, a real-time clock, and a number of small computer system interface (SCSI) adapters with hard disks. Unlike other FS simulators, ML-RSIM includes a detailed timing-accurate representation of various hardware components. ML-RSIM does not model any particular system or device, rather it implements detailed general device prototypes that can be used to assemble a range of real machines.

ML-RSIM uses a detailed representation of an OS kernel, Lamix kernel. The kernel is Unix-compatible, specifically designed to run on ML-RSIM and implements core kernel functionalities, primarily derived from NetBSD. Application linked for Lamix can (in most cases) run on Solaris. With a few exceptions, Lamix supports most of the major kernel functionalities such as signal handling, dynamic process termination, and virtual memory management.

3.3.4 ABSS

An augmentation-based SPARC simulator, or ABSS for short, is a multiprocessor simulator based on AugMINT, an augmented Mips interpreter. ABSS simulator can be either trace-driven or program-driven. We have described examples of trace-driven simulators, including the DineroIV, where only some abstracted features of an application (i.e., instruction or data address traces) are simulation. Program-driven simulators, on the other hand, simulate the execution of an actual application (e.g., a benchmark). Program-driven simulations can be either interpretive simulations or execution-driven simulations. In interpretive simulations, the instructions are interpreted by the simulator one at a time, while in execution-driven simulations, the instructions are actually run on real hardware. ABSS is an execution-driven simulator that executes SPARC ISA.

ABSS consists of several components: a thread module, an augmenter, cycle-accurate libraries, memory system simulators, and the benchmark. Upon execution, the augmenter instruments the application and the cycle-accurate libraries. The thread module, libraries, the memory system simulator, and the benchmark are linked into a single executable. The augmenter then models each processor as a separate thread and in the event of a break (context switch) that the memory system must handle, the execution pauses, and the thread module handles the request, usually saving registers and reloading new ones. The goal behind ABSS is to allow the user to simulate timing-accurate SPARC multiprocessors.

3.3.5 HASE

HASE, hierarchical architecture design and simulation environment, and SimJava are educational tools used to design, test, and explore computer architecture components. Through abstraction, they facilitate the study of hardware and software designs on multiple levels. HASE offers a GUI for students trying to understand complex system interactions. The motivation for developing HASE was to develop a tool for rapid and flexible developing of new architectural ideas.

HASE is based in SIM++, a discrete-event simulation language. SIM++ describes the basic components and the user can link the components. HASE will then produce the initial code ready that forms the bases of the desired simulator. Since HASE is hierarchical, new components can be built as interconnected modules to core entities.

HASE offers a variety of simulations models intended for use for teaching and educational laboratory experiments. Each model must be used with HASE, a Java-based simulation environment. The simulator then produces a trace file that is later used as input into the graphic environment to represent interior workings of an architectural component. The following are few of the models available through HASE:

•: Simple pipelined processor based on MIPS
•: Processor with scoreboards (used for instruction scheduling)
•: Processor with prediction
•: Single instruction, multiple data (SIMD) array processors
•: A two-level cache model
•: Cache coherency protocols (snooping and directory)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124202320000039