Preview only show first 10 pages with watermark. For full document please download

Haqu: Hardware-accelerated Queueing For Fine-grained Threading On A Chip Multiprocessor

Queues are commonly used in multithreaded programs for synchronization and communication. However, because software queues tend to be too expensive to support finegrained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hardware queues require modifications to the processor core and need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we propose a hardware-accelerated queue, or HAQu. HAQu adds hardware to a CMP that accelerates operations on software queues. Our design implements fast queueing through an application’s address space with operations that are compatible with a fully software queue. Our design provides accelerated and OS-transparent performance in three general ways: (1) it provides a single instruction for enqueueing and dequeueing which significantly reduces the overhead when used in fine-grained threading; (2) operations on the queue are designed to leverage low-level details of the coherence protocol ; and (3) hardware ensures that the full state of the queue is stored in the application’s address space, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offloading fine-grained checks for improved software reliability, and automatic...

   EMBED


Share

Transcript

  HAQu: Hardware-Accelerated Queueingfor Fine-Grained Threading on a Chip Multiprocessor ∗ Sanghoon Lee, Devesh Tiwari, Yan Solihin, James Tuck   Department of Electrical & Computer Engineering North Carolina State University { shlee5,dtiwari2,solihin,jtuck  } @ncsu.edu Abstract Queues are commonly used in multithreaded programs for synchronization and communication. However, becausesoftware queues tend to be too expensive to support fine-grained parallelism, hardware queues have been proposed to reduce overhead of communication between cores. Hard-warequeuesrequiremodificationstotheprocessorcoreand need a custom interconnect. They also pose difficulties for the operating system because their state must be preserved across context switches. To solve these problems, we pro- pose a hardware-accelerated queue, or HAQu. HAQu addshardware to a CMP that accelerates operations on softwarequeues. Our design implements fast queueing through anapplication’s address space with operations that are com- patible with a fully software queue. Our design providesaccelerated and OS-transparent performance in three gen-eral ways: (1) it provides a single instruction for enqueue-ing and dequeueing which significantly reduces the over-head when used in fine-grained threading; (2) operationson the queue are designed to leverage low-level details of the coherence protocol ; and (3) hardware ensures that the full state of the queue is stored in the application’s addressspace, thereby ensuring virtualization. We have evaluated our design in the context of application domains: offload-ing fine-grained checks for improved software reliability,and automatic, fine-grained parallelization using decou- pled software pipelining. 1. Introduction Chip multiprocessors (CMPs) are now ubiquitous, butachieving high performance on a wide range of applica-tions is still a big challenge. It is important to considernew architectural features that can expand the programma-bility and performance of CMPs. In this paper, we considersuch a feature by adding architectural support for a single- producer, single-consumer queue . ∗ This work was supported in part by NSF Award CNS-0834664. Queues are a commonly used programming construct inmultithreaded programs for synchronization and communi-cation. They are effective when the granularity of paral-lelism is large enough to amortize the cost of enqueueingand dequeueing. But software queues tend to be too ex-pensive to support fine-grained parallelism. For this reason,hardware queues have been proposed to reduce the over-head of communicating between cores. However, hardwarequeues tend to be unattractive to designers. They come at alargedesigncostastheyrequiremodificationstotheproces-sor core and need a custom, global interconnect. They alsopose difficulties for the operating system because their statemust be preserved across context switches. These chal-lenges grow with ever increasing processor virtualization.To solve these problems, we propose a hardware ac-celerated queue, or HAQu (read like haiku ). HAQu addshardware to a CMP that accelerates operations on softwarequeues. Rather than adding a custom hardware queue, ourdesign implements fast queueing through an application’saddress space with operations that are compatible with afully software queue. Our design provides accelerated andOS-transparent performance in three general ways. (1) Itprovides instructions for enqueueing and dequeueing whichsignificantly reduce the overhead of fine-grained threading.These instructions are critical for achieving fast queueingsince they eliminate the high instruction count involved inqueueing operations. (2) Operations on the queue are de-signed to leverage low-level details of the on-chip cachesand coherence protocol to provide fast communication. (3)Furthermore, the full state of the queue is stored in the ap-plication’s address space, thereby providing virtualization,queues in nearly unlimited quantity, and queues of a largesize. On the whole, HAQu is no worse for programmabilitythan a software queue.We have evaluated our design in three contexts. First, weevaluate our mechanism using a micro-benchmark designedto show the peak performance of our architecture comparedto state-of-the-art queueing algorithms in software. Oursystem is able to achieve a 6.5 × speedup over an efficientsoftware implementation proposed by Lee et al [9]. Sec-ondly, we show how our design can facilitate fine-grained  parallelization by parallelizing Mudflap, a pointer-checkingutility built into GCC. Using HAQu, we can afford to par-allelize fine-grained, inlined pointer checking codes. Using16 cores, we achieve an average speedup of 1.8 × , calcu-lated using geometric mean, and a max speedup of 3.3 × on a set of SPECint applications. Finally, we demonstratethe potential of HAQu for automatic parallelization usingdecoupled software pipelined (DSWP). On a small set of SPECint applications, we show that our mechanism enablesfully automatic parallelization.The rest of this paper proceeds as follows: Section 2 dis-cusses the overall idea of our approach; Section 3 describesthe architecture of the queue, and Section 4 provides the im-plementation. Section 5 describes our experimental setup,and Section 6 provides the full evaluation in the context of three case studies. Section 7 discusses related work, andSection 8 concludes. 2. Main Idea: Hardware Accelerated Queue 2.1. Motivation Queues that support fine-grained parallelism on chip-multiprocessors must meet several criteria. We have identi-fied four that would a make a queue acceptable for a widerange of uses. Throughout this article, we restrict our dis-cussion to single-producer/single-consumer queues. First,( Criterion I  ) enqueue and dequeue operations must be madeas efficient as possible. Since these operations may occurfrequently, it’s necessary that the total latency is low. Sec-ond, ( Criterion II  ) the delay through the queue (i.e. the timebetween an enqueue and dequeue of the same item) must below enough that concurrent execution proceeds efficiently.Third, ( Criterion III  ) they should be no worse to programthan software queues. For example, there should be enoughqueues available to support a wide variety of applications,and their synchronization constraints should be no harderto understand than for fully software implementations. Fi-nally, ( Criterion IV  ) the queue must work seamlessly on anunmodified operating system (OS).The current state-of-the-art implementations for queue-ing are fully in software. Figure 1 shows an implemen-tation of a Lamport queue designed for sequential consis-tency; part (a) shows the queue definition, and part (b) and(c) show the enqueue and dequeue operations respectively.The enqueue and dequeue operations shown here are non-blocking and lock-free; a blocking implementation is trivialto construct using these operations. For weaker consistencymodels, a fence is needed to properly read and update thequeue state.While these implementations are not well suited for fine-grained parallelism, they do meet most of the criteria above.In particular, they provide low latency for communicatingthrough the queue because they often exploit the fast on-chip interconnect. This latency can be optimized throughcareful consideration of on-chip caches and the coherenceprotocol. They meet Criterion III  trivially because they areimplemented in software. Criterion IV  is met trivially as aresult of  Criterion III  ; since the queues are stored in virtualmemory, they are trivially virtualized by any OS. Softwarequeues, however, donotprovideenqueueanddequeueoper-ations that are efficient enough for fine-grained parallelism;so, they do not meet Criterion I  . Even with efficient non-blocking and lock-free implementations for enqueue anddequeue, the cost of these operations tend to be high — 10instructions per queue operation.Hardware queues have appeared in a variety of propos-als [13, 11, 6]. Due to the similarities, we will treat them thesame in this discussion. These designs support Criterion I  and Criterion II  . Using special instructions or special hard-ware, data is inserted directly into the hardware queue, andthis makes enqueue and dequeue highly efficient. In fact,the latency of enqueue or dequeue can be hidden by out-of-order execution. Furthermore, the storage for the queueis provided by special buffers and an interconnect betweentwo or more cores. This provides very low latency commu-nication. However, these mechanisms are costly, and thereare practical concerns over their programmability and thelimits on the number of queues that can be supported ( Crite-rionIII  ).Furthermore, thesehardwarequeueshavenotbeendesigned with the OS in mind, and often do not considerthe challenges of context virtualization. For example, howis the state of the queue preserved when one of the threads,either the producer or consumer, is context switched. Fur-thermore, allocation of global queueing resources may needto be managed by the operating system. Given the complex-ity of a modern OS, hardware that requires such changes inthe OS is considerably less desirable.Neither software nor proposed hardware mechanismsmeet all of the criteria for fine-grained threading on currentCMPs. However, the shortcomings of these software andhardware mechanisms point the way to a hybrid design thatleverages strengths from each one. 2.2. Our Proposal: Hardware-Accelerated Queue The Hardware-Accelerated Queue (HAQu) aims to ex-ploit the critical benefits of hardware queueing while re-taining the many desirable features of software queues. In asense, we start with a software implementation and speedupthe features that are critical for faster and more efficientfine-grainedthreading. Usinghardwaretoacceleratequeue-ing is not new to this proposal. The VAX [4] also supportedhardware accelerated queues (see comparison in Section 7).However, we will show that HAQu provides a simple, leanimplementation that integrates well into modern superscalararchitectures; and it can provide much higher throughputand lower latency compared to optimized software imple-mentations.HAQu provides two new instructions for enqueue anddequeue operations to meet Criterion I  . These new instruc-  struct queue {Item *head;Item *tail;uint size; // = SIZE;Item buffer[SIZE];}typedef struct queue * queue_p;int enqueue(queue_p q, Item val){Item *next = NEXT(q->head);if (q->tail == next)return BLOCK;*head = val;// memory barrierq->head = next;return SUCCESS;}int dequeue(queue_p q, Item *val){Item *tail = q->tail; if (tail == q->head)return BLOCK;// memory barrier*val = *tail;q->tail = NEXT(tail);return SUCCESS;}(a)(b)(c) Figure 1. Lamport’s queueing operations single-producer, single-consumer queue on a sequentiallyconsistent processor. tions keep the code footprint low, and their implementationwill work much like software enqueue and dequeue opera-tions. However, because they are implemented in hardware,we incorporate detailed knowledge of the memory consis-tency model, coherence protocol, and on-chip resources toprovide a highly optimized enqueue or dequeue operation.HAQu stores the queue’s state in the application’s ad-dress space. As a result, Criterion III  and Criterion IV  nearly trivially are met. This means that we can supporta large number of queues efficiently, and we can designHAQu to provide synchronization semantics that program-mers expect. Also, since the state of the queue is held invirtual memory, a context switch poses almost no problemssince the queue need not be saved or restored. Furthermore,HAQu can leverage the existing on-chip interconnect andcachecoherenceprotocolfortransferingdatafromproducerto consumer. This usually provides low enough latencies tomeet Criterion II  .Arriving at an efficient implementation is not as straight-forward as it may seem. Criterion I  and Criterion II  are dif-ficult to obtain without leveraging queue specific informa-tion in the hardware. We find that some information aboutthe queue must be tracked in hardware. However, we cando so only if we preserve OS transparency and programma-bility concerns. Our architecture for HAQu balances thecompeting concerns of achieving a high-throughput imple-mentation ( Criterion I  and Criterion II  ) while still provid-ing an implementation that can be deployed on conventionalCMPs that support a wide range of applications ( Criterion III  and Criterion IV  ) and programmability concerns. 3. System Architecture To explain the overall architecture of our proposal, wewill begin with a software queue and modify it to supportenqueueing and dequeueing operations in hardware. Thereare three main components to this discussion: the allocationand layout of the queue in memory, enqueue and dequeueoperations, and making these operations efficient for fine-grained threading. 3.1. Allocation and Layout Our queue will reside in memory but will be operatedon directly using hardware. Therefore, the layout of thequeue is determined a priori 1 . For now, we will assumea layout similar to Figure 1(a). This is not an optimizedconfiguration but is sufficient for the following discussion 2 .A client will allocate a queue using a memory allocator like malloc (in C) and prepare it for use. The head, tail, size,and buffer fields of the queue are located at fixed offsetsfromthebeginningofthequeue. Wepresumethatthequeuedata structure and its initialization can be provided througha runtime library. Furthermore, since the buffer is the lastfield of the struct, the queue can be of arbitrary size. 3.2. Enqueue and Dequeue Operations A key goal of HAQu is reducing the cost of enqueue anddequeue operations by adding two new instructions to carryout these tasks. A naive design directly implements Lam-port’s algorithm [8] in hardware.Consider an enqueue operation. When an enqueue in-struction is executed, it is passed the base address of thequeue as an input argument. Using the base address, thehead and tail fields of the data structure are read. From thispoint on, hardware implements Lamport’s algorithm.Figure 2 shows a possible timeline for the execution of queue operation. Since enqueue and dequeue operations aresimilar, we only describe the enqueue operation in detail. Incycle 1, the head and tail are loaded into temporary registers(these are not architectural registers or part of the registerfile). For this example, we assume that both memory opera-tions hit in the L1 cache and return 2 cycles later. In cycle 3, 1 This is a limitation of our proposal, but a handful of layout configu-rations could be supported and configured with firmware with little addi-tional cost. 2 An optimized implementation will pad the head and tail to fit on theirown lines to prevent false sharing.  Cycle 1t0 = MEM[head]t1 = MEM[tail] Dequeue Operations 2345t0 != t1val = MEM[t1]MEM[tail] = NEXT(t1)6 (a) Cycle 1t0 = MEM[head]t1 = MEM[tail] Enqueue Operations 2345NEXT(t0) != t1MEM[t0] = valMEM[head] = NEXT(t0)6 (b)  Access cacheWait for completion Access cacheWait for completion Figure 2. Enqueue (a) and dequeue (b) operations.Assumed cache hits with a dual-ported, 2-cycleaccess L1 cache. Mnemonic Descriptionbenq { b,h,w,d } Rs , QAddr Enqueue Rs into QAddr ,block on queue full.bdeq { b,h,w,d } Rd , QAddr Dequeue from QAddr to Rd ,block on queue empty.enq { b,h,w,d } Rt , Rs , QAddr Enqueue Rs into QAddr , Rt returns status.deq { b,h,w,d } Rt , Rd , QAddr Dequeue from QAddr to Rd , returns status in Rt . Figure 3. Enqueue and dequeue instructions. we perform the test to validate that the queue is not full. Inthis timeline, we assume it was not and proceed with cycle4 to store the value in the queue. We will refer to this storeas the enqueueing-store since it actually places the value inthe queue. (Similarly, we refer to the operation in cycle 4for dequeue as the dequeueing-load  .) At this point, we haveto wait for the store to complete before updating the headto the next position. Note, this store ordering is required forLamport’s algorithm to function properly and without races.The enqueue and dequeue operations are shown in Fig-ure3. SincewetargetaMIPSarchitectureinourevaluation,these instructions are shown using MIPS-like mnemonicsand implementation nuances. We support variants of theseoperations for each size of memory load or store supportedin MIPS to ensure efficient code generation. Each enqueueinstruction takes two input arguments: (1) the base addressof the queue, and (2) the data to be enqueued. Dequeueoperations only require the base address of the queue as in-put. Non-blocking instructions return an error code in theirdestination register. 3.3. Making Fine-Grained Queueing Efficient So far, our design does not consider the interaction of queue operations with the cache coherence protocol or on-chip caches. It is critical to do so in order to achieve highthroughput using our queue. Consider the implications of the proposed timeline in Figure 2 in a state-of-the-art CMPwhen used for fine-grained threading. First, if enqueue-ing and dequeueing operations are happening concurrently, (b) enqw x,Q1deqw a,Q1flag = 1temp = flagInitial state: flag = 0Thread AThread Benqw x,Q1deqw a,Q1flag = 1temp = flagInitial state: flag = 0Thread AThread Bfencefence No ordering on flag update(c)Ordering flag update enqw x,Q1deqw a,Q1Thread AThread B (a)Relaxed queue op ordering enqw y,Q1deqw b,Q1enqw z,Q2deqw c,Q2 Figure 4. Relaxed memory ordering for queue op-erations. then the head and tail variables for a queue are being readfrequently by two different cores. This results in frequenttransitions between exclusive state and shared state for thecache line holding each variable. This will significantlylengthen thetimeneeded to decideifan enqueueordequeuecan proceed. Second, even in the case where the threads arenot operating concurrently, the queue operations that canoccur per unit time are determined by the amount of ILPthat can be extracted from these instructions. The amount of parallelism is limited by the number of ports on the cache,the cache access time, and by the degree of ILP provided inthe microarchitecture for enqueue and dequeue operations.Overall, these effects could serve to drastically reduce thethroughput of our mechanism. To improve the efficiencyof our mechanism, we adopt a few key measures: relaxedmemory ordering for queue operations, reduced accesses toshared variables, and pipelined queue operations. 3.3.1. Relaxed Memory Ordering for Queue Operations To enhance the parallelism available in the microarchitec-ture, we will assume a system with weak memory ordering,and we allow queueing instructions to be re-ordered dynam-ically as long as they operate on different queues. Sinceeachqueueoperationactuallyinvolvesmultiplememoryac-cesses, weareallowingmoreaggressivereorderingofmem-ory operations than would be available in a software imple-mentation. This scenario is shown in Figure 4(a). The firstenqueueand dequeueoperations onQ1 mustbeproperly or-dered, otherwise FIFO ordering may not be preserved. Thisis indicated through the ordering arrows. However, the op-erations on Q2 can be re-ordered with respect to those onQ1, and this is shown by the lack of arrows to the enqueueor dequeue on Q2. For queueing operations on differentqueues that must occur in a given order, a memory fence isnecessary to guarantee their order.Furthermore, the relative order of other memory opera-tions with respect to queueing operations are also relaxed.In Figure 4(b), a flag is set to 1 before the enqueue in ThreadA, and the flag is read in Thread B after the correspondingdequeue. In this case, there is no guarantee that Thread Bobserves the value 1. However, this can be enforced using afence before the enqueue in Thread A and after the dequeuein Thread B, as shown in Fig 4(c).