Research Article  Open Access
Yibo Chen, Yu Wang, Yuan Xie, Andres Takach, "Parametric YieldDriven Resource Binding in HighLevel Synthesis with Multi
Parametric YieldDriven Resource Binding in HighLevel Synthesis with MultiV t h / V d d Library and Device Sizing
Abstract
The everincreasing chip power dissipation in SoCs has imposed great challenges on today’s circuit design. It has been shown that multiple threshold and supply voltages assignment (multi) is an effective way to reduce power dissipation. However, most of the prior multi optimizations are performed under deterministic conditions. With the increasing process variability that has significant impact on both the power dissipation and performance of circuit designs, it is necessary to employ statistical approaches in analysis and optimizations for low power. This paper studies the impact of process variations on the multi technique at the behavioral synthesis level. A multi resource library is characterized for delay and power variations at different voltage combinations. Meanwhile, device sizing is performed on the resources in the library to mitigate the impact of variation, and to enlarge the design space for better quality of the design choice. A parametric yielddriven resource binding algorithm is then proposed, which uses the characterized power and delay distributions and efficiently maximizes power yield under a timing yield constraint. During the resource binding process, voltage level converters are inserted between resources when required. Experimental results show that significant power reduction can be achieved with the proposed variationaware framework, compared with traditional worstcase based deterministic approaches.
1. Introduction
Integrating billions of transistors on a single chip with nanoscale transistors has resulted in great challenges for chip designers. One of these challenges is that the pace of productivity gains has not kept up to address the increases in design complexity. Consequently, we have seen a recent trend of moving design abstraction to a higher level, with an emphasis on electronic system level (ESL) design methodologies. A very important component of ESL is raising the level of abstraction of hardware design. Highlevel synthesis (HLS) provides this component by providing automation to generate optimized hardware from a highlevel description of the function or algorithm to be implemented in hardware. HLS generates a cycleaccurate specification at the registertransfer level (RTL) that is then used in existing ASIC or FPGA design methodologies. Commercial highlevel synthesis tools [1] have recently gained a lot of attention as evidenced in recent conference HLS workshops (DATE2008, DAC2008, and ASPDAC2009), conference panels, and publications that track the industry. While highlevel synthesis is able to quickly generate implementations of circuits, it is not intended to replace the existing lowlevel synthesis.The major benefit coming from highlevel synthesis is the high design efficiency, the ability to perform fast prototyping, functional verification, and earlystage design space exploration, which in turn provide guidance on succeeding lowlevel design steps and help produce highquality circuits.
Power consumption and process variability are among other critical design challenges as technology scales. While it is believed that tackling these issues at a higher level of the design hierarchy can lead to better design decisions, a lot of work has been done on lowpowerhighlevel synthesis [2–4] as well as processvariationawarehighlevel synthesis [5–8]. These techniques have been successfully implemented but most of the existing work focuses on one side of the issues in isolation. Recently, Srivastava et al. [9] explore the multi design space with the consideration of process variations at the gate level. Nevertheless, variationawarelow power exploration for behavioral synthesis is still in its infancy.
Multiple threshold and supply voltages assignment (multi) has been shown as an effective way to reduce circuit power dissipation [2, 3, 10, 11]. Existing approaches assign circuit components on critical paths to operate at a higher or lower , and noncritical portions of the circuit are made to operate at lower or higher , respectively. The total power consumption is thus reduced without degrading circuit performance. However, nowadays circuit performance is affected by process variations. If the variations are underestimated, for example, using nominal delays of circuit components to guide the design, noncritical components may turn to critical ones due to the variations, and circuit timing constraints may be violated. On the other hand, in existing cornerbased worstcase analysis, variations are overestimated resulting in design specs that are hard to meet, and this consequently increases design effort and degrades circuit performance.
Device sizing is a wellstudied technique for performance and power tuning at gate or circuitlevel [12]. To improve performance, upsizing of a high transistor, which increases switching power and die area, can be traded off against using a low transistor, which increases leakage power. Therefore, combining multi assignment and device sizing as integrated problem, can increase the design flexibility and further improve the design quality. Meanwhile, in terms of mitigating process variations, it is possible that increasing the size of transistors can reduce the randomness of the device parameters through averaging.
This paper presents a variationaware power optimization framework in highlevel synthesis using simultaneous multi assignment and device sizing. Firstly, the impact of parameter variations on the delay and power of circuit components is explored at different operating points of threshold and supply voltages. Device sizing is then performed to mitigate the impact of variations and to enlarge the design space for better quality of the design choice. A variationcharacterized resource library containing the parameters of delay and power distributions at different voltage “corners” and different device sizes, is built once for the given technology, so that it is available for highlevel synthesis to query the delay/power characteristics of resources. The concept of parametric yield, which is defined as the probability that the design meets specified constraints such as delay or power constraints is then introduced to guide design space exploration. Statistical timing and power analysis on the data flow graph (DFG) is used to populate the delay and power distributions through the DFG and to estimate the overall performance and power yield of the entire design. A variationaware resource binding algorithm is then proposed to maximize power yield under a timing yield constraint, by iteratively searching for the operations that have the maximum potential of performance/power yield improvement, and replacing them with better candidates in the multi resource library. During the resource binding process, voltage level converters are inserted for chaining of resource units having different supplies.
The contribution of this paper can be summarized as follow:(i)first, this is the first work to apply multi techniques during highlevel synthesis under the context of both delay and power variations. A flow for variationaware power optimization in multi HLS is proposed. This flow includes library characterization, statistical timing and power analysis methodologies for HLS, and resource binding optimization with variationcharacterized multi library;(ii)combined multi assignment and device sizing for highlevel synthesis are performed at the granularity of function unit level, to improve the design quality and at the same time to reduce the design complexity;(iii)voltage level conversion is explored during the resource binding in highlevel synthesis, enabling the full utilization of multi components for parametric yield maximization.
2. Related Work
Prior research work tightly related to this paper mainly falls into two categories: gate level power minimization by simultaneous multi assignment and gate sizing; lowpower highlevel synthesis using multi or multi; process variation aware highlevel synthesis.
Several techniques were proposed to consider allocation and transistor sizing as an integrated problem [13–16]. Wei et al. [14] presented simultaneous dual assignment and gate sizing to minimize the total power dissipation while maintaining high performance, while Karnik et al. [16] improved the simultaneous allocation and device sizing using a Lagrangian Relaxation method. However, all of the reported techniques focus on tuning at transistor level or gatelevel. While the fine granularity can yield optimal results, it also lead to high design complexity.
Shiue [2] proposed lowpower scheduling schemes with multi resources by maximizing the utilization of resources operating at reduced supply voltages. Khouri and Jha [3] performed highlevel synthesis using a dual library for leakage power reduction. Tang et al. [4] formulated the synthesis problem using dual as a maximum weightindependent set (MWIS) problem, within which nearoptimal leakage power reduction is achieved with greatly reduced run time. Very recently, Insup et al. explored optimal register allocation for highlevel synthesis using dual supply voltages [17]. However, all of these techniques were applied under deterministic conditions without taking process variation into consideration.
Process variationaware highlevel synthesis has recently gained much attention. Jung and Kim [6] proposed a timing yieldaware HLS algorithm to improve resource sharing and reduce overall latency. Lucas et al. [8] integrated timingdriven floorplanning into the variationaware highlevel design. Mohanty and Kougianos’s work [18] took into account the leakage power variations in lowpower highlevel synthesis; however, the major difference between [18] and our work is that, the delay variation of function units was not considered in [18], so the timing analysis during synthesis was still deterministic. Recently, Wang et al. [19] proposed a joint designtime optimization and postsilicon tuning framework that tackles both timing and power variations. Adaptive body biasing (ABB) was applied to function units to reduce leakage power and improve power yield.
3. Multi/ Library Characterization under Process Variations
Scheduling and resource binding are key steps during the highlevel synthesis process. The scheduler is in charge of determining the sequencing the operations of a control/data flow graph (CDFG) in control steps and within control steps (operator chaining) while obeying control and data dependencies and cycle constraints while optimizing for area/power/performance. The binding process binds operations to hardware units in the resource library to complete the mapping from abstracted descriptions of circuits into practical designs. This section presents the characterization of the variationaware multi/ resource library, including the delay and power characterization flow and the selection of dual threshold and supply voltages.
3.1. VariationAware Library Characterization Flow
In order to facilitate the design space exploration while considering process variations, the resource library of functional units for HLS has to be characterized for delay/power variations. As shown in Figure 1, under the influence of process variations, the delay and power of each component are no longer fixed values, but represented by probability density functions (PDFs). Consequently, the characterization of function units with delay and power variations requires statistical analysis methodologies.
Process variations come from a set of sources, including random doping fluctuation (RDF) [20] and geometric variations of the gate (primarily on channel length) [21]. Since both RDF and channel length variations manifest themselves as fluctuations on the effective threshold voltage of the transistor [22], their effects can be expressed by the variations of . Since this work focuses on demonstrating the effectiveness of variationaware synthesis, rather than a comprehensive modeling of all variation effects, we try to focus on variations with a simplified assumption of normal distribution of variations, rather than covering all physicallevel variation factors with different distributions. The magnitude of variations in real circuits can be obtained via onchip sensing and measurement. In this work, we use NCSU FreePDK 45 nm technology library [23] for all the characterization and experiments. We set the standard deviation of to be 50 mV, which is projected from the silicon measurement data in [24].
We then use a commercial gatelevel statistical timing analysis tool, Synopsys PrimeTime VX to perform the characterization. This variationaware tool increases the accuracy of timing analysis by considering the statistical distribution of device parameters such as threshold voltage and gate oxide thickness. Given the distribution of a device parameter P, PrimeTime VX calculates the distribution of gate delay continuously throughout the range of values, using linear interpolation and extrapolation at the librarydefined functional operating points, as shown in Figure 2. Validation against SPICE Monte Carlo statistical analysis [25] shows that PrimeTime VX analysis holds the similar accuracy but significantly reduces the running time.
The characterization flow takes as input the statistical distributions of process parameters ( in this work) and generates the statistical distributions of delay and power for each resource in the library. To characterize the delay of function units under the impact of process variations, the following steps are performed: (1)all the standard cells in a technology library are characterized using variationaware analysis, and the results including parameters of cell delay distributions are collected to build a variationaware technology library; (2)the function units used in HLS are then synthesized and linked to the variationaware technology library; (3)statistical timing analysis for the function units is performed using PrimeTime VX, and the parameters of delay distributions are reported.
Statistical power characterization for function units in the resource library can be done using Monte Carlo analysis in SPICE. The power consumption of function units consists of dynamic and leakage components. While dynamic power is relatively immune to process variation, leakage power is greatly affected and becomes dominant as technology continues scaling down [26]. Therefore, in this paper only leakage power is characterized using statistical analysis. However, this do not mean that considerations for dynamic power can be omitted. In fact, dynamic power optimization in highlevel synthesis has been a wellexplored topic [1]. Our variationoriented work emphasizing leakage power optimization can be stacked on or integrated into existing power optimization approaches in highlevel synthesis, to further reduce the total power consumption of circuits.
According to Berkeley shortchannel BSIM4 model [27], higher threshold voltages can lead to exponential reduction in leakage power, which is given by: where is the gate leakage current, and are the gate voltages, is the thermal voltage, and is the subthreshold swing factor. Since we assume that the device parameter follows normal distribution, follow lognormal distribution. Therefore, the leakage power of a function units is the sum of a set of lognormal distributions, which describe the leakage power of each library cell. According to [28], the sum of several lognormal random variables can be approximated by another lognormal random variable, as shown in (3): where describes the leakage power distribution of the function unit; while , , and are the corresponding variables for library cells that build up the function unit. The mean and deviation of can be estimated via iterative moment matching out of the leakage power distributions of library cells [28].
The power characterization flow is stated as follows. Process variations are set in the MOS model files, and 1000 runs of Monte Carlo iterations are performed for each library cell. After the characterization, the parameters of the leakage power distributions of library cells are extracted.
Note that in our work we only characterize subthreshold leakage, since it starts dominant for technology nodes of 45 nm and below. The gate leakage can also be characterized with similar methods.
3.2. Multi Library Characterization
Previous implementations using multiple threshold and supply voltages in conjunction have shown a very effective reduction in both dynamic and leakage power [11]. Therefore, our approach considers the combination of dual threshold and dual supply voltages, and characterizations are performed at the four “corners” of voltage settings, namely , , , and , where is the nominal case and the other three are lowpower settings. Note that although only 4 voltage settings are discussed in this paper, it is natural to extend the approach presented here to deal with more voltage settings. To reduce the process technology cost, in this paper, the multi techniques are applied at the granularity of function units. That means, all the gates inside a function unit operate at the same threshold and supply voltages. Voltages only differ from function units to function units.
The selection of appropriate values of threshold and supply voltages for power minimization has been discussed under deterministic conditions [11]. Rules of thumb are derived for the second and as functions of the original voltages [11]: where stands for the initial ratio between dynamic and static power. While the empirical models in [11] are validated on actual circuit benchmarks [29], they may not be accurate under the impact of process variations. A refined model taking into account the process variations is presented in [9]. As shown in Figure 3, the total power reduction with variation awareness is plotted under different combinations of () and (, and this guides the optimal value selection in this work.
The characterization results (which will be further discussed in Section 6) show that, power reduction is always achieved at the cost of delay penalties. Moreover, larger delay variations are observed for slower units operating at high or low, which means larger probability of timing violations when they are placed on the nearcritical paths. This further demonstrates the necessity of statistical analysis and parametric yielddriven optimization approaches.
3.3. Device Sizing for the Resource Library
Conventionally, device sizing is an effective technique to optimize CMOS circuits for dynamic power dissipation and performance. In this work, we show that device sizing may also be utilized to mitigate the impact of process variations. As previously mentioned, the sources of process variations mainly consists of random doping fluctuation (RDF) [20] and geometric variations (GVs). GV affect the real through the drain induced barrier lowering (DIBL) effect. Originally, both RDF and GV have almost the equal importance in determining the variance. As we propose to use low and high resource units in the design, the difference between supply voltage and threshold voltage diminishes, and this reduces DIBL effect. As a result, the uncertainty in arising from GV rapidly falls as . On the other hand, the RDFinduced variation is independent of changes and solely a function of channel area [30]. Therefore, variation resulting from RDF becomes dominating as approaches .
Due to the independent nature of RDF variations, it is possible to reduce their impact on circuit performance through averaging. Therefore, upsizing the device can be an effective way for variability mitigation with enlarged channel area. According to [31], variance resulting from RDF is roughly proportional to , which means we can either increase the transistor width or channel length or both. Conventional sizing approaches focus on tuning the transistor width for performance. In terms of process variability mitigation, the measurement data of variation for 4 different device sizes is plotted in Figure 4, which shows that increasing transistor width is a more effective way to reduce the variance [24]. Although larger transistor width means larger leakage power, the fluctuations on leakage power are reduced, and the design space for resource binding is significantly enlarged, thus using resources with larger size in the design may still be able to improve the parametric yield.
In this work, we upsize all the function units in the resource library to generate alternatives for power tuning and variability mitigation. The sizing is performed on all the gates with two different settings: the basic size (1W1L) and the doublewidth size (2W1L). We then perform the variation characterization for the upsized function units under all the four voltage “corners” presented in the previous section. The characterization results will be presented in Section 6.
4. Yield Analysis in Statistical HighLevel Synthesis
In this section, a parametric yield analysis framework for statistical HLS is presented. We first show the necessity of statistical analysis by a simple motivational example and then demonstrate the statistical timing and power analysis for HLS as well as the modeling and integration of level converters for multi HLS.
4.1. Parametric Yield
To bring the processvariation awareness to the highlevel synthesis flow, we first introduce a new metric called parametric yield. The parametric yield is defined as the probability of the synthesized hardware meeting a specified constraint , where can be delay or power.
Figure 5 shows a motivational example of yieldaware analysis. Three resource units , , and have the same circuit implementation but operate at different supply or threshold voltages. Figure 5 shows the delay and power distributions for , , and . In this case the mean power follows up , and the mean delay follows , which is as expected since power reduction usually comes at the cost of increased delay. The clock cycle time and the power consumption constraint (e.g., the TDP (thermal design power) of most modern microprocessors) are also shown in the figure. If the variation is disregarded and nominalcase analysis is used, any of the resource units can be chosen since they all meet timing. In this case, would be chosen as it has the lowest power consumption. However, under a statistical point of view, has a low timing yield (approximately 50%) and is very likely to cause timing violations. In contrast, with cornerbased worstcase analysis only can be chosen under the clock cycle time constraint (the worstcase delay of slightly violates the limit), whereas has a poor power yield. In fact, if we set a timing yield constraint instead of enforcing the worstcase delay limitation, can be chosen with a slight timing yield loss but a wellbalanced delay and power tradeoff. Therefore, a yielddriven statistical approach is needed for exploring the design space to maximize one parametric yield under other parametric yield constraints.
4.2. Statistical Timing and Power Analysis for HLS
Highlevel synthesis (HLS) is the process of transforming a behavioral description into register level structure description. Operations such as additions and multiplications in the DFG are scheduled into control steps. During the resource allocation and binding stages, operations are bound to corresponding function units in the resource library meeting type and latency requirements.
Given the clock cycle time , the timing yield of the entire DFG, is defined as where is the probability function, are the arrival time distributions at control step , respectively.
The arriving time distribution of each clock cycle can be computed from the delay distributions of function units bound at that cycle. Two operations, and , are used to compute the distributions:(i) operation is used when two function units are chained in cascade within a clock cycle, as shown in and of Figure 6. The total delay can be computed as the “sum’’ of their delay distributions (normal distribution assumed);(ii) operation is used when the outputs of two or more units are fed to another function unit at the same clock cycle, as shown in of Figure 6. The “maximum” delay distribution can be computed out of the contributing distributions using tightness probability and moment matching [19].
(a) No level conversion
(b) synchronous conversion
(c) synchronous conversion
With these two operations, the arriving time distribution of each clock cycle is computed, and the overall timing yield of the DFG is obtained using (5).
The total power consumption of a DFG can be computed as the sum of the power consumptions of all the function units used in the DFG. Given a power limitation , the power yield of the DFG is computed as the probability that total power is less than the requirement, as expressed in (6): Since dynamic power is relatively immune to process variations, it is regarded as a constant portion which only affects the mean value of the total power consumption. Therefore, the total power is still normally distributed, although statistical analysis is only applied to the leakage power. As aforementioned in Section 3, our proposed yielddriven statistical framework can be stacked on existing approaches for dynamic power optimization, to further reduce the total power consumption of circuits.
4.3. Voltage Level Conversion in HLS
In designs using multi resource units, voltage level convertors are required when a lowvoltage resource unit is driving a highvoltage resource unit. Level conversion can be performed either synchronously or asynchronously. Synchronous level conversion is usually embedded in flipflops and occurs at the active clock edge, while asynchronous level converters can be inserted anywhere within the combinational logic block.
When process variations are considered, asynchronous level converters are even more favorable, because they are not bounded by clock edges, and timing slacks can be passed through the converters. Therefore, time borrowing can happen between lowvoltage and highvoltage resource units. As slow function units (due to variations) may get more time to finish execution, the timing yield can be improved, and the impact of process variations is consequently reduced.
While many fast and lowpower level conversion circuits have been proposed recently, this paper uses the multi level converter presented in [32], taking the advantage that there is no extra process technology overhead for multi level converters, since multi is already deployed for function units. The proposed level converter is composed of two dual cascaded inverters. Its delay and power are then characterized in HSPICE using the listed parameters [32].
The delay penalty of a level converter can be accounted by summing its delay with the delay of the function unit it is associated to. The power penalty can be addressed by counting the level converters used in the DFG and adding the corresponding power to the total power consumption.
5. YieldDriven Power Optimization Algorithm
In this section, we propose our yielddriven power optimization framework based on the aforementioned statistical timing and power yield analysis. During the highlevel synthesis design loop, resource binding selects the optimal resource instances in the resource library and binds them to the scheduled operations at each control step. A variationaware resource binding algorithm is then proposed to maximize power yield under a preset timing yield constraint, by iteratively searching for the operations with the maximum potential of timing/power yield improvement, and replacing them with better candidates in the multi resource library.
5.1. VariationAware Resource Binding Algorithm Overview
Our variationaware resource binding algorithm takes a search strategy called variable depth search [19, 33, 34] to iteratively improve the power yield under performance constraints. The outline of the algorithm is shown in Algorithm 1, where a DFG is initially scheduled and bound to resource library with nominal voltages . A lowerbound constraint on the timing yield is set, so that the probability of the design can operate at a given clock frequency, will be larger than or equal to a preset threshold (e.g., 95%). In the algorithm, a move is defined as a local and incremental change on the resource bindings. As shown in the sub routine GENMOVE in Algorithm 1, the algorithm generates a set of moves and finds out a sequence of moves that maximizes the accumulated , which is defined as , where is a weighting factor to balance the weights of timing and power yield improvements. The optimal sequence of moves is then applied to the DFG, and the timing and power yields of the DFG are updated before the next iteration. The iterative search ends when there is no yield improvement or the timing yield constraint is violated.

Note that our worstcase resource binding algorithm uses the same search strategy (variable depth search) [19, 33, 34] as the variationaware resource binding algorithm. The key difference is that, instead of iteratively improving the power yield under performance constraints, the worstcase resource binding algorithm iteratively reduces the power consumption under specified performance constraints, where both the power consumption calculation and performance constraints are specified as deterministic numbers, rather than using the concept of power yield and performance yield.
5.2. Voltage Level Conversion Strategies
Moves during the iterative search may result in lowvoltage resource units driving highvoltage resource units. Therefore, level conversion is needed during resource binding. However, if resources are selected and bound so that lowvoltage resource units never drive highvoltage ones, level conversion will not be necessary, and the delay and power overheads brought by level converters can be avoided. This reduces the flexibility of resource binding for multivoltage module combinations, and may consequently decrease the attainable yield improvement. The tradeoff in this conversionavoidance strategy, can be explored and evaluated within our proposed power optimization algorithm.
We also incorporate other two strategies of level conversions in the power optimization algorithm for comparison. All the three strategies are listed as follows:(i)level conversion avoidance: resource binding is performed with the objective that lowvoltage resources never drive highvoltage ones. As shown in Figure 6(a), no darktolight transition between operations is allowed (while dark operations are bound to low units), so that level conversion is avoided. This is the most conservative strategy;(ii)synchronous level conversion: voltage level conversion is done synchronously in the levelconverting flipflops (SLCs). As shown in Figure 6(b), the darktolight transition only happens at the beginning of each clock cycles. The flipflop structure proposed in [35] is claimed to have smaller delay than the combination of an asynchronous converter and a conventional flipflop. However, as discussed previously, synchronous level conversion may reduce the flexibility of resource binding as well as the possibility of timing borrowing. The effectiveness of this strategy is to be explored by the optimization algorithm;(iii)asynchronous level conversionL: asynchronous level converters (ALCs) are inserted wherever level conversion is needed, as darktolight transition can happen anywhere in Figure 6. This aggressive strategy provides the maximum flexibility for resource binding and timing borrowing. Although it brings in delay and power overhead, it still has great potential for timing yield improvement.
5.3. Moves Used in the Iterative Search
In order to fully explore the design space, three types of moves are used in the iterative search for resource binding;(i)resource rebinding: in this move, an operation is assigned to a different function unit in the library with different timing and power characteristics. The key benefit of the multi techniques is that it provides an enlarged design space for exploration, and optimal improvements are more likely to be obtained;(ii)resource sharing: in this move, two function units that are originally bound to different function units, are now merged to share the same function unit. The type of move reduces the resource usage and consequently improves the power yield;(iii)resource splitting: in this move, the operation that originally shared function unit with other operations, is split from the shared function unit. This type of move might lead to other moves such as resource rebinding and resource sharing.
After each move, the algorithm checks where the lowsupply voltage function units are used and decides whether to insert or remove the level converters, according to the predefined level conversion strategy. If a move is against the strategy, it is revoked, and new moves are generated until a qualifying move is found.
5.4. Algorithm Analysis
It has to be noted that, in the procedure GENMOVE shown in Algorithm 1, even though the returned might be negative, it still could be accepted. Since the sequence of a cumulative positive gain is considered, the negative gains help the algorithm escape from local minima through hill climbing.
As for the computational complexity, it is generally not possible to give nontrivial upper bounds of run time for local search algorithms [33]. However, for variable depth search in general graph partitioning, Aarts and Lenstra [33] found a nearoptimal growth rate of run time to be , where is the number of nodes in the graph. In our proposed algorithm, the timing and power yield evaluation, as well as the level converter insertion, are performed at each move. Since the yield can be updated using a gradient computation approach [19], the run time for each move is at most . Therefore, the overall run time for the proposed resource binding algorithm is .
6. Experimental Results
In this section, we present the experimental results of our variationaware power optimization framework for highlevel synthesis. The results show that our method can effectively improve the overall power yield of given designs and reduce the impact of process variations.
We first show the variationaware delay and power characterization of function units. The characterization is based on NCSU FreePDK 45 nm technology library [23]. The voltage corners for the characterization are set as , , , and . The characterization results for five function units, including two 16bit adders bkung and kogge, two 8bit × 8bit multipliers pmult and booth, and one 16bit multiplexer mux21, are depicted in Figures 7, 8, 9, 10. In the figures, the color bars show the nominal case values, while the error bars show the deviations. It is clearly shown that with lower and/or higher , significant power reductions are achieved at the cost of delay penalty. Meanwhile, up sizing the transistor can improve the circuit performance but also yield to larger power consumption. In terms of variability mitigation, both voltage scaling and device sizing have significant impact on the delay and leakage variations. We can explore this trend further in Figures 11 and 12, where the delay and power distributions of the function unit bkung is sampled at a third of . The plotted curves show that the magnitude of delay variation increases for higher units, which means larger probabilities of timing violations if these high units are placed on nearcritical paths. The figures also show that upsizing the device can effectively reduce the delay and leakage variations, as depicted by the error bars in Figures 11 and 12.
(a) 𝑉 t h = 0 . 3 7 V
(b) 𝑉 t h = 0 . 5 6 V
(a) 𝑉 t h = 0 . 3 7 V
(b) 𝑉 t h = 0 . 5 6 V
(a) 𝑉 t h = 0 . 3 7 V
(b) 𝑉 t h = 0 . 5 6 V
(a) 𝑉 t h = 0 . 3 7 V
(b) 𝑉 t h = 0 . 5 6 V
With the variationaware multi resource library characterized, our proposed resource binding algorithm is applied on a set of industrial highlevel synthesis benchmarks, which are listed in Table 1. A total power limitation is set for each benchmark to evaluate the power yield improvement. The dynamic power consumption of function units is estimated by Synopsys Design Compiler with multi technology libraries generated by Liberty NCX. In this work with FreePDK 45 nm technology, the dynamic power is about 2 times the mean leakage power. The power yield before and after the improvement is then computed using (6) in Section 4.2. The proposed resource binding algorithm is implemented in C++, and experiments are conducted on a Linux workstation with Intel Xeon 3.2 GHz processor and 2 GB RAM. All the experiments run in less than 60 s of CPU time.

We compare our variationaware resource binding algorithm against the traditional deterministic approach, which uses the worstcase () delay values of function units in the multi library to guide the resource binding. For deterministic approach, we leverage a commercial HLS tool called CatapultC to obtain the delay/area/power estimation. The worstcased based approach will naturally lead to 100% timing yield; however, the power yield is poor as shown in the motivational example in Figure 5. In contrast, our yieldaware statistical optimization algorithm takes the delay and power distributions as inputs, explores the design space with the guidance of YieldGain, and iteratively improves the power yield under a slight timing yield loss. The comparison results are shown in Figures 13, 14, 15 and 16, respectively.
Figure 13 shows the power yield improvement against worstcase delay based approach, with different level conversion strategies. A fixed timing yield constraint of 95% is set for the proposed variationaware algorithm, using the function units with default device sizes (1W1L). The overheads of the level converters used in this paper are listed in Table 2. The usage of function units and level converters under the three listed conversion strategies (conversion avoidance, synchronous conversion and asynchronous conversion) is listed in Table 3, in which “VddH FUs number’’ and “VddL FUs number’’ show the numbers of function units with high/low supply voltages, respectively, and “LCs number’’ counts the number of converters used in the design. The last column counts the total power overhead of the asynchronous level converters. The average power yield improvements for the three strategies are 11.7%, 17.9%, and 22.2%, respectively. From Figure 13 and Table 3 we can see that larger power yield improvements can be achieved when more lowVdd function units are used in the design. The results also validate our claims in Section 4.3 and Section 5.2 that, asynchronous level conversion is more favorable in statistical optimization because it enables timing borrowing between function units and leads to the timing yield improvement that can compensate the overhead of the converters. Therefore, compared to the synchronous case, more asynchronous converters are used while yielding better results.


Figure 14 shows power yield improvement with multi technique only, which means only the resource units with nominal supply voltage can be selected. In this case, no level conversion is needed so there is no overhead for level converters. Only function units with default device sizes (1W1L) are used. The average power yield improvements against worstcase delay based approach, under timing yield constraints 99%, 95%, and 90% are 5.7%, 7.9%, and 9.8%, respectively. At timing yield 95%, the average power yield improvement (7.9%) is smaller than the LCavoidance case (11.5%) in Figure 13, which shows that using multi resource units can further improve the power yield.
Figure 15 shows the power yield improvement against worstcase delay based approach, under different timing yield constraints. Asynchronous level conversion is chosen in this series of experiments. Only function units with default device sizes (1W1L) are used. The average power yield improvements under timing yield constraints 99%, 95%, and 90% are 11.6%, 20.6%, and 26.9%, respectively. It is clearly shown that, the power yield improvement largely depends on how much timing yield loss is affordable for the design. This will further push forward the design space exploration for a wellbalanced timing and power tradeoff.
Figure 16 shows the power yield improvement against worstcase delaybased approach, using function units with different device sizes. Asynchronous level conversion is chosen in this series of experiments, and a fixed timing yield constraint of 95% is set for the proposed variationaware algorithm. Compared with the average 20.6% yield improvement in the case using default device size (1W1L) only, using both defaultsize (1W1L) and doublesize (2W1L) resources can lead to an average power yield improvement of 30.9%. Obviously, upsized device with higher performance and smaller variability provide additional flexibility for design space exploration; however, this is achieved at the cost of larger silicon area.
7. Conclusions
In this paper, we investigate the impact of process variations on multi and device sizing techniques for lowpowerhighlevel synthesis. We characterize delay and power variations of function units under different threshold and supply voltages, and feed the variationcharacterized resource library to the HLS design loop. Statistical timing and power analysis for highlevel synthesis is then introduced, to help our proposed resource binding algorithm explore the design space and maximize the power yield of designs under given timing yield constraints. Experimental results show that significant power reduction can be achieved with the proposed variationaware framework, compared with traditional worstcase based deterministic approaches.
Acknowledgment
This work was supported in part by NSF 0643902,0903432, and 1017277, NSFC 60870001/61028006 and a grant from SRC.
References
 P. Coussy and A. Morawiec, HighLevel Synthesis: From Algorithm to Digital Circuit, Springer, 2009.
 W. T. Shiue, “High level synthesis for peak power minimization using ILP,” in Proceedings of the IEEE International Conference on ApplicationSpecific Systems, Architectures, and Processors, pp. 103–112, July 2000. View at: Google Scholar
 K. S. Khouri and N. K. Jha, “Leakage power analysis and reduction during behavioral synthesis,” IEEE Transactions on Very Large Scale Integration Systems, vol. 10, no. 6, pp. 876–885, 2002. View at: Publisher Site  Google Scholar
 X. Tang, H. Zhou, and P. Banerje, “Leakage power optimization with dualVth library in highlevel synthesis,” in Proceedings of the 42nd Design Automation Conference (DAC '05), pp. 202–207, June 2005. View at: Google Scholar
 W. L. Hung, X. Wu, and Y. Xie, “Guaranteeing performance yield in highlevel synthesis,” in Proceedings of the International Conference on ComputerAided Design (ICCAD '06), pp. 303–309, November 2006. View at: Publisher Site  Google Scholar
 J. Jung and T. Kim, “Timing variationaware highlevel synthesis,” in Proceedings of the IEEE/ACM International Conference on ComputerAided Design (ICCAD '07), pp. 424–428, November 2007. View at: Publisher Site  Google Scholar
 F. Wang, G. Sun, and Y. Xie, “A variation aware high level synthesis framework,” in Proceedings of the Design, Automation and Test in Europe (DATE '08), pp. 1063–1068, March 2008. View at: Publisher Site  Google Scholar
 G. Lucas, S. Cromar, and D. Chen, “FastYield: Variationaware, layoutdriven simultaneous binding and module selection for performance yield optimization,” in Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC '09), pp. 61–66, January 2009. View at: Publisher Site  Google Scholar
 A. Srivastava, T. Kachru, and D. Sylvester, “Lowpowerdesign space exploration considering process variation using robust optimization,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 26, no. 1, pp. 67–78, 2007. View at: Publisher Site  Google Scholar
 K. Usami and M. Igarashi, “Lowpower design methodology and applications utilizing dual supply voltages,” in Proceedings of the Design Automation Conference (ASPDAC '00), pp. 123–128, Yokohama, Japan, 2000. View at: Google Scholar
 A. Srivastava and D. Sylvester, “Minimizing total power by simultaneous Vdd/Vth assignment,” IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, vol. 23, no. 5, pp. 665–677, 2004. View at: Publisher Site  Google Scholar
 C. P. Chen, C. C. N. Chu, and D. F. Wong, “Fast and exact simultaneous gate and wire sizing by lagrangian relaxation,” in Proceedings of the IEEE/ACM International Conference on ComputerAided Design (ICCAD '98), pp. 617–624, ACM, New York, NY, USA, 1998. View at: Google Scholar
 S. Sirichotiyakul, T. Edwards, C. Oh et al., “Standby power minimization through simultaneous threshold voltage selection and circuit sizing,” in Proceedings of the 1999 36th Annual Design Automation Conference (DAC '99), pp. 436–441, June 1999. View at: Google Scholar
 L. Wei, K. Roy, and C. K. Koh, “Power minimization by simultaneous dualVth assignment and gatesizing,” in Proceedings of the 22nd Annual Custom Integrated Circuits Conference (CICC '00), pp. 413–416, May 2000. View at: Google Scholar
 P. Pant, R. K. Roy, and A. Chatterjee, “Dualthreshold voltage assignment with transistor sizing for low power CMOS circuits,” IEEE Transactions on Very Large Scale Integration Systems, vol. 9, no. 2, pp. 390–394, 2001. View at: Publisher Site  Google Scholar
 T. Karnik, Y. Ye, J. Tschanz et al., “Total power optimization by simultaneous dualVt allocation and device sizing in high performance microprocessors,” in Proceedings of the 39th Annual Design Automation Conference (DAC '02), pp. 486–491, June 2002. View at: Google Scholar
 S. Insup, P. Seungwhun, and S. Youngsoo, “Register allocation for highlevel synthesis using dual supply voltages,” in Proceedings of the 46th ACM/IEEE Design Automation Conference (DAC '09), pp. 937–942, July 2009. View at: Google Scholar
 S. P. Mohanty and E. Kougianos, “Simultaneous power fluctuation and average power minimization during nanoCMOS behavioral synthesis,” in Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference on Embedded Systems (VLSID '07), pp. 577–582, January 2007. View at: Publisher Site  Google Scholar
 F. Wang, X. Wu, and Y. Xie, “Variabilitydriven module selection with joint design time optimization and postsilicon tuning,” in Proceedings of the Asia and South Pacific Design Automation Conference (ASPDAC '08), pp. 2–9, March 2008. View at: Publisher Site  Google Scholar
 R. W. Keyes, “Physical limits in digital electronics,” Proceedings of the IEEE, vol. 63, no. 5, pp. 740–767, 1975. View at: Google Scholar
 D. S. Boning and S. Nassif, “Models of process variations in device and interconnect,” in Design of High Performance Microprocessor Circuits, IEEE Press, 2000. View at: Google Scholar
 B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, “Analysis and mitigation of variability in subthreshold design,” in Proceedings of the International Symposium on Low Power Electronics and Design, pp. 20–25, August 2005. View at: Google Scholar
 NCSU, “45 nm FreePDK,” http://www.eda.ncsu.edu/wiki/FreePDK. View at: Google Scholar
 M. Meterelliyoz, A. Goel, J. P. Kulkarni, and K. Roy, “Accurate characterization of random process variations using a robust lowvoltage highsensitivity sensor featuring replicabias circuit,” in Proceedings of the IEEE International SolidState Circuits Conference (ISSCC '10), pp. 186–187, February 2010. View at: Publisher Site  Google Scholar
 C. Jacoboni and P. Lugli, The Monte Carlo Method for Semiconductor Device Simulation, Springer, 1990.
 N. S. Kim, T. Austin, D. Blaauw et al., “Leakage current: Moore's law meets static power,” Computer, vol. 36, no. 12, pp. 68–64, 2003. View at: Publisher Site  Google Scholar
 K. M. Cao, W. C. Lee, W. Liu et al., “BSIM4 gate leakage model including sourcedrain partition,” in Proceedings of the IEEE International Electron Devices Meeting, pp. 815–818, December 2000. View at: Google Scholar
 N. C. Beaulieu, A. A. AbuDayya, and P. J. McLane, “Comparison of methods of computing lognormal sum distributions and outages for digital wireless applications,” in Proceedings of the IEEE International Conference on Communications, pp. 1270–1275, May 1994. View at: Google Scholar
 S. H. Kulkarni, A. N. Srivastava, and D. Sylvester, “A new algorithm for improved VDD assignment in low power dual VDD systems,” in Proceedings of the 2004 International Symposium on Lower Power Electronics and Design (ISLPED '04), pp. 200–205, August 2004. View at: Google Scholar
 M. J. M. Pelgrom, A. C. J. Duinmaijer, and A. P. G. Welbers, “Matching properties of MOS transistors,” IEEE Journal of SolidState Circuits, vol. 24, no. 5, pp. 1433–1440, 1989. View at: Publisher Site  Google Scholar
 J. Kwong and A. P. Chandrakasan, “Variationdriven device sizing for minimum energy subthreshold circuits,” in Proceedings of the 11th ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED '06), pp. 8–13, October 2006. View at: Publisher Site  Google Scholar
 S. A. Tawfik and V. Kursun, “MultiVth level conversion circuits for multiVDD systems,” in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '07), pp. 1397–1400, May 2007. View at: Google Scholar
 E. Aarts and J. K. Lenstra, Local Search in Combinatorial Optimization, Princeton University Press, 2003.
 A. Raghunathan and N. K. Jha, “Iterative improvement algorithm for low power data path synthesis,” in Proceedings of the IEEE/ACM International Conference on ComputerAided Design, pp. 597–602, November 1995. View at: Google Scholar
 F. Ishihara, F. Sheikh, and B. Nikolić, “Level conversion for dualsupply systems,” IEEE Transactions on Very Large Scale Integration Systems, vol. 12, no. 2, pp. 185–195, 2004. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2012 Yibo Chen et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.