Low-power design techniques have been around for quite a while and have until recently been looked upon as a set of nice optimizations to have, not a tape-out requirement in many application domains. The rapid growth of the consumer market for handheld devices and growing awareness about environmental impact from power usage has changed this. Today, low-power design is not a feature; it is a requirement for gaining/keeping market share. The problem now is how to use all the techniques in the low-power design puzzle to minimize power consumption and still meet a design schedule.
Before looking at how Algorithmic Synthesis accelerates low-power design, let’s take a look at the spectrum of alternatives to achieve low power. Analysis and optimization of a design’s power footprint can start as early as the system level specification or as late as the physical layout. While design changes to reduce power can occur at any stage of the design cycle, the amount of effort and the power reduction are inversely proportional to each other. The figure below helps illustrate this point.

Figure 1: Power Reduction Opportunities Across Stages of a Design Cycle
It can be seen from Figure 1, that the closer a design is to the gate level, the harder it is to make changes that will reduce power consumption. At the same time, the maximum possible reduction in power consumption decreases the closer a design is to the final gate level implementation. There are several reasons which explain this phenomenon. Primarily the inverse relationship between effort and power savings can be summarized by the following constraints at the RTL level
- Once the RTL is functionally complete and in verification, touching the RTL for power savings is a complex and risky approach which can delay tape-out. Once a design is in verification, the most common type of RTL change has to do with functional correctness, not with design improvements.
- Opportunities to change the power profile on an algorithm once RTL is complete are very limited. Maybe a logic equation can be further optimized here or there, but it is too late to do architectural changes with real impact on the power profile. For example, changing the memory representation of data can have a large impact on power. The problem is that this change usually involves an algorithm change, which will not be taken up at the RTL level.
- Once the RTL is complete, it is typically too late in the design cycle to consider advanced design techniques such as clock gating. Without clock gating, a design can easily leave 50% more power on the table than is needed for functional correctness.
When looking at Figure 1 and the constraints at the RTL level, it is clear that the solution for low-power has to be applied earlier in the design cycle. The engineer has to move to a higher level of abstraction to gain some freedom in the design cycle to test out algorithmic variations for their impact on the power profile required for functionality. One way of increasing the level of abstraction, is to move the design capture from RTL to C by using Algorithmic Synthesis (AS) tools.
AS refers to a class of hardware design tools, which raise the level of abstraction for design capture from RTL to a programmatic language such as C. One of the key advantages of AS tools is that efficient hardware implementations are derived from untimed, sequential C algorithms. The allows the designer to focus on the algorithm, while at the same time be shielded from the error-prone steps involved in writing/verifying RTL. While all AS tools offer an increased level of design abstraction when compared to RTL, they do not all provide the same level of capabilites to enable power reduction and optimization of an algorithm. From the field of AS tools, PICO Extreme Power from Synfora is the first to automatically optimize power consumption at both the system and the architecture level by using a variety of techniques such as multi-level clock gate insertion. As shown by Figure 1, there are clear benefits to tackling power consumption at the system and architecture level instead of the transistor and layout levels.
Another conclusion which can be drawn from Figure 1 is that the higher abstraction level for design capture, the faster it is to test and verify different power saving strategies. With this in mind, the question is how can a technique such as multi-level clock gating be efficiently used on a design captured in an AS tool?
The answer to this question requires the explanation of a more basic concept. What is clock-gating and what are it’s benefits? The basic premise of clock gating is that portions of a computational datapath can be turned on and off depending on dynamic processing requirements by shutting off sections of the clock tree network. While the concept is simple, it’s implementation is actually quite complex. Effective use of clock gating requires
- Fine grain knowledge about the schedule of sections of a datapath/blocks relative to other elements in the design. One common mistake with clock-gating is to turn-off a block or datapath section without taking into account the downstream effects of that decision, which leads to dead-locks.
- Increased verification effort and complexity to cover all the cases when a block may be inactive and turned off. The verification team also has to take into account the cases where the block is turned on again. Both the shutdown and startup of a clock gated element must be tested to occur only in a safe state of the circuit operation.
While clock-gating has the potential of delivering significant power reduction in a given design, the complexity associated with the verification of this technique prohibits many tradiational hand-written RTL flows from utilizing it. An AS tool like PICO Extreme Power, solves the problems associated with designing clock-gated hardware through automation. In the case of the PICO solution, the tool is in complete control of the RTL being generated. This means that PICO has complete knowledge of block inactive/active states, and of cross block dependencies which affect the clock gating implementation. Without affecting how the user creates the design in the AS tool, automatic clock gating insertion happens at the following levels:
- Coarse-grain: Automatic startup and shutdown of large portions of a design from the top-level module. At this level, the AS tools has to guarantee both functional correctness and the correctness of the control logic associated with clock gating. The correctness of the clock-gating has to be verified both statically and through simulation to provide the user with confidence in the correctness of the solution.
- Fine-grain: Even if an entire block can not be turned-off, portions of that block can be. The AS tool should detect this possibility, creat the appropriate control logic and the verification infrastructure to prove correct operation. One way of enabling fine-grain clock gating is through the use of multi-level hierarchical design using a TCAB design methodology. TCABs will be discussed in more detail in a follow-up posting.
In addition to inserting the clock gating circuits at different levels of the design hierarchy, the AS tool needs to verify the correct sequencing of all clock and clock enable signals. Without a verification component as part of any automated clock gating solution, the power savings achieved by this technique will be overshadowed by the manual effort in verifying the correctness of the circuit. Like in a traditional hand design RTL flow, clock gating is a powerful technique, but it will not be used if the verification burden is high.
Unlocking the low-power design puzzle requires a combination of techniques, which can be readily applied at the C algorithmic level. In addition to the classical approaches in AS tools such as architectural exploration and algorithmic changes, clock gating is an important tool in minimizing power consumption.





Just to add some more technical background on my previous comment:
Clock gates are typically added before or during synthesis. This is a good thing, but has 2 fundamental limitations. The first limitation is that automatic clock gating techniques can only identify a limited set of gating opportunities – basically it finds explicit recirculation muxes and replaces them with a clock gate. The limitation is that no other gating opportunities are exploited. This leaves a lot of power savings on the table.
The second limitation of clock gate insertion at RTL level or higher is that design timing is basically unknown, or only very vaguely known (remember: most delay is in the wires and the wires are unknown before placement). This means that the feasibility of the clock gate enable timing is basically unknown. This is not a minor point. The timing on a clock gate enable signal is always very problematic and is one of the key limiting factors on the maximum effectiveness of clock gating. Basically, you want the clock gate to be as high up the tree as possible for maximum power savings. But for timing closure it is better to push the clock gate as low down the clock tree as possible (closer to the FFs). Finding the optimal position for the clock gate is a classic engineering trade-off problem that cannot be solved at the RTL level.
My point is that both these limitations are overcome by adding and optimizing clock gates at the gate level during CTS. The reasons this is so are:
(a.) many more clock gating opportunities become visible at the gate level that are not visible at the RTL level.
(b.) the optimal placement of clock gates can only be done during CTS because there is no clock tree before then (duh!), and there is not enough timing information before then to determine the feasibility of the gate.
There are more issues that have to do with power and activity, but that gets us into deeper waters than I have time for in this comment.
In summary: Clock gating is indeed important, but RTL clock gating is only half the story and the other half can only be done at the gate level during CTS.
Fernando is right that power saving opportunities must be exploited at all levels, and he is also correct that clock gating has proven to be one of the most practical and successful techniques.
Fernando is wrong, however, when he claims that automatic clock gating is not possible after RTL coding is done. Azuro’s PowerCentric clock tree synthesis product has the ability to add a significant number of clock gates at the gate level, during CTS.
The benefits of this technique are:
(1.) 20% to 40% additional clock power savings (verified by TSMC and included in their Ref Flow 10)
(2.) Complete formal equivalence with RTL
(3.) Gate enable paths meet timing because PowerCentric synthesizes the enable logic together with the CTS and places it too. So PowerCentric has full visibility into the placed-gate timing (clock and logic) and ensures its correctness.
Azuro’s advanced clock gating technology is one of the reasons PowerCentric has been adopted by 4 of the top 5 semiconductor vendors in the world.