DRAM and Memory System Trends

Steven Woo
Rambus Inc.

October 24, 2004
Moore’s Law Driving Performance

Performance driven by increasing clock speeds and functionality…

Increasing transistor counts need to be fed with more data…

…increasing demands being placed on memory systems

<table>
<thead>
<tr>
<th>Clock Speed</th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>108 KHz</td>
<td>1971</td>
</tr>
<tr>
<td>10 MHz</td>
<td>1978</td>
</tr>
<tr>
<td>10 MHz</td>
<td>1982</td>
</tr>
<tr>
<td>12 MHz</td>
<td>1989</td>
</tr>
<tr>
<td>33 MHz</td>
<td>1991</td>
</tr>
<tr>
<td>50 MHz</td>
<td>1993</td>
</tr>
<tr>
<td>66 MHz</td>
<td>1997</td>
</tr>
<tr>
<td>450 MHz</td>
<td>1999</td>
</tr>
<tr>
<td>1.5 GHz</td>
<td>2000</td>
</tr>
<tr>
<td>3 GHz</td>
<td>2002</td>
</tr>
</tbody>
</table>

Intel® 4004 1971
10 μm 2.3K
10 MHz 29K

Intel® 8086 1978
12 MHz 1.5 μm 134K
3 MHz 1.7 μm 275K

Intel® 80286 DX 1989
12 MHz 3 μm 134K

Intel® 386™ DX 1991
33 MHz 1 μm 275K
50 MHz 1.2 μm 1.2M

Intel® Pentium® 1993
66 MHz 0.8 μm 3.1M
300 MHz 0.35 μm 7.5M

Intel® Pentium® II 1997

Intel® Pentium® III 1999
450 MHz 0.25 μm 9.5M

Intel® Pentium® 4 2000
1.5 GHz 0.18 μm 42M

Intel® Pentium® 4 HT 2002
3 GHz 0.13 μm 55M

Intel® Pentium® 4 HT 2002
3 GHz 0.13 μm 55M
• Amdahl’s Law
  • System performance limited by slowest component
• The Processor-Memory Performance Gap

Memory system is becoming more of a limiting factor in overall system performance
What Makes A Good Memory System?

Basic PC Architecture

- CPU
- Memory
- Disk
- I/O Controller
- Memory Controller
- Front Side Bus
- PCI Express / AGP
- Graphics
- Memory Bus
- Basic PC Architecture
- SATA / IDE
What Makes A Good Memory System?

- Low latency
- High bandwidth
- High capacity
- High bank count

High DRAM bandwidth keeps CPU and graphics engine pipelines filled

High memory capacity reduces interactions with slow disks

Low DRAM latency reduces the amount of time the CPU is stalled waiting for data

High bank count aids concurrent execution of multiple threads
Characteristics of a Good Memory System

- **Low latency**
  - Latencies (from CPU’s point of view) are rising
  - Increasingly harder to fix, CPU+controller integration helps
  - Expect more features that leverage bandwidth

- **High bandwidth**
  - Physical constraints, signal integrity are limiters
  - Changes to DRAMs and bus topologies occurring

- **High memory system capacity**
  - Application data set sizes are growing
  - Signal integrity with multiple modules becoming more difficult

- **High memory system bank count**
  - Threads occupy banks, too few banks leads to contention
  - Contention increases latency and reduces bandwidth
  - Important for multi-threaded and multi-core systems
Memory Latency Trends

- Memory latency (in CPU clocks) increasing
  - CPU can stall hundreds of cycles waiting for data
  - Getting more difficult to hide latency
  - Contention increases latency

**DRAM latency dramatically increasing from CPU’s point of view, total memory system latency even higher**

<table>
<thead>
<tr>
<th>Year</th>
<th>CPU Frequency (MHz)</th>
<th>DRAM Access Time (ns)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1990</td>
<td>25</td>
<td>125</td>
</tr>
<tr>
<td>2004</td>
<td>3800</td>
<td>30</td>
</tr>
</tbody>
</table>

- CPU Frequency: \(152\)x increase
- DRAM Access Time: \(4.2\)x increase
- DRAM Access Time in CPU clocks: \(28.5\)x decrease
Addressing Memory Latency

“Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed – you can’t bribe God.” -- Anonymous

- Physical distances remaining roughly fixed
  - Electrical distance increases as CPU clock speeds increase
- Reducing DRAM access latency comes at a price

Charge in bit cell drives bit line for detection by sense amp
Addressing Memory Latency

“Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed – you can’t bribe God.” -- Anonymous

- Physical distances remaining roughly fixed
  - Electrical distance increases as CPU clock speeds increase
- Reducing DRAM access latency comes at a price

More charge drives bit line more quickly, reducing latency. But this also requires a larger bit cell.
Addressing Memory Latency

“Bandwidth problems can be cured with money. Latency problems are harder because the speed of light is fixed – you can’t bribe God.” -- Anonymous

- Physical distances remaining roughly fixed
  - Electrical distance increases as CPU clock speeds increase
- Reducing DRAM access latency comes at a price
  - Increasing storage cell capacitance reduces latency...
  - ...but increases storage cell area, reducing storage capacity
- DRAMs must provide high density, inexpensive storage

**DRAMs must balance access latency and storage density**
Integration of Mem Controller Into CPU Offers Large Latency Reduction

- Latency between CPU and mem controller eliminated
  - Memory controller runs at CPU clock speed
  - Shifts decision on memory type forward in design cycle
  - Benefit is large -- AMD, Compaq, others integrating memory controllers into CPU
Integration of Mem Controller Into CPU Offers Large Latency Reduction

- Latency between CPU and mem controller eliminated
  - Memory controller runs at CPU clock speed
  - Shifts decision on memory type forward in design cycle
  - Benefit is large -- AMD, Compaq, others integrating memory controllers into CPU

Compaq Alpha 21364 (EV7)

On-chip memory controllers support 10 RDRAM channels per CPU
Rising Front Side Bus Speeds Fuel the Need for Increased Memory Bandwidth

<table>
<thead>
<tr>
<th>Intel® Pentium® II</th>
<th>1998</th>
<th>66 MHz</th>
<th>533 MB/sec</th>
</tr>
</thead>
<tbody>
<tr>
<td>Intel® Pentium® III</td>
<td>1999</td>
<td>133 MHz</td>
<td>1.1 GB/sec</td>
</tr>
<tr>
<td>Intel® Pentium® 4</td>
<td>2000</td>
<td>400 MHz</td>
<td>3.2 GB/sec</td>
</tr>
<tr>
<td>Intel® Pentium® 4 HT</td>
<td>2002</td>
<td>533 MHz</td>
<td>4.2 GB/sec</td>
</tr>
<tr>
<td>Intel® Pentium® 4 HT</td>
<td>2003</td>
<td>800 MHz</td>
<td>6.4 GB/sec</td>
</tr>
</tbody>
</table>
Ways to Increase Memory Bandwidth

• Wider memory bus

• Faster per-pin signaling

Physical constraints and signal integrity being pushed to their limits in current memory systems
Physical constraints limit bus widths. Increasing memory bandwidth requires increased per-pin bandwidths.

128b DDR interface
65mm chipset perimeter

Asus P4C800, Intel 875 Chipset

Trace length matching difficult for wide buses
Closing the Processor-Memory Performance Gap With Faster Signaling

Increasing data transfer rates helps close the Processor-Memory Performance Gap...

...but this comes at a price

Signal integrity is limiting performance in traditional computer systems
ISMM 2004 Keynote

Signaling Speed Limited by Capacitance

At 100 Mbps signaling rates, must be able to bring a wire high and low in 10ns

Capacitance determines how quickly a signal can rise and fall

At 400 Mbps signaling rates, must be able to bring a wire high and low in 2.5ns

Excess capacitance prevents data from being driven high and low at the desired signaling rate

At higher speeds, multi-drop bus topology needs reduced capacitance to meet reduced timing windows. Fewer modules reduces capacitance, but also reduces capacity.
Reflections Limit Signal Integrity

Initial signal from memory controller
Reflections Limit Signal Integrity

Energy from signal travels down both paths in multi-drop bus topology
Reflections Limit Signal Integrity

Energy reflects back from the DRAM…
Reflections Limit Signal Integrity

…and travels back down the multi-drop bus
Reflections Limit Signal Integrity

Another signal is sent from the memory controller.
Reflections cause interference with subsequent signals, impacting signal integrity.
Reflections cause interference with subsequent signals, impacting signal integrity.

Termination resistor eliminates reflections from the end of the bus.
Reflections cause interference with subsequent signals, impacting signal integrity.

Many modules can be supported with low-speed signaling...
Reflections Limit Signal Integrity

Reflections cause interference with subsequent signals, impacting signal integrity.

Many modules can be supported with low-speed signaling...

...fewer modules can be supported with high-speed signaling.

As signaling speeds increase, signal integrity can be maintained with fewer modules.
Signal Quality at DRAM Changes With Signaling Speed and Number of Modules

DDR signaling with 4 connectors at 200 Mbps

Eye diagram showing signal quality at the receiver

Large separation between high and low voltages enables receiver to easily distinguish 0’s and 1’s

Large separation between edges gives good timing margin

Large separation between high and low voltages enables receiver to easily distinguish 0’s and 1’s
Signal Quality at DRAM Changes With Signaling Speed and Number of Modules

- Reflections increase voltage uncertainty
- Trace length mismatches increase timing uncertainty
- Large, open data eye ensures data integrity

DDR signaling with 4 connectors at 200 Mbps
Signal Quality at DRAM Changes With Signaling Speed and Number of Modules

Signaling twice as fast results in twice as many data eyes per unit of time.

Data eyes almost closed, data errors at receiver very likely to occur.

DDR signaling with 4 connectors at 400 Mbps.
Signal Quality at DRAM Changes With Signaling Speed and Number of Modules

Removing two modules opens data eyes substantially.

DDR signaling with 2 connectors at 400 Mbps
Signal Integrity Issues Limiting Signaling Speeds and Number of Memory Modules

- At higher signaling speeds, signal integrity degrades, fewer modules can be supported.
- Memory capacity (in terms of number of DRAMs) is shrinking.
- Server performance depends heavily on memory capacity.

Need to increase signal integrity and number of modules at the same time.
Improving Signal Integrity With On-Die Termination
Improving Signal Integrity With On-Die Termination

Termination resistors placed inside each DRAM
Improving Signal Integrity With On-Die Termination
Improving Signal Integrity With On-Die Termination
Improving Signal Integrity With On-Die Termination
Improving Signal Integrity With On-Die Termination

- Signal reflections handled with on-die terminators
- Improved signal quality enables higher speed signaling

On-die termination becoming standard in DRAMs
Coming to a Server Near You: The Advanced Memory Buffer (AMB)

- Change in bus topology with AMB
  - Daisy chained AMBs increase number of modules
  - Unidirectional differential point-to-point signaling increases signal integrity, enables higher speeds

AMB = Advanced Memory Buffer

Unidirectional point-to-point links between AMBs are simple and have no tees

High-speed serial interface using differential signaling can achieve transfer rates much higher than wide buses
Coming to a Server Near You: The Advanced Memory Buffer (AMB)

- Change in bus topology with AMB
  - Daisy chained AMBs increase number of modules
  - Unidirectional differential point-to-point signaling increases signal integrity, enables higher speeds

Point-to-point topologies offers a better signaling environment. Many interfaces are going serial (narrowing) to improve signal integrity and increase signaling speed.
Benefits of the Fully Buffered DIMM

Use of the AMB increases signal integrity, but adds small amount of latency compared with no AMB...

...but for many apps the increased capacity is more important than memory system latency

Point-to-point topology enhances signal integrity, enabling higher capacities and faster communication speeds. FB-DIMM coming in 2006.
Memory Latency and Bandwidth are not Independent

- As memory bandwidth increases, so does latency
- Latency under load critical to system performance
- Increasing functionality will drive bandwidths higher

*Increased memory system bank count reduces contention*
Memory System Banks and Ranks

- DRAM bits partitioned into banks
  - DRAM interface connects the DRAM core to the memory bus
  - The DRAM core holds the data bits. Bits are grouped into banks.

- On a memory module, multiple DRAMs may respond to each memory request
  - A rank is the group of DRAMs that responds to a memory request

ISO 2004 Keynote
Why Memory Systems Need More Banks

- Bank conflicts reduce performance
  - Back-to-back accesses to the same bank can incur a latency penalty (also reduces effective bandwidth)
- More banks reduce the chances of bank conflicts

Bank count can be increased by adding modules

The banks in each rank are independent. 4 ranks each with 4 banks = 16 total banks in memory system.

Need to support many ranks (modules) for good performance. Signal integrity is critical.
DRAM Capacity and Banks

- High capacity needed for good performance
  - Important to support many modules
  - ITRS roadmap: DRAM capacity will double every ~3 years
- Upcoming transition to 1Gb generation
  - DRAM bank count increasing from 4 to 8
  - More banks per rank, more banks in memory system
  - Benefits latency under load and effective bandwidth
- High bank count needed for future systems
  - Threads occupy banks, contention if too few banks
  - Architecture-aware mapping of data improves performance
Memory System Trends and Issues

- **Memory latency**
  - Latencies (from CPU’s point of view) are rising, hard to reduce
  - Low access latency and high bit cell density at odds
  - CPU+memory controller integration helps

- **Memory bandwidth**
  - Expect more features that drive bandwidth in the future
  - Physical constraints, signal integrity are limiters
  - Changes to DRAMs, bus topologies, signaling occurring

- **Memory capacity**
  - Application data set sizes are growing
  - Signaling and bus topology changes will improve capacity

- **Memory system bank count**
  - Important for multi-threaded and multi-core systems
  - Signaling, bus topology, and DRAM architecture changes help
  - Architecture-aware mapping of data improves performance
DRAMs and Memory Systems Must Address Emerging Trends

- **Cost reduction**
  - Critical for consumer devices with long lifetimes

- **Thermal dissipation**
  - High bandwidth DRAMs pushing cooling solutions

- **Power consumption**
  - Mobile/handheld computing need low operating and standby power
Cost Reduction Critical in Some Consumer Markets

- High per-pin signaling rate means fewer pins required, enabling cost reduction in game consoles

Playstation 2 2000  
Playstation 2 2004

DRAM shrink for cost reduction

Fewer pins needed means chips are not pad-limited, allowing aggressive cost reduction through die shrinks
High Performance DRAMs Pushing Thermal Limits

High performance DRAMs

Graphics Processor

Fan+heatsink cools graphics processor and DRAMs. Some graphics cards using heat pipes over the DRAMs.

Trend will continue in future high performance memory systems.
Reduced Power Consumption Critical for Mobile / Handheld Markets

- Lower Vdd for lower operating power
  - SDRAM (~1996): 3.3 V
  - DDR2 (~2004): 1.8 V
  - Mobile SDRAM (~2004): 1.8V
- Standby/Powerdown power extremely important
  - Partial Array Self-Refresh: Refresh only a portion of the bits in the DRAM core
  - Temperature Compensated Self-Refresh: Use on-die circuitry to adjust self-refresh rate
  - Deep Powerdown: Do not refresh data in DRAM core
Challenges for Future DRAM Technologies

- Increasing signaling rates
- Shrinking system form factors
Signaling Rates Limited by Signaling Topologies

Point-to-point topologies have lower attenuation, making them more amenable to high-speed signaling.

Multi-drop bus topologies have larger signal attenuation at higher speeds.

Point-to-point signaling a strong candidate to replace multi-drop buses in PC memory systems.
Differential Point-to-Point Signaling for High Speed Communication

Differential point-to-point signaling needed for very high speed, low swing signaling. AMB will use differential point-to-point signaling.

- DDR signaling with 2 connectors at 800Mbps
- XDR differential point-to-point signaling at 3.2 Gbps
Equalization Enables Even Higher Signaling Speeds

- Equalization pre-processes signals to ensure good signal quality at the receiver

Differential Point-to-Point Signaling at 8Gbps

No Equalization

Two-tap Equalization
Shrinking Form Factors Driving Innovation in Packaging

- Computing environment are shrinking
  - Yesterday’s desktop is today’s handheld
- System in Package (SiP), stacked dies enable smaller form factors, better signal integrity

Stacked dies offer smaller form factors, but are difficult to re-work

Stacked packages leverage existing infrastructure, address known good die issue

Problems
- Known Good Die issue reduces yield
- Thermal Issues

http://www.tessera.com/technologies/products/z_mcp/ball_stacked.htm
Summary

- Latency continues to be difficult to reduce
- Physical limitations challenging increases in signaling speeds
- Different market segments with different needs evolving
- DRAMs and memory systems evolving to meet these challenges