Session 14 – TAPA II

Technology/Circuits Focus Session - Embedded Memory

Thursday, June 14, 3:25 p.m.

Chairpersons: L. Cheng, Oracle

M. Yamaoka, Hitachi America, Ltd.

14.1 - 3:25 p.m.

Isolated Preset Architecture for a 32nm SOI embedded DRAM Macro, J. Barth, D. Plass, A. Vehabovic, R. Joshi*, R. Kanj*, S. Burns, T. Weaver, IBM Systems and Technology Group, *IBM Research

The Isolated Preset Architecture (IPA) improves retention characteristics by implementing a weak read ‘1’ Isolation scheme, allowing a lower stored ‘1’ level to be sensed. The architecture also reduces sub-array area by 15% and bit-line activation power by 2x compared to previous design, without impacting performance. The architecture was implemented in IBM’s 32nm High-K/Metal SOI embedded DRAM technology. Hardware results confirm 1.8ns random cycle and 2x improved retention characteristic with optimized Analog reference tuning.

14.2 - 3:50 p.m.

A 260mV L-shaped 7T SRAM with Bit-Line (BL) Swing Expansion Schemes Based on Boosted BL, Asymmetric-V_TH Read-Port, and Offset Cell VDD Biasing Techniques, M.-P. Chen, L.-F. Chen, M.-F. Chang, S.-M. Yang, Y.-J. Kuo, J.-J. Wu, M.-S. Ho**, H.-Y. Su*, Y.-H. Chu*, W.-C. Wu*, T.-Y. Yang*, H. Yamauchi^, National Tsing Hua University, *ICL, ITRI, **National Chung Hsing University, ^Fukuoka Institute of Technology

This work proposes bit-line (BL) swing expansion schemes (BL-EXPD), which minimize the product of SRAM cell area (A) and the minimum operation voltage (VDDmin) to the best of our knowledge. The key-enablers to minimize A *VDDmin are: L-shaped 7T cell (L7T) and BL-EXPD. L7T features: (1) area efficient compact cell layout, (2) a read-disturb free decoupled 1T read port (RP), and (3) a half-select disturb free write back scheme. BL-EXPD enables a 9x larger read-BL (RBL) swing at 6 spoint than that of our previously proposed Z8T and allows a single BL sensing for cell area saving. A fabricated 65nm 256-row BL 32Kb L7T SRAM achieves a 260mV VDDmin. As a result, it’s A VDDmin is ~50% lower than for Z8T and the conventional 8T SRAM cells.

14.3 - 4:15 p.m.

A 1.6-mm2 38-mW 1.5-Gb/s LDPC Decoder Enabled by Refresh-Free Embedded DRAM, Y.S. Park, D. Blaauw, D. Sylvester, Z. Zhang, University of Michigan

Memory dominates the power consumption of high-throughput LDPC decoders. A 700 MHz refresh-free embedded DRAM (eDRAM) is designed as a low-power memory to retain data for the required access window. 32 1-kb eDRAM arrays are integrated in a 1.6 mm2, 65nm LDPC decoder suitable for IEEE 802.11ad. The LDPC decoder consumes 38 mW for a 1.5 Gb/s throughput at 90 MHz and 10 decoding iterations, and it achieves up to 9 Gb/s at 540 MHz.

14.4 - 4:40 p.m.

1Gsearch/sec Ternary Content Addressable Memory Compiler with Silicon-Aware Early-Predict Late-Correct Single-Ended Sensing, I. Arsovski, T. Hebig, D. Dobson, R. Wistort, IBM Systems Technology Group

This paper describes a Ternary Content Addressable Memory (TCAM) that uses a novel Early-Predict Late Correct (EPLC) search scheme to achieve the highest published TCAM search throughput of 1billion searches / sec, while using a power-efficient two-phase search sensing that consumes only 0.76W on a 2048x640 TCAM.

Abstract:

A Ternary Content Addressable Memory (TCAM) uses a two phase search operation where early prediction on its pre-search results prematurely activates the subsequent main-search operation, which is later interrupted only if the final pre-search results contradict the early prediction. This early main-search activation improves performance by 30%, while the low-probability of a late-correct has a negligible power impact. This Early Predict Late Correct (EPLC) sensing enables a high-performance TCAM compiler implemented in 32nm High-K Metal Gate SOI process to achieve 1Gsearch/sec throughput on a 2048x640bit TCAM instance while consuming only 0.76W. Embedded Deep-Trench (DT) capacitance for power supply noise mitigation adds 5% overhead for a total TCAM area of 1.56mm2

14.5 - 5:05 p.m.

A 2.8GHz 128-entry x 152b 3-Read/2-Write Multi-Precision Floating-Point Register File and Shuffler in 32nm CMOS, S. Hsu, A. Agarwal, M. Anders, H. Kaul, S. Mathew, F. Sheikh, R. Krishnamurthy, S. Borkar, Intel Corporation

A 128-entry x 152b 3-read/2-write ported multi-precision floating-point register file/shuffler with measured 2.8GHz operation is fabricated in 1.05V, 32nm CMOS. Single-precision (24b-mantissa), 2-way 12b or 4-way 6b reduced mantissa precision modes, certainty tracking bits, mode-dependent gating, area-efficient windowing using 1R/1W cells, and ultra-low-voltage read/write circuits enable 350mV-1.2V wide dynamic voltage range with measured peak energy-efficiency of 751GOPS/W at 400mV, 4-way 6b-mode (22.3x higher than 1.05V single-precision mode) and 19% area reduction over single-precision 3R/2W implementations.