CIRCUITS
SESSION 14 – TAPA II
Technology/Circuits
Focus Session - Embedded Memory
Thursday, June 14, 3:25 p.m.
Chairpersons: L.
Cheng, Oracle
M.
Yamaoka, Hitachi America, Ltd.
14.1 - 3:25 p.m.
Isolated
Preset Architecture for a 32nm SOI embedded DRAM Macro, J. Barth, D. Plass, A. Vehabovic, R.
Joshi*, R. Kanj*, S. Burns, T. Weaver, IBM Systems and Technology Group, *IBM
Research
The Isolated Preset Architecture (IPA) improves retention
characteristics by implementing a weak read ‘1’ Isolation scheme, allowing a
lower stored ‘1’ level to be sensed. The
architecture also reduces sub-array area by 15% and bit-line activation power
by 2x compared to previous design, without impacting performance. The
architecture was implemented in IBM’s 32nm High-K/Metal SOI embedded DRAM
technology. Hardware results confirm 1.8ns random cycle and 2x improved
retention characteristic with optimized Analog reference tuning.
14.2 - 3:50 p.m.
A
260mV L-shaped 7T SRAM with Bit-Line (BL) Swing Expansion Schemes Based on Boosted
BL, Asymmetric-VTH Read-Port, and Offset Cell VDD Biasing Techniques, M.-P. Chen, L.-F. Chen, M.-F. Chang, S.-M.
Yang, Y.-J. Kuo, J.-J. Wu, M.-S. Ho**, H.-Y. Su*, Y.-H. Chu*, W.-C. Wu*, T.-Y.
Yang*, H. Yamauchi^, National Tsing Hua University, *ICL, ITRI, **National
Chung Hsing University, ^Fukuoka Institute of Technology
This work proposes bit-line (BL) swing expansion schemes
(BL-EXPD), which minimize the product of SRAM cell area (A) and the minimum
operation voltage (VDDmin) to the best of our knowledge. The key-enablers to
minimize AVDDmin are: L-shaped 7T cell (L7T) and BL-EXPD. L7T features:
(1) area efficient compact cell layout, (2) a read-disturb free decoupled 1T
read port (RP), and (3) a half-select disturb free write back scheme. BL-EXPD
enables a 9x larger read-BL (RBL) swing at 6point than that of
our previously proposed Z8T and allows a single BL sensing for cell area
saving. A fabricated 65nm 256-row BL
32Kb L7T SRAM achieves a 260mV VDDmin. As a result, its AVDDmin is ~50%
lower than for Z8T and the conventional 8T SRAM cells.
14.3 - 4:15 p.m.
A
1.6-mm2 38-mW 1.5-Gb/s LDPC Decoder Enabled by Refresh-Free Embedded DRAM, Y.S. Park, D. Blaauw, D. Sylvester, Z.
Zhang, University of Michigan
Memory dominates the power consumption of high-throughput
LDPC decoders. A 700 MHz refresh-free embedded DRAM (eDRAM) is designed as a
low-power memory to retain data for the required access window. 32 1-kb eDRAM
arrays are integrated in a 1.6 mm2, 65nm LDPC decoder suitable for IEEE
802.11ad. The LDPC decoder consumes 38 mW for a 1.5 Gb/s throughput at 90 MHz
and 10 decoding iterations, and it achieves up to 9 Gb/s at 540 MHz.
14.4 - 4:40 p.m.
1Gsearch/sec
Ternary Content Addressable Memory Compiler with Silicon-Aware Early-Predict
Late-Correct Single-Ended Sensing, I.
Arsovski, T. Hebig, D. Dobson, R. Wistort, IBM Systems Technology Group
This paper describes a Ternary Content Addressable Memory
(TCAM) that uses a novel Early-Predict Late Correct (EPLC) search scheme to
achieve the highest published TCAM search throughput of 1billion searches /
sec, while using a power-efficient two-phase search sensing that consumes only
0.76W on a 2048x640 TCAM.
Abstract:
A Ternary Content Addressable Memory (TCAM) uses a two phase
search operation where early prediction on its pre-search results prematurely
activates the subsequent main-search operation, which is later interrupted only
if the final pre-search results contradict the early prediction. This early
main-search activation improves performance by 30%, while the low-probability
of a late-correct has a negligible power impact. This Early Predict Late
Correct (EPLC) sensing enables a high-performance TCAM compiler implemented in
32nm High-K Metal Gate SOI process to achieve 1Gsearch/sec throughput on a 2048x640bit
TCAM instance while consuming only 0.76W. Embedded Deep-Trench (DT) capacitance
for power supply noise mitigation adds 5% overhead for a total TCAM area of
1.56mm2
14.5 - 5:05 p.m.
A
2.8GHz 128-entry x 152b 3-Read/2-Write Multi-Precision Floating-Point Register
File and Shuffler in 32nm CMOS, S.
Hsu, A. Agarwal, M. Anders, H. Kaul, S. Mathew, F. Sheikh, R. Krishnamurthy, S.
Borkar, Intel Corporation
A 128-entry x 152b 3-read/2-write ported multi-precision
floating-point register file/shuffler with measured 2.8GHz operation is
fabricated in 1.05V, 32nm CMOS. Single-precision (24b-mantissa), 2-way 12b or
4-way 6b reduced mantissa precision modes, certainty tracking bits,
mode-dependent gating, area-efficient windowing using 1R/1W cells, and
ultra-low-voltage read/write circuits enable 350mV-1.2V wide dynamic voltage
range with measured peak energy-efficiency of 751GOPS/W at 400mV, 4-way 6b-mode
(22.3x higher than 1.05V single-precision mode) and 19% area reduction over
single-precision 3R/2W implementations.