0% found this document useful (0 votes)
55 views11 pages

DDR4 A Controller PHY For Managed DRAM Solution With Damping-Resistor-Aided Pulse-Based Feed-Forward Equalizer

This document presents a controller PHY for high-capacity DRAM that utilizes a damping-resistor-aided pulse-based feed-forward equalizer (PB-FFE) to mitigate intersymbol interference and reflection issues in command/address (C/A) channels. The proposed solution improves energy efficiency and timing margins while reducing manufacturing costs by employing a managed DRAM solution. The architecture includes various digital modules for training and calibration, ensuring effective communication with multiple DRAM chips.

Uploaded by

김희상
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views11 pages

DDR4 A Controller PHY For Managed DRAM Solution With Damping-Resistor-Aided Pulse-Based Feed-Forward Equalizer

This document presents a controller PHY for high-capacity DRAM that utilizes a damping-resistor-aided pulse-based feed-forward equalizer (PB-FFE) to mitigate intersymbol interference and reflection issues in command/address (C/A) channels. The proposed solution improves energy efficiency and timing margins while reducing manufacturing costs by employing a managed DRAM solution. The architecture includes various digital modules for training and calibration, ensuring effective communication with multiple DRAM chips.

Uploaded by

김희상
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO.

8, AUGUST 2021 2563

A Controller PHY for Managed DRAM Solution


With Damping-Resistor-Aided Pulse-Based
Feed-Forward Equalizer
Hyeongjun Ko , Mino Kim, Hyunkyu Park , Graduate Student Member, IEEE, Sangyoon Lee , Jaewook Kim,
Suhwan Kim , Senior Member, IEEE, and Joo-Hyung Chae , Member, IEEE

Abstract— A controller PHY for high-capacity DRAM is (DIMMs). When using multiple DIMMs for additional capac-
presented. To reduce precursor and postcursor intersymbol ity, registered DIMMs (RDIMMs) are used to reduce the load-
interference due to its dispersive channel characteristics and ing on the command/address (C/A) lines [2], and load-reduced
a heavy load of many DRAM chips and to attenuate reflec-
tion on a highly reflective command/address (C/A) channel, DIMMs (LRDIMMs) with additional data buffers are also used
a damping-resistor-aided three-tap pulse-based feed-forward to reduce further the loading of the data bus [3]. Another way
equalizer (PB-FFE) is introduced. An appropriate damping resis- to increase capacity is to stack multiple DRAM chips on one
tance can attenuate reflection, and the PB-FFE compensates for package. In this case, to prevent an increase in input and output
increased insertion loss due to the damping resistor. In addition, (IO) loadings, the internal data buses of each DRAM are
the current flows only before and after a signal transition in
the PB-FFE, improving energy efficiency and maintaining the usually connected using a through-silicon-via (TSV); however,
turn-ON resistance during the no-transition region. A controller the TSV process increases the manufacturing cost. To over-
PHY based on this equalizer was fabricated in a 55-nm CMOS come the cost problem, a managed DRAM solution (MDS)
process. The PB-FFE increases the timing margin of the C/A was recently proposed as a cost-efficient solution with the
signal from 0.23 to 0.29 UI at 1067 Mb/s. At 2133 Mb/s, the read moderate performance [4]. In the MDS DIMM, eight DRAM
timing and voltage margins of the DQ signal are 0.53 UI and
211 mV after read training, and its write margin is 0.72 UI and chips are stacked in each package using wire bonding, and
230 mV, respectively, after write training. the IO loadings for four DQ lines are reduced using an on-
Index Terms— Dram interface, dual-inline memory module die repeater. However, there are 33 C/A pins in each DRAM,
(DIMM), feed-forward equalizer (FFE), glitch-free digitally con- and too much area is required to repeat these ones. Therefore,
trolled delay-line, memory controller, pulse-based equalizer. C/A lines are connected to all DRAM chips, and each C/A
transmitter (Tx) of the controller has to drive 80 DRAM
I. INTRODUCTION chips. As a result, the C/A lines have a very large capacitive

T HE recent information and communication technol-


ogy (ICT) topics can be represented by big data, Internet
of Things (IoT), and cloud services, which are data-centric
loading, and precursor intersymbol interference (ISI) should be
addressed in these dispersive channels [5]. In addition, there is
no input termination at the C/A receivers in DRAM, and the
technologies. After the Exabyte Era, we have entered the resulting impedance mismatch causes postcursor reflections on
Zettabyte Era, and technology for effectively storing and the C/A lines.
processing such massive data becomes very important [1]. Approaches such as impedance-matched bidirectional mul-
Therefore, the demand for high capacity, low power, and tidrop (IMBM) and parallel branching with write-direction
low-cost memory continues to increase. impedance matching (PBIM) topologies have been used to
To increase storage capacity, the main memory is typically improve impedance matching to reduce reflection in multidrop
configured in the form of dual in-line memory modules memory channels [6], [7]. In IMBM, the input impedance is
Manuscript received August 21, 2020; revised December 31, 2020; accepted adjusted to Z 0 by using resistors with resistances of n × Z 0
February 18, 2021. Date of publication March 17, 2021; date of current version and Z 0 /n in each branch of the multidrop, where n =
July 23, 2021. This article was approved by Associate Editor Daniel Friedman. 1, 2, . . . , k − 1 and k is the number of the branch. Because a
This work was supported in part by the Research Resettlement Fund for
the New Faculty of Kwangwoon University in 2021. (Corresponding author: large number of resistors are needed, and there is not enough
Joo-Hyung Chae.) space to mount these resistors, it is difficult to implement
Hyeongjun Ko and Mino Kim were with the Department of Electrical and in a DIMM. The PBIM topology also requires an additional
Computer Engineering, Seoul National University, Seoul 08826, South Korea.
They are now with SK Hynix, Icheon 17336, South Korea. resistor to match impedance, and each branch should have
Hyunkyu Park, Sangyoon Lee, Jaewook Kim, and Suhwan Kim are with a transmission line with a characteristic impedance of Z 0 /2;
the Department of Electrical and Computer Engineering, Seoul National thus, it is also difficult to implement in a DIMM. A decision
University, Seoul 08826, South Korea (e-mail: [email protected]).
Joo-Hyung Chae was with SK Hynix, Icheon 17336, South Korea. He is feedback equalizer (DFE) can be used on the receiver side to
now with the Department of Electronics and Communications Engineering, improve ISI in multidrop memory channels with impedance
Kwangwoon University, Seoul 01897, South Korea (e-mail: [email protected]). discontinuities [8]–[10]. However, DFE cannot remove pre-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2021.3062876. cursor ISI that needs to be addressed for dispersive channels
Digital Object Identifier 10.1109/JSSC.2021.3062876 such as the MDS C/A channel. It is disadvantageous in terms
0018-9200 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
2564 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 8, AUGUST 2021

Fig. 1. PHY architecture of the MDS controller; this article is predominantly about the C/A Tx, which is a sub-block of the controller PHY.

of cost and power consumption to implement equalizers such this increases insertion loss; the three-tap PB-FFE compen-
as continuous-time linear equalizer (CTLE) and feed-forward sates for this insertion loss. The proposed PB-FFE only injects
equalizer (FFE) in each DRAM receiver, and the precursor current before and after a signal transition to compensate
ISI is generally dominated by the first precursor; thus, a Tx for precursor and postcursor ISI and make no current flows
equalizer is appropriate to compensate for this. Conventional through the PB-FFE when there is no transition. In addition,
feed-forward equalizing schemes in the transmitter shift the the impedance of the output driver does not change during
output data and sum the tap currents, which waste power and the no-transition region. The position of the third tap, which
change the output impedance of the drivers when there are no can cancel the postcursor reflection, can be controlled by
transitions in the signal [11]. introducing an adjustable delay. Furthermore, for our PHY
A pulse-based FFE (PB-FFE) [11]–[15] can overcome the with 132 C/A Txs, the area of an encoder and serializers needs
above disadvantages of the conventional FFE while compen- to be minimized; thus, the PB-FFE uses one serializer and a
sating for ISI. Wang and Gai [11] presented the PB-FFE using simple encoder to encode the serialized data.
precoded data, but the area and wire-consuming encoder and
ten serializers occupy a large area. In [12], a small current II. MDS C ONTROLLER PHY A RCHITECTURE
and a large termination resistor are used. However, using Fig. 1 shows the architecture of the MDS controller PHY
a termination resistor larger than the channel characteristic and how it communicates with DRAMs. The PHY consists
impedance can cause large reflection in the highly reflective of an all-digital phase-locked loop (ADPLL), an all-digital
channel with heavy DRAM loading. A pre-emphasis-based delay-locked loop (ADDLL), a clock distribution circuit, a link
FFE using the transition detector cannot remove the precursor training finite-state machine (LTFSM), eight pairs of clock
ISI [13], [14]. The PB-FFE in [15] requires a quadrature clock signal (CK) Txs, four groups of 33 Txs for the C/A lines,
for an additional return-to-zero (RZ) data aligner, resulting 80 DQ signal transceivers for 20 nibbles of data, and 20
in increased power consumption. Moreover, these PB-FFE transceiver pairs for the corresponding DQS signals. Here,
designs cannot reduce the postcursor reflections. a nibble is a bundle of four IOs and one strobe pair. The
To alleviate the above issues, we present an MDS con- ADPLL generates both the global CK (PHYCLK) used by
troller PHY with damping-resistor-aided three-tap pulse-based the transceivers and the system clock (SYSCLK) used by
feed-forward equalizing C/A Tx on heavy load DRAM inter- the LTFSM. The frequency of PHYCLK is 1066 MHz, and
faces. The C/A Tx needs to drive the signal to 80 DRAM that of SYSCLK is 533 MHz. The ADDLL has the same
dies without receiver termination, making a highly reflective delay line as that in each transceiver and provides a delay
channel environment. A damping resistor is used at the DRAM control-code corresponding to 1-cycle of PHYCLK to each
receiver to attenuate reflection in this channel environment but transceiver. Each delay line divides the received control-code

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
KO et al.: CONTROLLER PHY FOR MDS WITH DAMPING-RESISTOR-AIDED PB-FFE 2565

Fig. 3. C/A channel environment (a) without and (b) with damping resistor.

Fig. 2. MDS C/A channel.

by 128, allowing the clock phase to be adjusted in 1/128 cycle


intervals. The clock tree minimizes the skew between each
C/A group and between each nibble of DQ. To prevent
skews between clock phases, PHYCLK is distributed as a
single phase, and complementary CKs are generated in each
transceiver.
The LTFSM contains a number of digital modules, includ-
ing circuits for power-up and initialization, ZQ calibration,
CA training, read preamble training, read latency and DQ
training, write leveling, and write training. The power-up
and initialization module performs power-up and initial reset
procedures for both the PHY and the DRAMs. Also, it pro-
grams the mode register sets (MRS) for each DRAM chip,
using gear-down mode in which C/A signals are widened to
improve the initial sampling margin before CA training. The
ZQ calibration module transmits the ZQ calibration command
to the DRAM, and also calibrates the drive strengths of
the C/A drivers and DQ drivers in the PHY. Other training
modules adjust the DCDLs in each transceiver during the
training sequences specified in the DDR4 standard. To avoid a
Fig. 4. Simulated (a) waveform and (b) reflection of SBR according to
clock domain crossing problem between the LTFSM and each R_CHIP value.
Tx, regardless of the delay being produced by the DCDL, each
Tx divides the PHYCLK before it passes through the DCDL
and samples the input DQ and C/A signals using this divided Although the C/A channel has the characteristic impedance
CK. Each Tx outputs DQ and C/A signals to DRAM using the of 50  and is terminated with a 20- resistor (RCH,TERM )
optimized timing determined by the training procedure. The at the end, each C/A input of the DRAM is not terminated
DQ receiver receives data from the DRAM and passes it to to reduce power consumption; this makes a highly reflective
the LTFSM through asynchronous FIFO. The structures of the channel environment, leading to large postcursor reflections.
Tx and receiver are described in more detail later. The input parasitic of each DRAM chip can be modeled as
a π-network [17], as shown in the upper right corner of Fig. 2.
III. DAMPING -R ESISTOR -A IDED P ULSE -BASED The parasitic input capacitance of each DRAM input, C_PAD,
F EED -F ORWARD E QUALIZER includes the capacitance of the PAD, electrostatic discharge
(ESD) protection diodes, and the metal interconnects. The
A. Command/Address Channel With Damping Resistor resistance R_CHIP between PAD and C/A receiver circuit
Fig. 2 shows the structure of the MDS C/A channel. Each includes the resistance of the metal lines and ESD protec-
channel is connected to ten DRAM packages, each of which tion resistor. The capacitive load C_LOAD includes the gate
has an octal-die package (ODP) structure in which eight capacitance of the receiver circuit and controllable capacitor.
DRAMs are stacked and connected by interchip bond wire. Fig. 3(a) shows the simplified highly reflective MDS C/A
Thus, each C/A Tx drives ten DRAM packages, or high-load channel. R_CHIP acts as a damping resistance, and reflec-
80 DRAM dies; each five packages are placed on the front and tions can be attenuated as the resistance value increases,
back sides of the DIMM. The C/A signal uses the center-tap as shown in Fig. 3(b). Simulated waveform and reflection of
termination (CTT), of which the bias voltage is VDDQ/2. single-bit response (SBR) according to the R_CHIP value are

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
2566 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 8, AUGUST 2021

Fig. 5. Variation of the simulated eye width with the input parasitic resistance
and capacitance of each DRAM.
Fig. 7. Return loss of the C/A channel, probed on the output of the C/A Tx
in Fig. 2

Fig. 8. Tx for one subgroup of 3 C/A signals.

Fig. 6. Insertion loss of the C/A channel.


DRAM is 11.7 dB and increases rapidly around the operating
frequency; thus, the ISI due to these insertion losses should
presented in Fig. 4(a) and (b), where SBR Amplitude means be compensated. Due to the termination at the end of the
the main-cursor amplitude after C/A channel and reflection channel, the insertion loss of 4.6 dB is shown for all DRAMs
means the largest amplitude of reflection. We applied the even at the low frequency. The return loss measured at the
640-mV single-bit signal before the C/A channel and measured C/A Tx output is 8.5 dB at low frequency and 6.8 dB at the
this signal at the C/A receiver input of the farthest DRAM operating frequency. Since each DRAM does not have C/A
chip. It shows that increasing R_CHIP attenuates reflection receiver termination, the return loss is relatively high.
due to damping. However, the larger R_CHIP is multiplied by
C_LOAD, increasing the insertion loss; thus, the PB-FFE is
applied to overcome this tradeoff. B. Command/Address Transmitter
Fig. 5 shows the eye width at the C/A input of the farthest The controller PHY transmits four groups of 33 C/A signals
DRAM chip, for different values of R_CHIP and C_LOAD, including CS[0:1], C[0:2], ACT_n, A[0:17], BG[0:1], BA[0:1],
obtained from a simulation in which the bond wires were CKE[0:1], ODT[0:1], and PAR as standard DDR4 for each
modeled by their s-parameters and Tx FFE was not used. group. The timing skew between C/A signals should be
The purpose of this simulation is to find an optimum value minimized because a group of 33 C/A signals is sampled
of R_CHIP and C_LOAD to minimize the signal reflection. simultaneously by CK at each DRAM. However, it is inef-
When C_LOAD is less than 0.3 pF, the timing margin tends to ficient in terms of area and power consumption to perform
increase due to the reduced reflection with increasing R_CHIP; per pin deskewing for C/A signals. In our PHY, each group
but when the load exceeds 0.3 pF, the timing margin decreases of 33 C/A signals is divided into 11 subgroups and performs
rapidly due to increased insertion loss with increasing R_CHIP. C/A training separately to reduce the skew. Because the
The maximum timing margin is achieved when C_LOAD internal timing margin of the CS[0:1] signals in the DRAM
is 0.2 pF and R_CHIP is 1.2 k. When R_CHIP becomes chip is different from that of the other C/A signals due to
larger than 1.2 k, the timing margin is reduced regardless the pre-CMD scheme described in [4], the CS[0:1] signals are
of C_LOAD. As a result of this simulation, we designed that composed of one subgroup. Four of the remaining 31 C/A sig-
C_LOAD was set to 0.2 pF and R_CHIP was set to 1 k nals are composed of another subgroup, and the other 27 sig-
having some margin not to exceed 1.2 k, in each DRAM. nals are composed of nine subgroups of three each. Fig. 8
As the DRAM input parasitic values found above, shows the block diagram of one of the subgroup C/A Tx of
Figs. 6 and 7 show the insertion loss and return loss of the three signals. The two-phase data sequences CA_P[0:1] for
C/A channel, respectively. The insertion loss for the farthest each C/A are applied from the LTFSM. To avoid domain

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
KO et al.: CONTROLLER PHY FOR MDS WITH DAMPING-RESISTOR-AIDED PB-FFE 2567

generate pulse signals of 1 unit-interval (UI) in length before


and after each transition as follows; just before the rising
edge (PRE_L2H), just before the falling edge (PRE_H2L),
just after the rising edge (POST_L2H), and just after the
falling edge (POST_H2L). The inverted signals PREB_L2H,
PREB_H2L, POSTB_L2H, and POSTB_H2L are generated
separately, so that the pull-up and pull-down drivers can
be controlled independently. Signals with the prefix “PRE”
act as pretaps to remove the precursor, and the direction of
compensation is selected by the SIGN_PRE signal. Signals
with the prefix “POST” are pre-emphasis pulses whose posi-
tion is controllable by using the DLY_POST signal, reducing
postcursor reflection. Its delay range is from 0 to 1050 ps with
the resolution of 150 ps. The SIGN_PRE and SIGN_POST
signals decide whether the tap coefficient adds or subtracts,
selecting the direction of compensation. The tap-coefficient
Fig. 9. Block diagram of the PB-FFE. ranges of the PB-FFE are from 0 to 0.5 for pre-tap and
from 0 to 0.25 for post-tap. When the PB-FFE is providing
simple pre-emphasis, inverted pre-emphasis pulses are not
crossing problems between the PHYCLK and the signals from
required; but it is available for use as a post-tap to remove
the LTFSM depending on the delay of the DCDL, the signals
the postcursor by adding a delay to the pre-emphasis pulse.
CA_P[0:1] are sampled first with a divided PHYCLK (CLK2),
By equalizing channel loss using pulses when there is a signal
and then serialized with a delayed clock (CLK2_CA). The
transition, the proposed FFE does not consume DC power,
serialized C/A data are transmitted in SDR mode. Unlike the
and the output driver impedance problem can be addressed
DQ path using the high-tap termination (HTT), the C/A path
during the no-transition region. Because MDS C/A channel is
uses the CTT.
an ISI-dominant dispersive channel, we set SIGN_PRE to 0 to
CA training is performed using the CA parity mode
compensate for the precursor and SIGN_POST to 0 to boost
described in the DDR4 standard [16]. CS signals are trained
the transition edge using a pre-emphasis pulse. DLY_POST is
first. During CS training, the controller transmits C/A signals
set to 0.
with a bit pattern designed to produce a failure during parity
Fig. 10 shows the detailed implementation of the output
checking. When CS is sampled successfully by the DRAM,
driver. The main tap consists of 24 driver units, with a turn-
it asserts the ALERT_N signal, which is the error flag signal of
ON resistance (RON ) of 240  each, and operates with 14 
the DDR4, to low. By increasing the delays applied to the CS
or 10  by enabling 17 or 24 units, respectively, according
signals, the LTFSM trains the output timing of the CS signals
to the RONSEL signal. Each driver unit consists of active
and sets the delay at 1/4 of the pass window to provide more
transistors and a passive resistor to assure driver linearity
setup than hold time. After the CS signal has been trained,
and to provide ESD protection. RON of the driver unit is
each subgroup of C/A signals is trained sequentially in the
automatically determined by adjusting PCODE and NCODE
same manner. The LTFSM sets the delay at the center of the
so that its value is the same as the 240- external resistor
pass window for C/A signals except for the CS signals.
connected to a separate ZQ pin. The pre-tap and post-tap also
use identical driver units. The pretap consists of 12 driver units
C. Pulse-Based Feed-Forward Equalizer that can be selected by EQ_PRE signals to adjust the pretap
A PB-FFE [11]–[15] was presented to eliminate power coefficient, and the post-tap consists of six driver units that
wasting in conventional FFE [18]. The input to the PB-FFE are controlled by EQ_POST signals.
is precoded so that the pre- or post-taps only inject current to Fig. 11 shows an example timing diagram when the
the output nodes upon voltage level transition of the output SIGN_PRE, SIGN_POST, and DLY_POST signals are 0.
signal, but conventional FFE makes the DC current path to The signals of MPRE_PU, MPRE_PD, MPOST_PU, and
the ground even in data nontransition region, increasing the MPOST_PD are the equalizing pulses of each pre-tap and
power consumption [19]. To evaluate the power saving in post-tap, as shown in Fig. 9. Each pre-tap and post-tap driver
the PB-FFE, we simulated both conventional and PB-FFE operates before and after the transition, and the PB-FFE
with a data-rate of 1067 Mb/s and a PRBS8 data pattern; performs the same function as a conventional FFE at the
a supply voltage of 1.1 V, a temperature of 40 ◦ C, and output of the C/A Tx while keeps RON unchanged during
a typical process corner are used. Under the condition of data nontransition. Fig. 12 shows the simulated SBR of the
achieving the same margin improvement, the current con- C/A channel for the nearest and farthest DRAM input, with
sumption of conventional and PB-FFE is 10.1 and 7.6 mA, and without applying the PB-FFE. Fig. 12(b) shows that
respectively; thus, the PB-FFE saves power consumption by the precursor is reduced by 40 mV, and the transition is
25%. Fig. 9 shows a block diagram of the proposed PB-FFE boosted by 20 mV with pre-emphasis for the farthest DRAM.
using edge-detection logic. The incoming serialized data are Fig. 13 shows the simulated eye diagram of the farthest DRAM
shifted using D flip-flops, and the combinational logic gates input with and without applying the PB-FFE at 1.2 Gb/s. The

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
2568 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 8, AUGUST 2021

Fig. 13. Simulated eye diagram of the farthest DRAM input (a) without and
(b) with applying the PB-FFE at 1.2 Gb/s. The Vertical eye mask is 200 mV.

Fig. 14. Block diagram of a nibble of DQ/DQS Tx.

Fig. 10. Detailed implementation of the output driver in the PB-FFE. for the same input rectangular mask. Considering supply and
reference voltage noise, crosstalk, receiver offset in DRAM,
and timing skew between C/A signals, the required target of
the input mask is 200 ps of a horizontal eye with 200 mV of
a vertical eye. The simulation result meets the required eye
mask.

IV. OTHER B UILDING B LOCKS


A. DQ/DQS Transmitter
Fig. 14 shows a section of the DQ/DQS Tx, which serializes
a nibble of data from the LTFSM and outputs it in DDR
Fig. 11. Example timing of the PB-FFE. mode. Since each DQ bus has only four DRAM loads and
ISI is not large, we did not use equalization in our DQ Tx,
improving the power efficiency. The Tx receives the four
4-phase data signals WRDQ0–WRDQ3 from the LTFSM,
together with the two-phase write-enable signal WREN_P. The
arriving data are sampled using the CLK2 signal to avoid
domain crossing problems caused by variable delay. The strobe
generator DQS_GEN generates the DQS preamble pattern
specified by the mode register, and subsequently the clock
pattern; and these are output as the strobe signal DQS_T,
Fig. 12. SBR of (a) nearest and (b) farthest DRAM from the controller PHY. together with its differential pair DQS_C, by two of the drivers
DRV. These drivers, together with the other four drivers DRV
simulation result shows that the timing margin is increased which output the data signals, are turned ON by the driver-
by 0.06 UI, which means improvement by 28%, with the tap enable signal DRV_EN, which is generated by serializing
coefficient of 0.5 and 0.25 for pre- and post-tap, respectively, WREN_P.

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
KO et al.: CONTROLLER PHY FOR MDS WITH DAMPING-RESISTOR-AIDED PB-FFE 2569

edge of the incoming DQS could not be found because the


DQS does not arrive at the expected read latency, and in this
case, LTFSM increases the internal latency of RDEN_P.
Since DQ and DQS for each nibble arrive at the same time,
the DQ signals are sampled by a DQS with a phase lag of 90◦
and this delay is fine-tuned during read training. The received
DQ signals are deserialized to four-phases, and transmitted
to the LTFSM through asynchronous FIFO, together with the
VALID signal. When all the VALID signals from 20 nibbles
are received, the LTFSM accepts the deserialized data.

Fig. 15. Block diagram of a nibble of DQ/DQS receiver.


C. Glitch-Free DCDL
DCDL is used for all Tx and receiver blocks for link
The PHY supports per-nibble write leveling and write training, and NAND-based structure with seamless boundary
training. The DCDL for CLK_DQS is adjusted so that CK switching scheme [20] is used in our PHY for the small
and DQS are aligned at the DRAM input after write leveling. minimum delay, high resolution, good monotonicity, and sim-
The DCDL for CLK_DQ produces 0.5 UI less delay, so that ple layout. When changing the delay of the DCDL, a glitch
the phase of CLK_DQ leads that of CLK_DQS by 90◦ . This problem may occur. A glitch in the output of a DCDL causes
delay, which is fine-tuned during write training, allows DRAM a phase error in the divided CK reaching the serializer and
to sample DQ with DQS. Because both the rising and falling deserializer, and write or read failure occurs accordingly.
edges of the CKs are used to output DQ in DDR mode, the To avoid this problem, the delay of the DCDL may change step
duty-cycle error of each DCDL is corrected by the duty-cycle by step from the final value to the target value after finishing
corrector (DCC). the training. The resolution of a step is TCK /128 or 1 UI/64,
and the DCDLs have a range of 0–191 steps in our PHY to
cover the delay of 1.5 × TCK . Thus, it may take 768 × TCK
B. DQ/DQS Receiver to adjust the DCDL for the worst-case when changing delay
Fig. 15 shows a 1-nibble DQ/DQS receiver, which receives in every four cycles of PHYCLK.
the four data signals DQ0–DQ3 from the DRAM, together An alternative way to suppress glitches is to switch each
with the strobes DQS_T and DQS_C. Upon receiving the delay stage sequentially, but an implementation [21] of this
data and strobe signal, only the pMOS transistor and passive scheme requires 1.5 times the area and power. Since our PHY
resistor of the Tx driver in Fig. 14 turn ON and operate as a has five DCDLs in each nibble and 11 in each CA group,
receiver termination, whereas the other components of the Tx making a total of 152 DCDLs, it is difficult to employ a
driver turn off. The time that read data arrives at the PHY after DCDL structure with a large area and power consumption as
transmitting the read command is “TCMD + RL × TCK + TDQ ,” described in [21].
where TCMD is the flight time of the C/A signals from the Tx In our PHY, there is a period of bus idle time when
to each DRAM; TCK is the period of PHYCLK; and TDQ is the changing the delay after training. To alleviate the above issues,
flight time of the DQ and DQS signals from the Tx of each the DCDL is implemented so that a glitch is not transmitted to
DRAM to the PHY. RL, which is defined in mode register, the SER or DES by blocking DCDL output using this period.
denotes read latency that is the number of clock cycles until a Fig. 16(a) and (b) shows a block diagram of the glitch-free
DRAM outputs the read data after receiving a read command. DCDL in our PHY and its timing diagram, respectively. When
The time at which the data arrive is not synchronized to TIME_CODE, which determines the delay, is changed by the
PHYCLK and each nibble arrives at different times. Thus, LTFSM, the code change detector asserts the FLAG signal, and
the buffer-enable time must be different for each nibble, the CK DCDL_IN to the DCDL is blocked accordingly. Then
and this is determined by read preamble training, which is the internal delay-control code DCDL_CODE is changed.
performed as follows. The LTFSM sends the two-phase read Finally, the FLAG signal is deasserted and the CK is again
to enable signal RDEN_P to the receiver after a delay of RL, supplied to the DCDL. The code change detector is supplied
and the receiver serializes it to generate the BUF_EN signal. with CLK4, which is a 1/4-rate version of PHYCLK. The
The correct buffer-enable time is determined by issuing a read CK to the DCDL is blocked for two cycles of CLK4, which
command and then waiting for the rising edge of the DQS. avoids EVEN/ODD phase inversion problems. Adopting this
This edge is found by sampling the incoming DQS using glitch-free DCDL, the required time to update the training
the BUF_EN signal which is delayed by a step of TCK /128 result is reduced to only 8 ×TCK and the overall training
during read preamble training. When both the rising edge time is reduced compared to the step-by-step methodology
of the incoming DQS and the BUF_EN signal are aligned, described above. During the training procedure, the blocking
the sampled DQS changes from 0 to 1, and the LTFSM period of the DCDL does not affect the overall training time
terminates read preamble training and adjusts the DCDL because there is a waiting time, longer than the blocking
shorter by 0.5 UI than stopped value to have buffer-enable period, to receive feedback from the DRAM after transmitting
timing margin. If “TCMD + TDQ ” is greater than TCK , the rising a training pattern.

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
2570 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 8, AUGUST 2021

Fig. 18. Measurement setup.

Fig. 16. (a) Block diagram and (b) timing diagram of the proposed glitch-free
DCDL.

Fig. 19. (a) Measured waveforms of CK-CKB, CS, and C/A at the package
ball, and simulated C/A signal at (b) package ball and (c) receiver buffer input
of the farthest DRAM die. All of them were observed at the B4 DRAM ODP
in Fig. 2.
Fig. 17. Die micrograph of the MDS controller chip.

controller side, CA training, read preamble training, read


V. E XPERIMENTAL R ESULTS
training, write leveling, and write training.
The MDS controller chip shown in Fig. 17 was fabricated Fig. 19(a) shows measured waveforms of CK-CKB, CS, and
in a 55-nm CMOS process, and it occupies 77.2 mm2 . The one of C/A signals, A5, which are measured at the package
controller was mounted on an MDS DIMM and interfaces ball of the B4 DRAM ODP in Fig. 2. During CA training,
with the 40 DRAM packages, in which 20 are on the front the CS signal is trained to have 75% setup time and 25% hold
side, and 20 are on the back side, as shown in Fig. 18. The time whereas the other C/A signals are trained to have both
C/A_AU group of the controller drives the upper right ten setup and hold time of 50%. As a result of the CA training,
DRAM packages, five are on the front, and five are on the CS signal is faster than A5 signal. The waveforms in Fig. 19(a)
back of the DIMM. The C/A_AD group drives the lower appear to have reflection due to no termination at the DRAM
right ten DRAM packages. The C/A_BU and C/A_BD groups receiver. The simulation results of Fig. 19(b) and (c) show that
drive the upper left and lower left ten DRAM packages, the real input signal at the receiver of the top chip has less
respectively. The DIMM is mounted on the test board, and reflection than at the DRAM package ball. We also confirmed
the control signals and data signals are connected to the that there is less reflection of the receiver input in different
automatic test equipment (ATE) which transmits the measure- dies. Fig. 20 shows the timing margin of each C/A subgroup
ment results to the PC. When power is on, the controller on the C/A lines, measured by the DCDL in the C/A Tx with
performs initialization sequences including media power-up, a resolution of TCK /128. Because the CS channel has less
MRS initialization, ZQ calibration on both the DRAM and DRAM load than the other C/A channels, the timing margin

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
KO et al.: CONTROLLER PHY FOR MDS WITH DAMPING-RESISTOR-AIDED PB-FFE 2571

TABLE I
P ERFORMANCE S UMMARY AND C OMPARISON W ITH O THER DRAM I NTERFACES

Fig. 22. Measured read timing and voltage margin on the DIMM.

Fig. 20. Measured C/A Margin with and without the PB-FFE.

Fig. 23. Measured write timing and voltage margin on the DIMM.
Fig. 21. Read margin measured by ATE.

of the PHY is 4.3 mV and the measured voltage margin is


of the CS channel is larger than the others. The minimum 49 steps or 211 mV, as shown in Fig. 22.
timing margin without FFE is 0.23 UI at 1067 Mb/s. Applying The write timing margin is measured by varying the DQ
the PB-FFE with tap coefficients of 0.5 and 0.25 for the pre- timing with a step size of TCK /128 or 1/64 UI, and read
and post-tap, respectively, the timing margin is increased to back the previous written data. The write voltage margin is
0.29 UI. measured by varying the reference voltage of the DRAM with
Fig. 21 shows a shmoo plot of the read operation with data a step size of 0.65% of the VDDQ as the DDR4 standard. The
driven by ATE, and the read margin of our controller PHY measured write timing and voltage margins are 0.72 UI and
is 0.58 UI. However, the read margin measured on the MDS 230 mV, as shown in Fig. 23.
DIMM is reduced to 0.53 UI due to the duty-cycle error and Fig. 24 shows the power breakdown of the burst write and
output jitter of the DRAM. The internal reference voltage step read operation, respectively. The total power consumption of

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
2572 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 8, AUGUST 2021

TABLE II
P ERFORMANCE S UMMARY AND C OMPARISON W ITH O THER PB-FFE S

R EFERENCES
[1] A. M. Ionescu, “Energy efficient computing and sensing in the Zettabyte
era: From silicon to the cloud,” in IEDM Tech. Dig., San Francisco, CA,
USA, Dec. 2017, pp. 1.2.1–1.2.8.
[2] DDR4 SDRAM Registered DIMM Design Specification, Standard
21C 4.20.28-1, JEDEC, May 2019.
[3] DDR4 SDRAM Load Reduced DIMM Design Specification, Standard
21C 4.20.27-1, JEDEC Aug. 2015.
[4] S. Lee et al., “23.4 a 512GB 1.1 V managed DRAM solution with
16GB ODP and media controller,” in IEEE Int. Solid-State Circuits
Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2019,
pp. 384–386.
[5] J. Ren et al., “Precursor ISI reduction in high-speed I/O,” in Proc. IEEE
Symp. VLSI Circuits, Kyoto, Japan, Jun. 2007, pp. 134–135.
[6] W.-Y. Shin et al., “A 4.8Gb/s impedance-matched bidirectional multi-
drop transceiver for high-capacity memory interface,” in Proc. IEEE
Fig. 24. Power breakdown of (a) write and (b) read operation. Int. Solid-State Circuits Conf., San Francisco, CA, USA, Feb. 2011,
pp. 494–496.
[7] W. Lee et al., “Parallel branching of two 2-DIMM sections with
the controller during burst write and read operation is 1.97 W, write-direction impedance matching for an 8-Drop 6.4-Gb/s SDRAM
interface,” IEEE Trans. Compon., Packag., Manuf. Technol., vol. 9, no. 2,
which satisfies the requirement for an MDS DIMM [4]. Table I pp. 336–342, Feb. 2019.
lists the comparison of this PHY to other DRAM interfaces, [8] J. Seo et al., “A 7.8-Gb/s 2.9-pJ/b single-ended receiver with 20-tap DFE
and Table II shows the performance comparison with other for highly reflective channels,” IEEE Trans. Very Large Scale Integr.
(VLSI) Syst., vol. 28, no. 3, pp. 818–822, Mar. 2020.
PB-FFE designs. Our damping-resistor-aided PB-FFE can [9] H.-J. Chi et al., “A single-loop SS-LMS algorithm with single-
transmit the signal to many loads with better energy efficiency. ended integrating DFE receiver for multi-drop DRAM interface,” IEEE
J. Solid-State Circuits, vol. 46, no. 9, pp. 2053–2063, Sep. 2011.
[10] S.-J. Bae, H.-J. Chi, Y.-S. Sohn, and H.-J. Park, “A 2Gb/s 2-tap DFE
VI. C ONCLUSION receiver for mult-drop single-ended signaling systems with reduced
A controller PHY for a high-capacity DRAM solution noise,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
Papers, San Francisco, CA, USA, 2004, pp. 244–245.
was presented. It was mounted on an MDS DIMM [4] and [11] Y. Wang and W. Gai, “Power-efficient pre-emphasis method for transmit-
interfaced to 40 DRAM packages. This controller supports ters with LVDS drivers,” Electron. Lett., vol. 50, no. 24, pp. 1811–1813,
all the training sequences specified in the DDR4 standard Nov. 2014.
[12] B. Kim and V. Stojanovic, “An energy-efficient equalized transceiver
including link trainings for C/A, read, and write operation. for RC-dominant channels,” IEEE J. Solid-State Circuits, vol. 45, no. 6,
A glitch-free DCDL reduces training time. To attenuate reflec- pp. 1186–1197, Jun. 2010.
tion and improve the ISI due to the heavy load of a number [13] S. Han, S. Lee, M. Choi, J.-Y. Sim, H.-J. Park, and B. Kim,
of DRAM chips on a C/A channel, a damping-resistor-aided “A Coefficient-Error-Robust feed-forward equalizing transmitter for eye-
variation and power improvement,” IEEE J. Solid-State Circuits, vol. 51,
PB-FFE is used in the C/A Tx. The controller was fabricated in no. 8, pp. 1902–1914, Aug. 2016.
a 55-nm CMOS and occupies 77.2 mm2 . Its C/A timing margin [14] H.-G. Ko, S. Shin, J. Oh, K. Park, and D.-K. Jeong, “6.7 an 8Gb/s/μm
at 1067 Mb/s is improved from 0.23 to 0.29 UI by applying the FFE-combined crosstalk-cancellation scheme for HBM on silicon inter-
poser with 3D-staggered channels,” in IEEE Int. Solid-State Circuits
PB-FFE. At 2133 Mb/s, the measured read timing and voltage Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2020,
margins are 0.53 UI and 211 mV after read training, and the pp. 128–130.
write margins are 0.72 UI and 230 mV after write training. [15] S.-G. Kim, T. Kim, D.-H. Kwon, and W.-Y. Choi, “A 5–8 Gb/s low-
power transmitter with 2-tap pre-emphasis based on toggling serializa-
The power consumption of the controller during burst write tion,” in Proc. IEEE Asian Solid-State Circuits Conf. (A-SSCC), Toyama,
and read operation is 1.97 W which satisfies the requirement Japan, Nov. 2016, pp. 249–252.
of MDS DIMM [4]. Our damping-resistor-aided PB-FFE can [16] DDR4 SDRAM, Standard JESD79-4C, JEDEC, Jan. 2020.
[17] H.-H. Chuang et al., “Signal/Power integrity modeling of high-speed
be applied to the standard RDIMM or LRDIMM to drive the memory modules using chip-package-board coanalysis,” IEEE Trans.
C/A channel with improved power efficiency. Electromagn. Compat., vol. 52, no. 2, pp. 381–391, May 2010.

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.
KO et al.: CONTROLLER PHY FOR MDS WITH DAMPING-RESISTOR-AIDED PB-FFE 2573

[18] C. Menolfi et al., “A 16Gb/s source-series terminated transmitter in Sangyoon Lee received the B.S. degree in electrical
65 nm CMOS SOI,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) and electronics engineering from Korea University,
Dig. Tech. Papers, San Francisco, CA, USA, Feb. 2007, pp. 446–447. Seoul, South Korea, in 2016. He is currently pursu-
[19] J.-H. Chae, Y.-U. Jeong, and S. Kim, “Data-dependent selection of ing the Ph.D. degree with Seoul National University,
amplitude and phase equalization in a quarter-rate transmitter for mem- Seoul.
ory interfaces,” IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 67, no. 9, His research interests include high-speed and
pp. 2972–2983, Sep. 2020. low-power I/O interface and memory interface.
[20] J.-T. Kwak, C.-K. Kwon, K.-W. Kim, S.-H. Lee, and J.-S. Kih, “A low
cost high performance register-controlled digital DLL for 1 Gbps×32
DDR SDRAM,” in Proc. Symp. VLSI Circuits. Dig. Tech. Papers, Kyoto,
Japan, 2003, pp. 283–284.
[21] D. De Caro, “Glitch-free NAND-based digitally controlled delay-lines,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 1,
pp. 55–66, Jan. 2013.
[22] W. Yun et al., “A digital DLL with hybrid DCC using 2-step duty Jaewook Kim received the B.S. degree in electrical
error extraction and 180◦ phase aligner for 2.67Gb/S/pin 16Gb 4-H and electronics engineering from Korea University,
stack DDR4 SDRAM with TSVs,” in IEEE Int. Solid-State Circuits Seoul, South Korea, in 2015. He is currently pursu-
Conf. (ISSCC) Dig. Tech. Papers, San Francisco, CA, USA, Mar. 2015, ing the Ph.D. degree with Seoul National University,
pp. 1–3. Seoul.
[23] M. Kim et al., “A 4266 Mb/s/pin LPDDR4 interface with an asynchro- His research interests include high-speed I/O and
nous feedback CTLE and an adaptive 3-step eye detection algorithm for memory interfaces.
memory controller,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 65,
no. 12, pp. 1894–1898, Dec. 2018.
[24] M. Kossel et al., “DDR4 transmitter with AC-boost equalization and
wide-band voltage regulators for thin-oxide protection in 14-nm SOI
CMOS technology,” in Proc. 43rd IEEE Eur. Solid State Circuits Conf.,
Leuven, Belgium, Sep. 2017, pp. 115–118.

Suhwan Kim (Senior Member, IEEE) received the


B.S. and M.S. degrees in electrical engineering and
computer science from Korea University, Seoul,
South Korea, in 1990 and 1992, respectively, and the
Ph.D. degree in electrical engineering and computer
Hyeongjun Ko received the B.S. degree in electrical science from the University of Michigan, Ann Arbor,
and electronics engineering from Korea University, MI, USA, in 2001.
Seoul, South Korea, in 2005, and the Ph.D. degree From 1993 to 1999, he was with LG Electronics,
in electrical and computer engineering from Seoul Seoul. From 2001 to 2004, he was a Research Staff
National University, Seoul, in 2020. Member with the IBM Thomas J. Watson Research
He joined the SK Hynix, Icheon, South Korea, Center, Yorktown Heights, NY, USA. In 2004,
in 2005. Since then, he has been engaged in I/O he joined Seoul National University, Seoul, where he is currently a Professor
circuit design and failure analysis of high-speed of electrical and computer engineering. His research interests include analog
DRAM such as DDR2, DDR3, and DDR4 and and mixed-signal integrated circuits, high-speed I/O circuits, low-power sensor
low-power DRAM such as LPDDR2, LPDDR3, and readout circuits, and silicon-photonic integrated circuits.
LPDDR4. His research interests include high-speed Dr. Kim received the 1991 Best Student Letter Award of the IEEE Korea
and low-power I/O interface and signal integrity. Section and the First Prize (Operational Category) in the VLSI Design Contest
of the 2001 ACM/IEEE Design Automation Conference, the Best letter Award
of the 2009 Korean Conference on Semiconductors, and the 2011 Best
letter Award of the International Symposium on Low-Power Electronics and
Design. He served as a Guest Editor of Special Issue on the IEEE Asian
Solid-State Circuits Conference for the IEEE J OURNAL OF S OLID -S TATE
Mino Kim received the B.S. and Ph.D. degrees C IRCUITS . He has also served as the Organizing Committee Chair for the
in electrical engineering from Seoul National Uni- IEEE Asian Solid State Conference, and a General Co-Chair and the Technical
versity, Seoul, South Korea, in 2010 and 2017, Program Chair for the IEEE International System-on-Chip (SoC) Conference.
respectively. He has participated multiple times on the Technical Program Committee
In 2017, he joined SK Hynix, Icheon, South of the IEEE International SoC Conference, the International Symposium
Korea. His research interests include high-speed I/O on Low-Power Electronics and Design, the IEEE Asian Solid-State Circuits
circuits, clock generation circuits, and high-speed Conference, and the IEEE International Solid-State Circuits Conference.
memory interfaces.

Joo-Hyung Chae (Member, IEEE) received the


B.S. and Ph.D. degrees in electrical engineering
from Seoul National University, Seoul, South Korea,
in 2012 and 2019, respectively.
In 2013, he joined the Department of LPDDR
Hyunkyu Park (Graduate Student Member, IEEE) Memory Design, SK Hynix, Icheon, South Korea,
received the B.S. degree in electrical and electronic as an Intern. From 2019 to 2021, he was with
engineering from Sungkyunkwan University, Suwon, SK Hynix, where he worked on GDDR memory
South Korea, in 2014. He is currently pursuing the design. In 2021, he joined Kwangwoon University,
Ph.D. degree with Seoul National University, Seoul, Seoul, where he is currently an Assistant Professor
South Korea. of electronics and communications engineering. His
His research interests are the design of high-speed research interests are the design of high-speed and low-power I/O circuits,
I/O circuits, clock generation circuits, and memory clocking circuits, memory interfaces, and mixed-signal in-memory comput-
interface. ing.
Dr. Chae received the Doyeon Academic Award from the Inter-University
Semiconductor Research Center, Seoul National University, in 2020.

Authorized licensed use limited to: Korea Advanced Inst of Science & Tech - KAIST. Downloaded on May 24,2025 at 15:17:07 UTC from IEEE Xplore. Restrictions apply.

You might also like