A Lean, Low Power, Low Latency DRAM
Memory Controller for Transprecision
Computing
Chirag Sudarshan1 , Jan Lappas1 , Christian Weis1 , Deepak M. Mathew1 ,
Matthias Jung2 , and Norbert Wehn1
1
2
Technische Universität Kaiserslautern, Germany
{sudarshan,lappas,weis,deepak,wehn}@eit.uni-kl.de
Fraunhofer Institute for Experimental Software Engineering (IESE), Germany
[email protected]
Abstract. Energy consumption is one of the major challenges for the
advanced System on Chips (SoC). This is addressed by adopting heterogeneous and approximate computing techniques. One of the recent
evolution in this context is transprecision computing paradigm. The idea
of the transprecision computing is to consume adequate amount of energy for each operation by performing dynamic precision reduction. The
impact of the memory subsystem plays a crucial role in such systems.
Hence, the energy efficiency of a transprecision system can be further
optimized by tailoring the memory subsystem to the transprecision computing. In this work, we present a lean, low power, low latency memory
controller that is appropriate for transprecision methodology. The memory controller consumes an average power of 129.33 mW at a frequency
of 500 MHz and has a total area of 4.71 mm2 for UMC 65 nm process.
Keywords: DRAM · DDR3 · Memory Controller · Transprecision ·
PHY.
1
Introduction
Approximate computing has been recognized as an effective technique to overcome the energy scaling barrier of computing systems by compromising the accuracy of results. Inspired by approximate computing, transprecision computing [1]
has emerged as a new computing paradigm that offers a dynamic precision reduction to the intermediate computation stages in order to achieve higher energy
efficiency without inheriting any errors in the final output. In other words, the
accuracy of the final result is same as that of a traditional full-precision computing. The dynamic precision reduction is achieved by spanning all the layers
of the computing system (i.e. from algorithm to the specifically tuned hardware
that supports a variety of precision settings) and offering multiple control loops
across these layers. The objective of H2020 European project OPRECOMP is
to implement transprecision computing platforms for applications ranging from
Internet-of-Things (mW Platforms - ASIC) to High Performance Computing
(KW platforms - FPGA).
OPRECOMP project uses the PULP (An Open Parallel Ultra-Low-Power
Processing) platform [2] that is implemented using RISC-V cores for energy efficient computing. RISC-V is an open Instruction Set Architecture (ISA) that
has acquired a lot of recognition across industry and academia. The high degree
of customization offered in RISC-V architecture enables the design of energy
efficient transprecision computation units. However, the ASIC implementation
of the PULP is bound to low memory density devices like Static Random Access Memories (SRAMs). But, many applications demand larger memory footprints i.e external memories. The most prominent external memories are the
Dynamic Random Access Memories (DRAMs). DRAM require a specialized circuitry called memory controller to manage the complex protocol. These memory
controllers are designed for general purpose and are not available as an open
hardware architecture. Additionally, DRAMs consume a major portion of the
overall system power and deteriorate the system performance due to its long
latency. This is partially due the DRAM operations such as refresh, activation
and precharge (refer Section 2) that contribute high energy [3] and results long
latency for the data accesses [4]. The impact of the refresh further increases for
the next generation high density DRAM devices (64Gb devices) [5, 6].
In this work we focus on the mW platform that has a memory channel consisting of a single DDR3 DRAM device (×8 device). We present a DDR3 memory
controller that includes several advanced features and optimizes the memory
subsystem for transprecision computing.
One of the fundamental feature is to adapt the idea of approximate computing
to the DRAM (i.e Approximate DRAM ) [7–9]. The key knobs of approximating
the DRAM is to vary the refresh rate in order to enable the trade-off between
energy efficiency, performance and reliability. This technique allows the processing units to store the application data in an appropriate refresh/reliability zone
(ranging from no refresh to high refresh rate) without incurring any computation errors. For example, the data are stored in no refresh zone if the lifetime
of the application is less than the required refresh period of the DRAM [10].
A typical approximate DRAM employs a fine granular refreshing technique in
order to refresh different zones of reliability at varied rates. The most favourable
approach to realize a fine granular refresh is using Optimized Row Granular
Refresh (ORGR) methodology [11]. The authors of [11] present a reverse engineering method to determine the user unknown minimum DRAM timings to
realize the fine granular refresh that is as effective but more flexible than the
conventional DRAM Auto-Refresh.
The second important feature is to exploit the application knowledge. General purpose memory controllers are confined to online scheduling techniques
that only have a local view on the executed application. However, numerous applications feature deterministic memory access patterns, which can be exploited
to improve bandwidth and energy. The authors of [4] present the methodology to
generate an Application-Specific Address Mapping (ASAM), which has a global
view on the application and exploits the application knowledge to optimally map
the data to a DRAM location that decreases the number of row misses, i.e. the
number of precharge and activate operations. The authors of [4] showed upto
9x and 8.6X improvements in bandwidth utilization and energy efficiency by
employing this technique.
However, all the discussed previous works (i.e. Approximate DRAM, ORGR,
ASAM) presented the proof of concepts mainly by simulations. To the best of
our knowledge for the first time our memory controller (i.e. frontend + physical
layer or PHY) combines all the previously discussed advanced techniques. The
designed DDR3 memory controller will be integrated with the PULP cluster
to demonstrate the advantage of transprecision computing for IoT/embedded
applications (mW platform). The key features of our memory controller are as
follows:
– Lean, low latency and low power, optimized for embedded systems that apply
the transprecision computing methodology.
– Enables fine granular refresh control using ORGR.
– Supports the exploitation of application knowledge using ASAM.
– Scalable and robust PHY design with an All-Digital-DLL (AD-DLL) that
uses glitch-free delay-lines without special filters.
The paper is structured as follows: Section 2 gives a brief overview on DRAM
and its operation. We present our implementation of the transprecision memory
controller and PHY in Section 3 and Section 4 respectively. Section 5 discusses
the post layout results and power estimation. Finally the paper is concluded in
Section 6.
2
DRAM Background
In this section we first introduce the basic terminology and the operation of a
DRAM device. DRAM devices are organized as a set of memory banks (e.g.
eight) that include memory arrays. The banks operate concurrently (bank parallelism) with some constraints on data access due to the shared data and command/address bus. Accessing data from the DRAM is a two step process. First,
the activate command (ACT) must be issued to the row of a certain bank. Then,
the column access (CAS) i.e. read (RD) or write command (WR) are executed
to read or write data from/to the specific column. The ACT command opens
an entire row of the memory array and buffers in the Primary Sense Amplifiers
that mimics a small cache, often called as row buffer. If a memory access targets
the same row as the currently cached row (called row hit), it results in a low
latency and low energy memory access. Whereas, if a memory access targets a
different row as the currently activated row (called row miss), it results in higher
latency and energy consumption. If a certain row in a bank is active it must be
precharged (PRE) before activating another row in the same bank. Additionally, to the normal RD and WR commands, there exist CAS commands with an
integrated auto-precharge (RDA, WRA). If auto-precharge is selected, the row
Table 1: Key Timings for a DDR3-800D Device
Name
Explanation
tRCD Row to Column Delay: The time interval between ACT and RD on
the same bank.
tRAS Row Active: The minimum active time for a row.
tRTP Read-to-Precharge Delay:The time interval between RD and PRE
command on the same bank.
tWR Write Recovery: The minimum time interval between the end of a
WR burst and a PRE command.
tRP
Row Precharge: The time interval between PRE and ACT on the
same bank.
tRRD Row-to-Row Delay: The minimum time interval between 2
consecutive ACT command to different bank.
tCCD Column-to-column Delay: The minimum time interval between 2
consecutive WR or RD command.
tWTR Write-to-Read: The minimum time interval between the end of a WR
burst and a RD command.
RL
Read Latency: Delay between the RD command and the availability
of the first RD data bursts on the DRAM data interface.
WL
Write Latency: Delay between the WR command and the availability
of the first WR data bursts on the DRAM data interface.
tRTW Read-to-Write: The minimum time interval between the end of a WR
burst and a RD command. tRT W = RL + tCCD + 2clk − W L
tRFC Refresh cycle time: The minimum time interval between the refresh
command and any valid command.
tREFI Refresh Interval: The minimum time interval between the consecutive
refresh commands.
Value
5 clk
15 clk
4 clk
6 clk
5 clk
4 clk
4 clk
4 clk
5 clk
5 clk
6 clk
110 ns
7.8 us
being accessed will be precharged at the end of the read or write access. Further, the DRAM device is issued an Auto-Refresh (AREF) command at every
tREFI duration that internally performs refresh operation. Table 1 shows the
key timings of DDR3 DRAM device and its values as defined in [12].
3
Memory Controller Architecture
Fig. 1 shows the architecture of the transprecision memory controller. In this
section, we describe on the architecture of the frontend. It is designed to satisfy the mW platform requirements i.e. low power, low area and low latency.
The frequency ratio between the PHY and the frontend is 1:4, similar to state
of the art memory controllers, such as [13, 14]. This allows the frontend to be
operated at a lower clock frequency, satisfying the timing constraints and consuming lower power. In order to compensate the frequency difference and avoid
Frontend
Config
Bus
HI Cmd/
Addr
Host
Interface (HI)
Write Data
AXI
from HI
Read Data
to HI
Register
File
Cmd
Buffer
ASAM
Bank
Machine
ORGR
Write Data
Buffer
Read Data
Buffer
Cmd
Mux
4x DRAM
Cmd/Addr
to PHY
Init
Logic
PHY
DDR3 (x8)
Device Inteface
Write Data
to PHY
Read Data
from PHY
Fig. 1: Transprecision Memory Controller Architecture
stalling of the PHY, the frontend issues 4× DRAM commands/addresses (i.e.
commands/addresses corresponding to next 4 PHY cycles) to the PHY.
As mentioned in Section 1, the ASAM technique presents better energy and
bandwidth results as compared to an online scheduler for the applications with
deterministic memory access pattern. Additionally, the online scheduler block
requires large buffers for reordering the incoming requests and introduces a very
high area overhead and latency penalty. Hence, this architecture does not integrate any online scheduler but rather employ ASAM block. The ASAM block is
a dedicated address decoder that translates the incoming address bits from the
Host Interface (HI) like AXI interface to an equivalent DRAM row, column and
bank addresses as per the configured custom address map. The ASAM block
incorporates a configurable address scrambling hardware as shown in Fig 2. The
incoming logical address from the HI is typically 32 bit, out of which only 30
bits are valid since the maximum density of a DDR3 device is 8Gb. Note that
the addresses from HI are byte addressable and the requests are in the granularity of a cache line (i.e. 8 bytes). The lower 3 bits of HI address are directly
mapped to the lower 3 bits of the DRAM column address (C2-C0). The remaining 27 bits of the HI address are scrambled by the 27× 27 bit multiplexers to
determine the DRAM addresses i.e. bank (B2-B0), row (R15-R0) and column
(C10-C3). A typical general purpose memory controller also supports multiple
HIs and has an N-Port Arbiter to prioritize the incoming request and delivers
it to the scheduler. However, this arbiter would further add a lot of resources
and latency to command processing. Thus, our controller is integrated with only
one HI interface, which is sufficient for a typical embedded processors like mW
platform.
The translated addresses and the corresponding HI command (i.e. read or
write) are forwarded to one of the eight command buffers (FIFO) depending on
the bank address. The Bank Machines (BM) consecutively process the incoming traffic associated with its respective DRAM bank. A BM keeps track of the
current active row of its respective bank and translate the incoming transactions to a sequence of DRAM commands. The command sequence depends on
Configuration
Register (138 bit)
U31U30 U29
138
135
.......
U8 U7 U6 U5 U4 U3 U1 U1 U0
27·MUX-27
3
B2 B1 B0 R15 R14 R13 R12 R11R10 R9 R8 R7 R6 R5 R4 R3 R2 R1 R0 C10 C9 C8 C7 C6 C5 C4 C3
B2 B1 B0 R15 R14 R13 R12 R11R10 R9 R8 R7 R6 R5 R4 R3 R2 R1 R0 C10 C9 C8 C7 C6 C5 C4 C3 C2 C1 C0
Fig. 2: ASAM Architecture
the current state of the bank and the target row of the incoming transaction.
The BM also guarantees all bank specific timings such as tRCD, tRTP, tWR,
tRAS and tRP. The Command Multiplexer (Cmd Mux) prioritizes the DRAM
commands from multiple BMs, ensures that the bus related timings (inter-bank
timing) like tWTR, tRTW, tRRD etc. are maintained, and packs 4× DRAM
commands/addresses. The Init block handles the initialization sequence of the
DRAM as specified in the DDR3 specification [12]. Until the initialization is
finished the rest of the memory controller is stalled. The write data from the
HI is stored in the write buffer and is forwarded to the PHY along with its
corresponding write command. The read data that arrives RL clock cycles (i.e
DRAM/PHY clock) later, is stored temporarily in read buffer and forwarded to
processing unit via HI. The data bus width of the frontend is 64 bits i.e. DRAM
Burst Length × DQ width (8 × 8). The configuration of the DRAM timings,
mode register settings and other internal parameters required by the frontend
and the PHY are done via the 8 bit Configuration Bus (Config Bus) that employs
a custom protocol.
The ORGR block manages the DRAM refresh operation using optimized fine
granular refresh technique. The ORGR block consists a set of counters to track
the refresh interval of different DRAM zones of reliability. As the counter expires,
the ORGR sends the corresponding BMs the row addresses to be refreshed.
That appropriate BMs will stall its further transactions and services the ORGR
request with the sequence of ACT and PRE commands (i.e. fine granular refresh)
with reduced DRAM timings. These ACT and PRE commands are given the
highest priority by the cmd mux. The reverse engineering methodology presented
in [11] to identify the user unknown minimum timings is executed during the
initialization. However, it is not triggered at every initialization due to the fact
that the identified minimum timings in most cases remain unmodified for the
entire course of operation of that device. Hence, it is triggered occasionally and
for the rest of the initialization, the last known minimum timing values are
configured via the config bus from the software. The config bus is also used to
define the DRAM reliability zones and their respective refresh intervals.
4
DDR3 Physical Layer (PHY) Architecture
This new PHY is designed with the focus on simplicity and robustness. All
full-custom atomic components are designed to be massively reused in the total
design, to decrease design time, implementation time and time to test. The
architecture of our DDR3 PHY is shown in Fig. 3. The PHY consists of two
major blocks:
– Data Bus (see Fig.3 1 )
– Address/Command (ADDR/CMD) Bus (see Fig.3 2 )
Interface to
Frontend
1
2
ADDR/CMD
Logic
Data Bus Logic
AD-DLL
(Master DLL)
Off-Chip
Driver
Calibaration
ADDR/
CMD
Driver
Data
Transceiver
DDR3
Interface
Fig. 3: DDR3-PHY Architecture
The bidirectional Data Bus consists of nine Data Transceivers, eight for data
(DQ[7:0]) and one differential data strobe (DQS, nDQS). Each transceiver is implemented by seven parallel push-pull drivers where each of them has the pull-up
(PMOS) and the pull-down (NMOS) path calibrated by the Off-Chip Driver Calibration Unit to Rdri = 240Ω. The push-pull driver transistors are build with
thick-oxide transistors (60Å gate oxide thickness). This allows a simple and robust ESD protection of the driver transistors and avoids complicated and fragile
stacked transistor design with thin-oxide transistors as presented in [15]. This
seven parallel push-pull drivers can be individual selected to set the required
on-chip-termination value and to implement different driving strengths depending on write or read operations. The on-chip-termination impedance is then
(Rdri /Ndri−sel )/2 and off-chip-driver impedance Rdri /Ndri−sel . Where Ndri−sel
is the number of selected drivers. The overall driver architecture is based on [16].
The receivers for the data transceivers are single-ended receivers with their
reference pin connected to the voltage Vref (for DDR3 Vref = V DD/2 = 0.75V ).
The data strobe receiver is implemented as a fully differential amplifier with a
common-mode-voltage equal to Vref . All receivers are biased locally to simplify
the analog signal global routing. To reduce the power consumption bias circuits
delay cntrl
Slave DLL
CLK
Slave DLL
CLK
Data Even DQ0
Data Even DQ7
Data[0]
Data[6]
'0'
DQS
'0'
'1'
DQS
delay cntrl
Data Odd DQ7
DQ0
8x DDR3 Data Bus
(a) Write Path DDR3 PHY
DQ7
Slave DLL
'1'
Slave DLL
Data[63:0]
Data Odd DQ0
DQS
DQS
Data[1]
DQ0
Data[7]
DQ7
8x DDR3 Data Bus
(b) Read Path DDR3 PHY
Fig. 4: DDR3-PHY Data Bus
are shared between two receivers. This PHY architecture implements also the
90° phase shift delay needed in DDR interfaces to center align the data strobe
signals (DQS, nDQS) to the data signals (DQ) with a master-slave DLL configuration. After power up the Master DLL does a 360° lock to the internal clock.
After locking the Data Bus Logic broadcasts the new configuration values from
the Master DLL to the Slave DLLs. The Slave DLLs are built using a replica
delay line from the Master DLL that represents a quarter of the Master DLLs
delay. The Slave DLLs are placed inside the read and write path of the Data
Bus (see Fig.4). The ADDR/CMD Bus consists of 26 single data rate drivers
(ADDR, nCS, nWE...) and two clock drivers (CLK, nCLK). These drivers are
using the same driver topology architecture as the Data Bus drivers but in a
single data rate configuration. This allows to reuse the same impedance control
values that are broadcasted by the Data Bus. The ADDR/CMD Bus is direct
controlled by frontend of the memory controller. The ADDR/CMD Logic is only
a thin abstraction layer that implements the serialization of the inputs and the
configuration of the drivers.
4.1
All Digital Delay Locked Loop (AD-DLL)
An AD-DLL is selected for this DDR3 PHY due to its robustness and good
scaling in deep-sub-micrometer CMOS processes., This enable fast design time
and lower complexity compare to the analog counterparts [17]. The AD-DLL is
composed of the following main components (see Fig.5):
– Phase Frequency Detector (PFD),
– DLL-Controller,
– four digital controlled delay lines (DCDL) with fine and coarse delay control.
An abstracted version of the Phase Frequency Detector used in this DLL is
shown in Fig. 7. The Phase Frequency Detector (PFD) compares the leading
edges of the reference clock (clk in) with the delayed clock (clock coming from
the digital controlled delay lines). Depending if clk in or the delay clk is leading
a small pulse on the internal up int or down int port will be generated. To enable
the digital DLL Controller to detect these ports a RS-NAND latch is used to
hold the last status. This kind of PFDs are common for All-Digital DLLs, but
they have a major drawback to be very susceptible to glitches at their input.
Missing clock edges due to glitches cause fault detection. This AD-DLL solves
this problem by using digital controlled delay lines (DCDL) that suppress to
generate glitches when switched to a different delay values. The coarse delay
of the DCDL is constructed by using 32 glitch-free NAND-based delay element
(DE) structure in a special three step switching scheme proposed by [18] (see
Fig. 6a). The fine delay is implemented by two standard invertes and a RCDelay where the capacitor is trimmable with a 4bit resolution. The trimmable
capacitor is build out of MOSFETS were drain and source are connected to the
signal. By switching the gate to VDD or to VSS the capacitor changes its value.
start
clk ref
PhaseDetector
DLLController
up
lock
delay cntrl[8:0]
clk 360
Digital
Controlled
Delay Line
#4
clk 270
Digital
Controlled
Delay Line
#3
clk 180
Digital
Controlled
Delay Line
#2
clk 90
Digital
Controlled
Delay Line
#1
Fig. 5: All Digital Delay Locked Loop Architecture
DE
a2
a1
IN
FI
FI
b1
TURN
FI
D
PASS
FI
S3
FI
b0
D
FI
FI
D
FI
D
OUT
S2
FI
FI
FI
FI
FI
S1
FI
FI
T3
T2
T1
S0
a4
a3
FI
FI
D
FI D
FI
b2
POSTTURN
FI
FI
D
in
out
b3
TURN
(a) Coarse delay [18]
Fig. 6: Coarse and fine delay elements
trimmable Cap
(b) Fine delay
"1"
clk_in
D
Q
up_int
Clk
up
clk_in
delay_clk
nClr
up_int
down_int
nClr
delay_clk
"1"
Clk
D
up
Q
down_int
Fig. 7: Phase Frequency Detector (PFD) with Latch Output
5
Results
The memory controller is implemented using the UMC 65 nm Low-Leakage
CMOS bulk technology. The synthesis, place and route, and timing/power analysis of the digital logic are carried out using Synopsis’ DesignCompiler, IC Compiler and PrimeTime, respectively. The circuit level simulations and the full
custom layout of the analog blocks of the PHY are done using Cadence Spectre
and Virtuoso. The power estimation of the DRAM device is performed using Micron’s Power Calculator [19]. The DRAM model provided by multiple vendors
and the bit true model of our PHY are used for the functional verification (postlayout) of the DDR3 controller and the PHY. Fig 8 shows the floor plan and
layout of our memory controller designed for mW platform. The pin pitch of the
data IOs is 200 µm and the ADDR/CMD IOs have a pin pitch of 100 µm. The
area distribution of the memory controller consisting of frontend, PHY digital
logic and PHY IO transceiver is shown in the Table 2. The core of the memory
controller i.e. frontend is is extensively lean, consuming only 2% of the total chip
area.
Table 2: Area Distribution of the Controller and PHY
Component
Area
PHY-IO transceiver 3.820 mm2
PHY-Digital
0.551 mm2
Frontend
0.339 mm2
The post-layout results show that the DDR3 controller achieves a performance of 533 MHz (PHY clock) leading to a data rate of 1066 Mbit/pin/s under
worst case condition (i.e slow process, low VDD and high temperature). The
peak frequency of the design is limited due to the package (QFN64) used for the
mW demonstrator. The controller has a very low latency of 3 frontend cycles
plus 1 PHY clock for processing the host interface (AXI4) requests and delivering
the associated commands to the DRAM.
The power consumption of a single I/O driver when subjected to 100% toggling rate for the frequencies 166 MHz and 500 MHz are 6.0 mW and 18.0 mW,
PHY-IO ADDR-CMD
DQ-Transceiver
AD-DLL
DLL Controller
2x Receiver
376μm
RCDCDL
DDR-Driver
PFD
110μm
32 x NAND DCDL
4000μm
DQ3
DQ2
291μm
1166μm
Controller
100μm
306μm
PHY-IO DATA
4000μm
Fig. 8: Floor Plan
respectively. Similarly, a single data receiver consumes 3.75 mW and 5.5 mW
power for the aforementioned frequencies. The power estimation of the memory controller is done using a random trace (for worst case estimation) with
equal number of reads and writes as an input to the controller that resulted in
a data bus utilization of 60%. Table 3 shows the distribution of the estimated
power for the frequencies 166 MHz and 500 MHz. The power consumed by the
ADDR/CMD block has no substantial difference with the DATA block power
and the DRAM device power. This is true only for a single DRAM device memory subsystem. However, in a typical SO-DIMM architecture the DRAM power
and PHY-IO DATA power will be predominant. The contribution of the DRAM
power to the total power is low (or not the major contributor) due to the relatively good data bus utilization of 60% (less overhead - low number of row misses
etc.) and that only a single DRAM device was used. Note that the vendors of
commercial available DDR3 controllers do not disclose power and performance
values of their IPs.
Table 3: Power Distribution of the Controller and PHY
Component
Power at 166 MHz Power at 500 MHz
Frontend + PHY Digital
6.312 mW
18.764 mW
PHY-IO ADDR/CMD block
19.98 mW
59.96 mW
PHY-IO DATA block
15.96 mW
50.61 mW
DRAM Device (2 Gb)
24.0 mW
67.0 mW
6
Conclusion
In this work, we presented a memory controller that is tailored for IoT/embedded
applications that leverage the transprecision computing methodology. This memory controller adapted several advanced techniques, such as approximate DRAM,
sophisticated refresh policy, and optimal address mapping and data placement by
exploiting application knowledge. These techniques allow the energy and performance optimization of DRAM subsystems. Experimental results show that the
memory controller’s design is lean, low latency and low power. Furthermore the
presented design of the DDR3 PHY is scalable, low-complexity and robust even
under worst case corner conditions (i.e slow process, low VDD and high temperature). Finally, with this memory controller design we enable open hardware
platforms, such as RISC-V, to integrate external DRAM devices.
Acknowledgment
The project OPRECOMP acknowledges the financial support of the Future and
Emerging Technologies (FET) programme within the European Unions Horizon
2020 research and innovation programme, under grant agreement No.732631
(http://www.oprecomp.eu). This work was also supported by the Fraunhofer
High Performance Center for Simulation- and Software-based Innovation.
References
1. A. C. I. Malossi, et al. The transprecision computing paradigm: Concept, design,
and applications. In 2018 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1105–1110, March 2018.
2. D. Rossi, et al. Energy-Efficient Near-Threshold Parallel Computing: The PULPv2
Cluster. IEEE Micro, 37(5):20–31, Sep. 2017.
3. Christian Weis, et al. DRAMSpec: A High-Level DRAM Timing, Power and Area
Exploration Tool. International Journal of Parallel Programming, 45(6):1566–1591,
Dec 2017.
4. Matthias Jung, et al. ConGen: An Application Specific DRAM Memory Controller
Generator. In Proceedings of the Second International Symposium on Memory
Systems, MEMSYS ’16, pages 257–267, New York, NY, USA, 2016. ACM.
5. Ishwar Bhati, et al. Flexible auto-refresh: enabling scalable and energy-efficient
DRAM refresh reductions. In Proceedings of the 42nd Annual International Symposium on Computer Architecture, pages 235–246. ACM, 2015.
6. Jamie Liu, et al. RAIDR: Retention-Aware Intelligent DRAM Refresh. In Proceedings of the 39th Annual International Symposium on Computer Architecture,
ISCA ’12, pages 1–12, Washington, DC, USA, 2012. IEEE Computer Society.
7. Matthias Jung, et al. Approximate Computing with Partially Unreliable Dynamic
Random Access Memory - Approximate DRAM. In Proceedings of the 53rd Annual
Design Automation Conference, DAC ’16, pages 100:1–100:4, New York, NY, USA,
2016. ACM.
8. Jan Lucas, et al. Sparkk: Quality-Scalable Approximate Storage in DRAM. In The
Memory Forum, June 2014.
9. Song Liu, et al. Flikker: Saving DRAM Refresh-power Through Critical Data Partitioning. SIGPLAN Not., 46(3):213–224, March 2011.
10. Matthias Jung, et al. Omitting Refresh - A Case Study for Commodity and Wide
I/O DRAMs. In 1st International Symposium on Memory Systems (MEMSYS
2015), Washington, DC, USA, October 2015.
11. Deepak M. Mathew, et al. Using Run-Time Reverse-Engineering to Optimize
DRAM Refresh. In International Symposium on Memory Systems (MEMSYS17),
2017.
12. Jedec Solid State Technology Association. DDR3 SDRAM (JESD 79-3), 2012.
13. Cadence Inc.
Cadence
Denali
DDR
Memory
IP.
http://ip.cadence.com/ipportfolio/ip-portfolio-overview/memory-ip/ddr-lpddr,
October 2014, last access 18.02.2015.
14. Synopsys, Inc. DesignWare DDR IP. http://www.synopsys.com/IP/ InterfaceIP/DDRn/Pages/, 2015, Last Access: 18.02.2015.
15. X. Fan et al. ESD protection circuit schemes for DDR3 DQ drivers. In Electrical
Overstress/Electrostatic Discharge Symposium Proceedings 2010, pages 1–6, Oct
2010.
16. C. Yoo, et al. A 1.8 V 700 Mb/s/pin 512 Mb DDR-II SDRAM with on-die termination and off-chip driver calibration. In 2003 IEEE International Solid-State Circuits
Conference, 2003. Digest of Technical Papers. ISSCC., pages 312–496 vol.1, Feb
2003.
17. S. Chen, et al. An all-digital delay-locked loop for high-speed memory interface applications. In Technical Papers of 2014 International Symposium on VLSI Design,
Automation and Test, pages 1–4, April 2014.
18. D. De Caro. Glitch-Free NAND-Based Digitally Controlled Delay-Lines. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 21(1):55–66, Jan
2013.
19. Micron. DDR3 SDRAM System Power Calculator, July 2011. last access 2014-0703.