1、Computer Organization&DesignThe Hardware/Software Interface,2024/2/16,1,Presentation Outline,Random Access Memory and its StructureMemory Hierarchy and the need for Cache MemoryThe Basics of CachesCache Performance and Memory Stall CyclesImproving Cache PerformanceMultilevel Caches,Large arrays of s
2、torage cellsVolatile memoryHold the stored data as long as it is powered onRandom AccessAccess time is practically the same to any data on a RAM chipOutput Enable(OE)control signalSpecifies read operationWrite Enable(WE)control signalSpecifies write operation2n m RAM chip:n-bit address and m-bit dat
3、a,Random Access Memory,Memory Technology,Static RAM(SRAM)for CacheRequires 6 transistors per bitRequires low power to retain bitDynamic RAM(DRAM)for Main MemoryOne transistor+capacitor per bitMust be re-written after being readMust also be periodically refreshedEach row can be refreshed simultaneous
4、lyAddress lines are multiplexedUpper half of address:Row Access Strobe(RAS)Lower half of address:Column Access Strobe(CAS),Static RAM Storage Cell,Static RAM(SRAM):fast but expensive RAM6-Transistor cellTypically used for cachesProvides fast access timeCell Implementation:Cross-coupled inverters sto
5、re bitTwo pass transistorsRow decoder selects the word linePass transistors enable the cell to be read and written,Dynamic RAM Storage Cell,Dynamic RAM(DRAM):slow,cheap,and dense memoryTypical choice for main memoryCell Implementation:1-Transistor cell(pass transistor)Trench capacitor(stores bit)Bit
6、 is stored as a charge on capacitorMust be refreshed periodicallyBecause of leakage of charge from tiny capacitorRefreshing for all memory rowsReading each row and writing it back to restore the charge,Typical DRAM Packaging,24-pin dual in-line package for 16Mbit=222 4 memory22-bit address is divide
7、d into11-bit row address11-bit column addressInterleaved on same address lines,Row decoderSelect row to read/writeColumn decoderSelect column to read/writeCell Matrix2D array of tiny memory cellsSense/Write amplifiersSense&lify data on readDrive bit line with data in on writeSame data lines are u
8、sed for data in/out,Typical Memory Structure,DRAM Operation,Row Access(RAS)Latch and decode row address to enable addressed rowSmall change in voltage detected by sense amplifiersLatch whole row of bitsSense amplifiers drive bit lines to recharge storage cellsColumn Access(CAS)read and write operati
9、onLatch and decode column address to select m bitsm=4,8,16,or 32 bits depending on DRAM packageOn read,send latched bits out to chip pinsOn write,charge storage cells to required valueCan perform multiple column accesses to same row(burst mode),Burst Mode Operation,Block TransferRow address is latch
10、ed and decodedA read operation causes all cells in a selected row to be readSelected row is latched internally inside the SDRAM chipColumn address is latched and decodedSelected column data is placed in the data output registerColumn address is incremented automaticallyMultiple data items are read d
11、epending on the block lengthFast transfer of blocks between memory and cacheFast transfer of pages between memory and disk,Trends in DRAM,SDRAM and DDR SDRAM,SDRAM is Synchronous Dynamic RAMAdded clock to DRAM interfaceSDRAM is synchronous with the system clockOlder DRAM technologies were asynchrono
12、usAs system bus clock improved,SDRAM delivered higher performance than asynchronous DRAMDDR is Double Data Rate SDRAMLike SDRAM,DDR is synchronous with the system clock,but the difference is that DDR reads data on both the rising and falling edges of the clock signal,Transfer Rates&Peak Bandwidth,1
13、Transfer=64 bits=8 bytes of data,DRAM Refresh Cycles,Refresh cycle is about tens of millisecondsRefreshing is done for the entire memoryEach row is read and written back to restore the chargeSome of the memory bandwidth is lost to refresh cycles,Loss of Bandwidth to Refresh Cycles,Example:A 256 Mb D
14、RAM chipOrganized internally as a 16K 16K cell matrixRows must be refreshed at least once every 50 msRefreshing a row takes 100 nsWhat fraction of the memory bandwidth is lost to refresh cycles?Solution:Refreshing all 16K rows takes:16 1024 100 ns=1.64 msLoss of 1.64 ms every 50 msFraction of lost m
15、emory bandwidth=1.64/50=3.3%,Expanding the Data Bus Width,Memory chips typically have a narrow data busWe can expand the data bus width by a factor of pUse p RAM chips and feed the same address to all chipsUse the same Output Enable and Write Enable control signals,Next.,Random Access Memory and its
16、 StructureMemory Hierarchy and the need for Cache MemoryThe Basics of CachesCache Performance and Memory Stall CyclesImproving Cache PerformanceMultilevel Caches,Processor-Memory Performance Gap,1980 No cache in microprocessor1995 Two-level cache on microprocessor,CPU Performance:55%per year,slowing
17、 down after 2004,Performance Gap,DRAM:7%per year,The Need for Cache Memory,Widening speed gap between CPU and main memoryProcessor operation takes less than 1 nsMain memory requires about 100 ns to accessEach instruction involves at least one memory accessOne memory access to fetch the instructionA
18、second memory access for load and store instructionsMemory bandwidth limits the instruction execution rateCache memory can help bridge the CPU-memory gapCache memory is small in size but fast,Typical Memory Hierarchy,Registers are at the top of the hierarchyTypical size 200 GB)Access time:5 10 ms,Pr
19、inciple of Locality of Reference,Programs access small portion of their address spaceAt any time,only a small set of instructions&data is neededTemporal Locality(in time)If an item is accessed,probably it will be accessed again soonSame loop instructions are fetched each iterationSame procedure may
20、be called and executed many timesSpatial Locality(in space)Tendency to access contiguous instructions/data in memorySequential execution of InstructionsTraversing arrays element by element,What is a Cache Memory?,Small and fast(SRAM)memory technologyStores the subset of instructions&data currently b
21、eing accessedUsed to reduce average access time to memoryCaches exploit temporal locality by Keeping recently accessed data closer to the processorCaches exploit spatial locality by Moving blocks consisting of multiple contiguous wordsGoal is to achieveFast speed of cache memory access Balance the c
22、ost of the memory system,Cache analogy,Studying books in libraryOption 1:Every time you switch to another book,return current book to shelf and get new book from shelfLatency=5 minutes.Option 2:Keep 10 commonly-used books on shelf above deskLatency=1 minute.Option 3:Keep three books open to appropri
23、ate locations on deskLatency=10 seconds.,Cache Memories in the Datapath,Almost Everything is a Cache!,In computer architecture,almost everything is a cache!Registers:a cache on variables software managedFirst-level cache:a cache on L2 cache or memorySecond-level cache:a cache on memoryMemory:a cache
24、 on hard diskStores recent programs and their dataHard disk can be viewed as an extension to main memoryBranch target and prediction bufferCache on branch target and prediction information,Main memory is implemented from DRAM dynamic random access memorycaches use SRAM static random access memorymag
25、netic disk Flash memory is used instead of disks in many embedded devices;,2024/2/16,26,Memory hierarchy,Use a small array of SRAM.Small so fast and cheap.Use a larger amount of DRAM.Cheaper than SRAM,faster than flash/disk.Use a lot of flash and/or disk.Non-volatile.Cheap.Big.Dont try to buy 264 by
26、tes of anything.Use“virtual memory”to make it look like the entire address range is available.A few TB is enough for most desktop machines today,or a smartphone in a few years.,Memory hierarchy,Use a small array of SRAM.For the CACHE(hopefully covers most loads and stores).Use a bigger amount of DRA
27、M.For the Main memory.Use a a lot of Disk.For Virtual memory and nonvolatile storage.,Memory hierarchy,Cache(SRAM),Main memory(DRAM),Disk(magnetic or floating gate),Cost,Latency,Accesstime,Next.,Random Access Memory and its StructureMemory Hierarchy and the need for Cache MemoryThe Basics of CachesC
28、ache Performance and Memory Stall CyclesImproving Cache PerformanceMultilevel Caches,A Very Simple Memory System,74,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,0123456789101112131415,Ld R1 M 1 Ld R2 M 5 Ld R3 M 1 Ld R3 M 7 Ld R2 M 7,Cache,Processor,tag data,R0R1R2R3,Memory,2 cache li
29、nes4 bit tag field1 byte block,31,A Very Simple Memory System,74,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,0123456789101112131415,Ld R1 M 1 Ld R2 M 5 Ld R3 M 1 Ld R3 M 7 Ld R2 M 7,Cache,Processor,tag data,R0R1R2R3,Memory,Is it in the cache?,No valid tags,This is aCache miss,Allocat
30、e:address tagMem1 block,1,110,1,32,A Very Simple Memory System,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,0123456789101112131415,Ld R1 M 1 Ld R2 M 5 Ld R3 M 1 Ld R3 M 7 Ld R2 M 7,Cache,Processor,1,tag data,R0R1R2R3,Memory,110,lru,Misses:1Hits:0,110,74,33,A Very Simple Memory System,100,110,120,130,140,150,160,170,180,190,200,210,220,230,240,250,0123456789101112131415,Ld R1 M 1 Ld R2 M 5 Ld R3 M 1 Ld R3 M 7 Ld R2 M 7,Cache,Processor,1,tag data,R0R1R2R3,Memory,110,Misses:1Hits:0,110,