第四章指令并行软件方面课件.ppt
- 【下载声明】
1. 本站全部试题类文档,若标题没写含答案,则无答案;标题注明含答案的文档,主观题也可能无答案。请谨慎下单,一旦售出,不予退换。
2. 本站全部PPT文档均不含视频和音频,PPT中出现的音频或视频标识(或文字)仅表示流程,实际无音频或视频文件。请谨慎下单,一旦售出,不予退换。
3. 本页资料《第四章指令并行软件方面课件.ppt》由用户(晟晟文业)主动上传,其收益全归该用户。163文库仅提供信息存储空间,仅对该用户上传内容的表现方式做保护处理,对上传内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!
4. 请根据预览情况,自愿下载本文。本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。
5. 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007及以上版本和PDF阅读器,压缩文件请下载最新的WinRAR软件解压。
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- 第四 指令 并行 软件 方面 课件
- 资源描述:
-
1、Computer Architecture-A Quantitative Approach计算机体系结构计算机体系结构计算机计算机体系结构体系结构Chapter 4 (2)Instruction-Level Parallelism Software Approaches王奕王奕ELecture for ILP:Software approaches(软件方法软件方法)Basic Compiler Technique for Exposing ILP Loop unrolling(基本的发现(基本的发现ILP的编译技术是循环展开)的编译技术是循环展开)Static Branch Predicti
2、on(静态分支预测静态分支预测)Static multiple Issue:VLIW(静态多指令发射(静态多指令发射VLIW)Advanced Compilor Support for Exposing and Exploiting ILP(对发现和开发对发现和开发ILP的高级编译器支持的高级编译器支持)Software pipelining(软件流水软件流水)Global Code scheduling(全局代码调度)(全局代码调度)Hardware Support for Exposing More Parallelism at compile time(对编译时开发对编译时开发ILP的硬
3、件支持的硬件支持)Conditional or Predicated(断言的断言的)instructions(条件指令或预条件指令或预测指令测指令)Compiler speculation with hardware support(在硬件支持下在硬件支持下的编译器投机技术的编译器投机技术)FP Loop:Where are the Hazards?Loop:LD F0,0(R1);F0=vector element ADDD F4,F0,F2;add scalar from F2 SD 0(R1),F4;store result SUBI R1,R1,8;decrement pointer
4、8B(DW)BNEZ R1,Loop;branch R1!=zero NOP;delayed branch slotAssumptions of the latency of the FP operations:FP ALU opAnother FP ALU op 3FP ALU opStore double 2 Load doubleFP ALU op 1Load doubleStore double 0Integer opInteger op 0 Where are the stalls?Reducing stalls from schedulling in BB and delayed
5、branchLoop:LD F0,0(R1)ADDD F4,F0,F2 SD 0(R1),F4 SUBI R1,R1,#8 BNEZ R1,LoopF D X M W F D s A1 A2 A3 A4 W F s D s s X M W F s s D X M W F s D X M W 10 CC10 CC F FLoop:LD F0,0(R1)SUBI R1,R1,#8 ADDD F4,F0,F2 BNEZ R1,Loop SD +8(R1),F4F D X M W F D X M W F DA1A2A3A4W F D X M W F D s X M W 6 CC F s D X M W
6、Unroll Loop Four Times(straightforward way)Rewrite loop to minimize stalls?1 Loop:LDF0,0(R1)2ADDDF4,F0,F23SD0(R1),F4;drop SUBI&BNEZ4LDF6,-8(R1)5ADDDF8,F6,F26SD-8(R1),F8;drop SUBI&BNEZ7LDF10,-16(R1)8ADDDF12,F10,F29SD-16(R1),F12;drop SUBI&BNEZ10LDF14,-24(R1)11ADDD F16,F14,F212SUBIR1,R1,#32;alter to 4*
7、8/13SD+8(R1),F1614BNEZR1,LOOP15NOP 15+4 x(1+2)=27 clock cycles,or 6.8 per iteration Assumes R1 is multiple of 41 cycle stall2 cycles stallUnrolled Loop That Minimizes Stalls What assumptions made when moved code?OK to move store past SUBI even though changes register OK to move loads before stores:g
8、et right data?When is it safe for compiler to do such changes?1 Loop:LDF0,0(R1)2LDF6,-8(R1)3LDF10,-16(R1)4LDF14,-24(R1)5ADDDF4,F0,F26ADDDF8,F6,F27ADDDF12,F10,F28ADDDF16,F14,F29SD0(R1),F410SD-8(R1),F811SUBIR1,R1,#3212SD+16(R1),F1213BNEZR1,LOOP14SD8(R1),F16;8-32=-24 14 clock cycles,or 3.5 per iteratio
9、nUsing Loop unrolling and scheduling with static Multiple IssueInteger InstructionFP instructionClock cycleLoop:L.D F0,0(R1)1 L.D F0,-8(R1)2 L.D F0,-16(R1)ADD.D F4,F0.F23 L.D F0,-24(R1)ADD.D F8,F6.F24 L.D F0,-32(R1)ADD.D F12,F10.F25 S.D F4,0(R1)ADD.D F16,F14.F26 S.D F8,-8(R1)ADD.D F20,F18.F27 S.D F1
10、2,-16(R1)8 DADDUI R1,R1,#-409 S.D F16,16(R1)10 BNE R1,R2,Loop11 S.D F20,8(R1)12Static Branch Prediction静态分支预测静态分支预测 Static branch predictors are used in processors when branch behavior is expected highly predictable at compile time.(静态分支预测一般用于分支行为在编译器时就具有很高有可预测性的情形静态分支预测一般用于分支行为在编译器时就具有很高有可预测性的情形)Se
11、veral different methods Always predict a branch as taken or untaken (总是预测转移成功或不成功总是预测转移成功或不成功)Predict on the basis of branch direction(基于转移方向的预测基于转移方向的预测)Backward-going branch to be taken,(向后预测为成功向后预测为成功)Forward-going branch to be not taken.(向前预测为不成功向前预测为不成功)Profile-based Prediction(基于以往概要信息基于以往概要信息
12、(含多方面的行为含多方面的行为)的预测的预测)Static Multiple issue:VLIW(静态多发射:静态多发射:VLIW)VLIW:Very Long Instruction Word(超长指令字超长指令字)Each“instruction”has explicit coding for multiple operations(每条每条“指令指令”都显式地包括多个操作都显式地包括多个操作)In EPIC,grouping called a“packet”In Transmeta,grouping called a“molecule”(with“atoms”as ops)Tradeo
13、ff instruction space for simple decoding (为了编码简单,牺牲了一些代码空间为了编码简单,牺牲了一些代码空间)The long instruction word has room for many operations By definition,all the operations the compiler puts in the long instruction word are independent=execute in parallel E.g.,2 integer operations,2 FP ops,2 Memory refs,1 bra
14、nch 16 to 24 bits per field=7*16 or 112 bits to 7*24 or 168 bits wide Need compiling technique that schedules across several branchesLoop Unrolling in VLIWMemory MemoryFPFPInt.op/Clockreference 1reference 2operation 1 op.2 branchLD F0,0(R1)LD F6,-8(R1)1LD F10,-16(R1)LD F14,-24(R1)2LD F18,-32(R1)LD F
15、22,-40(R1)ADDD F4,F0,F2ADDD F8,F6,F23LD F26,-48(R1)ADDD F12,F10,F2 ADDD F16,F14,F24ADDD F20,F18,F2 ADDD F24,F22,F25SD 0(R1),F4SD-8(R1),F8ADDD F28,F26,F26SD-16(R1),F12 SD-24(R1),F167SD-32(R1),F20 SD-40(R1),F24SUBI R1,R1,#488SD-0(R1),F28BNEZ R1,LOOP9 Unrolled 7 times to avoid delays 7 results in 9 clo
16、cks,or 1.3 clocks per iteration(1.8X)Average:2.5 ops per clock,50%efficiency Note:Need more registers in VLIW(15 vs.6 in SS)Problems for VLIW Technical problems(技术问题技术问题)Increase in code size(代码的增长代码的增长)Loop unrolling Unused function slots Limitations of lockstep operation(锁定同步操作的限制锁定同步操作的限制)A stall
17、 in any function unit may cause the entire processor to stall Logistical problem(逻辑问题逻辑问题)Binary code compatibility(二进制代码的兼容性二进制代码的兼容性)Major challenge for all multiple-issue processors Exploit large amounts of ILPAdvanced Compiler Support for Exploiting ILP(编译器对开发编译器对开发ILP的高级支持的高级支持)Detecting and En
18、hancing Loop-level Parallelism(检测并增强循环级并行检测并增强循环级并行)Eliminating Dependent Computations(消除相关计算消除相关计算)Software pipelining:Symbolic loop unrolling(软件流水:符号循环展开软件流水:符号循环展开)Global Code Scheduling(全局代码调度全局代码调度)Trace Scheduling:focus on Critical path (路径调度:关注关键路径路径调度:关注关键路径)SuperblocksDetecting and Enhancin
展开阅读全文