SlideShare a Scribd company logo
Parallelization of Coupled Cluster Code with OpenMP Anil Kumar Bohare Department of Computer  Science, University of Pune, Pune-7, India
Multi-core architecture and its implications to software This presentation has been made in OpenOffice. Multi-core architectures have a single chip package that contains one or more dies with multiple execution cores on computational engines. The jobs are run on appropriate software threads simultaneously. contd.
Multi-core architecture and its implications to software Current computer architectures like multi-core processor on a single chip are increasingly relying on parallel programming techniques like Message Passing Interface (MPI), Open specifications for Multi-Processing (OpenMP) etc. to improve performance of applications, leading to developments in High Performance Computing (HPC).
Parallelization of Coupled Cluster Code With the increase in popularity of Symmetric  Multiprocessing(SMP) systems as the building blocks for  high performance supercomputers, the need for applications that can utilize the multiple levels of parallel architecture clusters of SMPs have also increased.  This presentation describes parallelization of an important molecular dynamic  application ‘Coupled Cluster Singles and Doubles (CCSD)’ on multi-core systems. contd.
Parallelization of Coupled Cluster Code To reduce the execution time of sequential CCSD code, we optimize and parallelize it for accelerating its execution on multi-core systems.
Agenda Introduction Problem & Theories Areas of application Performance Evaluation System OpenMP implementation discussion General performance recommendations Advantages & Disadvantages Performance Evaluations Further improvement Conclusion References
Introduction / Background Coupled-cluster (CC) methods are now widely used in quantum chemistry to calculate the electron correlation energy. Common use is in  ab initio  quantum chemistry methods in the field of computational chemistry. Technique is used for describing many-body systems. Some of the most accurate calculations for small to medium sized molecules use this method.
Problem CCSD project contains 5 different files ‘ vbar’ is one of the many subroutines under focus. It: Computes the effective two-electron integrals which are CPU intensive. Has iterative calculations. Has big and time consuming loops. Has up to 8 levels of nested loops. Takes approximately 12 minutes to execute in a sequential code. contd.
Problem Goal is to reduce this time by at-least 30% i.e. making  it 7-8 minutes by applying OpenMP parallelization technique.
Parallelization: Description of the theory Shared-memory architecture (SMP): These parallel machines are built up on a set of processors which have  access to a common memory. Distributed-memory architecture (Beuwolf clusters):  each processor has its own private memory and information is interchanged through messages. Wiki: MPI is a computer software protocol that allows many computers to communicate with one another. In the last few years a new industry standard has evolved with the aim to serve the development of parallel programs  on shared –memory machines:  OpenMP.
OpenMP is an API (Application Program Interface) used to explicitly direct multi-threaded, shared memory parallelism. OpenMP defines a portable programming interface based on directives, run time routines and environment variables. OpenMP is a relatively new programming paradigm, which can easily deliver good parallel performance for small numbers (<16) of processors. OpenMP is usually used on existing serial programs to  achieve moderate parallelism with relatively little effort.
Use of OpenMP OpenMP is used in applications with intense computational needs, such as video games, big science & engineering problems. It can be used from very early programmers in school to scientists to parallel computing experts. It is available to millions of programmers in every major(Fortran & C/C++) compiler.
System used Supermicro computer node Quad Core Dual CPU = 8 cores Intel(R) Xeon(R) CPU  X5472 @ 3.00GHz 8GB RAM Red Hat Enterprise Linux WS release 4  Kernel: 2.6.9-42.Elsmp Compiler: Intel ifort (IFORT) 11.0 20090131 The parallel CCSD implementation with OpenMP is compiled by Intel Fortran Compiler Version 11.0 with O3 optimization  flag.
How to apply OpenMP? Identify compute intensive loops Scope of Data Parallelism Use of PARALLEL DO directive Reduction variables Mutual Exclusion Synchronization - Critical Section Use of Atomic directive OpenMP Execution Model
Identify compute intensive loops If you have big loops that dominate execution time, these are ideal target for OpenMP. Divide loop iterations among threads: We will focus mainly on loop level parallelism in this presentation. Make the loop iterations independent.. So they can safely execute in any order without loop-carried dependencies. Place the appropriate OpenMP directives and test.
Scope of Data Parallelism Shared variables are shared among all threads. Private variables vary independently within threads. By default, all variables declared outside a parallel block are shared except the loop index variable, which is private. In the shared memory setup the private variables in each thread avoid dependencies and false sharing of data .
PARALLEL DO Directive The first directive specifies that the loop immediately following should be executed in parallel. For codes  that spend the majority of their time executing the content of loops, the PARALLEL DO directive can result in significant increase in performance. contd.
PARALLEL DO Directive These are actual examples taken from OpenMP version of CCSD. C$OMP PARALLEL  C$OMP DO SCHEDULE(STATIC,2)  C$OMP&PRIVATE(ib,ibsym,orbb,iab,iib,iq,iqsym,ibqsym,iaq,iiq, $ig,igsym,orbg,iig,iag,ir,irsym,iir,iar,orbr,irgsym, $kloop,kk,ak,rk,f4,vqgbr,imsloc,is,issym,iis,orbs,ias)‏ do 1020 ib=1,nocc body of loop continue 1020
Reduction variables Variables that are used in collective operations over the elements of an array can be labeled as REDUCTION variables: xsum=0 C$OMP PARALLEL DO REDUCTION(+:xsum)‏ do in=1,ntmax xsum=xsum+baux(in)*t(in)‏ enddo C$OMP END PARALLEL DO Each processor has its own copy of xsum. After the parallel work is finished, the master thread collects the values generated by each processor and performs global reduction.
Mutual Exclusion Synchronization-Critical Section Certain parallel programs may require that each processor executes a section of code, where it is critical that only one processor executes the code section at a time.  These regions can be marked with CRITICAL/END CRITICAL directivcs . Example C$OMP CRITICAL(SECTION1)‏ call findk(orbq,orbr,orbb,orbg,iaq,iar,iab,iag,kgot,kmax)‏ C$OMP END CRITICAL(SECTION1)‏
Atomic Directive The ATOMIC directives ensures that specific memory location is to be updated automatically, rather than exposing it to the possibility of multiple, simultaneous writing threads. Example C$OMP ATOMIC aims31(imsloc) = aims31(imsloc)-twoe*t(in1)‏
Problem solution: Flow
Compilation & Execution  Compile the OpenMP version of CCSD Code anil@node:~#   ifort -openmp ccsd_omp.F -o ccsd_omp.o Set the OpenMP environment variables [email_address] :~#   cat exports export OMP_NUM_THREADS=2 or 4 or 8 (number of threads to be spawned while executing the specified loops)‏ export OMP_STACKSIZE=1G(Less size may result in segmentation fault)‏ [email_address] :~# source exports contd.
Compilation & Execution  Execute the OpenMP version of CCSD code anil@node:~# date>run_time; time ./ccsd_omp.o; date>>run_time
OpenMP Execution Model
General performance recommendations Be aware of the Amdahl's law Minimize serial code Remove dependencies among iterations Be a ware of directives cost Parallelize outer loops Minimize the number of directives Minimize synchronization- minimize the use of  BARRIER,CRITICAL Reduce False Sharing Make use of private data as much as possible.
Advantages With multiple cores is that we could use them to extract thread level parallelism in a program and hence increase the performance of the sequential code. Original source code is almost left untouched. Can substantially reduce the execute time (upto 40%) of a given code resulting in power saving. Designed to make programming threaded applications quicker, easier, and less error prone.
Disadvantages OpenMP code will only run on SMP machines When the processor must perform multiple tasks simultaneously, it will cause performance degradations There can be several iterations of trials before the user gets expected timings from the OpenMP codes
Result
Descriptive statistics Graph shows that as number of cores increasing wall clock it reducing at 35.66% of total time to increase performance
Further improvement This technique can be applicable to multi-level nested do loops which are highly complex and require more time. This code can also benefit with hybrid approach i.e. outer loop is parallelized between processors using MPI and the inner loop is parallelized for processing elements inside each processor with OpenMP directives. Though this effectively means rewriting the complete code.
Conclusion In this presentation, I parallelized and optimized the 'vbar' subroutine in CCSD Code.  I conducted a details performance characterization on 8-cores processor system. Found the optimization technique such as SIMD (Single  Instruction Multiple Data) is one of four Flynn's Taxonomy,  effective. Decreased the runtime linearly when adding more compute cores to the same problem Multiple cores/CPUs dominate the future computer architectures; OpenMP would be very useful for parallelization of sequential applications, in these architectures
References Barney, Blaise.”Introduction to Parallel Computing”   . Lawrence Livermore National Laboratory. http://www.llnl.gov/computing/tutorials/parallel_comp/  The official for OpenMP www.openmp.org http://www.llnl.gov/computing/tutorials/openMP/ R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald,  Parallel Programming in OpenMP.  Morgan Kaufmann, 2000.   http://www.nersc.gov/nusers/help/tutorials/openmp MPI web pages at Argonne National Laboratory  http://www-unix.mcs.anl.gov/mpi
 

More Related Content

What's hot (20)

PDF
Harnessing the Killer Micros
Jim Belak
 
PPT
Suyash Thesis Presentation
Tanvee Katyal
 
PDF
Parallel program design
ZongYing Lyu
 
PDF
Move Message Passing Interface Applications to the Next Level
Intel® Software
 
PDF
Mkl mic lab_0
Hung le Minh
 
PPTX
Parallelization using open mp
ranjit banshpal
 
PDF
Open mp
Gopi Saiteja
 
PPT
FIR filter on GPU
Alexey Smirnov
 
PPTX
defense-linkedin
Dr. Spiros N. Agathos
 
PDF
Programming Languages & Tools for Higher Performance & Productivity
Linaro
 
PPTX
Openmp
Amirali Sharifian
 
PDF
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
PDF
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
IDES Editor
 
ODP
OpenMp
Neel Bhad
 
PPT
OpenMP And C++
Dragos Sbîrlea
 
PDF
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
PPT
IS-ENES COMP Superscalar tutorial
Roger Rafanell Mas
 
PPTX
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
PPT
TASK SCHEDULING ON ADAPTIVE MULTI-CORE
Haris Muhammed
 
Harnessing the Killer Micros
Jim Belak
 
Suyash Thesis Presentation
Tanvee Katyal
 
Parallel program design
ZongYing Lyu
 
Move Message Passing Interface Applications to the Next Level
Intel® Software
 
Mkl mic lab_0
Hung le Minh
 
Parallelization using open mp
ranjit banshpal
 
Open mp
Gopi Saiteja
 
FIR filter on GPU
Alexey Smirnov
 
defense-linkedin
Dr. Spiros N. Agathos
 
Programming Languages & Tools for Higher Performance & Productivity
Linaro
 
Concurrent Programming OpenMP @ Distributed System Discussion
CherryBerry2
 
A High Speed Pipelined Dynamic Circuit Implementation Using Modified TSPC Log...
IDES Editor
 
OpenMp
Neel Bhad
 
OpenMP And C++
Dragos Sbîrlea
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Dr. Fabio Baruffa
 
IS-ENES COMP Superscalar tutorial
Roger Rafanell Mas
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
TASK SCHEDULING ON ADAPTIVE MULTI-CORE
Haris Muhammed
 

Similar to Parallelization of Coupled Cluster Code with OpenMP (20)

PPT
Lecture6
tt_aljobory
 
PPTX
Intro to OpenMP
jbp4444
 
PDF
Introduction to OpenMP
Akhila Prabhakaran
 
PPT
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
PPT
Nbvtalkataitamimageprocessingconf
Nagasuri Bala Venkateswarlu
 
PDF
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Jeff Larkin
 
PDF
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Jeff Larkin
 
PPT
Parllelizaion
Vivek Kantariya
 
PDF
Introduction to OpenMP
Akhila Prabhakaran
 
PDF
Task based Programming with OmpSs and its Application
Facultad de Informática UCM
 
PPT
openmp.ppt
FAfazi1
 
PPT
openmp.ppt
GopalPatidar13
 
PDF
openmpfinal.pdf
GopalPatidar13
 
PPTX
OpenMP.pptx
MunimAkhtarChoudhury
 
PDF
parallelprocessing-openmp-181105062408.pdf
bosdhoni7378
 
PPTX
Parallel processing -open mp
Tanjilla Sarkar
 
PDF
Parallel Programming
Roman Okolovich
 
PDF
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
PPT
Programming using Open Mp
Anshul Sharma
 
PPTX
Presentation on Shared Memory Parallel Programming
Vengada Karthik Rangaraju
 
Lecture6
tt_aljobory
 
Intro to OpenMP
jbp4444
 
Introduction to OpenMP
Akhila Prabhakaran
 
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
Nbvtalkataitamimageprocessingconf
Nagasuri Bala Venkateswarlu
 
Best Practices for OpenMP on GPUs - OpenMP UK Users Group
Jeff Larkin
 
GTC16 - S6510 - Targeting GPUs with OpenMP 4.5
Jeff Larkin
 
Parllelizaion
Vivek Kantariya
 
Introduction to OpenMP
Akhila Prabhakaran
 
Task based Programming with OmpSs and its Application
Facultad de Informática UCM
 
openmp.ppt
FAfazi1
 
openmp.ppt
GopalPatidar13
 
openmpfinal.pdf
GopalPatidar13
 
parallelprocessing-openmp-181105062408.pdf
bosdhoni7378
 
Parallel processing -open mp
Tanjilla Sarkar
 
Parallel Programming
Roman Okolovich
 
Omp tutorial cpugpu_programming_cdac
Ganesan Narayanasamy
 
Programming using Open Mp
Anshul Sharma
 
Presentation on Shared Memory Parallel Programming
Vengada Karthik Rangaraju
 
Ad

Recently uploaded (20)

PDF
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
DOCX
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
PPTX
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
PPT
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
PDF
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
PDF
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
PDF
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
PPTX
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
PDF
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
PDF
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
PDF
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
PDF
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
PPTX
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
PDF
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
PPTX
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
DOCX
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
PPTX
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
PPTX
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
PDF
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
PPTX
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
NASA A Researcher’s Guide to International Space Station : Physical Sciences ...
Dr. PANKAJ DHUSSA
 
Python coding for beginners !! Start now!#
Rajni Bhardwaj Grover
 
AI Penetration Testing Essentials: A Cybersecurity Guide for 2025
defencerabbit Team
 
Ericsson LTE presentation SEMINAR 2010.ppt
npat3
 
Mastering Financial Management in Direct Selling
Epixel MLM Software
 
Transforming Utility Networks: Large-scale Data Migrations with FME
Safe Software
 
LOOPS in C Programming Language - Technology
RishabhDwivedi43
 
Q2 FY26 Tableau User Group Leader Quarterly Call
lward7
 
“NPU IP Hardware Shaped Through Software and Use-case Analysis,” a Presentati...
Edge AI and Vision Alliance
 
Newgen 2022-Forrester Newgen TEI_13 05 2022-The-Total-Economic-Impact-Newgen-...
darshakparmar
 
SIZING YOUR AIR CONDITIONER---A PRACTICAL GUIDE.pdf
Muhammad Rizwan Akram
 
NLJUG Speaker academy 2025 - first session
Bert Jan Schrijver
 
Future Tech Innovations 2025 – A TechLists Insight
TechLists
 
How do you fast track Agentic automation use cases discovery?
DianaGray10
 
Mastering ODC + Okta Configuration - Chennai OSUG
HathiMaryA
 
Cryptography Quiz: test your knowledge of this important security concept.
Rajni Bhardwaj Grover
 
From Sci-Fi to Reality: Exploring AI Evolution
Svetlana Meissner
 
New ThousandEyes Product Innovations: Cisco Live June 2025
ThousandEyes
 
“Squinting Vision Pipelines: Detecting and Correcting Errors in Vision Models...
Edge AI and Vision Alliance
 
Seamless Tech Experiences Showcasing Cross-Platform App Design.pptx
presentifyai
 
Ad

Parallelization of Coupled Cluster Code with OpenMP

  • 1. Parallelization of Coupled Cluster Code with OpenMP Anil Kumar Bohare Department of Computer Science, University of Pune, Pune-7, India
  • 2. Multi-core architecture and its implications to software This presentation has been made in OpenOffice. Multi-core architectures have a single chip package that contains one or more dies with multiple execution cores on computational engines. The jobs are run on appropriate software threads simultaneously. contd.
  • 3. Multi-core architecture and its implications to software Current computer architectures like multi-core processor on a single chip are increasingly relying on parallel programming techniques like Message Passing Interface (MPI), Open specifications for Multi-Processing (OpenMP) etc. to improve performance of applications, leading to developments in High Performance Computing (HPC).
  • 4. Parallelization of Coupled Cluster Code With the increase in popularity of Symmetric Multiprocessing(SMP) systems as the building blocks for high performance supercomputers, the need for applications that can utilize the multiple levels of parallel architecture clusters of SMPs have also increased. This presentation describes parallelization of an important molecular dynamic application ‘Coupled Cluster Singles and Doubles (CCSD)’ on multi-core systems. contd.
  • 5. Parallelization of Coupled Cluster Code To reduce the execution time of sequential CCSD code, we optimize and parallelize it for accelerating its execution on multi-core systems.
  • 6. Agenda Introduction Problem & Theories Areas of application Performance Evaluation System OpenMP implementation discussion General performance recommendations Advantages & Disadvantages Performance Evaluations Further improvement Conclusion References
  • 7. Introduction / Background Coupled-cluster (CC) methods are now widely used in quantum chemistry to calculate the electron correlation energy. Common use is in ab initio quantum chemistry methods in the field of computational chemistry. Technique is used for describing many-body systems. Some of the most accurate calculations for small to medium sized molecules use this method.
  • 8. Problem CCSD project contains 5 different files ‘ vbar’ is one of the many subroutines under focus. It: Computes the effective two-electron integrals which are CPU intensive. Has iterative calculations. Has big and time consuming loops. Has up to 8 levels of nested loops. Takes approximately 12 minutes to execute in a sequential code. contd.
  • 9. Problem Goal is to reduce this time by at-least 30% i.e. making it 7-8 minutes by applying OpenMP parallelization technique.
  • 10. Parallelization: Description of the theory Shared-memory architecture (SMP): These parallel machines are built up on a set of processors which have access to a common memory. Distributed-memory architecture (Beuwolf clusters): each processor has its own private memory and information is interchanged through messages. Wiki: MPI is a computer software protocol that allows many computers to communicate with one another. In the last few years a new industry standard has evolved with the aim to serve the development of parallel programs on shared –memory machines: OpenMP.
  • 11. OpenMP is an API (Application Program Interface) used to explicitly direct multi-threaded, shared memory parallelism. OpenMP defines a portable programming interface based on directives, run time routines and environment variables. OpenMP is a relatively new programming paradigm, which can easily deliver good parallel performance for small numbers (<16) of processors. OpenMP is usually used on existing serial programs to achieve moderate parallelism with relatively little effort.
  • 12. Use of OpenMP OpenMP is used in applications with intense computational needs, such as video games, big science & engineering problems. It can be used from very early programmers in school to scientists to parallel computing experts. It is available to millions of programmers in every major(Fortran & C/C++) compiler.
  • 13. System used Supermicro computer node Quad Core Dual CPU = 8 cores Intel(R) Xeon(R) CPU X5472 @ 3.00GHz 8GB RAM Red Hat Enterprise Linux WS release 4 Kernel: 2.6.9-42.Elsmp Compiler: Intel ifort (IFORT) 11.0 20090131 The parallel CCSD implementation with OpenMP is compiled by Intel Fortran Compiler Version 11.0 with O3 optimization flag.
  • 14. How to apply OpenMP? Identify compute intensive loops Scope of Data Parallelism Use of PARALLEL DO directive Reduction variables Mutual Exclusion Synchronization - Critical Section Use of Atomic directive OpenMP Execution Model
  • 15. Identify compute intensive loops If you have big loops that dominate execution time, these are ideal target for OpenMP. Divide loop iterations among threads: We will focus mainly on loop level parallelism in this presentation. Make the loop iterations independent.. So they can safely execute in any order without loop-carried dependencies. Place the appropriate OpenMP directives and test.
  • 16. Scope of Data Parallelism Shared variables are shared among all threads. Private variables vary independently within threads. By default, all variables declared outside a parallel block are shared except the loop index variable, which is private. In the shared memory setup the private variables in each thread avoid dependencies and false sharing of data .
  • 17. PARALLEL DO Directive The first directive specifies that the loop immediately following should be executed in parallel. For codes that spend the majority of their time executing the content of loops, the PARALLEL DO directive can result in significant increase in performance. contd.
  • 18. PARALLEL DO Directive These are actual examples taken from OpenMP version of CCSD. C$OMP PARALLEL C$OMP DO SCHEDULE(STATIC,2) C$OMP&PRIVATE(ib,ibsym,orbb,iab,iib,iq,iqsym,ibqsym,iaq,iiq, $ig,igsym,orbg,iig,iag,ir,irsym,iir,iar,orbr,irgsym, $kloop,kk,ak,rk,f4,vqgbr,imsloc,is,issym,iis,orbs,ias)‏ do 1020 ib=1,nocc body of loop continue 1020
  • 19. Reduction variables Variables that are used in collective operations over the elements of an array can be labeled as REDUCTION variables: xsum=0 C$OMP PARALLEL DO REDUCTION(+:xsum)‏ do in=1,ntmax xsum=xsum+baux(in)*t(in)‏ enddo C$OMP END PARALLEL DO Each processor has its own copy of xsum. After the parallel work is finished, the master thread collects the values generated by each processor and performs global reduction.
  • 20. Mutual Exclusion Synchronization-Critical Section Certain parallel programs may require that each processor executes a section of code, where it is critical that only one processor executes the code section at a time. These regions can be marked with CRITICAL/END CRITICAL directivcs . Example C$OMP CRITICAL(SECTION1)‏ call findk(orbq,orbr,orbb,orbg,iaq,iar,iab,iag,kgot,kmax)‏ C$OMP END CRITICAL(SECTION1)‏
  • 21. Atomic Directive The ATOMIC directives ensures that specific memory location is to be updated automatically, rather than exposing it to the possibility of multiple, simultaneous writing threads. Example C$OMP ATOMIC aims31(imsloc) = aims31(imsloc)-twoe*t(in1)‏
  • 23. Compilation & Execution Compile the OpenMP version of CCSD Code anil@node:~# ifort -openmp ccsd_omp.F -o ccsd_omp.o Set the OpenMP environment variables [email_address] :~# cat exports export OMP_NUM_THREADS=2 or 4 or 8 (number of threads to be spawned while executing the specified loops)‏ export OMP_STACKSIZE=1G(Less size may result in segmentation fault)‏ [email_address] :~# source exports contd.
  • 24. Compilation & Execution Execute the OpenMP version of CCSD code anil@node:~# date>run_time; time ./ccsd_omp.o; date>>run_time
  • 26. General performance recommendations Be aware of the Amdahl's law Minimize serial code Remove dependencies among iterations Be a ware of directives cost Parallelize outer loops Minimize the number of directives Minimize synchronization- minimize the use of BARRIER,CRITICAL Reduce False Sharing Make use of private data as much as possible.
  • 27. Advantages With multiple cores is that we could use them to extract thread level parallelism in a program and hence increase the performance of the sequential code. Original source code is almost left untouched. Can substantially reduce the execute time (upto 40%) of a given code resulting in power saving. Designed to make programming threaded applications quicker, easier, and less error prone.
  • 28. Disadvantages OpenMP code will only run on SMP machines When the processor must perform multiple tasks simultaneously, it will cause performance degradations There can be several iterations of trials before the user gets expected timings from the OpenMP codes
  • 30. Descriptive statistics Graph shows that as number of cores increasing wall clock it reducing at 35.66% of total time to increase performance
  • 31. Further improvement This technique can be applicable to multi-level nested do loops which are highly complex and require more time. This code can also benefit with hybrid approach i.e. outer loop is parallelized between processors using MPI and the inner loop is parallelized for processing elements inside each processor with OpenMP directives. Though this effectively means rewriting the complete code.
  • 32. Conclusion In this presentation, I parallelized and optimized the 'vbar' subroutine in CCSD Code. I conducted a details performance characterization on 8-cores processor system. Found the optimization technique such as SIMD (Single Instruction Multiple Data) is one of four Flynn's Taxonomy, effective. Decreased the runtime linearly when adding more compute cores to the same problem Multiple cores/CPUs dominate the future computer architectures; OpenMP would be very useful for parallelization of sequential applications, in these architectures
  • 33. References Barney, Blaise.”Introduction to Parallel Computing” . Lawrence Livermore National Laboratory. http://www.llnl.gov/computing/tutorials/parallel_comp/ The official for OpenMP www.openmp.org http://www.llnl.gov/computing/tutorials/openMP/ R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald, Parallel Programming in OpenMP. Morgan Kaufmann, 2000. http://www.nersc.gov/nusers/help/tutorials/openmp MPI web pages at Argonne National Laboratory http://www-unix.mcs.anl.gov/mpi
  • 34.