We have been working on optimizing the single-core CPU performance of the AGBNP2 implicit solvent implementation. This is the first phase of a two-year NSF-sponsored effort to radically improve the performance of the software and port it to modern parallel architectures such as multi-core CPU's, general processing graphical processing units (GPGPU's) and many integrated cores coprocessors (MIC's).
The main handle we have been exploiting so far to improve single-core performance is SSE vectorization, that is the capacity of most CPU's to perform multiple floating point operations in parallel. High performance compilers claim to vectorize code automatically. However the reality is that automatic vectorization rarely yields significant improvement in performance. The reason is that for optimal performance data structures need to be reorganized in such a way that contiguous areas of memory are accessed and processed. In addition, often vectorization requires a reworking of the underlying algorithms to avoid instruction branches ("if" statements) and other obstacles to a single simple stream of vectorized processing.
The structure of the AGBNP2 implicit solvent formulation implementation is outlined here on the right. It is organized into three levels. At the first level geometrical descriptors of the molecular conformation are gathered (atomic volumes and surface areas). At the second level Born radii, which measure solvent exposure, are computed. Finally in the third step these quantities are integrated into the components of the solvation free energy: electrostatic, non-polar, and short-ranged solute-solvent hydrogen bonding. In this phase we have optimized subroutines related to the Born radii computation (level 2) and the Generalized Born electrostatic free energy pair and self energies (level 3), that is the ones that most resemble conventional pair-wise molecular mechanics potentials.
As a further simplification, in this first phase we have not implemented a non-bonded cutoff and corresponding non-bonded neighbor lists. Due to the problem of scattered reads and writes to memory, specialized techniques are required for optimal vectorization and parallelization with neighbor lists (see for example Eastman & Pande, 2010). We plan to take advantage of these algorithms when the AGBNP2 library is ported to OpenMM in subsequent phases of the project. As the cases below show, for systems with up to ~2,000 particles, which form the majority of our current application areas, ignoring cutoffs leads to better performance despite the larger number of interactions. So we expect that this effort will lead to immediate and long lasting benefits regardless of the lack of cutoffs.
The graphs below summarize the speed improvements for three molecular systems. A trp cage peptide (272 atoms), the GCN4 dimer (1,094 atoms) and the protein-ligand complex between chlorophenol and the ligand binding domain of T4 lysozyme (1,323 atoms). The GCN4 dimer was chosen as a particularly challenging case as its long and thin structure is particularly disadvantageous to the removal of non-bonded cutoffs. AGBNP3 refers to the optimized working version of AGBNP2. The horizontal lines refer to the reference AGBNP2 single core speeds, the higher one being with cutoffs. AGBNP3 speeds are given as a function of version date.
The results show very significant improvements for the calculation of GB pair energies and Born radii (generally 3X over AGBNP2 with cutoffs and 2X over AGBNP2 witout cutoffs). The overall speed-up is less significant due to remaining bottlenecks, particularly the calculation of self-volumes, which are now dominating performance. Updates will be posted as optimization progresses.