Parallel Preconditioned Conjugate Gradients
CS267 Fall 2002 – Assignment 3
Rabin K. Patra, Sergiu Nedevschi, Sukun Kim
Source code:
1. Introduction
The purpose of this assignment is solving linear system Ax=b quickly. There are two major approaches: the Conjugate Gradient algorithm and Sparse Matrix algorithms. We will be concentrating on the Conjugate Gradient method and efficient parallel implementations of the same.
(1) Conjugate gradient algorithm.
To make conjugate gradient algorithm faster, preconditioning is used ([1]). The preconditioning matrix M is an essential factor in performance. As MAMT gets closer to I, performance gets better.
In the Jacobi method, for a given A, diagonal blocks B1, B2, …, Bn are chosen. And their Cholesky factors, C1, C2, …, Cn are computed. Preconditioning matrix M is a collection of these Cholesky factors ([2]). Since Cholesky factor is used, solving Mx=b is easy. Moreover, blocked structure of M enables parallelization of solving Mx=b. We did not change the preconditioning matrix M. We tried to reorder rows and columns of A.
In SSOR, A is used as preconditioning matrix M. However since this is not appropriate for parallelization, we did not use SSOR in parallel implementation.
(2) Parallelism
UPC and Titanium codes are given. UPC ([3]) did not work on Millennium or Seaborg. More functionality and flexibility were given by MPI ([4]) than Titanium ([5]), and we felt more comfortable with MPI. Therefore, we used MPI for parallel implementation.
2. Serial implementation
For serial implementation, we just test the given code.
For Jacobi method, the block size is 50.
Conclusion:
3. Parallel implementation using MPI
Background:
From the performance of the serial code for solving 'Ax=b' - it was clear that the Jacobi and SSOR preconditioning methods were far superior compared to the Vanilla (no preconditioning).
An analysis of both the Jacobi and the SSOR algorithms shows that though the SSOR algorithm is better in terms of no of iterations and the total computation time, the Jacobi method is easier to parallelize. It divides the matrix into blocks which are factorized using Cholesky factorization. This lends itself naturally to a parallel model where blocks are divided among the processors. The other operations of matrix-vector multiplication and the inner product can also be parallelized.
The UPC implementation posted on the homework page was not feasible as neither Millennium nor Seaborg supported the UPC library. The alternatives were MPI and Titanium and we chose MPI as it is more versatile.
The MPI implementation:
Results: We used matrices from the Matrix Market site for testing the MPI implementation. We tested matrices from two classes of problems and we chose only the reasonably big matrices.
BCS: Structural Engineering Matrices
Matrix bcsstk18(759 iterations)
Matrix bcsstk16(75 iterations)
CYLSHELL: Finite element analysis of cylindrical shells
Matrix s1rmt3m1(425 iterations)
Matrix s1rmq4m1(430 iterations)
Matrix s2rmq4m1(650 iterations)
Matrix s2rmt3m1(670 iterations)
Conclusion:
4. Fill-reducing using METIS
METIS ([6]) contains a library for fill-reducing reordering routines. This library reorders rows and columns of matrix A so that result matrix A has smaller number of fill elements when Cholesky factorized. We tried this method but the performance degraded. There are two main reasons for this phenomenon, we think. First, we Cholesky-factorize diagonal blocks of matrix not the whole matrix. So this reordering does not guarantee reduced fill in blocked Cholesky factors. Second, this algorithm increases the bandwidth of the given matrix. Since conditioning matrix M contains only diagonal part, wider bandwidth of A makes MAMT farther from I, resulting in worse performance.
5. Narrowing bandwidth
We thought narrowing
bandwidth will make MAMT close to I, and
will give better performance. We tried to find library for Reverse Cuthill-McKee Algorithm ([7]).
To this end, we implemented our
own Reverse Cuthill Mckee
algorithm – which involves Bread First search of the sparse matrix graph giving
us the ordering for the elements(permutation).
Applying this permutation on the original graph gives us the modified sparse
matrix. A key point in the algorithm is to select a good starting point to
start the BFS, and we implemented a heuristic for doing that as proposed in ([7]).
Results:
The CG algorithm does not
always converge after applying the Reverse Cuthill-Mckee
transformation – and sometimes it produces the wrong solution.
Matrix |
Size |
Elements |
Initial BW |
Final BW |
Converged? |
Initial iterations |
Final iterations |
Nos1 |
237 |
627 |
4 |
4 |
Yes |
30 |
15 |
Gr_30_30 |
900 |
4322 |
31 |
59 |
Yes |
21 |
44 |
S3rmt3m3 |
5357 |
207695 |
5302 |
212 |
No |
- |
- |
1138_bus |
1138 |
4054 |
1030 |
131 |
No |
- |
- |
Conclusion:
6. Reference
[1] Ian Sobey. Conjugate Gradient Algorithm. http://www.sjc.ox.ac.uk/scr/sobey/iblt/chapter1/notes/node20.html
[2] Department of Computer Sciences, The University of Texas at Austin. Cholesky Factorization. http://www.cs.utexas.edu/users/plapack/icpp98/node2.html
[3] The High Performance Computing Laboratory, The George Washington University. Unified Parallel C. http://upc.gwu.edu/
[4] MPI Forum. Message Passing Interface. http://www.mpi-forum.org/
[5] Computer Science Division, University of California at Berkeley. Titanium. http://titanium.cs.berkeley.edu/
[6] George Karypis. METIS. http://www-users.cs.umn.edu/~karypis/metis/index.html
[7] Dale Olesky. The Reverse Cuthill-McKee Algorithm. http://www.csc.uvic.ca/~dolesky/csc449-540/1.5-1.6c.pdf