CS267 Fall 2002 – Assignment 2

Sharks and Fish Particle Simulation

 

Professor: Horst Simon; Teaching Assistant: David Garmire

Students: Rabin K. Patra, Sukun Kim, Sergiu Nedevschi

 

 

Exploring Several Implementation Techniques

 

Several techniques for writing parallel programs can be used to solve the Fish and Shark particle simulation problem. We began our approach by exploring each of these techniques (MPI, Pthreads and Titanium), and running them on both the Millennium and Seaborg clusters.

 

 

Running MPI on Millennium:

 

We plotted the total completion time and communication time versus the number of processors used, using 10, 100 or 1000 fishes (time per 100 steps):

 

In all the above plots, the time was measured for 100 execution steps.

As we can see, for a small number of fishes, spreading the computation on several nodes doesn’t bring any speedup (the performance degrades). This is easily explained by the fact that the computation on a single node is extremely small, and dividing this small computation doesn’t make up for the communication overhead (which in this case is inter-node communication). We can observe this by looking of the big fraction of the total time taken by communication.

However in case of 1000 fishes, where the CPU time becomes the bottleneck, parallelizing the computation helps a lot, because in this case the communication time represents but a small fraction of the total execution time.

 

Running MPI on Seaborg:

 

 

Running Pthreads on Millennium:

 

Using pthreads on Millennium is pointless. This is because pthreads work in a single node, they can’t be used to communicate between nodes. Unfortunately, if the parent process, that spawns the pthreads, uses one processor (blocking in expecting the threads), all the threads are left to run on the other processor of the node. The only thing this approach brings is overhead in creating the threads and switching contexts, because the code is executed on one processor anyway. The results below (for 10, 100, and 1000 fishes) are confirming our hypothesis.

 

 

 

Running Pthreads on Seaborg:

 

In this case, pthreads are useful. This is because each of Seaborg’s nodes has 16 processors, which can communicate by using shared memory. So, in case we don’t need to scale our computation on more than 16 processors, pthreads seem to be the best choice, because the communication overhead doesn’t exist in case of shared memory (the communication overhead in this case is synonymous to the synchronization overhead).

 

 

Running Titanium on Millennium:

 

The titanium implementation proved to be exceptionally slow. The results for 10, 100 and 200 fishes are shown in the following figures. Please note that the time is given in seconds/step. The time for each step can be huge (more than 2 seconds per step sometimes).

 

As the number of processors increases, the time it takes to complete increases greatly. This shows poor scalability.

 

 

Running Titanium on Seaborg:

 

The titanium implementation on Seaborg runs very similar to the way it runs on Millenium (comparable, slightly longer total and communication times):


Optimizations

 

Trying to Optimize the Titanium Code

 

(1) Code level optimization:

 

There are two major tools in Titanium for parallel computing (immutable classes, Titanium array).

 

Immutable classes: To be an immutable class, its fields should not be updated. Even though an array of Fish class is used, Fish class can not be an immutable class, because fields of Fish class should be updated.

 

Titanium array: This array can be used only when there is some correspondence among elements of arrays involved in computation. For example c[i] = a[i] + b[i] for each i. However in this simulation, there is no such correspondence.

 

None of them can be used.

 

 

(2) Optimization using compiler options:

 

We tried the following optimization options of compiler. Since the performance on Millennium fluctuated too much, we collected data and plotted graphs only on Seaborg.

 

processors

Naïve

nobcheck

nobcheck+ optimize

1

1127

868

335

2

16875

16497

15831

4

41921

40709

39524

8

96091

95375

92539

 

 

‘--nobcheck’ forces no boundary check of array reference. Around 2% of improvement was seen.

‘--optimize’ optimizes C code generated. Another 2% of improvement was seen.

 

 

Optimizing the MPI and Pthreads Code

 

MPI and Pthreads are both very useful. Pthreads are great for SMP’s, but cannot scale on more than one node. However, they feature minimal communication time, because the processors communicate using shared memory. The solution would be to combine the Pthread approach (for communication within one node), and MPI (for communication between nodes).

 

Combining MPI and Pthreads

 

Our approach was to combine these approaches, in order to take advantage of the fast communication in case of the Pthreads, and of the scalability of the MPI implementation.

The array containing the information for all the particles was divided among the nodes. Each of these local arrays for a node is further divided into smaller arrays handled by pthreads. Each thread has access to the whole array of the node, but only computes the interactions for its own part of the array.

 

 

Only one of the threads is responsible for the MPI communication with the other nodes.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Barnes Hut Algorithm for the n-Body Problem

 

Background: The naive implementation of the n-body problem performs n2 computations for each step as it calculates the gravitational force between all pairs of particles. This is evidently inefficient as we can approximate the force for distant particles.

 We can then think of an algorithm which uses spatial locality in allocating particles to processors and the replaces a block of neighboring particles by its centre of mass when evaluating its force with distant particles.

 

The Barnes-Hut algorithm makes use of a "divide and conquer" strategy to recursively subdivide the problem space in order to facilitate calculation of inter-particle distances.

 

Steps:

 

1. Creating the tree:

The first step in the Barnes-Hut method is to define the spatial extent of the problem. The Barnes-Hut approach to this aspect of the problem is to make the problem space a rectangular parallelepiped whose sides are of equal length. The length of the sides is the maximum spatial extent of the particles in any spatial dimension.

 

The second step is to divide the space recursively. The approach taken in the Barnes-Hut method is to evenly divide the space in each dimension recursively. For example, in two dimensions the space is divided into four square regions; in three dimensions, the initial cube is divided into eight sub-cubes. The result of this division is a set of problem sub-spaces that are congruent with one another and are spatially scaled versions of the original problem space. Thus, this process can be repeated recursively until either one or no particles are in a cell. In each case, and at each stage of the recursive process, the center of mass of the particle distribution in each cell is calculated.

function InsertBody(global root,p) =

FindMyLeaf(p)

     if (leaf_is_full)

            SplitLeaf(root,p)

            InsertBody(new_leaf)

     else

            InsertToLeaf(p)

      endif

end

 

 
 

 

 

 

 

 

 

 

 

 

 

 


2.Walking the tree:

Given that the tree is constructed, we now must "walk the tree" in order to carry out the force calculations. As the primary objective is to maximize the efficiency in calculating inter-particle distances, we note that if an array of particles is "far enough" away from an individual particle, the array can be treated as a single particle with a composite mass (or charge), located at the center of mass (or charge) of the array.

 

Thus, for each particle in turn, the tree is traversed, starting at the topmost "universe" node. The spatial extent of the node is divided by the distance from the center of mass of the node to the particle. If this quotient is less than a specified quantity (customarily called the theta parameter) the particles in the node are "far enough" away to be considered a single particle. If the quotient is greater than theta, the tree is recursively descended.

 

After the total force is calculated on each particle, the particles can be moved. After movement, the tree may be reconstructed and walked iteratively. This is the general way that the Barnes-Hut tree is used to calculate the motions of a large number of particles. Detailed analysis of the algorithm indicates that both the tree building phase and the tree walking phase are of order N log N, clearly superior to the unsophisticated N-squared direct summation method.

 

Text Box: function treeForce(node,p) =
	if region at node contains 1 particle then
    		compute direct interaction between the two particles
	else
    		if check_distant_enough(p,node) then 
			compute direct interaction between p
		and the center-of-mass of node
     else
     	   	for each subnode of node in parallel
  				compute treeForce(p, subnode) recursively
    	endif
    endif
end   

function check_distant_enough(p,node)
	distance = distance from p to center of mass of node.
	if( size_of_node/distance < theta)
		return true
	else
		return false
	endif
end

 

Parallel Implementation:

The Barnes-Hut algorithm is inherently parallelizable in both the steps - in tree construction all processors can together make a global tree with each inserting its own particles(allocated by some initial breakdown) and - in force calculation once we have global tree, every processor can walk it independently of others for evaluating force  on its own particles.

 

Our implementation:

The problems with parallelization are:

  1. Creating a global tree needs a locking protocol as more that one processors might try to insert a particle at a leaf or split an internal node of the tree.
  2. Also constructing the tree all over again for each iteration might be prohibitive - we make the observation that only a fraction of the particles cross the boundaries of their region.

 

So we decided to have each processor compute its own full copy of the tree - it can do so if it has the information about all the particles. Then each processor is allocated a group of particles for force evaluation. All processors then do tree-walking on their copy of the tree to calculate the forces and update the positions for their particles. Each processor also checks for each particle if it has moved out of its original leaf and stores this result. The updated positions are distributed to all processors in the next step. Next each processor re-adjusts its copy of the tree by deleting particles that have migrated and re-inserting them into the tree.

 

for all processors

      Construct the full tree

for each step do:

            for all processors

                  for each particle allocated to it

                        compute the total force on it by walking tree

                        update the position

                        count the no of force computations

to be done for each particle

                  for each particle

                        check if it has left the rectangle defined by its leaf

      Do Allgatherv to get updated info about all particles to each processor

     

      //Adjust the tree

      for all processors

            for each particle that has left its leaf

                  delete from its earlier leaf

                  insert it to the tree again

            if number of re-insertions > no of nodes in tree/10

                  construct the tree again.

     

      //Calculate load ands split the particle sequence

      //load of a particle = no. of force computations done for it

//in this iteration

     

Split the particle sequence among the nproc processors by dividing

load equally among processors

 

Output timing/communication information.       

 

 
                       

 

                         

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Results: The Barnes-Hut algorithm has an asymptotic complexity of N.log(N) where N is the number of particles and it is very efficient for large number of particles - we show the performance of this algorithm for 10,100 and 1000 fishes(particles). We also compare it with the Naive MPI implementation. As we can see - the Barnes-Hut method is about 3 to 4 times faster than the Naive nethod for moderate no. of processors. The difference is due to the vasty reduced computations(eg for 3000 particles Barnes-Hut makes 300,000 calculations per iteration compared with 3000^2 for the Naive implementation). However with increasing the no. of processors , the communication time starts to dominate the runtime though the total runtime still decreases quite well.

 

Conclusions: As the Barnes-Hut method approximates forces, we have to be careful that the errors do not accumulate - this depends on the parameter 'theta' we used for deciding whether to explore a node in tree walking. With values of theta~0.5, we got quite good accuracy but we used theta = 0.3 for which the error was less than 1% for 1000 iterations. We also think that the Euler approximation used by us is not very accurate and a better method(eg Runge-Kutta) would give s better error bounds. Also the Barnes-Hut has no inherent bound on the errors it can produce - so we should use some of the modifications to this method which have better error behaviour.

 

 

 

 

 

Extension of Sharks and Fish Particle Simulations

 

Using pthreads, we extended this simulation mainly in three ways: aquarium simulation, solar system simulation, and moon simulation. As attachments to this document we provide simulation results for several cases described below, which can be visualized using the Acquarium visualization tool.

 

(1) Aquarium Simulation

 

Modelling and Implementation

 

We tried to simulate aquarium as close as possible.

At first, unrealistic gravity is removed, and flow is added. We introduced an ocean flow imitating the water currents present on earth (case 1).

In reality water induces great friction. So we added the effect of friction (case 2).

Finally we introduced SHARKS.

At first, fish are rejected by each shark (by introducing a rejection force). Sharks, on the other hand, are attracted by all fishes (case 3).

However, using the model mentioned above, sharks wandered around without being very effective in catching fish, since all fishes attract each shark. A real good predator chases only one prey at a time. We changed the model by making each shark chase a single fish until successful. Moreover, a shark never chases a fish chased by another shark (case 4).

To make the simulation more realistic, we tuned values of parameters. In this case, fishes seem pathetic compared to the sharks. But that is exactly what happens in a real world (case 5, case 6).

 

Difficulties

Each processor updates its own portion of memory. However we needed to update two objects managed by two different processors at the same time. To avoid two sharks chasing one fish, we marked each chased fish. Beginning to chase a fish and marking it should happen atomically. If a shark begins to chase a fish and marks it at a later time, another shark may already begin to chase the same fish.

This problem is not an application specific one. There are many cases in which two different objects must be updated atomically. The problem gets worse if two objects reside on different nodes.

 

 

 

 

 

 

(2) Solar System Simulation

Modelling and Implementation

At first we positioned the sun with its huge mass at the center. Then we provided planets with initial speed so that they can rotate around the sun (case 1). Moreover, we assigned different masses to each planet (case 2).

In the sequel, we tested the case when the planets feature different initial speed. Some of them followed an ellipsoidal orbit (case 3), while others went away from the sun.

We then removed the sun, and we could witness planets form a solar system around a planet with the largest mass (case 4_1). During simulation, we found an interesting case. When two of planets have similar large mass, they form binary stars (case 4_2).

 

Difficulties

In the solar system simulation, movement of planets should be made continuously and not at discrete moments, so that planets follow a perfect ellipse. Only then the balance of force and speed is preserved keeping planets to rotate around the sun. Otherwise very small errors accumulate, leading to unbalance and deviation from the orbit. Using the Euler method , the radius of the orbit increases constantly.

This problem is already addressed. We found several sites debating this problem this. Among them, below site is the most useful site. This site extensively talks about it and suggests solutions.

http://homepage.ntlworld.com/david.conner/Solar%20System%20Dynamics.htm

Thanks to our teaching assistant, we recently became aware of several other approximation methods, but we didn’t have time to include them in our model.

 

 

(3) Moon Simulation

Modelling and Implementation

We have three objects: the sun, the earth, and the moon. The earth rotates around the sun, and the moon rotates around the earth (case 1). We also added some other planets (case 2).

 

 Difficulties

Just like in the case of the solar system simulation, discrete time progress and round-off error made the moon deviate from its orbit. This deviation arose more quickly. The movement of moon relative to the earth is smaller and more sophisticated. Therefore, very smalls errors, which would be insignificant to movements relative to the sun, can cause deviations from the orbit around the earth.