Sunday, August 1, 2010

[From Jeffrey] cuda-hydro speedup

This is going to be a very short update.

I improved the CUDA code in nearly every way I know, so now is the moment of truth. How much faster is it?

16 times.

This is on k3, comparing the speed of my c code to my CUDA c code under the exact same conditions.

This is slightly slower than I have hoped. Probably I'm still doing something wrong.

Another disappointing thing is the maximum resolution I can get is still just 732 by 732. If I push it to 1k by 1k, I get a segmentation fault that indicates memory shortage, which is confusing since the amount of memory the program needs should not be that much.

The next thing I plan to do is to run it on k4/k5. Sounds like a simple thing to do, but unfortunately k4 & k5 don't have mathgl installed yet. Guess I need to do that first.

Friday, July 30, 2010

[from Anthony] progress cont...

Just a quick comment on the asymptotic behaviour of D...what I thought for k3_real (red line) appeared as it looked was that we need to take care of the sign when doing a square-root...so one root for k=+sqrt(-D) and another for k=-sqrt(-D) which was why k3_real looked like a reflection of k1_real for r>>1. I agree...we need to do more analysis to explain k2 and also k3_real for r<<1.

****

So I tried integrating starting from the left with the following initial settings...

m=2 (yes...m was equal to 2)
dr = 0.003 (small enough to see if there could be kinks or interesting lumps)
r_initial = 0.01
eta(r=r_initial)= 0
v_eta(r=r_initial)=1

The coefficients B(r) and C(r) should be correct as I did the asymptotic predictions analytically and confirmed it numerically. Oh, from what I found, for r>>1, C(r) is actually increasing (rather than decreasing from our other parameters) because of the last term D/c^2, which is directly proportional to r (assuming c ~ v_kep). Nevertheless, the only part I think that has the greatest contribution to the solution of eta is I(r). Doing the asymptotic predictions for I(r) was a problem for me because I was not sure what assumptions I can do to simplify the term so I had to do it numerically.

Before I show the results, I'm going to define some of the symbols I used...

tau_1 ~ perturbed optical depth
c ~ sound speed
v_tau ~ dtau/dr (for unperturbed optical depth)
f ~ unperturbed radiative force
WW ~ shorthand for m*(Omega-Omega_p)
sigma ~ surface density
a ~ 2nd derivative of eta
v ~ 1st derivative of eta

From the list: sigma, c, f, WW, and v_tau have already been calculated prior to this integration.

So with the initial settings I adapt the leapfrog method for the integration...

tau_1[j+1] = tau_1[j]+(1/(c[j]*c[j]))*v_tau[j]*eta[j]*dr;


I[j+1]= I[j]+f[j]*(((log(r[j+1]*sigma[j+1]*f[j+1]/D[j+1])-log(r[j]*sigma[j]*f[j]/D[j]))/dr)+
((2*m*W[j])/(r[j]*WW[j])))*tau_1[j];

a[j]= - B[j]*v[j] - C[j]*eta[j]- I[j];

v[j+1] = v[j] + a[j]*dr;
eta[j+1] = eta[j] + v[j+1]*dr;

So again, the only part I'm suspicious is tau_1 which includes I(r). But here's what I got from this....



I stopped the integration at R=5 because I thought it was meaningless to continue on. This seems to act very similar to the wave for k_3 (look back at my previous post). Now, I thought it was weird that 'nothing' was happening between r=0 to r=1 so I restricted r to this range and I get...


I noticed something was happening between r=0 and r=0.2 so I restricted r further...

Looking at the numerical results, it seems that C(r) was the dominating factor (which I think was the one making these responses) and after r=0.4, I(r) became the dominate one very quickly. Moreover, this also resembles the waves for k_1 and k_2. I also plotted eta between 0.4 to r=0.8-ish...

So if you piece together the last 3 pics you would get the first pic I posted.

That's all I have so far in terms of integrating. If you need to see more pics, let me know and I'll post them.

***

Pawel, I believe you reminded me that this second ODE we been dealing with is actually an integro-differential equation. So the only method that rings to my head to solve this kind is using Laplace transform. With the initial conditions we defined (eta(r_initial)=0; v_eta(r_initial)=1) it would be convenient to use this method where we transform and solve for L(eta) then inverse laplace transform it to get an analytic solution. The first couple of terms are easy to transform but for I(r) I was not sure whether we should treat (1/c^2)(dtau/dr) as a constant and pull them out of the integral. If I did that, it would be easier to solve. So now I'm at simplifying and solving for L(eta) and so far I have this fraction with a 3rd degree polynomial at the denominator. Now I just need to find some way to simplify it and inverse transform it. But before I go further, do you think this method is actually suitable for our case???

Anthony

Thursday, July 29, 2010

[From Jeffrey] cuda-hydro

So the code works now...



The resolution for this simulation is 732x732 ( in case you wonder why such strange numbers: 732=3*(256-12) ). Approximately 10000 time steps were taken and it took 2 hours on k3. Total duration in simulation time is approximately 2 crossing time.

This is definitely not the fastest it can get. Computations are currently only done on one block that has a size of 256, so the simulation box is 3x3=9 times too large for the block. I should be able to utilize 9 blocks in parallel to do this simulation, which would in principle give me another factor of 9 in speed. The only reason why this is not done yet is that the memory sharing between blocks is not trivial and I didn't want to involve too much complication before I get it to work. Now that I understand cuda a little more, I already have a plan for the improvement and it shouldn't be too difficult.

Tuesday, July 20, 2010

[from Anthony] Progress report on modal analysis

First of all, I'm now back to Toronto from Santa Cruz and I have to say ISIMA was excellent. I definitely learned that there's a lot of more complicated problems in astro modeling where it involves lots of hydrodynamics (MHD) and there was one group from UC San Diego working on fusion (plasma physics). So, after meeting a bunch of graduate students from all over the world there's definitely a lot more we could learn in astro modeling.

Here's the site for what's currently happening in the program...

***

So while I was at Santa Cruz, I managed to get some work done on my modal analysis with Pawel. Having to solve for the roots of k (wave number) via WKB approximation along with Cardano's method on the third order polynomial; so there are three roots to this polynomial.

This is the first root of k with both real and imaginary parts. Notice the the three lumps on the graph. Those are the Inner Lindblad, corotation, and Outer Lindblad resonances respectively.

This is the second root for k.

This is the third root for k.

By integrating for eta (look back on my previous entry on what eta is...) using the WKB solution exp(ikr) or more specifically exp(i*integral(k_{i} dr)) we get the following three eta's...

This is the first eta using k_1 with both real and imaginary parts. There seems to be a large response between 0.1 to 0.5 and then some response between r=1 to r=1.7-ish.


This is the second eta using k_2. It seems nothing happens after r=0.5.

This is the strangest one. It seems eta acts violently (look at the number on the eta axis) as r grows which I can't seem wrap my head around. Looking back at the roots of k, k3 sort of acts opposite to k1, so there's a sign reversal on the k's which implies that the graph is flipped horizontally but I don't quite understand why there's such large values for eta_3.


Thursday, July 1, 2010

[from James] current status of dust-disk sim

Currently there are a couple of versions of simulation based on Josh's original code. On Cudak4, they're somewhere inside the student/.../C/src/ directories (I don't recall the exact path but it's the same folder in which all the other CUDA programs are kept). There are two versions in two separate folders called "tau" and "taunot", respectively.

For some reason, the program which ran just fine on cudak2 and cudak3 Sunday night wouldn't run on cudak4 or cudak5; the first two cudaMemCpy calls (from host to device) worked fine, but all subsequent cudaMemCpy calls, regardless of direction, threw a "wrong direction" error. Equally strange, commenting out everything tau-related made it work again.

The "taunot" folder contains the working simulation in which everything relating to the tau grid has been commented out (which is why, if you re-compile it, nvcc complains several times about nested comments). Also in this version, just for the fun of it we included a force term in the theta'' calculation to simulate a non-rotating bar. As mentioned previously, you can also ssh into cudak4 and run this sim remotely (although when I tried it from home the display was hideously slow, presumably because I have a slow internet connection and a 5-year-old computer).

The "tau" folder contains the non-working sim-in-progress in which the tau-grid is being re-implemented. After much discussion and debate (even some argument) on Tuesday we decided to re-implement tau using the following scheme:
- use two grids: an integer array "dtau_set" storing the number of particles in each grid cell; and a float array "tau_set" storing the value of tau in each cell (this will of course take up more memory but for now it seems that speed is more important)
- when the disk is initially generated, find out which cell each particle belongs to and increment that cell in the integer dtau_set
- start the simulation loop
- kernel 1: send one thread along each "ray" (solid angle) from r_min to r_max, calculating tau for each cell in the float array tau_set based on the data stored in the dtau_set integer array
- kernel 2: integration, one thread per particle: first figure out which cell the particle is in, retrieve the appropriate value of tau, and calculate beta; then calculate the acclerations and update the new positions and velocities; now re-calculate the particle's cell location, and if it has changed cells, decrement the old dtau_set location and increment the new dtau_set location

That's the plan, still in progress. Feel free to work on the code. Clearly this algorithm still isn't super-efficient: there are several calls to global memory, and some values get re-calculated more than once even though they haven't changed (most notably the particle's cell index). However we have managed to limit ourselves to two kernels, and each thread is independent with no need to invoke syncthreads(). The presentation date has come and gone (with no presentation :( ) but for my part I'd still like to see this thing up and running.

Tuesday, June 29, 2010

[from Pawel] Last minute results

Robert, James and Pawel spent a lot of time on Monday working on
(i) making infiniband work concurrently with ethernet, and installing all openMPI
tests - a lot of success with that but tests aren't finished yet. a lot of software setup and configuration needs to be done. it look like we'll get our 20Gbps!

(ii) simulation of particles.

we decided to skip tau-effects and concentrate on benchmarking our naive, barely optimized code. after conquerring the problem of 512 particles-only
in the old code, we can now run up to 30 million particles. around 10pm we even started doing some "galaxy" simulations by adding external forces. we assumed a non-rotating, weakly barred force.

we benchmarked the compute part of the code at roughly 100 GFLOP/s.
this is approximately what we'd expect from a global memory bandwidth-limited problem:
the card can stream 120 GB/s from DDR5 to GPU, that is 30 GFLoats/s.
If 3.3 FLOP are done on every float from RAM, then we are comp/bandwidth balanced. Our leapfrog integration is probably limited by bandwidth, although the GPU heats up to 93C at times, and the fan becomes a hair-dryer.

Sunday, June 27, 2010

[from Pawel] Josh's code

Since Josh has left us a cuda proto-code. We had our work-party gathering this Sunday. Only Pawel, Jeffrey, Anthony and James were available to enjoy the BBQd fish (basa=panga), but there was (i) no flying because of Buttenville airport lockdown to please 20 alien leaders, and (ii) no swimming either, because it started raining and we preferred to stay under the umbrella.
the wrong forcast must be due to the lack of CUDA.

Those of us who had looked into the code told us about how impossible it is to compile it on their computers. Anthony had to leave and the 3 remaining Magnificent retreated upstairs and started up cudak2 (yes, a clandestine copy of cudak1 does exist).

After an hour or two we figured out how to compile Josh's code:
cudak2[152]:~...C/src/tau/CUDA_pSim$ nvcc CUDA_pSim.cu -I/usr/local/cuda/sdk2.3/C/common/inc/ -L/usr/local/cuda/sdk2.3/C/lib/ -lm -lglut -lcudart -lcutil
and after a few additional hours we figured out what the !@#$ does it do that 500 particles that were stared in a nice disk begin following some very strange orbits essentially going up and falling onto the central star, then scattering at high speed to infinity.

ok, to make the long story short we corrected the initial conditions and the equations of motion to the point where some of us (me) thought they now represented the correct, Binney-Tremaine-like equations, only written using the reversed phi <--> theta convention of Josh, and some of us (Jeffrey) were complaining about the signs in front of the sin(phi_Josh). The code was still doing its weird thing, unless we forced it to simulate a 2-D, flat disk, where it worked uncomfortably slowly, but apprently ok.

In desperation, we allowed Jeffrey to do the nonsense (-:) change of signs and... everything became like Newton intended it: a stable 3-d disk. It turned out Josh's phi and theta were not the spherical-coordinates phi & theta, his phi was actually theta+pi/2 not pi/2-theta! so given his (non-standard) definition of meridional axis pointing downward not upward, our equations had two out of a dozen signs in front of acceleration terms wrong --> non-conservation of angular momentum --> trouble on timescale of a few orbits.

We put the corrected code on cudak3 here:
cudak3[9]:~/cuda/projects/tau$

cudak3[10]:~/cuda/projects/tau$ ls

3Ddisplay.h CUDA_pSim.sln movPos.h
a.out CUDA_pSim_vc90.sln movPos.h~
CUDA_pSim.cu CUDA_pSim_vc90.suo particle.h
CUDA_pSim.cu~ CUDA_pSim_vc90.vcproj particle.h~
CUDA_pSim_gold.cpp CUDA_pSim_vc90.vcproj.Josh-PC.Josh.user vc90.pdb
CUDA_pSim_kernel.cu CUDA_pSim.vcproj
CUDA_pSim_kernel.cu~ forces.h

cudak3[11]:~/cuda/projects/tau$
cudak3[11]:~/cuda/projects/tau$ a.out

size of each particle is 24 bytes

Number of particles: 1000
Mass ratio (0 <>
minimum radius from solar mass (0.0 <>
Time step: .01
Npart=1000 u=1.000e-02 R_min=5.000e-01 dt=1.000e-02 h=2.500e-04 M=1.000e+05 G=1.000e-05
creating particle array with 1000 particles...
setting initial conditions for particle array...
creating optical thickness array with 4 slices on device...
optical thickness array created...
freeglut (a.out): Unable to create direct context rendering for window 'Particle Simulation'
This may hurt performance.
cudak3[12]:~/cuda/projects/tau$
* * *
Interestingly, we did get openGL output on our remote machine (cudak2) from cudak3. It was very slow but it worked. Apparently, if you transfer data back to host from gpu device and then plot it with openGL, which is not the fastest way to do it, and you have the ssh -X ... connection to a cuda-capable machine, you can get you graphics served over internet. that's nice. All SDK demos fail to do it since they keep the frame buffer for plotting on the gpu runnig the calculation, and openGL extensions to somehow replicate the data on a remote client are not available, so demos print error messages and quit.
* * *
Next step: why is Josh's code running only ~500 particles (anything more and it refused to move the particles, the dsplay refreshed but the particles don't evolve in time)?
{Oh! I think I know. just as I was writing this I think I realized.. of course the max number must be 512, and 512 is the limit of threds per block
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum sizes of each dimension of a block: 512 x 512 x 64
this is of course the output from cudak2[6]:~...C/bin/linux/release$ deviceQuery on the gtx280 cards }

Saturday, June 26, 2010

[From Robert] Installing the software Part 1

Installing the software proved to be just as difficult, if not more, to set up than the hardware. The specifications on Nvidia's website state that the CUDA software will run on (one of many flavours of linux) Fedora 10 .
Because of this, we decided to work with version 10 64-bit (even though the latest build was 13 at the time). However, due to troubles in recognizing our new video cards (GTX 480), we were not able to load the Graphical User Interface (GUI) properly. This then prompted us to install Fedora 11 64-bit, but we ran into the main issue of the NVIDIA drivers not installing on our system due to incompatible kernel. With so much frustration, we decided to take a chance with Fedora 12 64-bit. Sure enough, everything started to work out, and installing the NVIDIA and CUDA software were on their way.
---
First off, installing the NVIDIA drivers, we decided to use Linux x86_64 Display Driver Version 256.25 Beta.
Because the NVIDIA drivers require a non-X11 interface, the computer needs to be exited out from it. To do so, a Terminal window was opened up, and logged in as root
# su
Then, the OS initialization file called 'inittab' was edited
# cd /etc/
Here, the last line of the file was edited from
id:5:initdefault:
to
id:3:initdefault:
The file was saved and the computer was rebooted. From here the computer then loaded in the OS but loaded a text based log-in screen, due to the changes made in the file. Here, the OS was then logged in as root, and then moved to the location of the NVIDIA driver to run the installation file. In our case:
# cd /Downloads/
# sh NVIDIA-Linux-x86_64-256.25.run
After installing, the inittab file needed to be reverted back to 'id:5:initdefault:' so that on boot, it would load the GUI. Once saved, the computer was restarted.
--
After installing the NVIDIA drivers, the CUDA compiler needed to be installed. In our case:
# cd /Downloads/
# sudo sh cudatoolkit_3.0_linux_64_fedora10.run.sh
This was installed in the default path (/usr/local/cuda)
To make the compiler command (nvcc) work, the .bash_profile and the .bashrc (located at ~/) needed to be edited. Both PATH and LD_LIBRARY_PATH needs to be pointed to the CUDA library:
#
PATH=/usr/local/cuda/bin:$PATH
LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH #if 32 bit machine is used, use lib:$
export PATH LD_LIBRARY_PATH
#
For the .bashrc file, the following needed to be added:
#
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda/lib64
#

The system then needed to be restarted for the changes to take place.
--
The only thing left now was installing the SDK, which includes both sample and benchmark programs and diagnostics for the hardware. In our case:
# cd /Downloads/
# sh gpucomputingsdk_3.0_linux.run.sh
This was then installed in the default path (~/NVIDIA_GPU_Computing_SDK/), although it could have been installed in non-user path as well.

Once this was installed, a MAKE file had to be run on the SDK at ~/NVIDIA_GPU_Computing_SDK/C/
In our machine, our make file did not execute properly, complaining about " cannot find -lglut "
To fix this, we needed to install the glut packages:
# sudo yum install freeglut
and
# sudo yum install freeglut-devel

Once these were installed, the MAKE file was executed again, and it passed, installing the SDK examples to ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/
From there, a numerous executables are available, mainly deviceQuery and bandwidthTest.

# ./bandwidthTest --device=all
[bandwidthTest]
./bandwidthTest Starting...


!!!!!Cumulative Bandwidth to be computed from all the devices !!!!!!

Running on...

Device 0: GeForce GTX 480
Device 1: GeForce GTX 480
Device 2: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5155.2

Device to Host Bandwidth, 3 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4267.7

Device to Device Bandwidth, 3 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 120670.9


[bandwidthTest] - Test results:
PASSED


# ./deviceQuery --all
./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 3 devices supporting CUDA

Device 0: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1609760768 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 1: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

Device 2: "GeForce GTX 480"
CUDA Driver Version: 3.00
CUDA Runtime Version: 3.00
CUDA Capability Major revision number: 2
CUDA Capability Minor revision number: 0
Total amount of global memory: 1610285056 bytes
Number of multiprocessors: 15
Number of cores: 480
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.45 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4246847, CUDA Runtime Version = 3.00, NumDevs = 3, Device = GeForce GTX 480, Device = GeForce GTX 480


PASSED

At this point, the NVIDIA CUDA was installed and working.
To be continued...

[from Pawel] success with drivers, methinks

well, it seems that some of our previous problems stemmed from our not RTFM; we had mismatched versions on three things in the nvidia stack, all of which need to always be updated together:
in /home/student/Downloads/ on both cudak4 and cudak5 we now have the newest driver (newer than was available from nvidia download page a few days ago when I reported last)
NVIDIA-Linux-x86_64-256.35.run [version 256.35]
we also have a toolkit cudatoolkit_3.1_linux_64_fedora12.run, and finally
the SDK in gpucomputingsdk_3.1_linux.run
all these things except the last need to be installed by a superuser like
#sh (file name with .run extension)

I think you should set up a directory, either NVIDIA_GPU_Computing_SDK/
or like I do, cuda31 (to have cuda toolkit/SDK version visible) **in your own**
home directory. Then you can modify the sdk programs in the, say,
/home/your-name/cuda31/C/src/ directory at will, and use their makefiles which
put the demos or your own program executables in the /home/your-name/cuda31/C/bin/linux/release/ directory

* * *
It seems that all the sdk demos are working properly on all cards, on both
cudak4 and cudak5 now!

[I can see temperatures in the nvidia-settings utility on k4 now, but not yet on k5,
I'll try rebooting. in any case, I have stress-tested k4, and it works ok, also doesn't get stuck in high-temp state].

I'm curious about the CPU speed. the Fedora 12 utility shows that it's remaining in the slow/cool state 1.6(?)GHz, even when running multiple cuda examples. I'd think it should switch to the max ~2.86 GHz. does that have to do with any benchmark results?

[from Josh]

I have sent an email with my code in it. It is written in C/C++ and CUDA C. It is almost working. What it really needs is someone who knows more about CUDA to take a look at it. Pawel knows more than any of us on the subject I'm sure. Anyway I wont be able to work on it at my camp so I'm leaving it to you. Theres 3 days and I've completed 99% of the code. All that really needs to be done is adjust the way some things are written to avoid some memory bugs that wont allow it to run properly (I get through about 5 time steps then there's an error). I will post what the problems are and the layout of my code to make things easier for you.

1. in the particle.h define all the constants and structs (CUDA C only deals with structs not classes).
2. in 3ddisplay.h is the visualization which i have debugged completely.
3. in movPos.h is the calling of the kernels and memcpy from device to host.
4. in CUDA_pSim.cu is the main function and setup resides in runSim function.
5. in CUDA_pSim_kernel.cu are the kernels

Things that need to be checked over: 3,4,5
I think the problem may stem from how I have defined the tau_set. I compressed a 3D array into a 1D array. since the array was of dimensions theta_s x phi_s x r_s the 1D array goes from [0..theta_s * phi_s * r_s-1] where each vertical slice of the shell starts at the bottom at (theta=0,phi=-pi/2,r=0). The easiest thing to do here would be for someone who understands CUDA better to implement a 3D array and simplify things greatly.

There is need in the second kernel to place an individual particle in the corresponding tau cell and I had calculated the index to be (int(theta/dtheta)%theta_s)*phi_s*r_s + (int(phi/dphi)%phi_s)*r_s + (int(r/dr_%r_s). based on theta,phi,r of particle, and tau cell spacing dtheta,dphi,dr and total number of pieces in each demension theta_s,phi_s,r_s.

However this did not work since obviously a negative number is achievable. This caused an error. Also it would be good if someone could think of a cheap way of reducing theta and phi into a lower angle once it exceeds 2pi or pi/2 respectively.

If you implement a 3d array then you must modify the kernels so that every loop in movPos calls the first kernel which zeros all tau grid. Then the second kernel must be called (after __syncthreads()) which counts the number of particles in each grid and saves it in the respective tau cell of your array. Then the final kernel may be called again after synchronizing which integrates the particles by calculating tau and using the beta equation.

Everything is well commented and if worst comes to worst and you just want to dispaly CUDA then you can try to remove all aspects of tau and simply show an N-body simulation. That would mean that the only array is a 1D particle array. And I already know that my integration code works from my previous version.

Also to compile my code you must specify in linux "-lm -lglut -lglu -lcudart -lcutil" You can most likely just use a make file from another project that has openGL enabled.

I'm very sorry I have wont be able to help out the next few days. I wanted to be at the presentation very much and had planned on it until the date changed (after I had already made plans with my family).

I wish you all good luck. I will check my email in the next few hours to before I leave to see if any of you respond.

I will also send contact info so that you can reach me and ask questions about my code if you choose to try and fix it. I think only someone good and CUDA can fix the issue however I think you all should try.

Josh,

Friday, June 25, 2010

[from Pawel] new driver/SDK on cudak5

we have a 195.36.31 driver (stable, new) on cudak5.
the previous one, 256.25 was a beta version that we needed to compile anything
using the gcc-3.4 compiler, but gradually showed some serious problems,
like not seeing chosen devices and putting them in high-gear mode (high T) permanently.

Josh installed the appropriate toolkit with SDK and it now works ~ok!!
I've tested every SDK demo on all cards and the result is that we no longer have
any problems with multi-GPU!

the problems you'll encouner running nvidia-settings utility & examples in
/home/student/NVIDIA_GPU_Computing_SDK/C/bin/linux/release

are restricted to following programs
====================================================

nvidia-setting does not show:
Thermal Settings
PowerMizer
DFP-1(20 - GeForce GTX 480)

for devices 1 and 2.
to see temperatures you must say
% nvidia-smi -q -a
which prints correct temperatures. I've been able to generate T=90-94C while running
3 examples of nbody test on 3 cards. it's normal.
=======================================================================================
printout from ./deviceQuery
(..)
Concurrent copy and execution: Yes
Run time limit on kernels: No
(...)

and yet...

nt@cudak5 release]$
[student@cudak5 release]$ ./concurrentKernels
[concurrentKernels] - Starting...

CUDA Device GeForce GTX 480 has 15 Multi-Processors
CUDA Device GeForce GTX 480 is NOT capable of concurrent kernel execution
concurrentKernels.cu(111) : cudaSafeCall() Runtime API error : unspecified launch failure.
[student@cudak5 release]$
======================================================================================

[student@cudak5 release]$ ./fluidsGL
[fluidsGL] - [OpenGL/CUDA simulation]
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL_kernels.cu(44) : cutilCheckMsg() CUTIL CUDA error : cudaMemcpy failed : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$

[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=2
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 2: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$
[student@cudak5 release]$ ./fluidsGL --device=1
[fluidsGL] - [OpenGL/CUDA simulation]
Using device 1: GeForce GTX 480
CUDA device [GeForce GTX 480] has 15 Multi-Processors
fluidsGL.cu(183) : cudaSafeCall() Runtime API error : unspecified launch failure.
fluidsGL.cu(379) : cutilCheckMsg() CUTIL CUDA error : cudaGLUnregisterResource failed : unspecified launch failure.
[student@cudak5 release]$
[student@cudak5 release]$


====================================================================

Those who are close to producing working CUDA programs please do so on cudak5 in student account. Don;t worry (by necessity) about infiniband.
While having no internode comm looks moderately bad, not having any CUDA will definitely look bad during our 29th Jun meeting.

We have very little time.. full time engagement is expected now to the end of month.

Thursday, June 24, 2010

[from Josh] CUDAizing progress report

I would say I'm about half way done CUDA conversion. Anyway this is my scheme for now since it seems the simplest with the given time we have. I have a single copy of tau grid on device and a copy of particle array on both host and device. Once i initialize data on host I send it to the device and call a single kernel. The kernel executes force command and integrates a single time step. I only retrieve data from the device at intervals of modulus N. In between kernel calls I have to sync the threads so that the particle array remain synchronous. The only type of memory I am currently using is global memory for the two arrays, all other data is passed as params. After I get this first version working I would like to make it work on multiple device (which require memory sharing and work load splitting) have groups of particles in the same tau shell use shared page locked memory to share the tau slice properties of that slice and do particles in order by slice instead of chronological. This would have two benefits: page locked memory is very fast (it is the memory that thread blocks can see and use fastest. It is also called pinned memory and it is very limited in size so only small usages are applicable), and doing particle by slice would cause tau not to be retrieved multiple times (wasted time). Anyway I think I'll be able to make it to the meeting tomorrow however I may spend the time finishing my code since I'm leaving tomorrow for my cottage and need to finish the code before the 29th. I'll see if I can get my code running on c4 or 5 before I leave on a single device.

Wednesday, June 23, 2010

[from Anthony] Brief Update

I should explain what I've been doing. For the past several weeks I've been reading quite a lot of papers on modal anaylsis by going through their derivations of the modal equation (specifically Goldreich/Tremaine 1979, Val-Borro et al. 2008, Laughin/Bodenheimer 1994,...and more). We mainly stayed with Goldreich/Tremaine (GT79) derivation where the modal equation is...
...where...
Phi_1 is the total perturbed potential, phi_1 can be thought of as the perturbed gravitational potential (self-gravity, external potential, etc.) and eta is the perturbed enthalpy.

Now, with radiation pressure, the modal equation becomes...

Sorry if it's too small...I can't seem to make it bigger. Essentially, this equation is a second order ODE of the form...
where...
Note that the term I is also in terms of eta so we believe this equation can be thought of as a third order (homogeneous?) equation. And obviously the coefficients we're dealing are non-constant with derivatives all over the place.

So far, I tried to code this type of ODE (assuming the coefficients are constant) via leapfrog and I have something like this...
...like a decaying free-vibration...which is expected. Also remember that equations like these typically have two solutions (if you solve the characeristic equation you would a few roots - real or imagninary).

So now, I'm trying to figure out how to code this equation with all the non-constant coefficients included.

Tuesday, June 22, 2010

[from James]

I've discovered that my integrator isn't working properly; I've tried three different formulations of the leapfrog method, with different initial conditions, but with all of them my error keeps growing. Also, the program gradually slows down, even with only 1 particle, and after about 50,000 force calculations it slows to a crawl (and the relative error climbs close to unity).

leapfrog version 1:

R = r + 0.5*h*v # drift for 1/2 step
T = theta + 0.5*h*omega # drift for 1/2 step

dv = R*omega*omega - GM/(R*R) # acceleration in r
dw = -2.0*omega*v/R # acceleration in theta

v = v + h*dv # kick for full step
omega = omega + h*dw # kick for full step

r = R + 0.5*h*v # drift for remaining 1/2 step
theta = T + 0.5*h*omega # drift for remaining 1/2 step

dE = h*(GM*v/(r*r) + v*dv + r*v*omega*omega + r*r*omega*dw)
error = log(abs(dE/E),10)

Version 2 (this one does an initial 1/2 kick before the loop starts):

# incremental acceleration
dv = L*L/(r*r*r) - GM/(r*r)
dw = -2.0*omega*v/r

# drift for full step
r = r + h*v
theta = theta + h*omega

# Kick for full step - note that because of the initial 1/2 kick, this
# will put the velocities 1/2 step ahead of the positions
v = v + h*dv
omega = omega + h*dw

# compute total energy, as a check:
E2 = 0.5*(v*v + r*r*omega*omega) - GM/r
error = log(abs((E-E2)/E),10)


Version 3 (this one modifies the equations to allow full time steps over
the same time interval, but is mathematically equivalent to the other
versions):

dv1 = L*L/(r*r*r) - GM/(r*r) # incremental acceleration in r at t
dw1 = -2.0*omega*v/r

r = r + h*v + 0.5*h*h*dv1
theta = theta + h*omega + 0.5*h*h*dw1

dv2 = L*L/(r*r*r) - GM/(r*r) # incremental acceleration in r at t+dt
dw2 = -2.0*omega*v/r

v = v + 0.5*h*(dv1 + dv2)
omega = omega + 0.5*h*(dw1 + dw2)

E2 = 0.5*(v*v + r*r*omega*omega) - GM/r
error=log(abs((E-E2)/E),10)


You probably noticed that I calculated the error in total energy differently in the first method compared to the second two; the first one calculates dE and then the error as dE/E; the second way is to calculate E2 based on the updated positions and velocities and then the error as (E-E2)/E. Both methods should yield similar errors.

[from Pawel] Mail problem

Today I have a strange mail problem - maybe it's some campus upgrade gone wrong. Anyway, what I was going to mention is that we're in our final 1-1.5 weeks of work and that means: switch to panic mode. I expect progress and problem reports on this blog from all of you almost every day! I'd much prefer the former :-)

The St George campus will be closed from Thu so we'll meet on Fri at UTSC,
and on Sat. at my place. We'll of course spend some time with some of you who haven't been to Buttonville, within 50 km radius from there >->--

The meeting is 29 June, 2pm in AA bldg rm 401 or something like that (Vce Principal Research room on 4th floor). Those who can attend, say so!

There are many things which don;t work right now.
cudak4/5 don't have:
1. fixed IP addresses,
2. a working set of driver + SDK which sees 3 fermi cards properly (I installed driver 195.36.31 which was released days ago, but that doesn't work with the more experimental 256.25 (or some such)
that we had before. I didn;t have time to solve that thing and see if all cards are usable with the new recommended driver.
3. device 2 on cudak5 overheated, was 78C at idle, instead of normal 56C.
The reason was that it was in the high=gear state all the time (max clock speed on gpu and mem).
After driver installation, toolkit failed to install so I don;t yet know if that problem went away.
nvidia-settings showed that 1 of of 3 devices (0,1,2) has a different version of
VBIOS. however, it was dev 1 not 2, so it's false hint.
4. even though most SDK demos ran correctly on dev 0 and 1, there was a disturbingly long startup time, as opposed to instantaneous load on old cuda setups. dev 2 wass also hanging for a looong time before generating the Cuda Mem Alloc Error or Unspecified Launch Error.
this is some kind of software/driver problem. Robert swapped 2 cards and
dev 2 was always having problems (confusingly, dev 1 has problems on cudak4).

Any help in solving these fascinating problems will be appreciated.
Maybe I'll go to UTSC today..

keep up the good work

Pawel

Monday, June 21, 2010

[from Jeffrey] hydrocode

Code works now. Yay.

This is a shock tube example done in 2d. The resolution is 500 by 500, and the Courant number is 0.5.

video

Now onto cuda-izing.

[from Josh] progress update

video
Google erased my last post so I'm posting this again in a hurried fashion.
I finished my tau simulation.
Used same format as previous simulation in xyz that I did.
Used sphereical coords with r.. i derived in previous post.
Where radial acceleration I use F = Fgrav*(1-beta*exp(-tau))
tau is calculated by
tau = gamma*(A_*)*sum[from i=0 by increments of dr up to i=r](N_i/A(i))
Where gamma is a constant defined by
gamma = -log(I/I0) (for an individual particle)
N_i is the number of particles at radius i (in slice)
A(i) is the surface area of section at radius r defined as
(r^2 + r*dr + 0.5*dr^2)*dtheta*dphi

Program is slow because I hurried to finish and I will need to spend some time optimizing. But now I need to convert it to CUDA C by Thursday since I plan to leave for my cottage by Friday.

EDIT: again I just realized that I can't see anything with google resolution modification so I'll post on youtube when I get a chance and post a link

Wednesday, June 16, 2010

[from Josh] r.. (..=double derivative) in r,theta,phi coord system

The coord system I use along with r.. in spherical coords. I'm nearly positive that I'm accurate since it was very messy in the rough but I think I caught everything and simplified as far as can be done.




Tuesday, June 15, 2010

[from Josh] r.. theta.. and phi.. (..=double derivative) in x,y,z coord system


When using spherical coords (r,theta,phi)
where r is the radius from center,
theta is the angle from the x axis (0..2pi),
and phi is the angle from the x-y plane (-pi/2..pi/2):

x = r*cos(theta)*cos(phi)
y = r*sin(theta)*cos(phi)
z = r*sin(phi)

In reverse:

r = sqrt(x^2+y^2+z^2)
theta = arctan(y/x)
phi = arctan(z/sqrt(x^2+y^2))

The following time derivative were derived using the above three equations for
r, theta, and phi. I tried to make it as legible as possible
Let me know if you would like a neater copy and I'll post a picture:


The first time derivative of r(t) is:

/ d \ / d \ / d \
x(t) |--- x(t)| + y(t) |--- y(t)| + z(t) |--- z(t)|
\ dt / \ dt / \ dt /
---------------------------------------------------
(1/2)
/ 2 2 2\
\x(t) + y(t) + z(t) /



The second time derivative of r(t) is:

/ 2
1 |/ d \ / d / d \\
------------------------ ||--- x(t)| + x(t) |--- |--- x(t)||
(1/2) \\ dt / \ dt \ dt //
/ 2 2 2\
\x(t) + y(t) + z(t) /

2 2
/ d \ / d / d \\ / d \ / d / d \\
+ |--- y(t)| + y(t) |--- |--- y(t)|| + |--- z(t)| + z(t) |--- |--- z(t)||
\ dt / \ dt \ dt // \ dt / \ dt \ dt //

2
/ / d \ / d \ / d \\
\ |x(t) |--- x(t)| + y(t) |--- y(t)| + z(t) |--- z(t)||
| \ \ dt / \ dt / \ dt //
| - ------------------------------------------------------
/ (3/2)
/ 2 2 2\
\x(t) + y(t) + z(t) /

The first time derivative of theta is:

/ d \ / d \
|--- y(t)| x(t) - y(t) |--- x(t)|
\ dt / \ dt /
---------------------------------
2 2
x(t) + y(t)

The second time derivative of theta is:

/ d / d \\ / d / d \\
|--- |--- y(t)|| x(t) - y(t) |--- |--- x(t)||
\ dt \ dt // \ dt \ dt //
---------------------------------------------
2 2
x(t) + y(t)

-

// d \ / d \\ / / d \ / d \\
||--- y(t)| x(t) - y(t) |--- x(t)|| |2 x(t) |--- x(t)| + 2 y(t) |--- y(t)||
\\ dt / \ dt // \ \ dt / \ dt //
---------------------------------------------------------------------------
2
/ 2 2\
\x(t) + y(t) /


The first time derivative of phi is:

/ d \ / 2 2\ / d \ / d \
|--- z(t)| \x(t) + y(t) / - z(t) x(t) |--- x(t)| - z(t) y(t) |--- y(t)|
\ dt / \ dt / \ dt /
------------------------------------------------------------------------
(1/2)
/ 2 2\ / 2 2 2\
\x(t) + y(t) / \x(t) + y(t) + z(t) /

The second time derivative of phi is (might as well shoot yourself now):

/ 2
1 |/ 2 2\ / 2 2
------------------------------------------- |\x(t) + y(t) / \x(t) + y(t)
(3/2) 2 \
/ 2 2\ / 2 2 2\
\x(t) + y(t) / \x(t) + y(t) + z(t) /

2\ / d / d \\
+ z(t) / |--- |--- z(t)||
\ dt \ dt //

/ 2 2\ / 2 2 2\ / d / d \\
- x(t) z(t) \x(t) + y(t) / \x(t) + y(t) + z(t) / |--- |--- x(t)||
\ dt \ dt //

/ 2 2\ / 2 2 2\ / d / d \\
- z(t) y(t) \x(t) + y(t) / \x(t) + y(t) + z(t) / |--- |--- y(t)||
\ dt \ dt //

2 2
/ d \ / 2 2\ / 2 2\ / 2 2
- 2 z(t) |--- z(t)| \x(t) + y(t) / - 2 \x(t) + y(t) / \x(t) - z(t)
\ dt /

/
2\ / / d \ / d \\ / d \ |/ 1 4
+ y(t) / |x(t) |--- x(t)| + y(t) |--- y(t)|| |--- z(t)| + 2 ||- - y(t)
\ \ dt / \ dt // \ dt / \\ 2

2
4 1 2 2 1 2 2\ / d \
+ x(t) - - y(t) z(t) + - x(t) y(t) | |--- x(t)|
2 2 / \ dt /

/ 2 2 1 2\ / d \ / d \
+ 3 |x(t) + y(t) + - z(t) | |--- y(t)| y(t) x(t) |--- x(t)|
\ 3 / \ dt / \ dt /

2\ \
1 / 4 / 2 2\ 2 4\ / d \ | |
- - \x(t) + \z(t) - y(t) / x(t) - 2 y(t) / |--- y(t)| | z(t)|
2 \ dt / / /

Happy Coding,

Josh,

[from Jeffrey]mathgl on k3

The mathgl library is now available on k3 for everyone to use. Here is a simple example for it:

//////////////////////////////////////////////////////

#include </usr/local/include/mgl/mgl_zb.h>

void image(char *fname){

int imax=60;
int jmax=60;
mglGraph *gr = new mglGraphZB;

mglData dat(imax,jmax);
//dat is declared as 2d with dimensions imax by jmax.

for (int i=0; i<imax; i++){
for (int j=0; j<jmax; j++){
dat.a[i+imax*j]={insert data to array};
}
}

//double *a is a public variable in
//the mglData class where data is
//saved as a 1d array.

//similarly for 3d plotting, dat can be declared as:
//
//mglData dat(imax,jmax,kmax);
//
//and each number in the array can be accessed though:
//
//for (int i=0; i<imax; i++){
// for (int j=0; j<jmax; j++){
// for (int k=0; k<kmax; k++){
// dat.a[i+imax*(j+jmax*k)]={insert data to array};
// }
// }
//}

gr->Box();
gr->Dens(dat);
gr->Colorbar();

cout << fname << " is being saved." << endl;
gr->WritePNG(fname);

return;
}

/////////////////////////////////////////////////////////

The function above produces a 2d density plot which is saved as a png file. For more fancy plots just check out the mathgl manual. Have fun!

In other news, my ppm sweep is not working quite right for unknown reasons. More debugging for me.

Monday, June 14, 2010

[from Pawel]

Yes, these things are challenging. last Friday(?) we played around with SDK demos on cudak4 with Jeffrey.

Some don't work (notably fluidsGL!), and with others we've had system freeze-ups. :-| They may not be related to SDK demos however, it could be a thermal instablility(?) of something on the mobo.

We need to investigate - hardware team, can you spend some time runnig things on cudak4, e.g., multiple SDK demos, to heat up the box and see if this will occur again. Before that, please install additional 14 cm fans (one in the vertical position behind harddisk compartment. The fans should snap into the black plastic frames, which should be attached to the bottom of the case with 3 screws I believe.
* * *

Next thing - networking with infiniband, we need to install drivers.

* * *
NOTICE!!
I've rescheduled the 24 June meeting for 29 June, rm AA401 UTSC, at 10-11am.
Time is very short and we must spent full time now working on this, ok?

I'd like us to deliver the working pair of nodes, with with our software demos, and networking
by the end ot this month. If we fail to deliver any working, useful programs, some of us may see a pay cut. :-(
cheers

P

Friday, June 11, 2010

[From Robert] Installing the hardware

Our super computer hardware set-up has proved to be more challenging step than what we originally planned for.
Custom build for high performance, each computer consists of:
1 Thermaltake VH6000BWS tower case
1 Asus P6T7 WS Supercomputer motherboard
1 Intel i7 930 2.8GHz processor
6 OCZ 2GB DDR3-1333MHz Triple Channel memory giving us 12GB in total
1 Western Digital 2TB Green Caviar
3 EVGA Nvidia GTX 480 with 1.5GB Video memory
1 LG GH22N S50 DVD Writer
1 Silverstone Strider ST1500 1.5kW power supply
1 Infiniband InfiniHost III Lx Single-Port MHGS18 20Gb/s card

There were numerous challenges when putting the parts together.
Our first speed bump happened when our newly built system would not recognize more than 8GB RAM, and at other times, no more than 2GB. It was a very peculiar error, as it randomly changed how much the computer saw. It did not help when moving the RAM sticks around, as it sometimes would not even start up. After confirming that it was not the memory modules that was bringing down the machine (by verifying it with the other computer), it was determined that it was the motherboard.
An repair was placed for it, but unfortunately, the repair was said to take 1 - 2 weeks, which we could not afford, and hence, another motherboard was purchased.
Our second speed bump occurred when the replacement motherboard arrived. After configuring the system, the computer would not turn on. After diagnosing and checking everything over numerous times, it was decided, that this new motherboard was also defective. This too was sent in for repair, though fortunately, our first motherboard had been 'fixed' and returned to us.
After receiving back the motherboard, the system was built again, and fortunately this time, everything worked flawlessly.

More updates to come...

Wednesday, June 9, 2010

[from Josh] Progress Report



I would like to say I'm almost done my CUDA conversion process but who knows with debugging. I would say my best guess is that I'll be done in 4 days, so by Monday. It should hopefully include a Cpu version and working CUDA version of a tau modeling polar coord simulation.

Also me and James were discussing how it would be possible for an adaptive leapfrog scheme and came to the conclusion that we would have to adjust the time stepping globally to do it. Individual time step adjustment would lead to skewed time system since some particles would not update in synchronous time with all the rest (causing undesired shadow effects).

Josh,

Monday, June 7, 2010

see how it flies

link to a book by a physicist on flying, it's actually the best I've seen anywhere, as it concentrates on the physical things. It's a complex reading at itmes if you haven't encountered aerodynamics before, and therefore it's interesting.

http://www.av8n.com/how/

* * *

please report difficulties and successes in blog entries. ask for help if you need to.
by now I'd think everybody's got the cpu side working and is deciphering the cuda stuff.

time is T-17 days for being ready with our testbed system running cuda examples and we're not even half way :-|

the infiniband switch is somewhere near - canada customs is already asking for money..
[edit: it's in the UTSC shipping dept!]

Monday, May 31, 2010

[from Pawel] how to compute tau? Party on Sat.

I'm talking about the optical thicknaess calculation for a bunch of particles.
Suppose the disk is 2D, otherwise you'll have to generalize. Then each particle must deposit
its dtau (small contribution to tau).

we talked a lot about this with James amd he's your guru on that (right James?). ask him any questions. you first declare an array (2d: r,phi) that will hold dtau(r,phi)

we established that it is much better to use the so-called "scatter" approach than "gather" approach. in the first, you traverse the list of N particles, and donate their dtau's to the
appropriate grid locations (grid indexes are not stored but found each time you need them for a given particle, it should save time and space in ram). in the second approach, you'd traverse the
dtau grid and look for particles inside each grid cell (that's a lot of searching! ~N.)

we need exp(-tau) in the end, we can get it by integrating tau(r,phi) = integral_0^r dtau(r,phi)
= sum_along_r_from_the_innermost_to_the_particle's_cell of dtau(r,phi).

we've talked about a nice trick to avoid computing exponential functions later on.
suppose you have dtau(r) in an array and want exp(-tau). then you can get a very good approximation to f=exp(-tau) by
f = integral_0^tau (-f) dtau
why? because df(tau) = -f(tau) dtau is the differential equation with solution f=exp(-tau) .
you need to start with initial value f=1 at the inner radius and integrate outside.

what is dtau contributed by each particle? it's dA/(4pi r^2) where dA is its geom. cross section.
in general, we don't want to count particles, all we want is to arrive at tau(r,phi) which
in the initial axisymmetric state has tau(r=outer radius) = tau_infinity = 10 (or 3, or something you input). Therefore do an integration using dtau = C/r^2 where C=1 initially, and after getting
tau(outer) or exp(-tau(outer)) computed with the above algorithm, you recalculate C so in all the
later calls to the tau-computing procedure it gives the right outside exp(-tau_infinity) number.

* * *

We will meet on Friday downtown, meanwhile there seems to be enough work and enough computers already doing or starting to do cuda at UTSC to drop in there the rest of the week.

On Sat, as I mentioned, you are invited for a pool-side party (swimming encouraged,
take your swimming pants). bring a friend if you'd like.
I'll make some bbq, but if you want fancy salads maybe we can make it a potluck = byo.. party) hopefully the weather will cooperate so we can go flying for 20 min with each of you out of buttonville airport (north of 404 & 407). http://planets.utsc.utoronto.ca/~pawel/VANsRV.html

please let me know how many are coming and whether you have a car (it would be good to have
more than just my car for the buttonville trip).

finally, the coordinates (time will still be confirmed, we'll see if the Wx is ok the whole or just part of the day):

37 Fairhill Cres., toronto, 2pm, Sat 26 May. I can give you a 1-2km ride from from Underhill and Lawrence to my place, send me the details of how you're coming so we'll set some time to meet
at the gas station there. the bus that goes on Lawrence is called 54.

Thursday, May 27, 2010

[from Josh] OpenGL code

I propose that if ones wishes to post code then they post it between "pre" tags so that they preserve formatting and allow others to test their code out.
"look at how I did this under HTML view to see what a pre tag is"


The way that I have built my code everything that is needed to run the simulation (such as "math.h", "freeglut.h",...etc) is "#include"ed in one header file. Also within that header file I define the global variables that will be used for the simulation such as number of particles, current runtime, dt, solar mass,... etc. I also define a class object called particle and define a dynamic array of particle objects. I then include this first header file in another header file called "forces.h". In the header file "forces.h" I define all the different forces that I want to take effect in my simulation (currently just gravitational but its very easy to add in other forces). I then include the forces header file into another header file called "movePos.h". In "movPos.h" all the numerical integration takes place. It calls functions from "forces.h" to do so and updates the particles positions. I then include "movePos.h" into my graphic header called "3Ddisplay.h" which I have posted below. Here, within my render function I call functions in "movePos.h" which update the particle object array that I defined in the very first header file. I then graph the positions of each of the particles adding visualization techniques and also display data.

Up until now everything has been contained in seperate header files that are easily updated, debugged, and managed. You should know that header files release their properties to any file that includes them, (ie. things that include header files inherit all things within the header files). I then have a file (not a header) called "main.cpp". This file includes "3Ddisplay.h" and thus inherits all the other header files functions and global variables. In "main.cpp" I define a single function called "main" that interactivity asks the user what type of simulation they would like to perform, what the parameters of the simulation are and then calls the displaySim function in "3Ddisplay.h" which then displays the simulation as you requested. The program is terminated when the user presses "q" in the display window.

Also my simulation utilizes openMP and will run in parallel across all available CPU cores.

I tested my program using "extern" variables instead of in inherited header variables but the complexity rose considerably and made debugging a nightmare. I also test various pointer array object declaration techniques to see which one was the fastest and I found that the using my current method of memory allocation is the fastest. This is similar to the way we would do things with CUDA too since CUDA works fastest with "pinned memory" (memory that is allocated once and remains static for the remainder of the program).

If you would like to use the below code you will have to either a) modify it completely so that it works with your code format, or b) try and follow my format.

In general you really only need to do two things:
- change displaySim function to "main" and add #include freeglut.h at the top of your program
- in the render function replace the movePos3D with your own code that updates particle positions. (Note where it says pPart[i] is my particle object and you should replace it with what ever your particle is or the corresponding x,y,z,vx,vy,vz of your code.

Note: you will need to have freeglut installed on your system and locatable by your compiler. See my previous post on the topic for those using windows.


#include "movePos.h"

/*openGL vars*/
static int left_click = GLUT_UP;
static int right_click = GLUT_UP;
static int xold;
static int yold;
static int width;
static int height;
static int wh;
static int hw;
static int vidtype = 'n';
static int leavetrail = 0;
static int paused = 0;
static double rotate_x = 30;
static double rotate_y = 15;
static double alpha = 0;
static double beta = 0;
static double scale = 100;
static double trans = -0.1*scale;

/*openGL functions*/
void ReshapeFunc(int new_width, int new_height){
width = (double)new_width;
height = (double)new_height;

hw = height / (double) width;
wh = width / (double) height;

glMatrixMode(GL_PROJECTION);

glLoadIdentity();
gluPerspective(20 * sqrt(double(1 + hw * hw)), wh, 8, 12);

glMatrixMode(GL_MODELVIEW);
glutPostRedisplay();
}

void MotionFunc(int x, int y){
if (GLUT_DOWN == left_click){
rotate_y = rotate_y + (y - yold) / 5.f;
rotate_x = rotate_x + (x - xold) / 5.f;
if (rotate_y > 90) rotate_y = 90;
if (rotate_y < -90) rotate_y = -90; glutPostRedisplay(); } if (GLUT_DOWN == right_click){ beta = beta + (y - yold) / 2.f; alpha = alpha + (x - xold) / 2.f; glutPostRedisplay(); } xold = x; yold = y; } void keyCB(unsigned char key, int x, int y){ if( key == 'q' ){ free(pPart); exit(0); } if( key == '+' ){ scale *= 2.0/3.0; } if( key == '-' ){ scale *= 1.5; } if( key == 'r' ){ vidtype = 'r'; } if( key == 'v' ){ vidtype = 'v'; } if( key == 'n' ){ vidtype = 'n'; } if( key == 't' ){ leavetrail = abs(leavetrail - 1); } if( key == 'p' ){ paused = abs(paused - 1); } } void renderBitmapCharacher(float x, float y, float z, void *font,char *string){ char *c; glRasterPos3f(x,y,z); for (c=string; *c != '\0'; c++){ glutBitmapCharacter(font, *c); } } void Render(int face){ double D; int i; /* Axis */ glBegin(GL_LINES); glColor3d(1, 0, 0); glVertex3d(-0.7*scale, -0.7*scale, -0.7*scale); glVertex3d( 0.7*scale, -0.7*scale, -0.7*scale); glColor3d(0, 1, 0); glVertex3d(-0.7*scale, -0.7*scale, -0.7*scale); glVertex3d(-0.7*scale, 0.7*scale, -0.7*scale); glColor3d(0, 0, 1); glVertex3d(-0.7*scale, -0.7*scale, -0.7*scale); glVertex3d(-0.7*scale, -0.7*scale, 0.7*scale); glEnd(); glRotated(beta, 1, 0, 0); glRotated(alpha, 0, 1, 0); glColor3d(1, 1, 1); glBegin(GL_POINTS); if (face==0) movePos3D(); KE = 0.0; #pragma omp parallel for for (i=0;i width)
glOrtho(-scale, scale, -scale * hw, scale * hw, -scale, scale);
else
glOrtho(-scale * wh, scale * wh, -scale, scale, -scale, scale);


glMatrixMode(GL_MODELVIEW);
/* Perspective view */
glViewport(0, 0, width / 2, height / 2);
glPushMatrix();

/* Display data */
char Npart_str[33];
sprintf(Npart_str,"N=%d",Npart);
renderBitmapCharacher(-scale,scale - 0*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,Npart_str);
char G_str[33];
sprintf(G_str,"G=%2.4Lf",G);
renderBitmapCharacher(-scale,scale - 1*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,G_str);
char t_str[33];
sprintf(t_str,"t=%2.4Lf",t);
renderBitmapCharacher(-scale,scale - 2*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,t_str);
char h_str[33];
sprintf(h_str,"h=%2.4Lf",h);
renderBitmapCharacher(-scale,scale - 3*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,h_str);
char scale_str[33];
sprintf(scale_str,"axis scale=%2.4Lf",scale*2);
renderBitmapCharacher(-scale,scale - 4*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,scale_str);
char K_str[33];
sprintf(K_str,"Kinetic Energy=%2.4Lf",KE);
renderBitmapCharacher(-scale,scale - 5*scale*0.06,0,(void *)GLUT_BITMAP_8_BY_13,K_str);

glTranslated(0,0,trans);
glRotated(rotate_y, 1, 0, 0);
glRotated(rotate_x, 0, 1, 0);
Render(0);
glPopMatrix();

/* Right view */
glViewport(0, height / 2 + 1, width / 2 + 1, height / 2);
glPushMatrix();
glRotated(90, 0, -1, 0);
Render(1);
glPopMatrix();

/* Face view */
glViewport(width / 2 + 1, height / 2 + 1, width / 2, height / 2);
glPushMatrix();
Render(2);
glPopMatrix();

/* Top view */
glViewport(width / 2 + 1, 0, width / 2, height / 2);
glPushMatrix();
glRotated(90, 1, 0, 0);
Render(3);
glPopMatrix();

/* Projection matrix is restored */
glMatrixMode(GL_PROJECTION);
glPopMatrix();
glMatrixMode(GL_MODELVIEW);

/* End */
glFlush();
glutSwapBuffers();
}


/* Run upon click */
void MouseFunc(int button, int state, int x, int y)
{
if (GLUT_LEFT_BUTTON == button)
left_click = state;
if (GLUT_RIGHT_BUTTON == button)
right_click = state;
xold = x;
yold = y;
}

/* Timed refresh to control refresh rate */
void TimerdisplayCB(int value){
if (paused == 0) glutPostRedisplay();
glutTimerFunc(33,TimerdisplayCB,value);
}


void displaySim(int *argcp, char **argv){
/* Creation of the window */
glutInit(argcp, argv);
glutInitDisplayMode(GLUT_RGBA);
glutInitWindowSize(glutGet(GLUT_SCREEN_WIDTH),glutGet(GLUT_SCREEN_HEIGHT));
glutCreateWindow("Particle Simulation");

/* OpenGL settings */
glClearColor(0, 0, 0, 0);
glEnable(GL_CULL_FACE);
glCullFace(GL_BACK);
glEnable(GL_BLEND);
glBlendFunc(GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA);

/* Declaration of the callbacks */
glutDisplayFunc(DisplayFunc);
glutReshapeFunc(ReshapeFunc);
glutKeyboardFunc(keyCB);
glutMouseFunc(MouseFunc);
glutMotionFunc(MotionFunc);
glutTimerFunc(33,TimerdisplayCB,0);

/* Main display */
glutMainLoop();
}

Wednesday, May 26, 2010

[from Robert] cudak4 & cudak5 up and running

Both machines are running Fedora 12 along with the newest 256.25 Beta drivers from NVIDIA finally. The CUDA compiler is also running, and I have sent everyone an email on how to access c5 (for now). Try it out!

[from Anthony] some CUDA examples

There's a website I found which has a handful of CUDA codes such as adding matrics and solving Laplace's equation.

https://visualization.hpc.mil/wiki/GPGPU

I do find it useful for CUDA beginner's like myself since they explain what is happening at each step of their code. They also compare codes that are programmed for the CPU and GPU in terms of how they are written (in C++ btw) and how fast the CPU and GPU execute the program.

For instance, in the Laplace 2D solver they discussed that the smaller the grid (the smaller then NxN array) and fewer iterations the more likely that the CPU wins in terms of speed because the GPU puts more effort in setting up. Here are some of my results with different N and iterations...

n=10 iterations=1000 time_gpu=6586.000 time_cpu=1601.000

n=10 iterations=100 time_gpu=615.000 time_cpu=150.000

n=60 iterations=100 time_gpu=617.000 time_cpu=5980.000

I won't be posting their codes - it's pretty length-y. They also have a plot where they compare the CPU/GPU times of executing the program.


By the way, the devices I have in my macbook pro are...


Device 0: "GeForce 9600M GT"

CUDA Driver Version: 3.0

CUDA Runtime Version: 3.0

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 268107776 bytes

Number of multiprocessors: 4

Number of cores: 32

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.25 GHz

Concurrent copy and execution: Yes

Run time limit on kernels: Yes

Integrated: No

Support host page-locked memory mapping: No

Compute mode: Default (multiple host threads can use this device simultaneously)


Device 1: "GeForce 9400M"

CUDA Driver Version: 3.0

CUDA Runtime Version: 3.0

CUDA Capability Major revision number: 1

CUDA Capability Minor revision number: 1

Total amount of global memory: 266010624 bytes

Number of multiprocessors: 2

Number of cores: 16

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 8192

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 2147483647 bytes

Texture alignment: 256 bytes

Clock rate: 1.10 GHz

Concurrent copy and execution: No

Run time limit on kernels: Yes

Integrated: Yes

Support host page-locked memory mapping: Yes

Compute mode: Default (multiple host threads can use this device simultaneously)



It's not as great in comparison tp what we have in our new machines where our group installed the new Fermi cards (I think three for each node or something?). So if I were to carry out some massive, intense computations and graphics it would be recommended to use our new machines since they have a better cooling system.

Just now I tried the code in Dr. Dobbs part 1 CUDA. It had no problems compiling with the nvcc command. It also suggested to add printf to track what is happening in the program. I will try that and see what kind of results I get.

For our CPU programs, most of us have written our N*(2-body) simulations and I have done that with both cartesian and polar coords. Still working on the visuals where I'm trying with OpenGL. Josh, you should post your openGL classes, codes, whatnot - it will save *all* of us a lot of time.

I already have mathGL in place but apparently it uses kuickshow as its image viewer and I seem to have difficulties getting kuickshow on my mac, so right now it can't output any image with mathGL. But it has no problems compiling so I'm pretty sure I installed it correclty.

As if now, I am (most of us as well....) modifying our codes to include radiation pressure as a source of physical interactions between particles and I'm still figuring out how to do this with our codes.


Anthony