
For programming assignments it is best to make a script (log) file showing
your compilation, executions and results:
script
do your compilation and runs
exit
Then all you did under "script" will be logged in a file with name typescript.
You may edit it a little to make it look nice.
Assignment 1. Due: 02/04/22 by email.
Read Chapter 1, 2, 5 (parts covered in class) of the Pacheco text.
Cache profiling using Valgrind ( http://www.valgrind.org , https://www.valgrind.org/docs/manual/cgmanual.html )
for matrix  vector multiplication.
Ex. 5.12 pp. 265  266 Pacheco text:
answer the questions for k = 8.

On thor, Valgrind is in /usr/bin/valgrind (but you could download Valgrind and do this part on your system).
Use the job scheduler. When using torque on thor, allow for the necessary memory
in your torque script.
In your torque script, run valgrind as, e.g.:
valgrind tool=cachegrind ./omp_mat_vect_rand_split 1 8000 8000
You can examine your omp output file, or the file cachegrind.out.pid generated in your directory, using:
cg_annotate cachegrind.out.pid > outputcachegrind.out.pid
 Without cache profiling determine speedup S = T1/Tp and efficiency E = S/p for the case k = 8 (8 x 8000000,
8000 x 8000, 8000000 x 8), and for 1, 2, 4, 8, 16 threads. Give results in a table or graph.
 Extra credit: Repeat for matrix x matrix multiplication of (m x n) times (n x q) matrices,
sizes, e.g., m, n, q = k, k, 1000000 k; m, n, q = 100 k, 100 k, 100 k;
m, n, q = 1000000 k, k, k.
Note: think about which loop to parallelize.
Assignment 2: Choose one of 2a, 2b, 2c.
Due: 03/21/22 by email.
Assignment 2a.
Submit by email and hardcopy.
Read Kumar et al. paper and Chapter 3 (MPI) of the Pacheco text (parts covered in class).
Compare ARR, GRR, RP and NN load balancing.
Start with all the work (W) in one process.
Use consecutive load balancing steps until the work is nearly uniformly distributed over the processes.
For NN specify the neighbors of each process in its list of neighbors.
Try various configurations.
Compare the number of steps needed for each of the LB methods.
Do this for a varying number of processes and different sizes of the work load (W).
Submit your results, e.g., graphs that give the number of steps needed for
each of the methods for varying number of processes p and for different work load sizes W. Also submit a report describing your implementation,
your approach for NN, termination detection, and discuss observations/performance.
Assignment 2b. Component labeling problem.
Submit by email and hard copy.
Quinn p. 337, #13.7
 Write a sequential program that solves the problem for (a) 4connect, (b) 8connect pixels.
 Generate two images at random, with 0s and 1s as in the problem description, of about the size
of a page, so that you can print it on a page. Run your program with these images as input and
print the results.
 Parallelize your program using MPI or OpenMP.
 Test your program for multiple processes/threads,
and print the outputs for your images.
 Explain your sequential and parallel algorithms, and comment on the parallel performance.
Assignment 2c. Adaptive Integration with load balancing.
Submit by email and hard copy.
Read Kumar et al. paper and Chapter 3 (MPI) of the Pacheco text (parts covered in class).
Integration.pdf
Assignment 3. Due: final exam week Submit by email and hard copy.
CUDA: matrix addition
Implement matrix addition in CUDA C = A+B where the matrices are NxN and N is large. This is an extension of the program for adding
two very long vectors.
In your main program assign (float) values to the elements of A and B: a[i][j] = 2*i + j + 1 and b[i][j] = i + 4*j + 2.
Call your kernel. Then check if all elements of C are correct; if they are correct, print "We did it!".
Also execute the matrix addition sequentially, and time this (nested loop) with gettimeofday(). Compare the time to the
execution time of the kernel
and calculate the speedup. Time the cudaMemcpy calls separately.
Do this for 3 (large to very large) values of N.
Submit a typescript showing: a listing (with "cat") of your source code, your compilation, and executions with output.
Discuss your findings in your report.
Extra Credit: Implement matrix multiplication.

