II. SUGGESTIONS FOR DEBUGGING FOR SERIAL EXECUTION OF SCIENTIFIC PROGRAMS. The following section is designed to provide suggestions for debugging programs written for scientific applications. However, many of the suggestions will apply to debugging other types of applications. This section deals only with debugging programs that are running serially and not in parallel. Programs can appear to have no bugs with some sets of data because all paths through the program may not be executed. To debug your program, it is important to try and test your program with a variety of different data sets so that (hopefully) all paths in your program can be tested for errors. Now assume that your program compiles and produces an executable, but the program execution either doesn't complete or it does complete but produces wrong answers. Carefully going through the following steps will help you find many of the commonly occurring bugs. All compiler options mentioned in this section are valid for Fortran/77, Fortran/90, C and C++ compilers unless indicated otherwise. STEP 1. Using lint If your program is written in C, you should use the lint utility that will help identify problems with your code at the compile step. If your C program is in the file prog.c, then invoke lint with: lint prog.c The output from the above invocation will be directed to your screen. There is a public domain version of lint for Fortran77 called ftncheck that can be obtained from the /pub directory at the anonymous ftp site ftp.dsm.fordham.edu at Fordham University. STEP 2. Check for out-of-bounds array accesses A common programming error is the use of array indices outside their declared limits. For help finding these errors in your Fortran 90 program, compile as follows (currently there is no such support for the C/C++ compilers): f90 -g -DEBUG:subscript_check:trap_uninitialized:conform_check prog.f90 Then run the generated executable by itself or under dbx or cvd. The "subscript_check" option enables bounds checking. The "trap_uninitialized" option caused the program to detect when the value of a variable is used before it has been set. The "conform_check" option enables conformance checking of array operands in array expressions. and click on the RUN button. (See the DEBUG_group man page for more information on this option.) (a) If you are running a C or C++ program, your program will stop at the first occurrence of an array index going out of bounds. You can now examine the value of the index which caused the problem using any of the methods described in section C above. Compiling with the -g option causes the compiler to generate symbolic debugging information so your program will execute under cvd. It also disables optimization. Sometimes disabling optimization will cause the bug to disappear. If this happens, you should still carefully go through each of these steps as best as you can. (b) If you are using the Fortran/90 compiler, after compiling with the above options and after running the generated executable under cvd, enter the following in the cvd pane cvd> stop in __f90_bounds_check Now click on the RUN button. Next click on VIEWS and select Call Stack and double click on the function/subroutine immediately below __f90_bounds_check. This will cause the source code for this function/subroutine to be displayed in the Main View Window and the line where cvd has stopped will be highlighted. You can now find the value of the index which caused the out-of-bounds problem. (c) If you are using the Fortran/77 compiler, after compiling with the above options and after running the generated executable under cvd, enter the following in the cvd pane cvd> stop in s_rnge Now click on the RUN button. Next click on VIEWS and select Call Stack and double click on the function/subroutine immediately below s_rnge. This will cause the source code for this function/subroutine to be displayed in the Main View Window and the line where cvd has stopped will be highlighted. You can now find the value of the index which caused the out-of-bounds problem. Note: For Fortran programs, bounds checking cannot be done in subprograms if arrays passed to a subprogram are declared with extents of "1" or "*" instead of passing in their sizes and using this information in their declarations. An example of how the declarations should be written to allow for bounds checking is: SUBROUTINE SUB(A,LDA,N, ...) INTEGER LDA,N REAL A(LDA,N) STEP 3. Check for uninitialized variables being used in calculations To find uninitialized REAL variables being used in floating point calculations, compile your program with -g -DEBUG:trap_uninitialized=ON This will force all uninitialized stack, automatic and dynamically allocated variables to be initialized with 0xFFFA5A5A. When this value is used as a floating point variable involving a floating point calculation, it is treated as a floating point NaN and it will cause a floating point trap. When it is used as a pointer or as an address a segmentation violation may occur. For example, if x and y are real variables and the program is compiled as above, x = y will not be detected when y is uninitialized since no floating point calculations are being done. However, the following will be detected: x = y + 1.0 After compiling your program with the above options, enter cvd and then click the RUN button. To find out where your program has stopped, click on VIEWS and select the Call Stack where you will see that many system routines have been called. Double click on the highest routine in the call stack that is clearly not a system routine. This will bring up the source code for this routine and the line where the first uninitialized variable (subject to the above-mentioned conditions) was used. You can now examine the values of the indices which caused the problem using any of the methods described in section I part C. At the present time, it is not possible to use cvd to detect the use of uninitialized INTEGER variables. STEP 4. Finding Divisions by Zero and Overflows A. To find floating point divisions by zero and overflows, first enter setenv TRAP_FPE ON if you are using the csh or tcsh shell. For other shells, see their man pages. Next compile your program with -g and link with -lfpe: -g -lfpe and then enter cvd In the cvd command/message pane enter cvd> stop in __catch Click on the RUN button; select Call Stack from VIEWS and then double click on the highest routine that is not a system routine. The line where execution stopped will now be highlighted in the Source code display area of the cvd Main View window. You may now use any of the methods in section C above to find variable values to discover why the divide by zero or overflow occurred. For more information on handling floating point exceptions, see the man pages for handle_sigfpes. B. To find integer divisions by zero, compile your program as -g -DEBUG:div_check=1 and enter cvd Click the RUN button and the program will automatically stop at the first line where an integer divide by zero occurred. You may now use any of the methods of section C to find variable values to discover why the divide by zero occurred. STEP 5. A core file is produced Sometimes during program execution a core file is produced and the program does not complete execution. This file is placed in your working directory with the file name of 'core'. You can find the place in your program where the execution stopped and the core file was produced by entering cvd core where is the executable that you were running. The cvd Main View window will come up and the source line where execution stopped may be highlighted in green. If it is not highlighted in green, then select Call Stack under VIEWS and double-click on the highest routine that is not a system routine. This will bring up the source code for this routine and the last line executed will be highlighted in green. If the executable was formed by compiling with the -g option, then you can view values of program variables when program execution stopped. You can find the assembly instruction where execution stopped by clicking on VIEWS and selecting Disassembly View. Remember that this is the last statement executed before the core file was produced and hence it does not necessarily mean that the bug in your program is in this line of code. For example, a program variable may have been initialized incorrectly, but the core was not produced until the variable was used later in the program. Some machines are configured to not produce a core file. To find out if this is the case on the machine you are using enter limit If the limit on coredumpsize is zero, no corefile will be produced. If the limit on coredumpsize is not large enough to hold the program's memory image, the core file produced will not be usable. To change the configuration to allow useful core files to be produced enter unlimit coredumpsize STEP 6. Incorrect answers are being produced Assume that the above steps have been taken and that all problems that can be detected by the above have been corrected. This means that your program completes execution, but obtains incorrect answers. What you do at this point will likely depend on special circumstances. The following is a list of some commonly used debugging procedures that may or may not apply to your situation. 1. Try running your program on a very small problem size where you can easily obtain intermediate results. Run your program under cvd on this small problem and compare with the known correct results. 2. If you know that a certain answer being calculated is not correct, set breakpoints in your program so you can monitor the value of the answer at various points in your program. 3. You may want to set breakpoints on each call to a selected function/subroutine where you suspect there may be problems, see section I part C. 4. Debugging COMMON blocks and EQUIVALENCE statements in Fortran. Variables used in these statements must have exactly the same type and dimension everywhere they appear and they must occur in exactly the same order. Normally ftncheck, for Fortran/77 programs will find these errors. However, for Fortran/77 programs it is best to use an include statement for each COMMON block. For Fortran/90 programs, it is best to use a module for each COMMON block. It is best not to use EQUIVALENCE statements. 5. Local data not saved. In Fortran, values of local variables are not guaranteed to be saved from one execution of the subprogram to the next unless they are either initialized in their declarations or they are declared to have the SAVE attribute. Some compilers/ machines automatically give all local variables the SAVE attribute, so moving a working program from this compiler/machine to a compiler/machine that does not do this may introduce this kind of bug. You should give local variables the SAVE attribute if you would like their values saved.