SUGGESTIONS FOR DEBUGGING FOR SERIAL (NON-{PARALLEL) EXECUTION OF SCIENTIFIC PROGRAMS. The following section is designed to provide suggestions for debugging programs written for scientific applications. However, many of the suggestions will apply to debugging other types of applications. This section deals only with debugging programs that are running serially and not in parallel. Programs can appear to have no bugs with some sets of data because all paths through the program may not be executed. To debug your program, it is important to try and test your program with a variety of different data sets so that (hopefully) all paths in your program can be tested for errors. Now assume that your program compiles and produces an executable, but the program execution either doesn't complete or it does complete but produces wrong answers. Carefully going through the following steps will help you find many of the commonly occurring bugs. All compiler options mentioned in this section are valid for Fortran/77, Fortran/90, C and C++ compilers unless indicated otherwise. STEP 1. Using lint If your program is written in C, you should use the lint utility that will help identify problems with your code at the compile step. If your C program is in the file prog.c, then invoke lint with: lint prog.c The output from the above invocation will be directed to your screen. There is a public domain version of lint for Fortran77 called ftncheck that can be obtained from the /pub directory at the anonymous ftp site ftp.dsm.fordham.edu at Fordham University. STEP 2. Check for out-of-bounds array accesses A common programming error is the use of array indices outside their declared limits. For help finding these errors in your programs compile as follows f90 -g -C prog.f or CC -g -C prog.c Then run the generated executable by itself under ddd. The -C option enables bounds checking. Using the ddd debugger for non-parallel programs: To run the program a.out under the control of the ddd visual debugger, issue ddd a.out When the window comes up, you can click on run, or you can type in run in the Command pane at the bottom of the window. You will need to use the command pane if you need arguments to the command or if you normally use an input file, E.g. run arg1 arg2 < input The program should stop at the statement for which the program first stepped outside of a known array bound. (In some cases, the bound is not know, e.g. when * is used as the last dimension in an array, or when arrays are declared with incorrect bounds.) Once the program has stopped at this statement, you can type w to see the surrounding 10 statements, and where to see the list of calls that have been made. You can also type print j to see the current value of the variable j in the program. You can run to a certain statement in the program by inserting a breakpoint by clicking to the left of the statement on which quo wish to stop. A stop symbol should appear. After that, you can click on continue to go to the next program or until the program naturally stops, or you can use the NEXT and STEP buttons to move either a single statement in the current routine, or a single statement in the program,resp. (Note: STEP stops at the next program statement even in a subroutine or function. NEXT does not stops inside called routines. Also STEP and NEXT are slow compared with the breakpoint and CONTINUE technique, so you want to get close with breakpoints, and then move in with STEP or NEXT. Any time you are stopped, you can interrogate the value of any of the variable which are valid in the current routine. See man ddd for more information, and man DB for information on the underlying non-parallel debugger. Note: For Fortran programs, bounds checking cannot be done in subprograms if arrays passed to a subprogram are declared with extents of "1" or "*" instead of passing in their sizes and using this information in their declarations. An example of how the declarations should be written to allow for bounds checking is: SUBROUTINE SUB(A,LDA,N, ...) INTEGER LDA,N REAL A(LDA,N) Debugging from a core file Sometimes during program execution a core file is produced and the program does not complete execution. This file is placed in your working directory with the file name of 'core'. You can find the place in your program where the execution stopped and the core file was produced by entering ddd core where is the executable that you were running. The offending statement should be highlighted. If the executable was formed by compiling with the -g option, then you can view values of program variables when program execution stopped. Remember that this is the last statement executed before the core file was produced and hence it does not necessarily mean that the bug in your program is in this line of code. For example, a program variable may have been initialized incorrectly, but the core was not produced until the variable was used later in the program. Some machines are configured to not produce a core file. To find out if this is the case on the machine you are using enter limit If the limit on coredumpsize is zero, no corefile will be produced. If the limit on coredumpsize is not large enough to hold the program's memory image, the core file produced will not be usable. To change the configuration to allow useful core files to be produced enter unlimit coredumpsize Debugging when the program completes execution, but produces incorrect answers Assume that the above steps have been taken and that all problems that can be detected by the above have been corrected. This means that your program completes execution, but obtains incorrect answers. What you do at this point will likely depend on special circumstances. The following is a list of some commonly used debugging procedures that may or may not apply to your situation. 1. Try running your program on a very small problem size where you can easily obtain intermediate results. Run your program under cvd on this small problem and compare with the known correct results. 2. If you know that a certain answer being calculated is not correct, set breakpoints in your program so you can monitor the value of the answer at various points in your program. 3. You may want to set breakpoints on each call to a selected function/subroutine where you suspect there may be problems, see above. 4. Debugging COMMON blocks and EQUIVALENCE statements in Fortran. Variables used in these statements must have exactly the same type and dimension everywhere they appear and they must occur in exactly the same order. Normally ftncheck, for Fortran/77 programs will find these errors. However, for Fortran/77 programs it is best to use an include statement for each COMMON block. For Fortran/90 programs, it is best to use a module for each COMMON block. It is best not to use EQUIVALENCE statements. 5. Local data not saved. In Fortran, values of local variables are not guaranteed to be saved from one execution of the subprogram to the next unless they are either initialized in their declarations or they are declared to have the SAVE attribute. Some compilers/ machines automatically give all local variables the SAVE attribute, so moving a working program from this compiler/machine to a compiler/machine that does not do this may introduce this kind of bug. You should give local variables the SAVE attribute if you would like their values saved. Debugging from a core file Sometimes during program execution a core file is produced and the program does not complete execution. This file is placed in your working directory with the file name of 'core'. You can find the place in your program where the execution stopped and the core file was produced by entering ddd core where is the executable that you were running. The offending statement should be highlighted. If the executable was formed by compiling with the -g option, then you can view values of program variables when program execution stopped. Remember that this is the last statement executed before the core file was produced and hence it does not necessarily mean that the bug in your program is in this line of code. For example, a program variable may have been initialized incorrectly, but the core was not produced until the variable was used later in the program. Some machines are configured to not produce a core file. To find out if this is the case on the machine you are using enter limit If the limit on coredumpsize is zero, no corefile will be produced. If the limit on coredumpsize is not large enough to hold the program's memory image, the core file produced will not be usable. To change the configuration to allow useful core files to be produced enter unlimit coredumpsize Debugging when the program completes execution, but produces incorrect answers Assume that the above steps have been taken and that all problems that can be detected by the above have been corrected. This means that your program completes execution, but obtains incorrect answers. What you do at this point will likely depend on special circumstances. The following is a list of some commonly used debugging procedures that may or may not apply to your situation. 1. Try running your program on a very small problem size where you can easily obtain intermediate results. Run your program under cvd on this small problem and compare with the known correct results. 2. If you know that a certain answer being calculated is not correct, set breakpoints in your program so you can monitor the value of the answer at various points in your program. 3. You may want to set breakpoints on each call to a selected function/subroutine where you suspect there may be problems, see above. 4. Debugging COMMON blocks and EQUIVALENCE statements in Fortran. Variables used in these statements must have exactly the same type and dimension everywhere they appear and they must occur in exactly the same order. Normally ftncheck, for Fortran/77 programs will find these errors. However, for Fortran/77 programs it is best to use an include statement for each COMMON block. For Fortran/90 programs, it is best to use a module for each COMMON block. It is best not to use EQUIVALENCE statements. 5. Local data not saved. In Fortran, values of local variables are not guaranteed to be saved from one execution of the subprogram to the next unless they are either initialized in their declarations or they are declared to have the SAVE attribute. Some compilers/ machines automatically give all local variables the SAVE attribute, so moving a working program from this compiler/machine to a compiler/machine that does not do this may introduce this kind of bug. You should give local variables the SAVE attribute if you would like their values saved.