Program Comprehension


Reading Skills

The importance of reading skills is grossly underestimated, or simply ignored in computer science education. Early computer science courses tend to focus on activities we typically consider constructive (the result is a tangible product). The explicit goal is to have the students producing programs written in a high-level language a quickly as possible. The justification for such behavior is fairly straightforward, student must demonstrate their competence in language constructs, and progam writing provides the opportunity. Hence the focus is language acquisition, and the use of language in solving a variety of problems. Skill at code reading is, at best, an incidental outcome learned during the course of these activities.

It may be difficult to pose a counter argument, since we know all programming books contain source code as examples, and students are expected to "read" these in order to understand an algorithm or data structure. Furthermore, students are expected to learn new languages, and the language constructs are typically provided in code fragment form (what a looping or selection statement works like). Students read code fragments to understand the behavior of a construct. The focus, and hence the student's attention, is on the language construct, not on understanding how such an assembly of statements solves a particular piece of the problem. This instructional regime in combination with the contructive approach mentioned in the previous paragraph leaves program comprehension an accidental outcome of computer science education. Systematic or thoughtful approaches to reading technical material such as this is not explicit in the educational program. Any claim that students learn to read these products is probably exaggerated. As Deimel and Naveda (Deimel90) write, the available evidence suggests that our current neglect of the topic cannot be justified by the argument that adequate program reading skills develop naturally and without special encouragement in students otherwise well prepared to enter professional practice. (p. 6)

The lack of attention to program comprehension is somewhat puzzling. Code reading, program understanding, and program comprehension appear to be essential for the learning of computer science and programming, as well as, the practicalities of the practitioner. Certainly understanding an algorithm requires considerable effort. If the practitioner/student cannot fully comprehend the material presented then they are doomed to invent their own strategy. An agreeably less than optimal situation.

The world of software development abounds with opportunities to engage coded passages in an attempt to understand the material, or the intent of the writer. Reading competence is necessary for activities where the reader must gain a sufficient understanding of the artifact to accomplish a task. To accomplish the task the individual must first gain a substantial understanding of the existing program through interacting with the code and program documentation, if it exists. The software maintainer must comprehend a given program sufficiently well to plan, design, and implement modifications (to extend, adapt or correct an artifact, without undue harm to its original integrity or structure). Though there may or may not be adequate supporting documentation, the programmer is bound inextricably to the code to determine how the current program performs its tasks. It would appear from this that ones ability to "make sense" of this legacy artifact is an economic issue for the industry, making it a formidable teaching/training issue for those that encounter the novice practitioner. The problems of program understanding to the maintenance programmer are magnified by unstructured or poorly structured legacy code. These problems are being tackled by researchers in reverse engineering (Müller94, Storey96) who are attempting to build tools to help in the comprehension activity.

Program comprehension is an important activity for other industrial practices too. There are numerous activities during development where the individual is required to take a foreign artifact and gain enough from reading it to accomplish tasks. Reviews, walk-throughs and inspections require individuals not acquainted with particular artifacts to become sufficiently versed in their operations to identify problems and defects, or to simply conduct an intelligent conversation regarding its operation. The reader attempts to understand the product sufficiently to identify defects. Though usually considered an effective mechanism for defect removal, Rifkin and Deimel (Rifkin94) hypothesized that these techniques were sometimes unsuccessful due to the inability of the reviewers to deal with the work product. Interestingly, organizations understand the importance of thorough training in the aspects of inspections, but do not provide any guidance looking for defects or program understanding. What they were able to demonstrate was that reading and understanding is "teachable/learnable" and that training programs to help individuals learn these skills is beneficial in the inspection/review task.

Aside from the importance of comprehension for industrial practice, improvement in code reading should support learning in computer science and software engineering classes. Students are typically given coded examples illustrative of a particular point. The effectiveness of the example is dependent on the ease with which the appropriate information can be extracted by the learner. A learner without strategies may not gain what was expected from the example or may do so only through the expenditure of undue effort. By improving the ability of students to interact with code-based material, instructional practices in these courses may be improved.

Program comprehension through reading appears to be a essential skill for software developers. As an essential skill, it seems reasonable to devote more explicit attention to helping students develop that skill during their studies. The basis of skilled reading behavior would come from the fact that students are engaged early and often in guided, deliberate instructional activities requiring the acquisition of information from code or code fragments. For our purpose, reading skills refers to those activities one engages in to make "sense" of software related products (e.g., code, specifications, designs, test cases). The activity is guided and deliberate in that principles from the research on program comprehension, software maintenance, and reverse engineering are integrated into the learning activities.

Reading Strategies

Programs are not read like novels. The flow of control in a program is dependent on the nature of the input and the structure of the program. Thus, programs have more of a hypertext flavor for the reader. As the reader encounters an item the decision must be made whether to transfer attention to that region of the program or to continue reading the current passage. Yet, like a novel, the programmer must uncover the plot(s) involved in the program and the roles various construct play in the plot.

Contrary to our intuition, the process of program understanding is a constructive process. The reader builds a model of how the program works through interacting with the artifact. A knowledge of the program is acquired from the code, comments, and associated documentation. The reader's existing knowledge structures are particularly important in providing specific information for the comprehension task. What particular knowledge is required to completely understand programs is still open. Brooks suggests that comprehension requires a problem domain model, a model for the problem, generic design templates, and programming language specific characteristics. Letovsky suggests a reader has a complete understanding of a program when she understands the goals of the program, how the goals and subgoals are achieved through program characteristics, and the role of the various program components in achieving those goals. The reader has a great effect on how successfully a particular program is read. Comprehension models suggest that the reader's knowledge of programming, of the programming language, and of the application domain-as well as his reading strategy-are important variables.

Top-down

One begins by gaining an understanding of the overall goals or purpose of the program. Each component of the program is then viewed from the perspective of how it relates to conduct of that purpose.

The strategy employed in this activity is hypothesis formation and evaluation (Brooks83). For each component (procedure, function, loop, fragment) the reader conjectures, typically a mental act, its role, then attempts to confirm or reject that hypothesis through more thorough examination. Rejection forces the reformulation of a hypothesis, and a continuation of the process. This continues until the reader is satisfied of her understanding of the program.

Bottom-up

A bottom-up strategy requires progress in the opposite direction. The reader begins with the details, understanding small fragments. These small fragments are composed into large aggregates whose purpose is constructed from its composite parts. This process continues until the program function is discovered. The bottom-up approach is sometime referred to as stepwise abstraction (Linger79) . During code reading, the reader looks at critical subroutines in the program and determines their function. Once the function is determined then the function, as a behavior, can be used to describe that block of code (abstraction). The reader works through the program hierarchy in this manner assembling abstractions to describe higher level components until the function of the program is determined. This is a bottom-up strategy requiring the understanding of code, and requiring the reader to map the code to suggested problem domain activity.

The literature (vonMayrhauser95) indicates programmers use some combination of the two strategies. Furthermore, the approach to reading/comprehension seems to be influenced by the nature of the task required of the individual (e.g., review, maintenance, testing). This raises important concerns in light of the variety of places code reading is useful, and may suggest the need for increased attention across the task areas. Furthermore, it clearly indicates the need for increased attention to the nature of the code reading activity in the educational setting.

Reading Guidelines

Researchers (Deimel90) at the Software Engineering Institute, Carnegie Mellon University, offer a general set of guidelines as a starting point:

  1. Be aware that code and comments (or other documentation) may not agree. The code may be correct and the comments wrong, or the reverse. Both may be wrong. (The code may not accomplish what it is supposed to do, and the comments may describe neither what should be done nor what is done.)

  2. Use indentation to help understand structure. However, incorrect indentation (more likely to come from a human than a compiler or prettyprinter) may be misleading.

  3. Try to build a model of the style conventions used in the program. If, for example, a consistent scheme has been used for identifiers, this knowledge can be used to help understand the meaning of newly encountered identifiers. It is important to read the program with the programmer's conventions, rather than your own, in mind.

  4. Arbitrary or stylistic differences may simply indicate programmer inconsistency, but they may also signal modified code (a maintainer with different habits has modified the code) or code whose functions are not as analogous as they might at first appear.

  5. Consider the possibility that the programmer did not know what he was doing.

  6. Watch out for code written to overcome compiler or computer limitations or code containing apparently magic numbers.

  7. Watch out for use of nonstandard language features. (Some compilers, for example, initialize variables that other compilers do not.)

  8. Use stepwise abstraction.

  9. Odd-looking arithmetic operations may be required to maintain accuracy. Consider possible roundoff implications.

  10. Because changes to the code often introduce errors and inconsistencies, look for evidence of changes. Look for change logs or comments about changes imbedded in comments. Stylistic differences can indicate changes by a programmer other than the author. If multiple versions of a program exist, using a tool to find changes (e.g., UNIX diff) can be helpful.

  11. If, after a good deal of study, a piece of code is making no sense, ask another programmer to look at it. Consider explaining to him what you think you do know.

  12. Search for information, particularly in documentation, that relates objects in different knowledge domains, for example, comments that associate variables with problem-domain objects.

  13. Be wary of objects that have the same identifier but different scopes. Reasoning about the wrong objects can be frustrating.

  14. Be wary of objects having nearly the same names, particularly those whose identifiers differ by a single character.

  15. Particular code may be an artifact that no longer serves a function.

  16. Be sure you make no inessential assumptions when reasoning about concurrent programs.

  17. Be alert for variables that serve more than one function or that are used inconsistently, as they can mislead the reader.

  18. The effect of apparent bugs in the program can be undone by an inverse bug somewhere else.

  19. Use symbolic execution to determine function.

  20. Use code substitution to verify or refine hypotheses. Substitute code for what you think is being performed into the program, and examine how your code differs from what code is actually there.

  21. Tracing code with test data, whether by hand or using a symbolic debugger, will not by itself tell you what function the code performs. However, it can help suggest some hypotheses and eliminate others.

  22. Be alert for literals that are conceptually distinct but that happen to have the same values. (The trouble usually begins when one tries to modify such code.)

  23. In languages that permit operator overloading, be sure the operator you think you have is really the one you do have.

  24. Be willing to abandon hypothesis for which there is insufficient evidence.

  25. Use an editor or browser to traverse the code. Editors that support multiple windows can show several parts of the program at once.

  26. File search tools such as UNIX grep can be used to find identifiers that may be in one of several files.

  27. In the absence of tools like a cross-reference generator, such unlikely tools as spelling checkers can be useful for listing the identifiers used in the program.

  28. Traditional debugging techniques can be used to read code. The addition of print statements, for example, can be useful in verifying hypotheses.

  29. Read programs with a cross-reference listing, structure chart, or similar summaries of program information close at hand. It is sometimes useful to generate such charts by hand if they cannot be obtained automatically.


NOTES:

Reverse Engineering
The process of finding the initial design of a software system by constructing it from the existing source code.

Modified: June 1, 1997