Abstract: We replicated a controlled experiment first run in the early 1980's to evaluate the effectiveness and efficiency of 50 student subjects who used three defect-detection techniques to observe failures and isolate faults in small C programs. The three techniques were code reading by stepwise abstraction, functional (black-box) testing, and structural (white-box) testing. Two internal replications showed that our relatively inexperienced subjects were similarly effective at observing failures and isolating faults with all three techniques. However, our subjects were most efficient at both tasks when they used functional testing. Some significant differences among the techniques in their effectiveness at isolating faults of different types were seen. These results suggest that inexperienced subjects can apply a formal verification technique (code reading) as effectively as an execution-based validation technique, but they are most efficient when using functional testing.
Abstract:In this paper we present a concrete method for validating software product measures for internal attributes and provide guidelines for its application. This method integrates much of the relevant previous work, such as measurement theory, properties of measures, and GQM. We identify two types of validation: theoretical and empirical. The former addresses the question "is the measure measuring the attribute it is purporting to measure?", and the latter addresses the question "is the measure useful in the sense that it helps reach some corporate goal?"
Abstract: This paper presents the external replication of a controlled experiment which compared three defect detection techniques (Ad Hoc, Checklist, and Defect-based Scenario) for software requirements inspections, and evaluated the benefits of collection meetings after individual reviews. The results of our replication were partially different from those of the original experiment. Unlike the original experiment, we did not find any empirical evidence of better performance when using scenarios. To explain these negative findings we provide a list of hypotheses. On the other hand, the replication confirmed one result of the original experiment: the defect detection rate is not improved by the collection meetings. The external replication was made possible by the existence of an experimental kit provided by the original investigators. We discuss what difficulties we encountered in applying the package to our environment, having different cultures and skills. We also discovered some critical problems in the original experiment which can be considered threats to its internal validity. Using our results, experience and suggestions, other researchers will be able to improve the original experimental design before attempting further replications.
Abstract: In order to improve software maintenance processes, we first need to be able to characterize and assess them. These tasks must be performed in depth and with objectivity since the problems are complex. One approach is to set up a measurement-based software process improvement program specifically aimed at maintenance. However, establishing a measurement program requires that one understands the problems to be addressed by the measurement program and is able to characterize the maintenance environment and processes in order to collect suitable and cost-effective data. Also, enacting such a program and getting usable data sets takes time. A short term substitute is therefore needed.
We propose in this paper a characterization process aimed specifically at maintenance and based on a general qualitative analysis methodology. This process is rigorously defined in order to be repeatable and usable by people who are not acquainted with such analysis procedures. A basic feature of our approach is that actual implemented software changes are analyzed in order to understand the flaws in the maintenance process. Guidelines are provided and a case study is shown that demonstrates the usefulness of the approach.
Abstract: This paper describes an empirical comparison of several modeling techniques for predicting the quality of software components early in the software life cycle. Using software product measures, we built models that classify components as high-risk, i.e., likely to contain faults, or low-risk, i.e., likely to be free of faults.
The modeling techniques evaluated in this study include principal component analysis, discriminant analysis, logistic regression, logical classification models, layered neural networks, and holographic networks. These techniques provide a good coverage of the main problem-solving paradigms: statistical analysis, machine learning, and neural networks.
Using the results of independent testing, we determined the absolute worth of the predictive models and compare their performance in terms of misclassification errors, achieved quality, and verification cost. Data came from 27 software systems, developed and tested during three years of project-intensive academic courses. A surprising result is that no model was able to effectively discriminate between components with faults and components without faults.
Abstract:
Abstract: Despite significant progress in the last 15 years, implementing a successful measurement program for software development is still a challenging undertaking. Most problems are not of theoretical but of methodological or practical nature. The Fraunhofer Institute for Experimental Software Engineering (FhG IESE) represents a unique European expertise acquired from a long series of measurement-based process improvement programs. These have been implemented over the years in many North American and European organizations. In this article, we present lessons learned from these experiences and we structure them into practical guidelines for efficient and useful software measurement aimed at process improvement in industry.
Abstract:
We consider reading techniques a fundamental means of achieving high quality software. Due tothe lack of research in this area, we are experimenting with the application and comparison ofvarious reading techniques. This paper deals with our experiences with a family of reading techniques known as Perspective-Based Reading (PBR), and its application to requirements documents. The goal of PBR is to provide operational scenarios where members of a review team read a document from a particular perspective, e.g., tester, developer, user. Our assumption is that the combination of different perspectives provides better coverage of the document, i.e., uncovers a wider range of defects, than the same number of readers using their usual technique.
To test the effectiveness of PBR, we conducted a controlled experiment with professional software developers from the National Aeronautics and Space Administration / Goddard Space Flight Center (NASA/GSFC) Software Engineering Laboratory (SEL). The subjects read two types of documents, one generic in nature and the other from the NASA domain, using two reading techniques, a PBR technique and their usual technique. The results from these experiments, as well as the experimental design, are presented and analyzed. Teams applying PBR are shown to achieve significantly better coverage of documents than teams that do not apply PBR.
We thoroughly discuss the threats to validity so that external replications can benefit from thelessons learned and improve the experimental design if the constraints are different from thoseposed by subjects borrowed from a development organization.
Abstract: Careful analysis of Software Engineering measurement data is essential in deriving the rightconclusions from performed experiments. Different data analysis techniques may provide dataanalysts with different and complementary insights into the studied phenomena. In this paper, two data analysis techniques - Rough Sets and Logistic Regression - are compared, from boththe theoretical and the experimental points of view. In particular, the empirical study was per-formed as part of the ESPRIT/ESSI project CEMP on a real-life maintenance project, the DATATRIEVE project carried out at Digital Engineering Italy. The two data analysis techniquesare different in nature: Logistic Regression uses a statistical approach, while the Rough Setsanalysis technique does not. We have applied both techniques to the same data set. The goal of the experimental study was to determine the major factors affecting reliability and reusability in the application context. Results obtained with either analysis technique are discussed and compared, to identify commonalities and differences between the two techniques. Finally, bothanalysis techniques are evaluated with respect to their weaknesses and strengths.
Abstract: This paper proposes a comprehensive suite of measures to quantify the level of class coupling during the design of object-oriented systems. This suite takes into account the different OO design mechanisms provided by the C++ language (e.g., friendship between classes, specialization, and aggregation) but it can be tailored to other OO languages. The different measures in our suite thus reflect different hypotheses about the different mechanisms of coupling in OO systems. Based on actual project defect data, the hypotheses underlying our coupling measures are empirically validated by analyzing their relationship with the probability of fault detection across classes. The results demonstrate that some of these coupling measures may be useful early quality indicators of the design of OO systems. These measures are conceptually different from the OO design measures defined by Chidamber and Kemerer; in addition, our data suggests that they are complementary quality indicators.
Abstract: Carrying out empirical studies is widely held to be of importance. A view less widely held is that experiments should be replicated externally to verify and validate the original results.
This paper serves two main functions. First, the need for external replications is established. The role of replication in experimental software engineering is discussed. Without the confirming power of external replications, results in experimental software engineering should only be provisionally accepted, if at all. An extension to the framework for experimentation in software engineering by Basili et al [5] is proposed to differentiate between the various kinds of internal and external replication and their powers of confirmation and to allow a better appreciation of the context of a piece of empirical work.Second, this paper presents a concrete example of an external replication of an experiment which tested the benefits to maintenance of using modular code against non-modular (monolithic) code. The results of the original experiment by Korson [32, 33] showed that a modular program could be maintained significantly faster than an equivalent monolithic version of the same program under the condition that modularity has been used to implement information hiding which localizes changes required by a modification. The results of our replication, however, were strikingly different from those of the original and showed no significant difference between the average times taken to maintain modular and monolithic code. An inductive analysis was undertaken to investigate the reasons for this difference. Evidence was uncovered suggesting an ability effect (which was not observed in the original experiment), a lack of realism in the context in which subjects were asked to perform tasks, differing degrees of subjects' understanding of the programs, different approaches taken by subjects toward making the required modifications, and possible deficiencies of subject monitoring. Other sources of variability are also discussed.
It is concluded that external replications, combined with inductive analysis techniques, have an important if not vital part to play in the realization of generalizable results.
Abstract: This empirical research was undertaken as part of a multi-method programme of research to
investigate unsupported claims made of object-oriented technology. A series of subject-based laboratory experiments, including an internal replication, tested the effect of inheritance
depth on the maintainability of object-oriented software. Subjects were timed performing identical maintenance tasks on object-oriented software with a hierarchy of three levels of inheritance depth and equivalent object-based software with no inheritance. This was then
replicated with more experienced subjects. In a second experiment of similar design, subjects were timed performing identical maintenance tasks on object-oriented software with a hierarchy of five levels of inheritance depth and the equivalent object-based software.
The collected data showed that subjects maintaining object-oriented software with three levels of inheritance depth performed the maintenance tasks significantly quicker than those maintaining equivalent object-based software with no inheritance. In contrast, subjects maintaining the object-oriented software with five levels of inheritance depth took longer, on average, than the subjects maintaining the equivalent object-based software (although statistical significance was not obtained). Subjects' source code solutions and debriefing questionnaires provided some evidence suggesting subjects began to experience diffculties
with the deeper inheritance hierarchy.
It is not at all obvious that object-oriented software is going to be more maintainable in the long run. These findings are sufficiently important that attempts to verify the results should be made by independent researchers.
Abstract: Several important questions still need to be answered regarding the maintainability of object-oriented design documents. This paper focuses on the following issues: are object-oriented design documents easier to understand and modify than structured design documents? Do they need to comply with quality guidelines such as the ones provided by Coad and Yourdon? What is the impact of such quality standards on the understandability and modifiability of design documents? Answers can be based on informed opinion or empirical evidence. Since software technology investments are substantial and there exist contradictory opinions regarding design strategies, performing experimental studies on these topics is a relevant research activity.
This paper presents a controlled experiment performed with computer science students as subjects. Results strongly suggest that quality guidelines based on Coad and Yourdon principles have a beneficial effect on the maintainability of object-oriented design documents. However, there is no strong evidence regarding the alleged higher maintainability of object-oriented design documents over structured design documents. Furthermore, results suggest that object-oriented design documents are more sensitive to poor design practices, in part because their cognitive complexity becomes increasingly unmanageable. However, because our ability to generalise these results is limited, they should be considered as preliminary, i.e., it is very likely that they can only be generalised to programmers with little object-oriented training and programming experience. Such programmers can, however, be commonly found on maintenance projects. As well as additional research, external replications of this study are required to confirm the results and achieve confidence in these findings.
Introduction: In recent years a substantial number of organizations have gained experience in software process improvement (SPI). Furthermore, some researchers have studied such organizations by collecting and analyzing costs and benefits data on their SPI efforts. The objective of this report is to review and summarize the empirical evidence thus far on the costs and benefits of SPI. The intention is that this review would be utilized to support the business case for initiating and continuing SPI programs, to aid in the selection amongst the alternative improvement paradigms, to make more accurate estimates of the costs and benefits of such efforts, and to help set and manage the expectations of technical staff and management.
Abstract: Inspection of software development artifacts have become an integral part of software quality improvement. However, experience tells us that inspection effectiveness (i.e., its capability to find defects) and efficiency (i.e., its cost-effectiveness) vary significantly across organizations or, even more striking, from one inspection to the other in a given organization. Thus, we investigate in this paper the variations across inspections at one of our customersÕ sites. By measuring and modelling their inspection processes, we identified some of the important factors that have an impact on effectiveness and efficiency and that can explain most of their variation. The models we developed show that exponential relationships exist between the inspected document size, the effort spent on preparation, and the resulting effectiveness and efficiency. Moreover, these models appear to make accurate predictions (a goodness of fit of R 2 =0.68 and R 2 =0.89 for effectiveness and efficiency, respectively). We show how they can be practically used for inspection resource planning, quality control, evaluation, and improvement.
Abstract: Productivity benchmarking allows software development projects and organizations to compare themselves to the market place in a given sector of industry. However, in practice benchmarking presents many difficulties such as identifying a meaningful basis of comparison. The European Space Agency (ESA) outsources many software projects. They have accumulated a large cost database from these projects. In this paper, we present a method for productivity benchmarking, as well as the productivity benchmarks we derived for one of our customers based on the ESA database. Furthermore, we provide usage scenarios for these models by describing how these models can be practically applied for benchmarking purposes. We developed alternative types of benchmarks using two different modelling techniques, namely least squares regression and regression trees. The most accurate model is obtained using least-squares regression, explains 92% of the variation in project effort, i.e., R 2 =0.92, corresponding to an average magnitude of relative error (MRE) of 0.34. Nevertheless, regression tree models are more intuitive and easier to apply for benchmarking purposes.
Abstract: Counts of defects found during the various defect detection activities in software projects and their classification provide a basis for product quality evaluation and process improvement. However, since defect classifications are subjective, it is necessary to ensure that they are repeatable (i.e., that the classification is not dependent on the individual). In this paper we evaluate a commonly used defect classification scheme that has been applied in IBM's Orthogonal Defect Classification work, and in the SEI's Personal Software Process. The evaluation utilizes the Kappa statistic. We use defect data from code inspections conducted during a development project. Our results indicate that the classification scheme is in general repeatable. We further evaluate classes of defects to find out if confusion between some categories is more common, and suggest a potential improvement to the scheme.
Abstract: The most common techniques for detecting defects in software artifacts are inspection and testing. Since both techniques are effort consuming, they are often presented as being counterparts or even rivals rather than as being complementary. Hence, few controlled empirical studies investigate the effects of inspection and testing on software quality when applied in sequence. This paper contributes a controlled experiment to shed light on this issue. Twenty subjects performed sequentially code inspection and structural testing using different coverage values as test criteria on a C-code module. We adopted this sequence because it is recommended for use in industry.
The results of this experiment show that inspection significantly outperforms structural testing with respect to (cost-)effectiveness for defect detection. Furthermore, the experimental results indicate little evidence to support the hypothesis that structural testing detects defects of a particular class that were missed by inspection and vice versa. These findings lead us to the conclusion that inspection and structural testing do not complement each other well. In fact, prior inspection seems to hinder the (cost-)effectiveness of structural testing. Since inspection out-performs structural testing and since 39 percent (on average) of the defects were not detected at all, it might be more valuable to apply inspection together with other testing techniques, such as boundary value analysis, to achieve a better defect coverage.
We are aware that a single experiment does not provide conclusive evidence. Hence, we consider it only one step in the determination of the optimal mix of defect detection techniques. Additional research as well as replication of this experiment are required to make further progress into this direction.
Abstract: The empirical study described in this paper addresses software reading for construction: how application developers obtain an understanding of a software artifact for use in new system development. This study focuses on the processes developers would engage in when learning and using object-oriented frameworks. We analyzed 15 student software development projects using both qualitative and quantitative methods to gain insight into what processes occurred during framework usage. The contribution of the study is not to test predefined hypotheses but to generate well-supported hypotheses for further investigation. The main hypotheses we produce are that example-based techniques are well suited to use by beginning learners while hierarchy-based techniques are not because of a larger learning curve. Other more specific hypotheses are proposed and discussed.
Abstract: Software inspection is an effective method of defect detection. Recent research activity has considered the development of tool support to further increase the efficiency and effectiveness of inspection, resulting in a number of prototype tools being developed. However, no comprehensive evaluations of these tools have been carried out to determine their effectiveness in comparison with traditional paper-based inspection. This issue must be addressed if tool-supported inspection is to become an accepted alternative to, or even replace, paper-based inspection.
This paper describes a controlled experiment comparing the eectiveness of tool-supported software inspection with paper-based inspection, using a new prototype software inspection tool known as ASSIST (Asynchronous/Synchronous Software Inspection Support Tool). 43 students used ASSIST and paper-based inspection to inspect two C++ programs of approximately 150 lines. The subjects performed both individual inspection and a group collection meeting, representing a typical inspection process. It was found that subjects performed equally well with tool-based inspection as with paper-based, measured in terms of the number of defects found, the number of false positives reported, and meeting gains and losses.
Comments should be sent to
Richard Upchurch (rupchurch@umassd.edu) This document
Created: March 5, 1996
by RLU
Modified: May 7, 1997
by RLU