Problems of Empirical Software Engineering Studies

Empirical studies are nowadays, mainstream approaches in software engineering research. This can be witnessed by the Google Scholar journal/conference ranking among the “Software System” subcategory. According to the ranking obtained on Jan 10, 2020, The journal “Empirical Software Engineering” has a higher ranking (7th) than “IEEE Transaction on Software Engineering” (8th), which is traditionally considered as the best journal in this field. Further, many presentations at other top conferences (ICSE, FSE, and ASE) are empirical natures and an emerging conference, MSR (“Mining Software Repositories”), precisely aims to empirical studies of software repositories which now wildly available thanks to Github and other alternatives.

So, what is wrong with this? Surely, Science is a study of empirical nature, isn’t it? “Induction” from the empirical data is a primary method of Science, isn’t it? However, the idea that Science is based on mere “induction” is a misconception.

Suppose we have data which suggest such, and such code metrics correlate the number of bugs. But is this correlation a fundamental/universal law? The correlation may depend on a particular programming language, programming paradigm, methodologies vogue only a handful of years, programming communities, application domains, etc. Of course, we can empirically study the factors which affect the correlation we just discover. However, whatever the results of further empirical studies are, we still have further doubt about universalities of these studies. This shows mere “induction” from data cannot prove any useful law.

Usually, most of (empirical, but not limited to) software engineering studies pose research questions based on a hypothesis. Then, these research questions are (in most cases) investigated empirically. For example, if someone thinks the length of function names affects the number of bugs, data from, say, a bug repository of Java programs are empirically studied, and the hypothesis is tested by statistical mean.

Unfortunately, a hypothesis considered by software engineering research is often “one-shot” in nature and does not form a systematic theory. Being a one-shot hypothesis, we cannot be sure about its scope, as stated above. Examining Java code repositories may not be sufficient, but there is no way to tell because of no background theory. Even a systematic theory is presented, they are not like theories of other fields of engineering in which rigorous mathematical formulations are employed. Therefore, rigorous verification nor falsification of a theory proposed is almost impossible.

Is there no way to improve this situation? I don’t think so, because the software has rich background theories. Software (development) has two aspects. The first aspect is that it is about computer programs. We have mature and rich mathematical theories of programs and their computation. Further, software development is all about the human being. Therefore, we can employ phycology, sociology, economics, and other social science and humanities to study software development.

An interesting (so I think) possibility: code mutation is widely used for a “proxy” of bugs introduced by human programmers. It would be interesting if we investigate the effects of code mutation on program behaviors theoretically, so we no longer rely on empirical data to understand the results of code mutation. Then, the task remained is empirically test the hypothesis that code mutation has similar behavior to human bugs.