Pdf a Large scale Empirical Study of Compiler Errors in Continuous Integration
A comprehensive empirical study on bug characteristics of deep learning frameworks
Abstract
Context:
Deep Learning (DL) frameworks enable developers to build DNN models without learning the underlying algorithms and models. While some of these DL-based software systems have been deployed in safety-critical areas, such as self-driving cars and medical diagnostics, for DL frameworks, characterizing their bugs and thus helping researchers to design specific quality assurance techniques become desperately needed.
Objective:
Our research aims to characterize bugs typical of DL frameworks at the source code level for an in-depth analysis of bug symptoms, root causes, and bug fixes. In this way, we hope to provide insights for researchers to design automatic quality assurance techniques, such as automatic repair techniques and fault location techniques, applicable to DL frameworks and DL-based software systems.
Method:
We started by summarizing the DL framework reference architecture and proposing the DL framework bug taxonomy. Then, we mined 1,127 DL framework bug reports from eight popular DL frameworks and labeled the bug types, root causes, and symptoms. Finally, we discussed the bug characteristics and explored how developers could possibly deal with these bugs.
Results:
Our main findings are: (i) DNN model building bugs and general type bugs accounted for one-third of the total defects. (ii) DNN model building bugs are more prone to algorithm logic constraints, internal API errors, and data/numerical errors. (iii) Fifteen bug-fixing patterns are summarized, providing reference for common DL framework bug repair and future research on the development of automatic DL framework bug detection tools.
Conclusion:
By analyzing the bug-fixing changes, we characterize the occurrences, root causes, symptoms, and fixing of these bugs. The study results have provided researchers with insights into how to ensure DL framework quality and presented actionable suggestions for DL framework developers to improve their code quality.
Introduction
The exponential advancement in deep learning (DL) techniques has driven the emergence of DL-based applications that offer commercial benefit to humans and are gradually becoming used in more life-critical fields, such as medical diagnosis [1], [2], in autonomous vehicles [3], [4], [5], and air traffic control [6]. Over the course of such unprecedented progress, the reliability of DL frameworks is always considered indispensable, especially for safety-critical systems. In this sense, it is vital to comprehensively study DL framework bugs, since it will allow DL frameworks to achieve more excellent reliability and greatly enlighten developers as to how to further exploit DL technologies to their full potential.
So far, DL system defect research has received extensive academic attention [7], [8], [9], [10], [11], [12], [13]. Initially, researchers used traditional software bug classification to study the defects of machine learning-based systems [7] and machine learning frameworks [8]. For instance, Zhang et al. [9] performed an empirical study of user coding errors in TensorFlow-based applications. Islam et al. [10], [11] followed Zhang's bug root causes and conducted a study on five popular deep learning frameworks. Jia et al. [13] particularly delved into the bugs inside TensorFlow. Humbatova et al. [12] proposed an extensive taxonomy of DL system defects by manually labeling open source community posts and interviewing industry practitioners.
These researches have greatly contributed to our knowledge about DL frameworks and their weaknesses; however, they generally analyze the bugs from the perspective of DL framework users - e.g., focusing on bugs caused by improper use of DL frameworks (Incorrect Model Parameter or Structure, Unaligned Tensor, API Misuse) [9], [10] and bugs related to the training process (Training Data Quality, Missing/Wrong Preprocessing, Hyperparameters) [12]. In real-world development, bugs in DL-based systems arise from far beyond the user's improper operations and are also affected by the reliability of the DL framework.
To more comprehensively understand what leads to the defects in DL framework development, and better deal with the various bugs generated therein, this paper takes a deep dive into the bugs inside DL frameworks from the source code level. We started by summarizing the reference architecture of the DL frameworks by combining existing research [14], [15], [16], [17] and domain knowledge [18], [19], [20], [21], [22], [23], [24], [25], [26], [27]. We also proposed a taxonomy of DL framework bugs by referring to researches on traditional software bugs. We then analyzed the defects from the source code level, explored the bug root causes and symptoms, and proposed feasible solutions to tackle them. Our endeavor seeks to greatly facilitate researchers' and developers' understanding of the underlying bugs of DL frameworks, which in turn will enable them to build more reliable bug detectors.
Understanding DL framework characteristics is the first yet critical step towards improving the quality of DL frameworks and DL-based applications. In this regard, we come up with the following questions: RQ1: What are the common types of DL framework bugs? DNN software often differs vastly from general-purpose software in structure [5], and the development process of DNN components is fundamentally different from that of traditional software [28]. Hence, we need to figure out what types of bugs exist in DL frameworks and their quantity and distribution. RQ2: What are the symptoms and root causes of DL framework bugs? Only with a clear understanding of the symptoms and root causes of DL framework bugs can we develop effective solutions to fix the bugs. RQ3: How could DL framework bugs fixed? Answering this question will help DL framework developers fix bugs. To answer the research questions, we investigated 1127 bug reports from eight popular DL frameworks. Specifically, we summarized the reference architecture of the DL framework and constructed a bug taxonomy. We labeled the type, root cause, and symptom of each bug, discussed the root causes and symptoms of different bug types, and answered when and how developers had dealt with DL framework bugs.
Overall, this paper makes the following contributions:
A large-scale empirical study of DL frameworks. Our study is significant to build, check DL framework bug types, root causes, symptoms, and trace defect history information to analyze bug fixing methods. We collected 44,803 pull requests from GitHub and filtered the data to obtain 7036 bug reports. Employing the stratified sampling method, we manually labeled 1127 bug reports. The experimental results were discussed and agreed upon by all authors. With such quantity and quality, this paper could provide an inestimable reference for future research, whether on bug characteristics study or DL framework bug fixing.
(RQ1) Reference architecture and bug taxonomy. We summarized the reference architecture of DL frameworks and, based on it, constructed a DL framework bug taxonomy. Different from the user-perspective bug types discussed in existing studies [7], [8], [9], [10], [11], [12], [13], in our bug taxonomy, bug types are classified into DL model building bugs and general bugs. It helps us maintain the DL framework more targeted and give full play to its potential. Also, through statistical analysis, we found that one-third of DL framework bugs do not appear in the traditional software bug types, which warns that we may need to put more effort into analyzing and resolving bugs specific to DL frameworks (Finding 1–2).
(RQ2) Bug symptoms and root causes. Referring to existing studies, we classified bug root causes into numerical root causes [29] and general programming root causes [30] and, building upon this foundation, summarized our findings. (1) Unlike bugs in DL-based software, those in DL frameworks rarely show the symptoms of hang, ineffectiveness, and inefficiency, and are more likely to result in system crashes or errors, which indicates that bugs in DL frameworks should not be approached similarly as those in DL systems and require further research (Finding 3). (2) Since DL model building bugs are rarely discussed in existing research, we have also filled this gap by thoroughly exploring the type and root causes of DL model building bugs. Specifically, we found the common root causes for the bugs in DL model building are algorithm implementation errors, lack of necessary checks, and a few numerical errors. Algorithm errors may arise from developers' misunderstanding of relevant deep learning theoretical knowledge. This shows that the development of DL frameworks has a high threshold for practitioners' expertise. While building DNN mmodels, DL framework developers fail to sufficiently check tensors such as types and shapes, leading to system output errors or even crashes. The common numerical errors are mathematical expression errors, followed by loss of precision, NaN errors, and overflow/underflow (Finding 4–7). (3) In terms of computational graphs and computational platforms, bug root causes of framework failures vary. There are not only those mentioned above but also logic errors, resource management errors, and environment configuration errors (Finding 8–9). (4) Wrong documentation and environment configuration errors are two common root causes of DL framework bugs. Because of the immeasurable documents and administrators' inadequate maintenance of the open source community, wrong documentation is also frequently met in DL framework development. DL frameworks usually require high compatibility with hardware and third-party libraries. However, relatively newly evolved DL frameworks are not inclusive and adaptable enough to advance side by side with the rapidly changing hardware and third-party libraries, leading to frequent environment configuration errors.
(RQ3) Bug fixing methods and costs. By analyzing bug-fixing change codes, we proposed 15 bug-fixing patterns. We described the bug scenario, root cause, and code change for each pattern. These patterns can well inform DL framework developers about how to detect and fix the bugs more efficiently (Finding 10–11).
Paper Organization. The remainder of this paper is organized as follows. We described the background and motivation in Section 2. Then, we presented the methodology of this paper in Section 3, where we proposed the bug taxonomy applied in this study and described the data collection process. In Section 4, we presented our research results, including bug types, root causes, and fixing methods, all of which are provided with descriptions and examples. We discussed the findings in Section 5. After that, we discussed related works in Section 7 and threats to the validity of this study in Section 7. Finally, we concluded with future work in Section 8.
Section snippets
Traditional bug characteristics
Various empirical studies [30], [31], [32] have been conducted on the bugs that lead to traditional software incorrect functionality. These studies have successfully guided the design of traditional software testing, bug detection, and troubleshooting. Seaman et al. [30] collected 81 projects' bug reports and proposed a set of bug classification schemes, which have good compatibility and can be applied to general-purpose software systems and help people build models to guide future development
Methodology
DL model building bug
Table 6 demonstrates the number and proportion of different DL modeling bugs within DL frameworks. NN-MI bugs, including model structure errors, errors in the model optimization mechanism, and model implementation errors, account for 14.2% of all DL modeling bugs-the highest proportion among its counterparts. Next to NN-MI bugs are Math-OP bugs, taking up 5% of the bugs related to DL modeling. The percentages of CP bugs, CGSE bugs and NN-OP bugs are comparable, respectively making up 4.3%, 4.0%
Comparison with prior works
Previous studies [9], [10], [11], [12], [13] have investigated the root causes of accidents [38] in DL systems, i.e., unintended behaviors that occur in machine learning systems when users specify the wrong objective function or commit implementation errors related to machine learning. In contrast, this paper conducts an in-depth empirical study to characterize the defects inside DL frameworks from the source code level.
Reference architecture. Following the approach of Hassan et al. [14], [15],
Related work
Empirical study on bug characteristics. Thung et al. [7] researched bugs in machine learning systems (Apache Mahout, Lucene, and OpenNLP). They analyzed the categories of these bugs and their corresponding fixes and divided these bugs into eleven categories. Sun et al. [8] conducted an empirical study on machine learning bugs by collecting bug categories, fixing patterns, fixing scale, fixed duration, and type of software maintenance from the top 3 popular machine learning projects on GitHub.
Threats to validity
Internal validity. Internal validity. The validity of our research results is highly dependent on the quality of the dataset we collect. Therefore, we applied multiple filters to ensure the effectiveness of bug report. First, we obtained the data from GitHub. We referred to the officially merged pull requests as bug reports. Each report contains a bug description and a bug patch, which is very important as it indicates that the user raises an bug and the corresponding patch is officially
Conclusion
In this paper, we conducted an empirical study on bug characteristics of DL frameworks. Comprehensive reference architecture of deep learning frameworks is presented, containing four layers and nine components. We examined 44,803 merged pull requests from eight DL frameworks: TensorFlow, Keras, Pytorch, Caffe, Theano, MXNet, CNTK, and DeepLearning4J. From these, we identified 7036 bug reports and manually examined 1127 bugs. We divided DL model building bugs into five types: Computation Graph
CRediT authorship contribution statement
Yilin Yang: Conceptualization, Methodology, Software, Writing – original draft, Validation. Tianxing He: Investigation, Data curation, Formal analysis. Zhilong Xia: Investigation, Data curation. Yang Feng: Supervision, Project administration, Writing – review & editing.
Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Yang Feng reports financial support was provided by National Natural Science Foundation of China.
Acknowledgments
We would like to thank anonymous reviewers for their insightful and constructive comments. This project was partially funded by the National Natural Science Foundation of China under Grant Nos. 62002158, 61832009, and 61932012, and the Science, Technology and Innovation Commission of Shenzhen Municipality (No. CJGJZD20200617103001003).
References (93)
- et al.
A systematic literature review on the barriers faced by newcomers to open source software projects
Inf. Softw. Technol.
(2015)
- et al.
The symptoms, causes, and repairs of bugs inside a deep learning library
J. Syst. Softw.
(2021)
- et al.
Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs
JAMA
(2016)
- et al.
Artificial intelligence with deep learning technology looks into diabetic retinopathy screening
JAMA
(2016)
- et al.
End to end learning for self-driving cars
(2016)
- et al.
An empirical evaluation of deep learning on highway driving
(2015)
- et al.
Deepxplore: Automated whitebox testing of deep learning systems
- et al.
Reluplex: An efficient SMT solver for verifying deep neural networks
- et al.
An empirical study of bugs in machine learning systems
- et al.
An empirical study on real bugs for machine learning programs
Repairing deep neural networks: Fix patterns and challenges
A reference architecture for web servers
A reference architecture for web browsers
nuts-flow/ml: data pre-processing for deep learning
(2017)
Caffe: Convolutional architecture for fast feature embedding
cudnn: Efficient primitives for deep learning
(2014)
Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems
(2015)
CNTK: Microsoft's open-source deep-learning toolkit
Theano: A Python framework for fast computation of mathematical expressions
(2016)
Deeplearning4j: Open-source, Distributed Deep Learning for the JVM
(2017)
Keras: The Python Deep Learning Library
(2018)
Pytorch: An imperative style, high-performance deep learning library
Adv. Neural Inf. Process. Syst.
(2019)
A comprehensive study of real-world numerical bug characteristics
Defect categorization: making use of a decade of widely varying historical data
Understanding and detecting real-world performance bugs
ACM SIGPLAN Not.
(2012)
Bug characteristics in open source software
Empir. Softw. Eng.
(2014)
Deepmutation: Mutation testing of deep learning systems
Deeptest: Automated testing of deep-neural-network-driven autonomous cars
Basic concepts and taxonomy of dependable and secure computing
IEEE Trans. Dependable Secure Comput.
(2004)
Concrete problems in AI safety
(2016)
PaddlePaddle: An open-source deep learning platform from industrial practice
Front. Data Domput.
(2019)
An intermediate representation for optimizing machine learning pipelines
Proc. VLDB Endow.
(2019)
Building Machine Learning Pipelines
(2020)
Techniques for automated machine learning
ACM SIGKDD Explor. Newsl.
(2021)
Torch: A Modular Machine Learning Software Library Technical Report
(2002)
Transferable graph optimizers for ml compilers
Adv. Neural Inf. Process. Syst.
(2020)
Cited by (0)
Recommended articles (6)
© 2022 Elsevier B.V. All rights reserved.
Source: https://www.sciencedirect.com/science/article/pii/S0950584922001306
0 Response to "Pdf a Large scale Empirical Study of Compiler Errors in Continuous Integration"
Postar um comentário