Hugendubel.info - Die B2B Online-Buchhandlung 

Merkliste
Die Merkliste ist leer.
Bitte warten - die Druckansicht der Seite wird vorbereitet.
Der Druckdialog öffnet sich, sobald die Seite vollständig geladen wurde.
Sollte die Druckvorschau unvollständig sein, bitte schliessen und "Erneut drucken" wählen.

Sharing Data and Models in Software Engineering

E-BookEPUBDRM AdobeE-Book
406 Seiten
Englisch
Elsevier Science & Techn.erschienen am22.12.2014
Data Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software engineering, this edited volume proceeds to identify critical questions of contemporary software engineering related to data and models. Learn how to adapt data from other organizations to local problems, mine privatized data, prune spurious information, simplify complex results, how to update models for new platforms, and more. Chapters share largely applicable experimental results discussed with the blend of practitioner focused domain expertise, with commentary that highlights the methods that are most useful, and applicable to the widest range of projects. Each chapter is written by a prominent expert and offers a state-of-the-art solution to an identified problem facing data scientists in software engineering. Throughout, the editors share best practices collected from their experience training software engineering students and practitioners to master data science, and highlight the methods that are most useful, and applicable to the widest range of projects.

Shares the specific experience of leading researchers and techniques developed to handle data problems in the realm of software engineering
Explains how to start a project of data science for software engineering as well as how to identify and avoid likely pitfalls
Provides a wide range of useful qualitative and quantitative principles ranging from very simple to cutting edge research
Addresses current challenges with software engineering data such as lack of local data, access issues due to data privacy, increasing data quality via cleaning of spurious chunks in data


Tim Menzies is a professor in computer science (WVU) and a former software research chair at NASA. He has published 200+ refereed articles, many in the area on data mining and SE. His research includes artificial intelligence, data mining and search-based software engineering. He is best known for his work on the PROMISE open source repository of data for reusable software engineering experiments.
mehr
Verfügbare Formate
BuchKartoniert, Paperback
EUR87,50
E-BookEPUBDRM AdobeE-Book
EUR68,95

Produkt

KlappentextData Science for Software Engineering: Sharing Data and Models presents guidance and procedures for reusing data and models between projects to produce results that are useful and relevant. Starting with a background section of practical lessons and warnings for beginner data scientists for software engineering, this edited volume proceeds to identify critical questions of contemporary software engineering related to data and models. Learn how to adapt data from other organizations to local problems, mine privatized data, prune spurious information, simplify complex results, how to update models for new platforms, and more. Chapters share largely applicable experimental results discussed with the blend of practitioner focused domain expertise, with commentary that highlights the methods that are most useful, and applicable to the widest range of projects. Each chapter is written by a prominent expert and offers a state-of-the-art solution to an identified problem facing data scientists in software engineering. Throughout, the editors share best practices collected from their experience training software engineering students and practitioners to master data science, and highlight the methods that are most useful, and applicable to the widest range of projects.

Shares the specific experience of leading researchers and techniques developed to handle data problems in the realm of software engineering
Explains how to start a project of data science for software engineering as well as how to identify and avoid likely pitfalls
Provides a wide range of useful qualitative and quantitative principles ranging from very simple to cutting edge research
Addresses current challenges with software engineering data such as lack of local data, access issues due to data privacy, increasing data quality via cleaning of spurious chunks in data


Tim Menzies is a professor in computer science (WVU) and a former software research chair at NASA. He has published 200+ refereed articles, many in the area on data mining and SE. His research includes artificial intelligence, data mining and search-based software engineering. He is best known for his work on the PROMISE open source repository of data for reusable software engineering experiments.
Details
Weitere ISBN/GTIN9780124173071
ProduktartE-Book
EinbandartE-Book
FormatEPUB
Format HinweisDRM Adobe
Erscheinungsjahr2014
Erscheinungsdatum22.12.2014
Seiten406 Seiten
SpracheEnglisch
Dateigrösse18318 Kbytes
Artikel-Nr.3173191
Rubriken
Genre9200

Inhalt/Kritik

Inhaltsverzeichnis
1;Front Cover;1
2;Sharing Data and Models in Software Engineering;4
3;Copyright;5
4;Why this book?;6
5;Foreword;8
6;Contents;10
7;List of Figures;20
8;Chapter 1: Introduction;30
8.1;1.1 Why Read This Book?;30
8.2;1.2 What Do We Mean by ``Sharing''?;30
8.2.1;1.2.1 Sharing Insights;31
8.2.2;1.2.2 Sharing Models;31
8.2.3;1.2.3 Sharing Data;32
8.2.4;1.2.4 Sharing Analysis Methods;32
8.2.5;1.2.5 Types of Sharing;32
8.2.6;1.2.6 Challenges with Sharing;33
8.2.7;1.2.7 How to Share;34
8.3;1.3 What? (Our Executive Summary);36
8.3.1;1.3.1 An Overview;36
8.3.2;1.3.2 More Details;37
8.4;1.4 How to Read This Book;38
8.4.1;1.4.1 Data Analysis Patterns;39
8.5;1.5 But What About � (What Is Not in This Book);39
8.5.1;1.5.1 What About ``Big Data''? ;39
8.5.2;1.5.2 What About Related Work?;40
8.5.3;1.5.3 Why All the Defect Prediction and Effort Estimation?;40
8.6;1.6 Who? (About the Authors);41
8.7;1.7 Who Else? (Acknowledgments);42
9;Part I: Data Mining for Managers;44
9.1;Chapter 2: Rules for Managers;46
9.1.1;2.1 The Inductive Engineering Manifesto;46
9.1.2;2.2 More Rules;47
9.2;Chapter 3: Rule #1: Talk to the Users;48
9.2.1;3.1 Users Biases;48
9.2.2;3.2 Data Mining Biases;49
9.2.3;3.3 Can We Avoid Bias?;51
9.2.4;3.4 Managing Biases;51
9.2.5;3.5 Summary;52
9.3;Chapter 4: Rule #2: Know the Domain;54
9.3.1;4.1 Cautionary Tale #1: ``Discovering'' Random Noise;55
9.3.2;4.2 Cautionary Tale #2: Jumping at Shadows;56
9.3.3;4.3 Cautionary Tale #3: It Pays to Ask;56
9.3.4;4.4 Summary;57
9.4;Chapter 5: Rule #3: Suspect Your Data;58
9.4.1;5.1 Controlling Data Collection;58
9.4.2;5.2 Problems with Controlled Data Collection;58
9.4.3;5.3 Rinse (and Prune) Before Use;59
9.4.3.1;5.3.1 Row Pruning;59
9.4.3.2;5.3.2 Column Pruning;59
9.4.4;5.4 On the Value of Pruning;60
9.4.5;5.5 Summary;63
9.5;Chapter 6: Rule #4: Data Science Is Cyclic;64
9.5.1;6.1 The Knowledge Discovery Cycle;64
9.5.2;6.2 Evolving Cyclic Development;66
9.5.2.1;6.2.1 Scouting;66
9.5.2.2;6.2.2 Surveying;67
9.5.2.3;6.2.3 Building;67
9.5.2.4;6.2.4 Effort;67
9.5.3;6.3 Summary;67
10;Part II: Data Mining: A Technical Tutorial;68
10.1;Chapter 7: Data Mining and SE;70
10.1.1;7.1 Some Definitions;70
10.1.2;7.2 Some Application Areas;70
10.2;Chapter 8: Defect Prediction;72
10.2.1;8.1 Defect Detection Economics;72
10.2.2;8.2 Static Code Defect Prediction;74
10.2.2.1;8.2.1 Easy to Use;74
10.2.2.2;8.2.2 Widely Used;74
10.2.2.3;8.2.3 Useful;74
10.3;Chapter 9: Effort Estimation;76
10.3.1;9.1 The Estimation Problem;76
10.3.2;9.2 How to Make Estimates;77
10.3.2.1;9.2.1 Expert-Based Estimation;77
10.3.2.2;9.2.2 Model-Based Estimation;78
10.3.2.3;9.2.3 Hybrid Methods;79
10.4;Chapter 10: Data Mining (Under the Hood);80
10.4.1;10.1 Data Carving;80
10.4.2;10.2 About the Data;81
10.4.3;10.3 Cohen Pruning;82
10.4.4;10.4 Discretization;84
10.4.4.1;10.4.1 Other Discretization Methods;84
10.4.5;10.5 Column Pruning;85
10.4.6;10.6 Row Pruning;86
10.4.7;10.7 Cluster Pruning;87
10.4.7.1;10.7.1 Advantages of Prototypes;89
10.4.7.2;10.7.2 Advantages of Clustering;90
10.4.8;10.8 Contrast Pruning;91
10.4.9;10.9 Goal Pruning;93
10.4.10;10.10 Extensions for Continuous Classes;96
10.4.10.1;10.10.1 How RTs Work;96
10.4.10.2;10.10.2 Creating Splits for Categorical Input Features;97
10.4.10.3;10.10.3 Splits on Numeric Input Features;100
10.4.10.4;10.10.4 Termination Condition and Predictions;103
10.4.10.5;10.10.5 Potential Advantages of RTs for Software Effort Estimation;103
10.4.10.6;10.10.6 Predictions for Multiple Numeric Goals;104
11;Part III: Sharing Data;106
11.1;Chapter 11: Sharing Data: Challenges and Methods;108
11.1.1;11.1 Houston, We Have a Problem;108
11.1.2;11.2 Good News, Everyone;109
11.2;Chapter 12: Learning Contexts;112
11.2.1;12.1 Background;113
11.2.2;12.2 Manual Methods for Contextualization;113
11.2.3;12.3 Automatic Methods;116
11.2.4;12.4 Other Motivation to Find Contexts;117
11.2.4.1;12.4.1 Variance Reduction;117
11.2.4.2;12.4.2 Anomaly Detection;117
11.2.4.3;12.4.3 Certification Envelopes;118
11.2.4.4;12.4.4 Incremental Learning;118
11.2.4.5;12.4.5 Compression;118
11.2.4.6;12.4.6 Optimization;118
11.2.5;12.5 How to Find Local Regions;119
11.2.5.1;12.5.1 License;119
11.2.5.2;12.5.2 Installing CHUNK;119
11.2.5.3;12.5.3 Testing Your Installation;119
11.2.5.4;12.5.4 Applying CHUNK to Other Models;121
11.2.6;12.6 Inside CHUNK;122
11.2.6.1;12.6.1 Roadmap to Functions;122
11.2.6.2;12.6.2 Distance Calculations;122
11.2.6.2.1;12.6.2.1 Normalize;122
11.2.6.2.2;12.6.2.2 SquaredDifference;123
11.2.6.3;12.6.3 Dividing the Data;123
11.2.6.3.1;12.6.3.1 FastDiv;123
11.2.6.3.2;12.6.3.2 TwoDistantPoints;124
11.2.6.3.3;12.6.3.3 Settings;124
11.2.6.3.4;12.6.3.4 Chunk (main function);125
11.2.6.4;12.6.4 Support Utilities;125
11.2.6.4.1;12.6.4.1 Some standard tricks;125
11.2.6.4.2;12.6.4.2 Tree iterators;126
11.2.6.4.3;12.6.4.3 Pretty printing;127
11.2.7;12.7 Putting It all Together;127
11.2.7.1;12.7.1 _nasa93;127
11.2.8;12.8 Using CHUNK;128
11.2.9;12.9 Closing Remarks;129
11.3;Chapter 13: Cross-Company Learning: Handling the Data Drought;130
11.3.1;13.1 Motivation;131
11.3.2;13.2 Setting the Ground for Analyses;132
11.3.2.1;13.2.1 Wait ⦠Is This Really CC Data?;134
11.3.2.2;13.2.2 Mining the Data;134
11.3.2.3;13.2.3 Magic Trick: NN Relevancy Filtering;135
11.3.3;13.3 Analysis #1: Can CC Data be Useful for an Organization?;136
11.3.3.1;13.3.1 Design;136
11.3.3.2;13.3.2 Results from Analysis #1;137
11.3.3.3;13.3.3 Checking the Analysis #1 Results;138
11.3.3.4;13.3.4 Discussion of Analysis #1;138
11.3.4;13.4 Analysis #2: How to Cleanup CC Data for Local Tuning?;140
11.3.4.1;13.4.1 Design;140
11.3.4.2;13.4.2 Results;140
11.3.4.3;13.4.3 Discussions;143
11.3.5;13.5 Analysis #3: How Much Local Data Does an Organization Need for a Local Model?;143
11.3.5.1;13.5.1 Design;143
11.3.5.2;13.5.2 Results from Analysis #3;144
11.3.5.3;13.5.3 Checking the Analysis #3 Results;145
11.3.5.4;13.5.4 Discussion of Analysis #3;145
11.3.6;13.6 How Trustworthy Are These Results?;146
11.3.7;13.7 Are These Useful in Practice or Just Number Crunching?;148
11.3.8;13.8 What's New on Cross-Learning?;149
11.3.8.1;13.8.1 Discussion;152
11.3.9;13.9 What's the Takeaway?;153
11.4;Chapter 14: Building Smarter Transfer Learners;154
11.4.1;14.1 What Is Actually the Problem?;155
11.4.2;14.2 What Do We Know So Far?;157
11.4.2.1;14.2.1 Transfer Learning;157
11.4.2.2;14.2.2 Transfer Learning and SE;157
11.4.2.3;14.2.3 Data Set Shift;159
11.4.3;14.3 An Example Technology: TEAK;160
11.4.4;14.4 The Details of the Experiments;164
11.4.4.1;14.4.1 Performance Comparison;164
11.4.4.2;14.4.2 Performance Measures;164
11.4.4.3;14.4.3 Retrieval Tendency;166
11.4.5;14.5 Results;166
11.4.5.1;14.5.1 Performance Comparison;166
11.4.5.2;14.5.2 Inspecting Selection Tendencies;171
11.4.6;14.6 Discussion;174
11.4.7;14.7 What Are the Takeaways?;175
11.5;Chapter 15: Sharing Less Data (Is a Good Thing);176
11.5.1;15.1 Can We Share Less Data?;177
11.5.2;15.2 Using Less Data;180
11.5.3;15.3 Why Share Less Data?;185
11.5.3.1;15.3.1 Less Data Is More Reliable;185
11.5.3.2;15.3.2 Less Data Is Faster to Discuss;185
11.5.3.3;15.3.3 Less Data Is Easier to Process;186
11.5.4;15.4 How to Find Less Data;187
11.5.4.1;15.4.1 Input;188
11.5.4.2;15.4.2 Comparisons to Other Learners;191
11.5.4.3;15.4.3 Reporting the Results;191
11.5.4.4;15.4.4 Discussion of Results;192
11.5.5;15.5 What's Next?;193
11.6;Chapter 16: How to Keep Your Data Private;194
11.6.1;16.1 Motivation;195
11.6.2;16.2 What Is PPDP and Why Is It Important?;195
11.6.3;16.3 What Is Considered a Breach of Privacy?;197
11.6.4;16.4 How to Avoid Privacy Breaches?;198
11.6.4.1;16.4.1 Generalization and Suppression;198
11.6.4.2;16.4.2 Anatomization and Permutation;200
11.6.4.3;16.4.3 Perturbation;200
11.6.4.4;16.4.4 Output Perturbation;200
11.6.5;16.5 How Are Privacy-Preserving Algorithms Evaluated?;201
11.6.5.1;16.5.1 Privacy Metrics;201
11.6.5.2;16.5.2 Modeling the Background Knowledge of an Attacker;202
11.6.6;16.6 Case Study: Privacy and Cross-Company Defect Prediction;203
11.6.6.1;16.6.1 Results and Contributions;206
11.6.6.2;16.6.2 Privacy and CCDP;206
11.6.6.3;16.6.3 CLIFF;207
11.6.6.4;16.6.4 MORPH;209
11.6.6.5;16.6.5 Example of CLIFF&MORPH;210
11.6.6.6;16.6.6 Evaluation Metrics;210
11.6.6.7;16.6.7 Evaluating Utility via Classification;210
11.6.6.8;16.6.8 Evaluating Privatization;213
11.6.6.8.1;16.6.8.1 Defining privacy;213
11.6.6.9;16.6.9 Experiments;214
11.6.6.9.1;16.6.9.1 Data;214
11.6.6.10;16.6.10 Design;214
11.6.6.11;16.6.11 Defect Predictors;214
11.6.6.12;16.6.12 Query Generator;215
11.6.6.13;16.6.13 Benchmark Privacy Algorithms;216
11.6.6.14;16.6.14 Experimental Evaluation;217
11.6.6.15;16.6.15 Discussion;223
11.6.6.16;16.6.16 Related Work: Privacy in SE;224
11.6.6.17;16.6.17 Summary;225
11.7;Chapter 17: Compensating for Missing Data;226
11.7.1;17.1 Background Notes on SEE and Instance Selection;228
11.7.1.1;17.1.1 Software Effort Estimation;228
11.7.1.2;17.1.2 Instance Selection in SEE;228
11.7.2;17.2 Data Sets and Performance Measures;229
11.7.2.1;17.2.1 Data Sets;229
11.7.2.2;17.2.2 Error Measures;232
11.7.3;17.3 Experimental Conditions;234
11.7.3.1;17.3.1 The Algorithms Adopted;234
11.7.3.2;17.3.2 Proposed Method: POP1;235
11.7.3.3;17.3.3 Experiments;237
11.7.4;17.4 Results;237
11.7.4.1;17.4.1 Results Without Instance Selection;237
11.7.4.2;17.4.2 Results with Instance Selection;239
11.7.5;17.5 Summary;240
11.8;Chapter 18: Active Learning: Learning More with Less;242
11.8.1;18.1 How Does the QUICK Algorithm Work?;244
11.8.1.1;18.1.1 Getting Rid of Similar Features: Synonym Pruning;244
11.8.1.2;18.1.2 Getting Rid of Dissimilar Instances: Outlier Pruning;245
11.8.2;18.2 Notes on Active Learning;246
11.8.3;18.3 The Application and Implementation Details of QUICK;247
11.8.3.1;18.3.1 Phase 1: Synonym Pruning;247
11.8.3.2;18.3.2 Phase 2: Outlier Removal and Estimation;248
11.8.3.3;18.3.3 Seeing QUICK in Action with a Toy Example;250
11.8.3.3.1;18.3.3.1 Phase 1: Synonym pruning;251
11.8.3.3.2;18.3.3.2 Phase 2: Outlier removal and estimation;252
11.8.4;18.4 How the Experiments Are Designed;254
11.8.5;18.5 Results;256
11.8.5.1;18.5.1 Performance;257
11.8.5.2;18.5.2 Reduction via Synonym and Outlier Pruning;257
11.8.5.3;18.5.3 Comparison of QUICK vs. CART;258
11.8.5.4;18.5.4 Detailed Look at the Statistical Analysis;259
11.8.5.5;18.5.5 Early Results on Defect Data Sets;259
11.8.6;18.6 Summary;263
12;Part IV: Sharing Models;264
12.1;Chapter 19: Sharing Models: Challenges and Methods;266
12.2;Chapter 20: Ensembles of Learning Machines;268
12.2.1;20.1 When and Why Ensembles Work;269
12.2.1.1;20.1.1 Intuition;270
12.2.1.2;20.1.2 Theoretical Foundation;270
12.2.2;20.2 Bootstrap Aggregating (Bagging);272
12.2.2.1;20.2.1 How Bagging Works;272
12.2.2.2;20.2.2 When and Why Bagging Works;273
12.2.2.3;20.2.3 Potential Advantages of Bagging for SEE;274
12.2.3;20.3 Regression Trees (RTs) for Bagging;275
12.2.4;20.4 Evaluation Framework;275
12.2.4.1;20.4.1 Choice of Data Sets and Preprocessing Techniques;276
12.2.4.1.1;20.4.1.1 PROMISE data;276
12.2.4.1.2;20.4.1.2 ISBSG data;278
12.2.4.2;20.4.2 Choice of Learning Machines;280
12.2.4.3;20.4.3 Choice of Evaluation Methods;282
12.2.4.4;20.4.4 Choice of Parameters;284
12.2.5;20.5 Evaluation of Bagging+RTs in SEE;284
12.2.5.1;20.5.1 Friedman Ranking;285
12.2.5.2;20.5.2 Approaches Most Often Ranked First or Second in Terms of MAE, MMRE and PRED(25);287
12.2.5.3;20.5.3 Magnitude of Performance Against the Best;289
12.2.5.4;20.5.4 Discussion;290
12.2.6;20.6 Further Understanding of Bagging+RTs in SEE;291
12.2.7;20.7 Summary;293
12.3;Chapter 21: How to Adapt Models in a Dynamic World;296
12.3.1;21.1 Cross-Company Data and Questions Tackled;297
12.3.2;21.2 Related Work;299
12.3.2.1;21.2.1 SEE Literature on Chronology and Changing Environments;299
12.3.2.1.1;21.2.1.1 Chronology;299
12.3.2.1.2;21.2.1.2 Changing environments;299
12.3.2.1.3;21.2.1.3 Chronology and changing environments;300
12.3.2.2;21.2.2 Machine Learning Literature on Online Learning in Changing Environments;300
12.3.3;21.3 Formulation of the Problem;302
12.3.4;21.4 Databases;303
12.3.4.1;21.4.1 ISBSG Databases;303
12.3.4.2;21.4.2 CocNasaCoc81;304
12.3.4.3;21.4.3 KitchenMax;305
12.3.5;21.5 Potential Benefit of CC Data;306
12.3.5.1;21.5.1 Experimental Setup;306
12.3.5.2;21.5.2 Analysis;307
12.3.5.2.1;21.5.2.1 Concept drift in SEE;307
12.3.5.2.2;21.5.2.2 Different sets representing different concepts;309
12.3.5.2.3;21.5.2.3 CocNasaCoc81 findings;309
12.3.6;21.6 Making Better Use of CC Data;309
12.3.7;21.7 Experimental Analysis;311
12.3.7.1;21.7.1 Experimental Setup;312
12.3.7.2;21.7.2 Analysis;313
12.3.7.2.1;21.7.2.1 Performance in comparison to random guess;313
12.3.7.2.2;21.7.2.2 Overall performance across time steps;313
12.3.7.2.3;21.7.2.3 Performance at each time step;316
12.3.8;21.8 Discussion and Implications;318
12.3.9;21.9 Summary;318
12.4;Chapter 22: Complexity: Using Assemblies of Multiple Models;320
12.4.1;22.1 Ensemble of Methods;322
12.4.2;22.2 Solo Methods and Multimethods;323
12.4.2.1;22.2.1 Multimethods;323
12.4.2.2;22.2.2 Ninety Solo Methods;323
12.4.2.2.1;22.2.2.1 Preprocessors;324
12.4.2.2.2;22.2.2.2 Predictors (learners);325
12.4.2.3;22.2.3 Experimental Conditions;326
12.4.3;22.3 Methodology;328
12.4.3.1;22.3.1 Focus on Superior Methods;328
12.4.3.2;22.3.2 Bringing Superior Solo Methods into Ensembles;329
12.4.4;22.4 Results;330
12.4.5;22.5 Summary;332
12.5;Chapter 23: The Importance of Goals in Model-Based Reasoning;334
12.5.1;23.1 Introduction;335
12.5.2;23.2 Value-Based Modeling;335
12.5.2.1;23.2.1 Biases and Models;335
12.5.2.2;23.2.2 The Problem with Exploring Values;335
12.5.2.2.1;23.2.2.1 Tuning instability;336
12.5.2.2.2;23.2.2.2 Value variability;337
12.5.2.2.3;23.2.2.3 Exploring instability and variability;339
12.5.3;23.3 Setting Up;339
12.5.3.1;23.3.1 Representing Value Propositions;339
12.5.3.1.1;23.3.1.1 Representing the space of options;341
12.5.4;23.4 Details;341
12.5.4.1;23.4.1 Project Options: P;342
12.5.4.2;23.4.2 Tuning Options: T;343
12.5.5;23.5 An Experiment;344
12.5.5.1;23.5.1 Case Studies: p P;344
12.5.5.1.1;23.5.1.1 Searching for rx;344
12.5.5.2;23.5.2 Search Methods;346
12.5.6;23.6 Inside the Models;347
12.5.7;23.7 Results;348
12.5.8;23.8 Discussion;349
12.6;Chapter 24: Using Goals in Model-Based Reasoning;350
12.6.1;24.1 Multilayer Perceptrons;353
12.6.2;24.2 Multiobjective Evolutionary Algorithms;355
12.6.3;24.3 HaD-MOEA;359
12.6.4;24.4 Using MOEAs for Creating SEE Models;360
12.6.4.1;24.4.1 Multiobjective Formulation of the Problem;361
12.6.4.2;24.4.2 SEE Models Generated;362
12.6.4.3;24.4.3 Representation and Variation Operators;362
12.6.4.4;24.4.4 Using the Solutions Produced by a MOEA;363
12.6.5;24.5 Experimental Setup;364
12.6.6;24.6 The Relationship Among Different Performance Measures;368
12.6.7;24.7 Ensembles Based on Concurrent Optimization of Performance Measures;371
12.6.8;24.8 Emphasizing Particular Performance Measures;375
12.6.9;24.9 Further Analysis of the Model Choice;377
12.6.10;24.10 Comparison Against Other Types of Models;377
12.6.11;24.11 Summary;382
12.7;Chapter 25: A Final Word;384
13;Bibliography;386
14;Bibliography;386
15;Index;408
16;Index;408
mehr
Leseprobe

Chapter 1
Introduction



Before we begin: for the very impatient (or very busy) reader, we offer an executive summary in Section 1.3 and statement on next directions in Chapter 25.


1.1 Why read this book?

NASA used to run a Metrics Data Program (MDP) to analyze data from software projects. In 2003, the research lead, Kenneth McGill, asked: What can you learn from all that data? McGill's challenge (and funding support) resulted in much work. The MDP is no more but its data was the seed for the PROMISE repository (Figure 1.1). At the time of this writing (2014), that repository is the focal point for many researchers exploring data science and software engineering. The authors of this book are long-time members of the PROMISE community.


Figure 1.1 The PROMISE repository of SE data: http://openscience.us/repo.


When a team has been working at something for a decade, it is fitting to ask, What do you know now that you did not know before? In short, we think that sharing needs to be studied much more, so this book is about sharing ideas and how data mining can help that sharing. As we shall see:

⢠Sharing can be very useful and insightful.

⢠But sharing ideas is not a simple matter.

The bad news is that, usually, ideas are shared very badly. The good news is that, based on much recent research, it is now possible to offer much guidance on how to use data miners to share.

This book offers that guidance. Because it is drawn from our experiences (and we are all software engineers), its case studies all come from that field (e.g., data mining for software defect prediction or software effort estimation). That said, the methods of this book are very general and should be applicable to many other domains.
1.2 What do we mean by sharing ?

To understand sharing, we start with a story. Suppose two managers of different projects meet for lunch. They discuss books, movies, the weather, and the latest political/sporting results. After all that, their conversation turns to a shared problem: how to better manage their projects.

Why are our managers talking? They might be friends and this is just a casual meeting. On the other hand, they might be meeting in order to gain the benefit of the other's experience. If so, then their discussions will try to share their experience. But what might they share?
1.2.1 Sharing insights

Perhaps they wish to share their insights about management. For example, our diners might have just read Fred Brooks's book on The Mythical Man Month [59]. This book documents many aspects of software project management including the famous Brooks' law which says adding staff to a late software project makes it later.

To share such insights about management, our managers might share war stories on (e.g.) how upper management tried to save late projects by throwing more staff at them. Shaking their heads ruefully, they remind each other that often the real problems are the early lifecycle decisions that crippled the original concept.
1.2.2 Sharing models

Perhaps they are reading the software engineering literature and want to share models about software development. Now models can be mean different things to different people. For example, to some object-oriented design people, a model is some elaborate class diagram. But models can be smaller, much more focused statements. For example, our lunch buddies might have read Barry Boehm's Software Economics book. That book documents a power law of software that states that larger software projects take exponentially longer to complete than smaller projects [34].

Accordingly, they might discuss if development effort for larger projects can be tamed with some well-designed information hiding.1

(Just as an aside, by model we mean any succinct description of a domain that someone wants to pass to someone else. For this book, our models are mostly quantitative equations or decision trees. Other models may more qualitative such as the rules of thumb that one manager might want to offer to another-but in the terminology of this chapter, we would call that more insight than model.)
1.2.3 Sharing data

Perhaps our managers know that general models often need tuning with local data. Hence, they might offer to share specific project data with each other. This data sharing is particularly useful if one team is using a technology that is new to them, but has long been used by the other. Also, such data sharing is become fashionable amongst data-driven decision makers such as Nate Silver [399], or the evidence-based software engineering community [217].
1.2.4 Sharing analysis methods

Finally, if our managers are very experienced, they know that it is not enough just to share data in order to share ideas. This data has to be summarized into actionable statements, which is the task of the data scientist. When two such scientists meet for lunch, they might spend some time discussing the tricks they use for different kinds of data mining problems. That is, they might share analysis methods for turning data into models.
1.2.5 Types of sharing

In summary, when two smart people talk, there are four things they can share. They might want to:

⢠share models;

⢠share data;

⢠share insight;

⢠share analysis methods for turning data into models.

This book is about sharing data and sharing models. We do not discuss sharing insight because, to date, it is not clear what can be said on that point. As to sharing analysis methods, that is a very active area of current research; so much so that it would premature to write a book on that topic. However, for some state-of-the-art results in sharing analysis methods, the reader is referred to two recent articles by Tom Zimmermann and his colleagues at Microsoft Research. They discuss the very wide range of questions that are asked of data scientists [27, 64] (and many of those queries are about exploring data before any conclusions are made).
1.2.6 Challenges with sharing

It turns out that sharing data and models is not a simple matter. To illustrate that point, we review the limitations of the models learned from the first generation of analytics in software engineering.

As soon as people started programming, it became apparent that programming was an inherently buggy process. As recalled by Maurice Wilkes [443] speaking of his programming experiences from the early 1950s:



It was on one of my journeys between the EDSAC room and the punching equipment that hesitating at the angles of stairs the realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.



It took several decades to find the experience required to build a size/defect relationship. In 1971, Fumio Akiyama described the first known size law, saying the number of defects D was a function of the number of lines of code; specifically

=4.86+0.018*loc

Alas, nothing is as simple as that. Lessons come from experience and, as our experience grows, those lessons get refined/replaced. In 1976, McCabe [285] argued that the number of lines of code was less important than the complexity of that code. He proposed cyclomatic complexity, or v(g), as a measure of that complexity and offered the now (in)famous rule that a program is more likely to be defective if

(g)>10

At around the same time, other researchers were arguing that not only is programming an inherently buggy process, its also inherently time-consuming. Based on data from 63 projects, Boehm [34] proposed in 1981 that linear increases in code size leads to exponential increases in development effort:

=a×KLOCb×âi(Emi×Fi)


  (1.1)


Here, a, b are parameters that need tuning for particular projects and Emi are effort multiplier that control the impact of some project factor Fi on the effort. For example, if Fi is analysts capability and it moves from very low to very high, then according to Boehm's 1981 model, Emi moves from 1.46 to 0.71 (i.e., better analysts let you deliver more systems, sooner).

Forty years later, it is very clear that the above models are true only in certain narrow contexts. To see this, consider the variety of software built at the Microsoft campus, Redmond, USA. A bird flying over that campus would see dozens of five-story buildings. Each of those building has (say) five teams working on each floor. These 12 * 5 * 5 = 300 teams build...
mehr