Simply fill out this form to download
Cleaning and preparing data makes up a substantial portion of the time and effort spent in a data science project—the majority of the effort, in many cases. It can be tempting to shortcut this process and dive right into the modeling step without looking very hard at the data set first, especially when you have a lot of data. Resist the temptation. No data set is perfect; you will be missing data, have misinterpreted data, or have incorrect data. Some data fields will be dirty and inconsistent. If you don’t take the time to examine the data before you start to model, you may find yourself redoing your work repeatedly as you discover bad data fields or variables that need to be transformed before modeling. In the worst case, you’ll build a model that returns incorrect predictions—and you won’t be sure why. By addressing data issues early, you can save yourself some unnecessary work, and a lot of headaches!
In this paper, we’ll demonstrate some of the things that can go wrong with data, and explore ways to address those issues using the R statistical language (https://cran.r-project.org/) before going on to analysis. For faster numerical libraries, we will use the Microsoft R Open distribution (https://mran.microsoft.com/open/). Throughout this discussion, we will keep an idealized goal in mind: using machine learning to build a predictive model.