Data Discovery
What is Data Cleaning?
When using data, most people agree that your insights and analysis are only as good as the data you are using. Essentially, garbage data in is garbage analysis out. Data cleaning, also referred to as data cleansing and data scrubbing, is one of the most important steps for your organization if you want to create a culture around quality data decision-making.
What is data cleaning?
Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. When combining multiple data sources, there are many opportunities for data to be duplicated or mislabeled. If data is incorrect, outcomes and algorithms are unreliable, even though they may look correct. There is no one absolute way to prescribe the exact steps in the data cleaning process because the processes will vary from dataset to dataset. But it is crucial to establish a template for your data cleaning process so you know you are doing it the right way every time.
What is the difference between data cleaning and data transformation?
Data cleaning is the process that removes data that does not belong in your dataset. Data transformation is the process of converting data from one format or structure into another. Transformation processes can also be referred to as data wrangling, or data munging, transforming and mapping data from one "raw" data form into another format for warehousing and analyzing. This article focuses on the processes of cleaning that data.
How do you clean data?
While the techniques used for data cleaning may vary according to the types of data your company stores, you can follow these basic steps to map out a framework for your organization.
Step 1: Remove duplicate or irrelevant observations
Remove unwanted observations from your dataset, including duplicate observations or irrelevant observations. Duplicate observations will happen most often during data collection. When you combine data sets from multiple places, scrape data, or receive data from clients or multiple departments, there are opportunities to create duplicate data. De-duplication is one of the largest areas to be considered in this process. Irrelevant observations are when you notice observations that do not fit into the specific problem you are trying to analyze. For example, if you want to analyze data regarding millennial customers, but your dataset includes older generations, you might remove those irrelevant observations. This can make analysis more efficient and minimize distraction from your primary target—as well as creating a more manageable and more performant dataset.
Step 2: Fix structural errors
Structural errors are when you measure or transfer data and notice strange naming conventions, typos, or incorrect capitalization. These inconsistencies can cause mislabeled categories or classes. For example, you may find “N/A” and “Not Applicable” both appear, but they should be analyzed as the same category.
Step 3: Filter unwanted outliers
Often, there will be one-off observations where, at a glance, they do not appear to fit within the data you are analyzing. If you have a legitimate reason to remove an outlier, like improper data-entry, doing so will help the performance of the data you are working with. However, sometimes it is the appearance of an outlier that will prove a theory you are working on. Remember: just because an outlier exists, doesn’t mean it is incorrect. This step is needed to determine the validity of that number. If an outlier proves to be irrelevant for analysis or is a mistake, consider removing it.
Step 4: Handle missing data
You can’t ignore missing data because many algorithms will not accept missing values. There are a couple of ways to deal with missing data. Neither is optimal, but both can be considered.
- As a first option, you can drop observations that have missing values, but doing this will drop or lose information, so be mindful of this before you remove it.
- As a second option, you can input missing values based on other observations; again, there is an opportunity to lose integrity of the data because you may be operating from assumptions and not actual observations.
- As a third option, you might alter the way the data is used to effectively navigate null values.
Step 5: Validate and QA
At the end of the data cleaning process, you should be able to answer these questions as a part of basic validation:
- Does the data make sense?
- Does the data follow the appropriate rules for its field?
- Does it prove or disprove your working theory, or bring any insight to light?
- Can you find trends in the data to help you form your next theory?
- If not, is that because of a data quality issue?
False conclusions because of incorrect or “dirty” data can inform poor business strategy and decision-making. False conclusions can lead to an embarrassing moment in a reporting meeting when you realize your data doesn’t stand up to scrutiny. Before you get there, it is important to create a culture of quality data in your organization. To do this, you should document the tools you might use to create this culture and what data quality means to you.
Components of quality data
Determining the quality of data requires an examination of its characteristics, then weighing those characteristics according to what is most important to your organization and the application(s) for which they will be used.
5 characteristics of quality data
- Validity. The degree to which your data conforms to defined business rules or constraints.
- Accuracy. Ensure your data is close to the true values.
- Completeness. The degree to which all required data is known.
- Consistency. Ensure your data is consistent within the same dataset and/or across multiple data sets.
- Uniformity. The degree to which the data is specified using the same unit of measure.
Benefits of data cleaning
Having clean data will ultimately increase overall productivity and allow for the highest quality information in your decision-making. Benefits include:
- Removal of errors when multiple sources of data are at play.
- Fewer errors make for happier clients and less-frustrated employees.
- Ability to map the different functions and what your data is intended to do.
- Monitoring errors and better reporting to see where errors are coming from, making it easier to fix incorrect or corrupt data for future applications.
- Using tools for data cleaning will make for more efficient business practices and quicker decision-making.
Data cleaning tools and software for efficiency
Softwares can help you drive a quality data culture by providing visual and direct ways to combine and clean your data. Tableau Prep has two products: Tableau Prep Builder for building your data flows and Tableau Prep Conductor for scheduling, monitoring, and managing flows across your organization. Using a data scrubbing tool can save a database administrator a significant amount of time by helping analysts or administrators start their analyses faster and have more confidence in the data. Understanding data quality and the tools you need to create, manage, and transform data is an important step toward making efficient and effective business decisions. This crucial process will further develop a data culture in your organization. To see how Tableau Prep can impact your organization, read about how marketing agency Tinuiti centralized 100-plus data sources in Tableau Prep and scaled their marketing analytics for 500 clients.
What is Data Inferencing and Mining?
The terms inference and prediction both describe tasks where we learn from data in a supervised manner in order to find a model that describes the relationship between the independent variables and the outcome. Inference and prediction, however, diverge when it comes to the use of the resulting model:
- Inference: Use the model to learn about the data generation process.
- Prediction: Use the model to predict the outcomes for new data points.
Since inference and prediction pursue contrasting goals, specific types of models are associated with the two tasks.
Model interpretability is a necessity for inference
In essence, the difference between models that are suitable for inference and those that are not boils down to model interpretability. What do I mean with model interpretability? I consider a model interpretable if a human, particularly a layman, could retrace how the model generates its estimates. Consider the following approaches for prediction:
- Interpretable: Generalized linear models (e.g. linear regression, logistic regression), linear discriminant analysis, linear support vector machines (SVMs), decision trees
- Less interpretable: neural networks, non-linear SVMs, random forests
Only a subset of interpretable methods is useful for inference. For example, linear SVMs are interpretable because they provide a coefficient for every feature such that it is possible to explain the impact of individual features on the prediction. However, SVMs do not allow for estimating the uncertainty associated with the model coefficients (e.g. the variance) and it is not possible to obtain an implicit measure of model confidence. Note that SVMs are capable of outputting probabilities but these probabilities are just a transformation of the decision values and are not basted on the confidence associated with the parameter estimates. This is why even interpretable methods such as linear SVMs and decision trees are unsuitable for inference.
In contrast, consider linear regression, which assumes that the data follow a Gaussian distribution. These models determine the standard error of the coefficient estimates and output confidence intervals. Since linear regression allows us to understand the probabilistic nature of the data generation process, it is a suitable method for inference. Bayesian methods are particularly popular for inference because these models can be adjusted to incorporate various assumptions about the data generation process.
Examples for prediction and inference
Merely using a model that is suitable for inference does not mean that you are actually performing inference. What matter is how you are using the model. For example, although generalized linear models are suitable for inference, I recently used them solely for prediction purposes. Consider the following examples that make the distinction between prediction and inference clearer:
- Prediction: You want to predict future ozone levels using historic data. Since you believe that there is a linear relationship between the ozone level and measurements of temperature, solar radiation, and wind, you fit several linear models (e.g. using different sets of features) on the training data and select the model minimizing the loss on the test set. Finally, you use the selected model for predicting the ozone level. Note that you do not care at all about the Gaussian assumption of the model or the additional information that is available on the model estimates as long as the model minimizes the test error.
- Inference: You want to understand how ozone levels are influenced by temperature, solar radiation, and wind. Since you assume that the residuals are normally distributed, you use a linear regression model. In order to harness the information from the full data set and you do not care about predictive accuracy, you fit the model on the full data set. On the basis of the fitted model, you interpret the role of the features on the measured ozone level, for example, by considering the confidence bands of the estimates.
Workflows for inference and prediction
The basic workflows for inference and prediction are described in the following sections.
Inference
- Modelling: Reason about the data generation process and choose the stochastic model that approximates the data generation process best.
- Model validation: Evaluate the validity of the stochastic model using residual analysis or goodness-of-fit tests.
- Inference: Use the stochastic model to understand the data generation process.
Prediction
- Modeling: Consider several different models and different parameter settings.
- Model selection: Identify the model with the greatest predictive performance using validation/test sets; select the model with the highest performance on the test set.
- Prediction: Apply the selected model on new data with the expectation that the selected model also generalizes to the unseen data.
The two modelling communities
Note that machine learning is often concerned with predictive modelling, while the statistical community often relies on stochastic models that perform inference. Due to the complexity of machine learning models, they are often treated as black boxes. For inference problems, on the other hand, the working principles of the used models are well understood.
Three contradictions in statistical modeling
In his famous 2001 paper, Leo Breiman argued that there are three revolutions in the modeling community, which are represented by the following terms:
- Rashomon: There is often not a single model that fits a data set best but there usually is a multiplicity of models that are similarly appropriate. The term Rashomon refers to a classic 1950 Japanese feature film in which various characters depict different and even contradictory versions of the same incident.
- Occam: The simplest model is often the most interpretable. However, the simplest model is also often not the the most accurate model for a given task. Occam refers to William of Occam’s razor, which states that Entities should not be multiplied beyond necessity, underlining the importance of simplicity.
- Bellman: Large feature spaces are often seen as a disadvantage. However, modern machine learning methods heavily rely on expanding the feature space in order to improve predictive accuracy. So, are we subject to a curse or a blessing of dimensionality? The term curse of dimensionality was coined by Richard Bellman. The term is used to emphasize the challenges of statistical modeling in large feature spaces.
Blessing or curse of dimensionality?
Predictive modeling particularly embraces the idea that high dimensionality is a blessing. All of the recently developed popular machine learning models such as neural networks and SVMs rely on the idea of expanding the feature space in order to learn about the non-linear relationships between the independent variables. As a consequence, these models can also be thought to violate Occam’s razor because these models do not utilize compact representations of the variables but rely on partially redundant encodings.
Generative modeling, on the other hand, is often concerned with feature elimination in order to allow for improved interpretability and generalizability. Since these methods strive for simple models for explaining the data generation process with few features, they simultaneously fulfill Occam’s razor and circumvent the curse of dimensionality.
Feature selection and Rashomon
Models based on feature selection are, however, much more affected by Rashomon, the multiplicity of appropriate models. Imagine you are doing generative modeling and the original data set contains 10,000 features. Using a notion of variable importance, you reduce the number of features down to 100. You then interpret the resulting model according to the 100 features. Now, the question is: Is this model the only way to view the data generation process?
Probably not because, stochastically, it is likely that there exists another model with a different set of 100 features that explains the outcome similarly well. This is due to the existence of a huge number of subsets with exactly 1000 features, namely (100000100)=102430(100000100)=102430. Thus, when performing feature selection, small perturbations in the data may lead to the selection of a model using a different feature subset.
Summary
Prediction and inference can be differentiated according to the following criteria:
Criterion | Prediction | Inference |
---|---|---|
Model Selection | - Evaluate a variety of models - Select the best-performing model |
- Reason about the data generation process - Select model whose assumptions seem most reasonable |
Validation | - Empirically determine loss on test set | - Use goodneess-of-fit tests |
Application | - Predict the outcome for new samples | - Use the model to explain the data generation process |
Ramnifcations | - Model interpretability suffers - Model validity shown for the test set - Model may overfit if the test data are similar to the training data |
- High model interpretability - Model validity is uncertain since predictive accuracy was not considered - Overfitting prevented through simplified assumptions |
So what can the communities of predictive and generative modeling learn from each other? In my opinion, practitioners of machine learning should spend more time with the data in order to make well-informed decisions about the type of predictive model that is used. In many prediction scenarios, well-established methods such as SVMs or neural networks are used without thinking about the alternatives according to the saying if all you have is a hammer, everything looks like a nail. However, there may be other models that may be more appropriate to a given data set, for example because they perform better, allow for improved generalizability, or because they are more interpretable.
Additionally, there should be more skepticism about the dogmatic use of the test error as the sole measure of model validity. Particularly in settings where the test set is similar to the training data, a low test error may not guarantee a low generalization error. Finally, further effort should be made in order to peel away some of the layers of dark color that render predictive models intransparent.
In the inference setting, model performance should be included as a criterion for the evaluation of model validity since it is hazardous to draw conclusions from an inaccurate model. If the validity of a model has not been sufficiently established, all interpretations should be taken with a grain of salt and it may be worth to investigate the perspectives that other models offer. Additionally, model assumptions should be well argumented rather than assuming a certain distribution (e.g. normal) for mere convenience.
Finally, when pruning the feature space (in both inference and prediction), one should judiciously decide on the feature selection criterion and be aware that the selected model may not be the only reasonable model for the data at hand. To demonstrate the validity of the selected features, resampling methods such as the bootstrap can be used to demonstrate the robustness of the selected variables.
https://www.tableau.com/learn/articles/what-is-data-cleaning