in

How to do Exploratory Data Analysis (EDA) in R (With Examples)

Hey there! As a fellow data enthusiast, I‘m excited to dive into this guide on exploratory data analysis (EDA) using R. EDA is truly the foundation of impactful analysis, so let‘s jump in and uncover its core concepts and techniques!

What is EDA and Why Does it Matter?

Exploratory data analysis refers to the critical first phase in any data science or analytics project where you get to intimately know your dataset. Rather than jumping straight to building models, EDA allows you to immerse yourself in the data to understand its nuances, subtleties, and hidden insights.

I like to think of EDA as getting to know someone on a first date. You don‘t just launch into heavy conversation, but instead ask questions, listen, and learn about each other. From this exploration emerges meaning. That‘s why EDA is so important – it‘s where the seeds of insight are planted through curious, iterative investigation.

One key benefit of EDA is that it pushes you to question assumptions about the data. Our human bias tends to make us see patterns that may not actually be there. By taking an exploratory approach, we let the data guide us organically rather than imposing our own biases on it.

According to leading data scientists, EDA accounts for 50-80% of the time and value in any analytics project. So dedicate the time to do it right! Thorough EDA will pay dividends when you transition from exploration to modeling and hypothesis testing.

Differences Between Descriptive and Exploratory Analysis

A quick clarification – descriptive analysis and exploratory analysis, while related, have different aims:

Descriptive analysis focuses on summarizing and describing the main characteristics of each individual variable. For example, the distribution, central tendency, spread, outliers, etc. Descriptive analysis helps answer questions like:

  • What are typical values for this variable?
  • How is this variable distributed?

Exploratory analysis goes a step further by exploring the relationships between variables. The goal is to extract insights that explain connections in the data. Explanatory analysis helps answer questions like:

  • How does variable A change with respect to variable B?
  • Are variables X and Y correlated?
  • What factors predict variable Z?

While descriptive analysis characterizes individual variables, exploratory analysis identifies relationships among variables. Both involve visualizations, but the goal of EDA is to uncover actionable insights for modeling.

Core Components of Exploratory Data Analysis

So what are the key tasks involved in effective EDA? Here are the major components:

Initial Data Summarization

First, get an overview of the dataset with summaries of each variable:

  • For numeric data, use summary statistics like mean, median, range, and interquartile range. Visualize distributions with histograms and boxplots.

  • For categorical data, examine value counts and visualize with bar charts.

  • Spot outliers and anomalies that warrant further investigation.

  • Identify patterns like skew through visual inspection.

Data Cleaning

Real-world data is rarely pristine, so EDA allows you to spot issues and formulate cleaning strategies:

  • Handle missing data. Identify strategy – remove rows, impute values, or model missingness.

  • Remove redundant, irrelevant, or highly correlated variables.

  • Address outliers. Techniques include trimming, smoothing, binning, or modeling outliers separately.

Univariate Analysis

Explore the distribution of each individual variable:

  • Visualize relationships of variables over time with line charts. Decompose time series.

  • For numeric data, visualize distributions with histograms and density plots. Transform skewed data.

  • Bin continuous data into ordinal categories like "low", "medium", and "high" values.

Bivariate Analysis

Examine relationships between pairs of variables:

  • Use scatterplots and line charts to identify correlations and nonlinear relationships.

  • Facet plots by color, groups, or subsets to uncover interactions.

  • Calculate correlation coefficients to quantify correlations.

Multivariate Analysis

Expand beyond pairs of variables to uncover complex variable interactions:

  • Use scatterplot matrices to visualize multivariate relationships.

  • Apply clustering algorithms like k-means to find groupings of observations.

  • Perform principal component analysis (PCA) to derive key underlying dimensions.

The goal of EDA is to thoroughly investigate the data from all angles. Let the data itself guide the analysis rather than rigidly following a linear process. Iteratively generate questions, investigate, generate new questions, etc. until insights emerge.

Helpful R Packages for EDA

The R ecosystem provides amazing packages for exploratory data analysis. Here are some of my favorites:

  • tidyverse – dplyr for data wrangling, ggplot2 for beautiful graphics, tidyr for transforming data, readr for data import.

  • skimr – Fantastic package for quick summary statistics on data frames. Gives you an overview of all variables.

  • DataExplorer – Auto EDA! Produces visualizations, statistics, and reports from data frames. Great for initial investigation.

  • ggthemes – Additional color palettes, themes, and scales for even prettier ggplot2 visualizations.

  • patchwork – Easily combine separate ggplot2 plots into panels and grids. Great for multivariate exploration.

  • visdat – Visualize missing data and impute it with multivariate imputation by chained equations (MICE).

  • VIM – Visualization and Imputation of Missing Values. Explore missing data interactively via charts and plots.

  • corrplot – Visualize correlation matrices as heatmaps with correlation coefficients.

There are many more excellent EDA packages, but these form a solid starting toolkit. The tidyverse is especially essential for data manipulation and visualization.

Okay, let‘s now walk through an example EDA on the built-in mpg dataset using this toolkit.

Hands-On Example: EDA on mpg Dataset

The mpg dataset contains fuel economy data for 234 cars, including manufacturer, model, engine details, and miles per gallon. Let‘s load it along with the tidyverse and skimr:

library(tidyverse)
library(skimr)

data(mpg)

First, use skim() to get an overview of all variables:

skim(mpg)

This quickly summarizes data types, completeness, and distributions. For example, we can see that cty and hwy have a wide range of fuel economy values, while class contains some missing values.

Next let‘s visualize the univariate distributions. A histogram shows cty is roughly normally distributed:

ggplot(mpg) +
  geom_histogram(aes(x = cty), bins = 15)

For categorical data like class, a bar chart shows the frequency of each level:

ggplot(mpg) +
  geom_bar(aes(x = class))

We notice many SUVs and compact cars, but few 2seaters. To examine relationships, let‘s look at highway mileage vs. engine displacement:

ggplot(mpg) +
  geom_point(aes(x = displ, y = hwy))

The downward trend suggests an inverse correlation – as displacement increases, highway mpg decreases. We can quantify this with correlation:

cor(mpg$displ, mpg$hwy)

# -0.7762801

Correlation does not imply causation, but this relationship passes an initial "sniff test" based on our knowledge about engines.

I‘ve only scratched the surface here, but you can see how iterative exploration combining statistics and visuals guides insights about variables and relationships. This forms a solid foundation before modeling tasks like prediction and classification.

Tips for Impactful Exploratory Analysis

Here are some tips to help guide your exploratory process:

  • Let questions about the data drive the analysis rather than following a rigid process. Ask "why?" frequently.

  • Employ both statistical summaries and visualizations to thoroughly characterize relationships. They complement each other.

  • Evaluate both numeric and categorical data differently. Know when to use histograms vs. bar charts.

  • Watch for groups, clusters, or hidden segments. Facet visualizations by color, subsets, or groups.

  • Treat missing data and outliers. Imputation or removal of missing values. Careful handling of anomalies.

  • Remove redundant variables or variables that won‘t aid modeling. Avoid saturated models.

  • Check linear modeling assumptions and treat violations accordingly. Nonlinear relationships? Homoscedasticity?

  • Document EDA insights in a report, notebook, or README to transition smoothly to modeling.

Following an exploratory approach rather than a rigid confirmatory path allows organic insights to emerge directly from the data. This sets up modeling success.

Key Takeaways and Next Steps

In summary, thorough exploratory data analysis should comprise 50-80% of any analytics project. EDA allows you to:

  • Become intimately familiar with your dataset.

  • Uncover patterns, relationships and insights.

  • Clean, transform, and prepare the data for modeling.

  • Identify appropriate modeling techniques for the data types and relationships observed.

R provides amazing packages for data wrangling, visualization, and exploration. For next steps, I recommend spending time practicing EDA on new datasets to sharpen your skills. Kaggle offers many free and interesting datasets to analyze.

I hope you found this guide helpful! Let me know if you have any other questions. Happy exploring!

AlexisKestler

Written by Alexis Kestler

A female web designer and programmer - Now is a 36-year IT professional with over 15 years of experience living in NorCal. I enjoy keeping my feet wet in the world of technology through reading, working, and researching topics that pique my interest.