Introduction

In this post, I would like to go through some common methods of data exploration. Data exploration is one of the introductory analysis that is performed before any model building task. Data exploration can uncover some hidden patterns and lead to insights into the some phenomenom behind the data.It can inform the selection of appropriate statistical techniques,tools and models. Exploratory techniques are also important for eliminating or sharpening potential hypotheses about the causes of the observed phenomena in the data. We can also detect outliers and anomalies in the data through exploration. Exploratory analysis emphasizes graphical visualizations of the data.

Load Required Packages

The pacman package provides a convenient way to load packages. It installs the package before loading if it not already installed.One of my favorite themes that I use with ggplot is the theme_pubclean. Here I set all themes with ggplot by it.

#install.packages("ggpubr")


#install_github("kassambara/easyGgplot2")
#p_install_gh("kassambara/easyGgplot2")

pacman::p_load(tidyverse,janitor,DataExplorer,skimr,ggpubr,viridis,kableExtra,Amelia,easyGgplot2,VIM)


theme_set(theme_pubclean())

The data for this analysis Orange Juice data, is contained in the ISLR package.The ISLR package created to store the data for the popular introductory statistical learning text, Introduction to Statistical Learning with Applications in R (Gareth James, Daniela Witten, Trevor Hastie and Rob Tibshirani).The data contains 1070 purchases where the customer either purchased Citrus Hill or Minute Maid Orange Juice. A number of characteristics of the customer and product are recorded.The categorical response variable is Purchase with levels CH and MM indicating whether the customer purchased Citrus Hill or Minute Maid Orange Juice. The goal of this data is to predict which of the two brands of orange juice did customers want to buy based on some seventeen features which describes the product and nature of the customers. The dataset can be downloaded here. It contains 1070 observations and seveenteen features plus the response variable purchase.

Description of Variables:

  1. WeekofPurchase: Week of purchase
  2. StoreID: Store ID
  3. PriceCH: Price charged for CH
  4. PriceMM: Price charged for MM
  5. DiscCH: Discount offered for CH
  6. DiscMM: Discount offered for MM
  7. SpecialCH: Indicator of special on CH
  8. SpecialMM: Indicator of special on MM
  9. LoyalCH: Customer brand loyalty for CH
  10. SalePriceMM: Sale price for MM
  11. SalePriceCH: Sale price for CH
  12. PriceDiff: Sale price of MM less sale price of CH
  13. Store7: A factor with levels No and Yes indicating whether the sale is at Store 7
  14. PctDiscMM: Percentage discount for MM
  15. PctDiscCH: Percentage discount for CH
  16. ListPriceDiff: List price of MM less list price of CH
  17. STORE: store id.
# Import dataset
orangejuice<-read_csv('https://raw.githubusercontent.com/NanaAkwasiAbayieBoateng/ExploratoryDataAnalysis/master/orangejuice.csv')

write_csv(orangejuice,"orangejuice.csv")

orangejuice%>%head()%>%
  kable(escape = F, align = "c") %>%
  kable_styling(c("striped", "condensed"), full_width = F)
Purchase WeekofPurchase StoreID PriceCH PriceMM DiscCH DiscMM SpecialCH SpecialMM LoyalCH SalePriceMM SalePriceCH PriceDiff Store7 PctDiscMM PctDiscCH ListPriceDiff STORE
CH 237 1 1.75 1.99 0.00 0.0 0 0 0.500000 1.99 1.75 0.24 No 0.000000 0.000000 0.24 1
CH 239 1 1.75 1.99 0.00 0.3 0 1 0.600000 1.69 1.75 -0.06 No 0.150754 0.000000 0.24 1
CH 245 1 1.86 2.09 0.17 0.0 0 0 0.680000 2.09 1.69 0.40 No 0.000000 0.091398 0.23 1
MM 227 1 1.69 1.69 0.00 0.0 0 0 0.400000 1.69 1.69 0.00 No 0.000000 0.000000 0.00 1
CH 228 7 1.69 1.69 0.00 0.0 0 0 0.956535 1.69 1.69 0.00 Yes 0.000000 0.000000 0.00 0
CH 230 7 1.69 1.99 0.00 0.0 0 1 0.965228 1.99 1.69 0.30 Yes 0.000000 0.000000 0.30 0

Univariate Analysis

plot_str(orangejuice)

There are 40 missing observations in the data set.In this exploratory analysis we would simply delete these missing values. Imputing missing values would be discussed extensively in a later post.When the number of missing values is relative to the sample size is small in a data set, a basic approach to handling missing data is to delete them.

plot_missing(orangejuice)

An alternate visualization approach is to use visna function from the extracat package.The columns represent the variables in the data and the rows the missing pattern.The blue cells represent cells of the variable with missing values.The proportion of missing values for each variable is shown by the bars vertically beneath cells.The right show the relative frequencies of patterns.

pacman::p_load(extracat)


extracat::visna(orangejuice, sort = "b", sort.method="optile", fr=100, pmax=0.05, s = 2)

library(VIM)

aggr(orangejuice , col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(orangejuice), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##        Variable        Count
##       SpecialMM 0.0046728972
##         LoyalCH 0.0046728972
##     SalePriceMM 0.0046728972
##       PctDiscMM 0.0046728972
##         PriceMM 0.0037383178
##          DiscMM 0.0037383178
##          DiscCH 0.0018691589
##       SpecialCH 0.0018691589
##       PctDiscCH 0.0018691589
##           STORE 0.0018691589
##         StoreID 0.0009345794
##         PriceCH 0.0009345794
##     SalePriceCH 0.0009345794
##       PriceDiff 0.0009345794
##        Purchase 0.0000000000
##  WeekofPurchase 0.0000000000
##          Store7 0.0000000000
##   ListPriceDiff 0.0000000000
library(Amelia)
missmap(orangejuice, main = "Missing values vs observed",col=c('navyblue','yellow'),y.cex=0.5)

plot_histogram(orangejuice)

plot_density(orangejuice)

plot_bar(orangejuice)

Purchases made at store store 7 is lower than other stores whereas more customers purchased Citrus Hill than Minute Maid Orange Juice

Multivariate Analysis

Multivariate analysis include examining the correlation structure between variables in the dataset and also the bivariate relationship between the response variable and each predictor variable.

pacman::p_load(GGally)

na.omit(orangejuice)%>%select_if(is.double)%>%ggpairs(  title = "Continuous Variables")

Multiple continuous variables can be visualized by Parallel Coordinate Plots (PCP). Each vertical axis represents a column variable in the data and the observations are drawn as lines connecting its value on the corresponding vertical axes. The ggplot extension GGally package has the ggparcoord function which can be used for PCP plots in R. High values for Week of purchase corresponds with stores with low ID numbers. Low values for Indicator of special on MM corresponds with higher customer loyalty

#p_ <- GGally::print_if_interactive

# this time, color by diamond cut
p <- ggparcoord(data = na.omit(orangejuice), columns = c(2:10), groupColumn = "Purchase", title = "Parallel Coord. Plot of Orange Juice Data",scale = "uniminmax", boxplot = FALSE, mapping = ggplot2::aes(size = 1),showPoints = TRUE,alpha = .05,)+
  #scale_fill_viridis(discrete = T)+
  
    scale_fill_manual(values=c("#B9DE28FF" , "#D1E11CFF" ))+
   ggplot2::scale_size_identity()
#p_(p)
p

na.omit(orangejuice)%>%select_if(is.double)%>%
  mutate(Above_Avg = PriceCH > mean(PriceCH)) %>%
  GGally::ggparcoord(showPoints = TRUE,
 
    alpha = .05,
    scale = "center",
    columns = 1:8,
    groupColumn = "Above_Avg"
    )

Correlation between numeric variables can also be visualized by a heatmap. Heatmaps can identify clusters with strong correlation among variables. The correlation matrix between the variables can be visualized neatly on a heatmap. e the correlation matrix and visualize this matrix with a heatmap. Deep points represent low correlations whereas light yellow represents strong correlations. There exist strong correlations among variable pairs such as (WeekofPurchase, Price) ,( PctDisc, SalePrice )for both CH and MM, ( ListPriceDiff, PriceMM) etc.

plot_correlation(na.omit(orangejuice),type = "continuous",theme_config = list(legend.position = "bottom", axis.text.x =
  element_text(angle = 90)))

The corrplot function can also equivalently plot the correlatio between variables in a dataset as shown below:

pacman::p_load(plotly,corrr,RColorBrewer,corrplot)



na.omit(orangejuice)%>%select_if(is.numeric)%>%cor()%>%corrplot::corrplot()

#Equivalently
#na.omit(orangejuice)%>%select_if(is.numeric)%>%cor()%>%
#  corrplot.mixed(upper = "color", tl.col = "black")
na.omit(orangejuice)%>%
  select_if(is.numeric) %>%
  cor() %>%
  heatmap(Rowv = NA, Colv = NA, scale = "column")

An interactive heatmap can be easily plotted courtesy the d3heatmap package.

pacman::p_load(d3heatmap)

na.omit(orangejuice)%>%
  select_if(is.numeric) %>%
  cor() %>%
d3heatmap(colors = "Blues", scale = "col",
          dendrogram = "row", k_row = 3)

ggsave("/Users/nanaakwasiabayieboateng/Documents/memphisclassesbooks/DataMiningscience/ExploratoryDataAnalysis/d3heatmap.pdf")
library(devtools)

#install_github("easyGgplot2", "kassambara")


pacman::p_load(ggalt,gridExtra,scales,kassambara,easyGgplot2)


p1<-ggplot(orangejuice, aes(x=SalePriceCH, fill=Purchase)) + geom_bkde(alpha=0.5)
p2<-ggplot(orangejuice, aes(x=SalePriceMM, fill=Purchase)) + geom_bkde(alpha=0.5)



# Multiple graphs on the same page
easyGgplot2::ggplot2.multiplot(p1,p2, cols=2)

The sale price for both purchased Citrus Hill and Minute Maid Orange Juice is multimodal and the Citrus Hill has a higher sale price.

The skimr and mlr packages have functions that conveniently summaeizes a dataset and present the output in a tabular form.

skimmed <-skim_to_wide(orangejuice)
skimmed%>%
  kable() %>%
  kable_styling()
type variable missing complete n min max empty n_unique mean sd p0 p25 p50 p75 p100 hist
character Purchase 0 1070 1070 2 2 0 2 NA NA NA NA NA NA NA NA
character Store7 0 1070 1070 2 3 0 2 NA NA NA NA NA NA NA NA
integer SpecialCH 2 1068 1070 NA NA NA NA 0.15 0.35 0 0 0 0 1 ▇▁▁▁▁▁▁▂
integer SpecialMM 5 1065 1070 NA NA NA NA 0.16 0.37 0 0 0 0 1 ▇▁▁▁▁▁▁▂
integer STORE 2 1068 1070 NA NA NA NA 1.63 1.43 0 0 2 3 4 ▇▃▁▅▁▅▁▃
integer StoreID 1 1069 1070 NA NA NA NA 3.96 2.31 1 2 3 7 7 ▃▅▅▃▁▁▁▇
integer WeekofPurchase 0 1070 1070 NA NA NA NA 254.38 15.56 227 240 257 268 278 ▆▅▅▃▅▇▆▇
numeric DiscCH 2 1068 1070 NA NA NA NA 0.052 0.12 0 0 0 0 0.5 ▇▁▁▁▁▁▁▁
numeric DiscMM 4 1066 1070 NA NA NA NA 0.12 0.21 0 0 0 0.23 0.8 ▇▁▁▂▁▁▁▁
numeric ListPriceDiff 0 1070 1070 NA NA NA NA 0.22 0.11 0 0.14 0.24 0.3 0.44 ▂▂▂▂▇▆▁▁
numeric LoyalCH 5 1065 1070 NA NA NA NA 0.57 0.31 1.1e-05 0.32 0.6 0.85 1 ▅▂▃▃▆▃▃▇
numeric PctDiscCH 2 1068 1070 NA NA NA NA 0.027 0.062 0 0 0 0 0.25 ▇▁▁▁▁▁▁▁
numeric PctDiscMM 5 1065 1070 NA NA NA NA 0.059 0.1 0 0 0 0.11 0.4 ▇▁▁▂▁▁▁▁
numeric PriceCH 1 1069 1070 NA NA NA NA 1.87 0.1 1.69 1.79 1.86 1.99 2.09 ▂▅▁▇▁▁▅▁
numeric PriceDiff 1 1069 1070 NA NA NA NA 0.15 0.27 -0.67 0 0.23 0.32 0.64 ▁▁▂▂▃▇▃▂
numeric PriceMM 4 1066 1070 NA NA NA NA 2.09 0.13 1.69 1.99 2.09 2.18 2.29 ▁▁▁▃▁▇▃▂
numeric SalePriceCH 1 1069 1070 NA NA NA NA 1.82 0.14 1.39 1.75 1.86 1.89 2.09 ▁▁▁▂▆▇▅▁
numeric SalePriceMM 5 1065 1070 NA NA NA NA 1.96 0.25 1.19 1.69 2.09 2.13 2.29 ▁▁▃▃▁▂▇▆
mlr::summarizeColumns(orangejuice)%>%
  kable() %>%
  kable_styling()
name type na mean disp median mad min max nlevs
Purchase character 0 NA 0.3897196 NA NA 4.17e+02 653.000000 2
WeekofPurchase integer 0 254.3813084 15.5582861 257.00 20.7564000 2.27e+02 278.000000 0
StoreID integer 1 3.9569691 2.3081886 3.00 1.4826000 1.00e+00 7.000000 0
PriceCH numeric 1 1.8674275 0.1020172 1.86 0.1482600 1.69e+00 2.090000 0
PriceMM numeric 4 2.0850375 0.1344285 2.09 0.1334340 1.69e+00 2.290000 0
DiscCH numeric 2 0.0519569 0.1175628 0.00 0.0000000 0.00e+00 0.500000 0
DiscMM numeric 4 0.1234146 0.2141255 0.00 0.0000000 0.00e+00 0.800000 0
SpecialCH integer 2 0.1470037 0.3542755 0.00 0.0000000 0.00e+00 1.000000 0
SpecialMM integer 5 0.1624413 0.3690285 0.00 0.0000000 0.00e+00 1.000000 0
LoyalCH numeric 5 0.5652030 0.3080704 0.60 0.3891084 1.10e-05 0.999947 0
SalePriceMM numeric 5 1.9619343 0.2525100 2.09 0.1482600 1.19e+00 2.290000 0
SalePriceCH numeric 1 1.8155192 0.1434442 1.86 0.1482600 1.39e+00 2.090000 0
PriceDiff numeric 1 0.1463237 0.2716379 0.23 0.1482600 -6.70e-01 0.640000 0
Store7 character 0 NA 0.3327103 NA NA 3.56e+02 714.000000 2
PctDiscMM numeric 5 0.0593881 0.1018414 0.00 0.0000000 0.00e+00 0.402010 0
PctDiscCH numeric 2 0.0273179 0.0622811 0.00 0.0000000 0.00e+00 0.252688 0
ListPriceDiff numeric 0 0.2179907 0.1075354 0.24 0.0889560 0.00e+00 0.440000 0
STORE integer 2 1.6282772 1.4304973 2.00 1.4826000 0.00e+00 4.000000 0
(spec_variables <- attr(orangejuice, "spec"))
## cols(
##   Purchase = col_character(),
##   WeekofPurchase = col_integer(),
##   StoreID = col_integer(),
##   PriceCH = col_double(),
##   PriceMM = col_double(),
##   DiscCH = col_double(),
##   DiscMM = col_double(),
##   SpecialCH = col_integer(),
##   SpecialMM = col_integer(),
##   LoyalCH = col_double(),
##   SalePriceMM = col_double(),
##   SalePriceCH = col_double(),
##   PriceDiff = col_double(),
##   Store7 = col_character(),
##   PctDiscMM = col_double(),
##   PctDiscCH = col_double(),
##   ListPriceDiff = col_double(),
##   STORE = col_integer()
## )
spec_variables<-c("LoyalCH", "SalePriceMM","SalePriceCH" ,"PctDiscMM","PctDiscCH","ListPriceDiff","Purchase","Store7")


spec_variable<-noquote(spec_variables)
 
pm<-ggpairs(orangejuice,spec_variable , title = "",mapping = aes(color = Purchase))+
  theme(legend.position = "top")

pm

We can select one of plots above as follows:

pm[1,7]

na.omit(orangejuice)%>% select_if(~!is.double(.x))%>%
  ggpairs( mapping = aes(color = Purchase) , title = "Categorical Variables")+
  theme(legend.position = "top")

#Equivalently

#na.omit(orangejuice)%>% select_if(funs(!is.double(.)))%>%
 # ggpairs(  title = "Categorical Variables")


#index=!sapply(na.omit(orangejuice), is.double)
#orange_numeric<-orangejuice[index==TRUE]
#orange_numeric%>%ggpairs(  title = "Categorical Variables")



#na.omit(orangejuice)%>%select_if(negate(is.double))%>%
#  ggpairs(  title = "Categorical Variables")
categorical_orange=na.omit(orangejuice)%>% select_if(~!is.double(.x))
continuous_orange=na.omit(orangejuice)%>% select_if(is.double)

categorical_orange<-noquote(names(categorical_orange))
continuous_orange<-noquote(names(continuous_orange))


ggduo(
  orangejuice, rev(continuous_orange), categorical_orange,
  mapping = aes(color = Purchase),
   types = list(continuous = wrap("smooth_loess", alpha = 0.25)),
  showStrips = FALSE,
  title = "Variable Comparison By Purchase",
  xlab = "Continuous Variables",
  ylab = "Categorical",
  legend = c(5,2)
) +
  theme(legend.position = "top")