Subscribe to DSC Newsletter

Few Exploratory Analysis techniques explained

In my previous blog post I have explained the steps needed to solve a data analysis problem. Going further, I will be discussing in-detail each and every step of Data Analysis. In this post, we shall discuss about exploratory Analysis.

What is Exploratory Analysis?

“Understanding data visually”

 

Exploratory Analysis means analyzing the datasets to summarize their main characteristics, often visually. This is the first step of any data analysis.

Objectives:

  • Know the data types of the dataset – whether continuous/discreet/categorical
  • Understand how the data is distributed
  • Extract important input variables for the analysis
  • Identify outliers
  • To identity patterns, if exists

Exploratory Analysis Techniques:

  • Box-Plot
  • Histogram
  • Trend analysis
  • Scatter Plots

Let us understand the exploratory analysis by considering a data analysis problem. 

Problem statement:

to analyze the incidents/events occurred over past 3 years and try to predict the event occurring in the future.

Solution: 

After understanding the problem statement and gaining the sufficient domain knowledgeIdentify the data sources & download the data into the programming environment.

The next step is to perform an Exploratory analysis as explained here. in today's post we shall look how  exploratory analysis can be done. 

Types of Exploratory analysis:

 

Type1: Understanding the data – variable names, dimensions of the dataset, data types of each and every variable. 

data = read.csv("datasource.csv") #load data 

 

dim(data) #know the dimensions of the data

[1] 839 50

 

Colnames(data) #know the column names 

[1] "Incidents" "Year of Occurance" "Location.of.Occurrence" "Date.of.Occurence" [5]"Time.of.Occurrence" “Operational Phase”

 

Str(data) # know the data types of each of the variable – continuous/descrete/categorical

$ Incidents: int 41505 41537 41539 41565 41589 41596 41598

$ Vehicle.Type : Factor w/ 7 levels "","Volvo(all series)",..: 6 2 2 2 6 6

$ Location.of.Occurrence: Factor w/ 101 levels "","Abidjan","Accra",..: 53 35 35 35 96

$ Date.of.Occurence: Factor w/ 520 levels "1/1/2010","1/1/2012" ..: 1 32 37 

 

Sum(!is.na(data$Date.of.Occurance) # counting the number of missing values in the column

Type2:Creating new varaibles/data type conversion suitable for the analysis – like factor variables into numerical,dates into year/month/day,time into hour of the day, etc. according to our convineince. 

 

#Extracting year/month from Date of occurrence and creating new variables

xn = as.POSIXct(data$"Date.of.Occurence",format="%m/%d/%Y")

data["year"] = as.numeric(format(xn,"%Y"))

data["month"] = as.numeric(format(xn,"%m"))

str(data$"year") 

num [1:839] 2010 2010 2010 2010 2010 2010

#extracting hour of the day and creating new variable TimeOfOccurance

(TOC) data["TOC"] = sub(":.*", "", data$"Time.of.Occurrence")

str(data$"TOC")

chr [1:839] "18" "6" "21" "13" "16" "13" "11" "11" "15" "1" "13" "6"

Type3: Observe the summary of each and every variable to understand the variables. summary(data$"Vehicle.Type")

Volvo (all series) ASHOK (all series) FIAT (all series) Maruthi (all series)

       210                    49                     71                      39 

Type4: Decide which variables are good for analysis by using trends, boxplots, histograms etc. 

boxplot(formula=as.numeric(data$"operational.Phase")~data$"year",col="blue")

boxplot

Box plot distribution of incidents occurring over the years.

hist(data[which(data$"year" == 2011),]$"month",breaks = "Sturges",col=c('blue','red','green'),labels=T) 

Histogram

The above histogram depicts the month wise distribution of incidents occurred in 2011 Trend analysis

trend analysis

In the above graph, we can bring out the below inferences:

Sharp fall in the data in 2012 might be not capturing of the incidents

An average of 30 incidents occurring monthly

In the month of Feb there is sharp fall in the incidents

trend analysis

In the above trend image with graph in red color is plotted against Number of people in the deck and number of Incidents.

This clearly says that there is no relation between the Incident occurring and number of people in the deck.

Hope the above post gives you a very good understanding of how exploratory analysis can be done.

Please do suggest few more techniques.

Also, please find more posts from www.dataperspective.info 

Views: 2135

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Sagar Diwakar Uparkar on December 4, 2014 at 7:39am
Hi, good evening. Could you please, brief on how to box plot use for variable selection? You can consider above example.
Comment by Chandrasekhara S. "C.S." Ganti on March 16, 2014 at 2:20pm

Nice post, brings back old memories, the  fundamental and first principles all brought out nicely for the budding Data analysts / Scientists -- whatever the new lingo is (would be)

Ó

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service