# AnalyticBridge

A Data Science Central Community

# Basic Data Exploration in R

When you're cleaning up data, you usually end up using a 5-8 functions a ton of times, and then a few more once or twice. Here are those 5-8 functions I find myself using again and again.

Here is a quick overview:

names() - returns the column names of a dateset

str() - gives the overview of a dataset

data.table package - includes functions for creating new columns, among other things

%in% operator - checks if a value is in a vector

Below are some examples. The dataset 'rock' is built into R.

`>  names(rock) # returns the column names [1] "area" "peri" "shape" "perm"`

`> str(rock)                         # gives the format of the dataframe 'data.frame': 48 obs. of 4 variables: \$ area : int 4990 7002 7558 7352 7943 7979 9333 8209 8393 6425 ... \$ peri : num 2792 3893 3931 3869 3949 ... \$ shape: num 0.0903 0.1486 0.1833 0.1171 0.1224 ... \$ perm : num 6.3 6.3 6.3 6.3 17.1 17.1 17.1 17.1 119 119 ...`

`# import the data.table package > install.packages("data.table")             # don't forget these 3 steps! > library(data.table)`

`> dtRock <- data.table(rock)`

`> dtRock[1:5]                    # returns the first 5 columns area peri shape perm 1: 4990 2791.90 0.0903296 6.3 2: 7002 3892.60 0.1486220 6.3 3: 7558 3930.66 0.1833120 6.3 4: 7352 3869.32 0.1170630 6.3 5: 7943 3948.54 0.1224170 17.1`

`# and my favorite way to create a new column`

`# area is measured in pixels, `so areaMP is in mega pixels

`> dtRock[, areaMP := area / 1000]    `

`> dtRock[1, ]                        # indicates the first row, all columns area peri shape perm areaMP 1: 4990 2791.9 0.0903296 6.3 4.99`

`> dtRock[, 'areaMP']                 # returns the entire 'areaMP' column`

`# The %in% operator is one of the most useful functions in R, I think. > a <- c(1,2,3,4)`

`> 4 %in% a                  # it's asking, is the value 4 in the vector a? [1] TRUE`

There are many other functions and packages, such as the 'dplyr' package by Hadley Wickham, but I am just showing the ones I use most frequently.

View the original post, and others from the author here.

Views: 4814

Tags: DataScience, R, datamunging

Comment