Ashwini Kuntamukkala - Technology Enthusiast

Hands-On Data Wrangling: What, How, and Why

Objective:

Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"

This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.

In this tutorial, we will get a feel for what data wranglers do and use the following tools with some hands-on exercises. Given the introductory nature of these examples, the size of data sets is small and does not necessitate HDFS and cluster computing. In the next few installments of this blog series, I intend to add such examples.

Here are the tools you will need to install on your computer:

Tool	Version	Download & Install Instructions	Type
R language	3.2.4	https://cran.rstudio.com/	Open Source
R Studio	0.99.887	https://www.rstudio.com/products/rstudio/download/	Open Source
OpenRefine	2.6	http://openrefine.org/download.html	Open Source
Trifacta Wrangler	3.0.1-client1	https://www.trifacta.com/trifacta-wrangler	Commercial but free offering with limited functionality

By no stretch of imagination is this tutorial supposed to be the end-all and all-inclusive learning experience of data wrangling tools and strategies but it merely scratches the scratch of the plethora of tools we have at our disposal to wrangle data.

Download Resources

Exercise 1: Fix date formats

Raw data file

Solution:

Using R & R Studio

Exercise 1 Preparation:

# Install the following packages
install.packages("stringr")
install.packages("dplyr")
install.packages("lubridate")
install.packages("tidyr")

Exercise 1a Solution:

# Load the previously installed packages
library(stringr)
library(dplyr)
library(lubridate)
library(tidyr)

# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"

# download the file containing dates

download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQ09oSjVMc3kwajQ&export=download", destfile = rawfile)

# read the file, since the file does not contain a header, it is false

dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))

# there should be 22

length(dates$date)

# use mutate to transform content from dates, ignore the warnings.

# the intelligence is embedded in useful functions in lubridate package

# create three separate columns based on date formats and then merge them all

# into one column

fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date)) %>% unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="")

# replace NAs with blank strings

fixed$fixedup <- gsub("NA","", fixed$fixedup)

# View fixed
fixed

# identify those dates where month and day are both less than equal to 12

fixed_md_dm <- fixed %>% mutate(fixedup, month = month(fixedup), day = day(fixedup), to_be_reviewed = day(fixedup) <= 12 & month(fixedup) <= 12)

fixed_md_dm

---------

Exercise 1b Solution:

# Continuing from previous example
# Alternative approach highlighting functions in R

# read the file, since the file does not contain a header, it is false

dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))

# there should be 22

length(dates$date)

#Define function

fixdates <- function(date) {

# empty vector

fixed <- c()

# iterate over the vector containing raw dates

for(i in 1:length(date)) {

# is it yyyymmdd?

if(is.na(ymd(date[i]))) {

# is it mmddyyyy
if(is.na(mdy(date[i]))) {
# is it ddmmyyyy

if(is.na(dmy(date[i]))) {

# none, then undertermined

fixed[i] <- NA

} else {

# assign ddmmyyyy
fixed[i] <- dmy(date[i])

}

} else {

# assign mmddyyyy

fixed[i] <- mdy(date[i])

}

} else {

# assign yyyymmdd

fixed[i] <- ymd(date[i])

}

# convert to date

class(fixed) <- "Date"

# return vector containing converted dates

fixed

}

# invoke function to convert raw dates

dates$FixedDate <- fixdates(dates$date)

# view contents
dates

---------

Exercise 1c Solution:

# Continuing from previous example

# Extension exercise: add three different dates

# read the file, since the file does not contain a header, it is false

dates <- read.csv(rawfile, header = FALSE, stringsAsFactors = FALSE, col.names = c("date"))

# there should be 22

length(dates$date)

# adding three more dates

dates[nrow(dates)+1,] <- c("01/02/2013")

dates[nrow(dates)+1,] <- c("12021010")

dates[nrow(dates)+1,] <- c("20021010")

# three turn in valid responses for more than one of yyyymmdd, ddmmyyyy, mmddyyyy!

# should be reviewed by someone who has domain knowledge about this data and

# hence we need to flag such data accordingly

# use mutate to transform content from dates, ignore the warnings.

fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date))

# add a new column to know if a certain date observation has to be reviewed

fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy)) + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE))

# extension to above, here we find those which seem to be valid and convert them

fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy)) + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE)) %>% filter(to_be_reviewed==FALSE) %>%unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="") %>% mutate(fixedup = str_replace_all(fixedup, "NA","")) %>% select(date, fixedup)

Trifacta recipe using "Wrangle"

splitrows col: column1 on: '\n'

replace col: column1 on: `(th)|(st)` with: ''

derive value: dateformat(column1, 'yyyy-MM-dd') as: 'column2'

derive value: (day(column2) <= 12) && (month(column2) <= 12) as: 'column3'

rename col: column2 to: 'Fixed_Dates'

rename col: column3 to: 'To_Be_Reviewed'

Exercise 2: Fix currency values

Raw data file

Solution:

Using R and R Studio

# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"

# download sales file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQONW01LXpUWTdmdXM&export=download", rawfile)

# read the file
sales <- read.csv(rawfile, stringsAsFactors = FALSE, header = T, sep = "\n")

# There is a lot going on here.
# Create a separate column extracting the currency symbol
# Create another column containing numeric content by replacing comma, $ and £
# Convert to numeric values
# Normalize to USD by converting £ to USD using 1.44 conversion ratio

sales %>% mutate(currency = str_extract(sales,"\\$|£") ) %>% mutate(raw_amount = str_replace_all(sales, ",|\\$|£", "") ) %>% mutate(amount = as.numeric(raw_amount)) %>%mutate(convertedAmountInUSD = ifelse(currency == '$', amount, ifelse(currency == '£', round(amount*1.44,2), NA)))

Trifacta recipe using "Wrangle"

splitrows col: column1 on: '\n'

header

extract col: sales on: `$|£`

split col: sales on: `$|£`

merge col: sales2,sales3 as: 'column1'

rename col: column1 to: 'amounts'

settype col: amounts type: 'Float'

replace col: amounts on: `,` with: ''

drop col: sales2,sales3

derive value: (sales1 == '$') ? amounts : ((sales1 == '£') ? (amounts * 1.44) : 'test') as: 'column1'

settype col: column1 type: 'Float'

set col: column1 value: numformat(column1, '$##,###.##')

Exercise 3: Fix Java Log4j log file with exception stack traces

Raw data file

Solution:

Using R & R Studio

Exercise 3 Preparation:

# in addition to the previously loaded libraries
# install another package to deal with Excel using cpp instead of
# error prone java dependency from R
install.packages("openxlsx")

Exercise 3 Solution:

# if you have loaded plyr, there are conflict with dplyr.
# hence detaching it
detach("package:plyr", unload=TRUE)

library(openxlsx)

# specify input and output file
rawfile <- "ReplaceWithFileNameIncludingPath"
outputFile <- "ReplaceWithFileNameIncludingPath"

# download log file from log file

download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQWlGQzdvbzhNRG8&export=download", rawfile)

# read log file, each line is a row

logs <- readLines(rawfile)

# read 311 lines

length(logs)

# vector of characters

class(logs)

# convert into a data frame

logsdf <- as.data.frame(logs, stringsAsFactors = FALSE)

# find structure of the data frame

str(logsdf)

# add a new column with line number

logsdf$ID <- seq.int(nrow(logsdf))

# peek at first 6 lines

head(logsdf)

# look at the class of columns of data frame

sapply(logsdf, class)

# function hacked to group exception lines together

# feel silly hacking such a contrived function...bear with me here

groupLogs <- function(ID) {

log_group <- c()

i <- 1

while (i <= length(ID)) {

log_group[i] = ID[i]

j <- i+1

while(j <=length(ID)) {

if(ID[j] - ID[j-1] == 1) {

log_group[j] <- ID[i]

} else {

log_group[j] <- ID[j]

i <- j

break

}

j <- j+1

if(j > length(ID))

i <- j

}

log_group

}

# find lines in log file that represent exceptions

exception_logs <- logsdf[str_detect(logsdf$logs, "^[^\\d]"),]

# invoke function to group together exception lines part of the same stack trace

exception_logs$log_grp <- groupLogs(exception_logs$ID)

# group all the lines in the same exception into a single collapsed row separated # by carriage return

exception_logs <- exception_logs %>% select(logs, log_grp) %>% group_by(log_grp) %>% summarise(stacktrace = paste(logs, collapse = "\r"))

# little math trick to help merge the exception into a the log line previous to

# this exception

exception_logs$log_grp <- ifelse(exception_logs$log_grp == 1,

exception_logs$log_grp, exception_logs$log_grp - 1)

# log lines that begin with timestamp

proper_logs <- logsdf[str_detect(logsdf$logs, "^[\\d]"),]

# split the log message as a separate column using " - " separator

proper_logs <- proper_logs %>% separate(logs, c("logs", "message" ), sep = "\\s-\\s")

# split the log text into time, thread, level and class

proper_logs <- proper_logs %>% separate(logs, c("time", "thread", "level", "class" ), sep = "\\s+")

# left outer join to include exceptions by joining the two data frames created

# earlier

logs <- merge(proper_logs, exception_logs, by.x = c("ID"), by.y = c("log_grp"), all.x = TRUE)

# write the excel file, example /Users/akuntamu/temp/text.xls

write.xlsx(logs, outputFile)

Trifacta recipe using "Wrangle"

Requires a two step approach.

Create an intermediate data set that distinguishes the regular log statements which begin with a timestamp from the exception stack traces

splitrows col: column1 on: '\n'

set col: column1 value: merge([matches([column1], `{start}{digit}`) ? 'Q' : '',column1])

Split rows based on the dummy identifier Q added in the previous step. This technique can help span multi row observations/records.

splitrows col: column1 on: 'Q'

split col: column1 on: `\n`

split col: column2 on: `{delim-ws}`

split col: column1 on: ' ' limit: 3

replace col: column7 on: `{start}{delim}` with: ''

rename col: column2 to: 'Time'

rename col: column5 to: 'Thread'

rename col: column6 to: 'Level'

rename col: column7 to: 'Class'

rename col: column4 to: 'Message'

rename col: column3 to: 'Exception'

Exercise 4: Web Scraping Top 50 pop songs in last decade

Web Page
If the above page is unavailable, use this cached page

Solution:

Using R & R Studio

Exercise 4 Preparation:

# Install the following packages
install.packages("gtools")

install.packages("rvest")

Exercise 4a Solution:

library(rvest)
library(tidyr)
library(stringr)
library(gtools)

# Scrape the page listing top 50 pop hits in the last decade
scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")

# If the above does not work, I have a cached copy of this page here.
# scraping_top50_pop <- read_html("")

# Scraping using rvest page (which was inspired by BeautifulSoup, python library
top_50_pop <- html_nodes(scraping_top50_pop, ".big") %>% html_text %>% trimws(which ="both")

# split the rank and rest of the text
matrix_top_50_pop <- str_split_fixed(top_50_pop, "\\.", 2)

# supply column names
colnames(matrix_top_50_pop) <- c("rank", "text")

# convert to a data frame
top_50_pop_df <- as.data.frame(matrix_top_50_pop, stringsAsFactors = FALSE)

# convert rank from character to numeric
top_50_pop_df$rank <- as.numeric(top_50_pop_df$rank)

# split the artist based on : delimiter
top_50_pop_df <- top_50_pop_df %>% separate(col = text, into = c("artist", "text"), sep = ":")

# split lines that contains all three fields: title, company, year
top_50_pop_df1 <- top_50_pop_df[which(top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title", "company", "year"), sep = "\\[|\\]")

# split lines that contain only title and year
top_50_pop_df2 <- top_50_pop_df[which(!top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title", "year"), sep = "\\s\$")

# smartly combine the above data frames since company is not common column in both
top_50_pop_combined <- smartbind(top_50_pop_df1,top_50_pop_df2)

# edit the year field to trim and replace parantheses
top_50_pop_combined$year <- apply(top_50_pop_combined[,"year", drop = FALSE], 2, function(year) {
str_replace_all(trimws(year, which = c("both")), "\\(|\$", "")
}
)[,"year"]

# trim artist
top_50_pop_combined$artist <- trimws(top_50_pop_combined$artist, which = "both")

# trim title
top_50_pop_combined$title <- trimws(top_50_pop_combined$title, which = "both")

# sort the final data frame using rank in the descending order
top_50_pop_final <- arrange(top_50_pop_combined, -rank)

# print
top_50_pop_final

# Provide output CSV file name
# Example outputCSV <- "/Users/akuntamu/test/top-50-pop.csv"

outputCSV <- "ReplaceWithFileNameIncludingPath"

# export to a CSV
write.table(top_50_pop_final, outputCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")

Trifacta recipe using "Wrangle":

Since Trifacta currently does not have data ingestion capability at the time of this writing, let's perform the first few steps of data scraping in R and create raw data file using the following commands.

Exercise 4b Solution:

library(dplyr)

library(rvest)

library(tidyr)

library(stringr)

scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")

top_50_pop <- html_nodes(scraping_top50_pop, ".big") %>% html_text %>% trimws(which ="both")

# replace outputRawCSV
# Example: outputRawCSV <- "/Users/akuntamu/test/top-50-pop-raw.csv"

outputRawCSV <- "ReplaceWithFileNameIncludingPath"

write.table(as.data.frame(top_50_pop), outputRawCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")

Now upload this file as an input data set into Trifacta Wrangler, create a project and use the following script

splitrows col: column1 on: '\n' quote: '\"'

replace col: column1 on: /^"/ with: ''

replace col: column1 on: /"$/ with: ''

delete row: matches([column1], `top_50_pop`)

split col: column1 on: `{delim-ws}`

split col: column3 on: `{delim-ws}`

split col: column4 on: ` [`

split col: column5 on: `] `

extract col: column3 on: ` \({digit}+\)`

replace col: column3 on: ` \({digit}+\)` with: ''

merge col: column5,column6 as: 'column7'

drop col: column5,column6

set col: column7 value: trim(column7)

replace col: column7 on: `\(` with: ''

replace col: column7 on: `\)` with: ''

rename col: column2 to: 'rank'

rename col: column1 to: 'artist'

rename col: column3 to: 'title'

rename col: column4 to: 'company'

rename col: column7 to: 'year'

sort order: -rank

Exercise 5: Web Scrape Using OpenRefine, R

This is a simple concocted exercise to consolidate entire 3 day agenda on this page into a simple table showing time slot and session information.

Solution:

Using Open Refine

Launch openrefine application
This starts a jetty web server on your machine which hosts the openrefine web app
Go to http://localhost:3333
Create Project
Using Clipboard
Paste the following link:

http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html
If not available, cached copy

Follow the prompts and select UTF-8 character encoding
Create project
The recipe for data scraping and transformation can be downloaded from here.
Once the project is created successfully, you will find Undo/Redo link
Click on that link to find Apply button
Paste the content of downloaded file into textbox in the dialog box and click perform operations
It will take a few seconds and run through the 14 steps.
We will review these 14 steps and walk through these individually

Using R and R Studio

Exercise 5 Preparation:

# Install the following packages
# you may be prompted to restart R session. accept it
install.packages("xml2")

Exercise 5 Solution:

# reloading several packages since R session restarted
library(rvest)
library(xml2)
library(dplyr)
library(gtools)
library(openxlsx)
library(xml2)

# read html content
webpage <- read_html("http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html")

# extract all tables
tbls <- html_nodes(webpage, "table")

# inspect tables
head(tbls)

# fetch only 1, 3, 4 tables since 2nd and 5th are not useful
tbls_list <- tbls %>% .[c(1,3,4)] %>% html_table(fill=TRUE)

# structure of returned object
str(tbls_list)

# set up column names for each of the data frames representing tables
colnames(tbls_list[[1]]) <- c("time-slot", "session-info")
colnames(tbls_list[[2]]) <- c("time-slot", "session-info")
colnames(tbls_list[[3]]) <- c("time-slot", "session-info")

# smartly append three tables
agenda <- smartbind(tbls_list[[1]],tbls_list[[2]], tbls_list[[3]])

# inspect structure again
str(agenda)

# there is a row with NA, the following command shows how to filter complete rows
agenda <- agenda[complete.cases(agenda),]

# replace outputXLS
# Example: outputXLS <- "/Users/akuntamu/test/agenda.xls"

outputXLS <- "ReplaceWithFileNameIncludingPath"

# write it out
write.xlsx(agenda, outputXLS)

This concludes this installment of data wrangling exercises using R, R Studio and Trifacta. Hope this wets your appetite for learning more about data wrangling. In future installments, we will look into dealing with missing values, identifying outliers etc using other technologies such as python pandas, open refine and/or any freeware offering by a commercial vendor to get more hands-on expertise

Feel free to post questions, comments, suggestions for improvement on Twitter. I am reachable @akuntamukkala.

Look forward to hearing from you!

Ashwini Kuntamukkala - Technology Enthusiast

Wednesday, April 19, 2017

Pros And Cons Of MicroServices Architecture

Abstract

Friday, May 13, 2016

Data Wrangling Examples

Hands-On Data Wrangling: What, How, and Why

Objective:

2.6