Hands-On Data Wrangling: What, How, and Why
Objective:
Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable". This process of data curation is called "Data Wrangling"
This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.
In this tutorial, we will get a feel for what data wranglers do and use the following tools with some hands-on exercises. Given the introductory nature of these examples, the size of data sets is small and does not necessitate HDFS and cluster computing. In the next few installments of this blog series, I intend to add such examples.
Here are the tools you will need to install on your computer:
By no stretch of imagination is this tutorial supposed to be the end-all and all-inclusive learning experience of data wrangling tools and strategies but it merely scratches the scratch of the plethora of tools we have at our disposal to wrangle data.
Download Resources
- Data Wrangling - Slide Deck
- Data Wrangling - Exercise Code
- Transform Legacy Enterprise to Data Driven Business - Slide Deck
Exercise 1: Fix date formats
Solution:
Using R & R Studio
Exercise 1 Preparation:
# Install the following packages
install.packages("stringr")
install.packages("dplyr")
install.packages("lubridate")
install.packages("tidyr")
Exercise 1a Solution:
# Load the previously installed packages
library(stringr)
library(dplyr)
library(lubridate)
library(tidyr)
# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"
# download the file containing dates
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQ09oSjVMc3kwajQ&export=download", destfile = rawfile)
# read the file, since the file does not contain a header, it is false
dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))
# there should be 22
length(dates$date)
# use mutate to transform content from dates, ignore the warnings.
# the intelligence is embedded in useful functions in lubridate package
# create three separate columns based on date formats and then merge them all
# into one column
fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date)) %>% unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="")
# replace NAs with blank strings
fixed$fixedup <- gsub("NA","", fixed$fixedup)
# View fixed
fixed
# identify those dates where month and day are both less than equal to 12
fixed_md_dm <- fixed %>% mutate(fixedup, month = month(fixedup), day = day(fixedup), to_be_reviewed = day(fixedup) <= 12 & month(fixedup) <= 12)
fixed_md_dm
---------
Exercise 1b Solution:
# Continuing from previous example
# Alternative approach highlighting functions in R
# read the file, since the file does not contain a header, it is false
dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))
# there should be 22
length(dates$date)
#Define function
fixdates <- function(date) {
# empty vector
fixed <- c()
# iterate over the vector containing raw dates
for(i in 1:length(date)) {
# is it yyyymmdd?
if(is.na(ymd(date[i]))) {
# is it mmddyyyy
if(is.na(mdy(date[i]))) {
# is it ddmmyyyy
if(is.na(dmy(date[i]))) {
# none, then undertermined
fixed[i] <- NA
} else {
# assign ddmmyyyy
fixed[i] <- dmy(date[i])
}
} else {
# assign mmddyyyy
fixed[i] <- mdy(date[i])
}
} else {
# assign yyyymmdd
fixed[i] <- ymd(date[i])
}
}
# convert to date
class(fixed) <- "Date"
# return vector containing converted dates
fixed
}
# invoke function to convert raw dates
dates$FixedDate <- fixdates(dates$date)
# view contents
dates
---------
Exercise 1c Solution:
# Continuing from previous example
# Extension exercise: add three different dates
# read the file, since the file does not contain a header, it is false
dates <- read.csv(rawfile, header = FALSE, stringsAsFactors = FALSE, col.names = c("date"))
# there should be 22
length(dates$date)
# adding three more dates
dates[nrow(dates)+1,] <- c("01/02/2013")
dates[nrow(dates)+1,] <- c("12021010")
dates[nrow(dates)+1,] <- c("20021010")
# three turn in valid responses for more than one of yyyymmdd, ddmmyyyy, mmddyyyy!
# should be reviewed by someone who has domain knowledge about this data and
# hence we need to flag such data accordingly
# use mutate to transform content from dates, ignore the warnings.
fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date))
# add a new column to know if a certain date observation has to be reviewed
fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy)) + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE))
# extension to above, here we find those which seem to be valid and convert them
fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy)) + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE)) %>% filter(to_be_reviewed==FALSE) %>%unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="") %>% mutate(fixedup = str_replace_all(fixedup, "NA","")) %>% select(date, fixedup)
Trifacta recipe using "Wrangle"
Exercise 2: Fix currency values
Solution:
Using R and R Studio
# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"
# download sales file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQONW01LXpUWTdmdXM&export=download", rawfile)
# read the file
sales <- read.csv(rawfile, stringsAsFactors = FALSE, header = T, sep = "\n")
# There is a lot going on here.
# Create a separate column extracting the currency symbol
# Create another column containing numeric content by replacing comma, $ and £
# Convert to numeric values
# Normalize to USD by converting £ to USD using 1.44 conversion ratio
sales %>% mutate(currency = str_extract(sales,"\\$|£") ) %>% mutate(raw_amount = str_replace_all(sales, ",|\\$|£", "") ) %>% mutate(amount = as.numeric(raw_amount)) %>%mutate(convertedAmountInUSD = ifelse(currency == '$', amount, ifelse(currency == '£', round(amount*1.44,2), NA)))
Trifacta recipe using "Wrangle"
Exercise 3: Fix Java Log4j log file with exception stack traces
Solution:
Exercise 3 Preparation:
# in addition to the previously loaded libraries
# install another package to deal with Excel using cpp instead of
# error prone java dependency from R
install.packages("openxlsx")
Exercise 3 Solution:
# if you have loaded plyr, there are conflict with dplyr.
# hence detaching it
detach("package:plyr", unload=TRUE)
library(openxlsx)
# specify input and output file
rawfile <- "ReplaceWithFileNameIncludingPath"
outputFile <- "ReplaceWithFileNameIncludingPath"
# download log file from log file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQWlGQzdvbzhNRG8&export=download", rawfile)
# read log file, each line is a row
logs <- readLines(rawfile)
# read 311 lines
length(logs)
# vector of characters
class(logs)
# convert into a data frame
logsdf <- as.data.frame(logs, stringsAsFactors = FALSE)
# find structure of the data frame
str(logsdf)
# add a new column with line number
logsdf$ID <- seq.int(nrow(logsdf))
# peek at first 6 lines
head(logsdf)
# look at the class of columns of data frame
sapply(logsdf, class)
# function hacked to group exception lines together
# feel silly hacking such a contrived function...bear with me here
groupLogs <- function(ID) {
log_group <- c()
i <- 1
while (i <= length(ID)) {
log_group[i] = ID[i]
j <- i+1
while(j <=length(ID)) {
if(ID[j] - ID[j-1] == 1) {
log_group[j] <- ID[i]
} else {
log_group[j] <- ID[j]
i <- j
break
}
j <- j+1
if(j > length(ID))
i <- j
}
}
log_group
}
# find lines in log file that represent exceptions
exception_logs <- logsdf[str_detect(logsdf$logs, "^[^\\d]"),]
# invoke function to group together exception lines part of the same stack trace
exception_logs$log_grp <- groupLogs(exception_logs$ID)
# group all the lines in the same exception into a single collapsed row separated # by carriage return
exception_logs <- exception_logs %>% select(logs, log_grp) %>% group_by(log_grp) %>% summarise(stacktrace = paste(logs, collapse = "\r"))
# little math trick to help merge the exception into a the log line previous to
# this exception
exception_logs$log_grp <- ifelse(exception_logs$log_grp == 1,
exception_logs$log_grp, exception_logs$log_grp - 1)
# log lines that begin with timestamp
proper_logs <- logsdf[str_detect(logsdf$logs, "^[\\d]"),]
# split the log message as a separate column using " - " separator
proper_logs <- proper_logs %>% separate(logs, c("logs", "message" ), sep = "\\s-\\s")
# split the log text into time, thread, level and class
proper_logs <- proper_logs %>% separate(logs, c("time", "thread", "level", "class" ), sep = "\\s+")
# left outer join to include exceptions by joining the two data frames created
# earlier
logs <- merge(proper_logs, exception_logs, by.x = c("ID"), by.y = c("log_grp"), all.x = TRUE)
# write the excel file, example /Users/akuntamu/temp/text.xls
write.xlsx(logs, outputFile)
Trifacta recipe using "Wrangle"
- Requires a two step approach.
- Create an intermediate data set that distinguishes the regular log statements which begin with a timestamp from the exception stack traces
- Split rows based on the dummy identifier Q added in the previous step. This technique can help span multi row observations/records.
Exercise 4: Web Scraping Top 50 pop songs in last decade
- Web Page
- If the above page is unavailable, use this cached page
Solution:
Using R & R Studio
Exercise 4 Preparation:
# Install the following packages
install.packages("gtools")
install.packages("rvest")
Exercise 4a Solution:
library(rvest)
library(tidyr)
library(stringr)
library(gtools)
# Scrape the page listing top 50 pop hits in the last decade
scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")
# If the above does not work, I have a cached copy of this page here.
# scraping_top50_pop <- read_html("")
# Scraping using rvest page (which was inspired by BeautifulSoup, python library
top_50_pop <- html_nodes(scraping_top50_pop, ".big") %>% html_text %>% trimws(which ="both")
# split the rank and rest of the text
matrix_top_50_pop <- str_split_fixed(top_50_pop, "\\.", 2)
# supply column names
colnames(matrix_top_50_pop) <- c("rank", "text")
# convert to a data frame
top_50_pop_df <- as.data.frame(matrix_top_50_pop, stringsAsFactors = FALSE)
# convert rank from character to numeric
top_50_pop_df$rank <- as.numeric(top_50_pop_df$rank)
# split the artist based on : delimiter
top_50_pop_df <- top_50_pop_df %>% separate(col = text, into = c("artist", "text"), sep = ":")
# split lines that contains all three fields: title, company, year
top_50_pop_df1 <- top_50_pop_df[which(top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title", "company", "year"), sep = "\\[|\\]")
# split lines that contain only title and year
top_50_pop_df2 <- top_50_pop_df[which(!top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title", "year"), sep = "\\s\\(")
# smartly combine the above data frames since company is not common column in both
top_50_pop_combined <- smartbind(top_50_pop_df1,top_50_pop_df2)
# edit the year field to trim and replace parantheses
top_50_pop_combined$year <- apply(top_50_pop_combined[,"year", drop = FALSE], 2, function(year) {
str_replace_all(trimws(year, which = c("both")), "\\(|\\)", "")
}
)[,"year"]
# trim artist
top_50_pop_combined$artist <- trimws(top_50_pop_combined$artist, which = "both")
# trim title
top_50_pop_combined$title <- trimws(top_50_pop_combined$title, which = "both")
# sort the final data frame using rank in the descending order
top_50_pop_final <- arrange(top_50_pop_combined, -rank)
# print
top_50_pop_final
# Provide output CSV file name
# Example outputCSV <- "/Users/akuntamu/test/top-50-pop.csv"
outputCSV <- "ReplaceWithFileNameIncludingPath"
# export to a CSV
write.table(top_50_pop_final, outputCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")
Trifacta recipe using "Wrangle":
Since Trifacta currently does not have data ingestion capability at the time of this writing, let's perform the first few steps of data scraping in R and create raw data file using the following commands.
Exercise 4b Solution:
library(dplyr)
library(rvest)
library(tidyr)
library(stringr)
scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")
top_50_pop <- html_nodes(scraping_top50_pop, ".big") %>% html_text %>% trimws(which ="both")
# replace outputRawCSV
# Example: outputRawCSV <- "/Users/akuntamu/test/top-50-pop-raw.csv"
outputRawCSV <- "ReplaceWithFileNameIncludingPath"
write.table(as.data.frame(top_50_pop), outputRawCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")
Now upload this file as an input data set into Trifacta Wrangler, create a project and use the following script
Exercise 5: Web Scrape Using OpenRefine, R
This is a simple concocted exercise to consolidate entire 3 day agenda on this page into a simple table showing time slot and session information.
Solution:
Using Open Refine
- Launch openrefine application
- This starts a jetty web server on your machine which hosts the openrefine web app
- Go to http://localhost:3333
- Create Project
- Using Clipboard
- Paste the following link:
- http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html
- If not available, cached copy
- Follow the prompts and select UTF-8 character encoding
- Create project
- The recipe for data scraping and transformation can be downloaded from here.
- Once the project is created successfully, you will find Undo/Redo link
- Click on that link to find Apply button
- Paste the content of downloaded file into textbox in the dialog box and click perform operations
- It will take a few seconds and run through the 14 steps.
- We will review these 14 steps and walk through these individually
Using R and R Studio
Exercise 5 Preparation:
# Install the following packages
# you may be prompted to restart R session. accept it
install.packages("xml2")
Exercise 5 Solution:
# reloading several packages since R session restarted
library(rvest)
library(xml2)
library(dplyr)
library(gtools)
library(openxlsx)
library(xml2)
# read html content
webpage <- read_html("http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html")
# extract all tables
tbls <- html_nodes(webpage, "table")
# inspect tables
head(tbls)
# fetch only 1, 3, 4 tables since 2nd and 5th are not useful
tbls_list <- tbls %>% .[c(1,3,4)] %>% html_table(fill=TRUE)
# structure of returned object
str(tbls_list)
# set up column names for each of the data frames representing tables
colnames(tbls_list[[1]]) <- c("time-slot", "session-info")
colnames(tbls_list[[2]]) <- c("time-slot", "session-info")
colnames(tbls_list[[3]]) <- c("time-slot", "session-info")
# smartly append three tables
agenda <- smartbind(tbls_list[[1]],tbls_list[[2]], tbls_list[[3]])
# inspect structure again
str(agenda)
# there is a row with NA, the following command shows how to filter complete rows
agenda <- agenda[complete.cases(agenda),]
# replace outputXLS
# Example: outputXLS <- "/Users/akuntamu/test/agenda.xls"
outputXLS <- "ReplaceWithFileNameIncludingPath"
# write it out
write.xlsx(agenda, outputXLS)
This concludes this installment of data wrangling exercises using R, R Studio and Trifacta. Hope this wets your appetite for learning more about data wrangling. In future installments, we will look into dealing with missing values, identifying outliers etc using other technologies such as python pandas, open refine and/or any freeware offering by a commercial vendor to get more hands-on expertise
Feel free to post questions, comments, suggestions for improvement on Twitter. I am reachable @akuntamukkala.
Look forward to hearing from you!