Friday, May 13, 2016

Data Wrangling Examples

Hands-On Data Wrangling: What, How, and Why

Objective:

Companies are finding that data can be a powerful differentiator and are investing heavily in infrastructure, tools and personnel to ingest and curate raw data to be "analyzable".  This process of data curation is called "Data Wrangling"

This task can be very cumbersome and requires trained personnel. However with the advances in open source and commercial tooling, this process has gotten a lot easier and the technical expertise required to do this effectively has dropped several notches.

In this tutorial, we will get a feel for what data wranglers do and use the following tools with some hands-on exercises. Given the introductory nature of these examples, the size of data sets is small and does not necessitate HDFS and cluster computing. In the next few installments of this blog series, I intend to add such examples. 

Here are the tools you will need to install on your computer:

Tool Version Download & Install Instructions Type
R language3.2.4 https://cran.rstudio.com/ Open Source
R Studio0.99.887 https://www.rstudio.com/products/rstudio/download/ Open Source
OpenRefine

2.6

http://openrefine.org/download.html Open Source
Trifacta Wrangler 3.0.1-client1 https://www.trifacta.com/trifacta-wrangler Commercial but free offering with limited functionality


By no stretch of imagination is this tutorial supposed to be the end-all and all-inclusive learning experience of data wrangling tools and strategies but it merely scratches the scratch of the plethora of tools we have at our disposal to wrangle data.




Download Resources

  1. Data Wrangling - Slide Deck
  2. Data Wrangling - Exercise Code
  3. Transform Legacy Enterprise to Data Driven Business - Slide Deck


Exercise 1: Fix date formats
Solution:

Using R & R Studio


Exercise 1 Preparation:

# Install the following packages
install.packages("stringr")
install.packages("dplyr")
install.packages("lubridate")
install.packages("tidyr")

Exercise 1a Solution:

# Load the previously installed packages
library(stringr)
library(dplyr)
library(lubridate)
library(tidyr)


# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"

# download the file containing dates 
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQ09oSjVMc3kwajQ&export=download", destfile = rawfile)

# read the file, since the file does not contain a header, it is false
dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))

# there should be 22
length(dates$date)

# use mutate to transform content from dates, ignore the warnings. 
# the intelligence is embedded in useful functions in lubridate package
# create three separate columns based on date formats and then merge them all 
# into one column 
fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date)) %>% unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="")
  
# replace NAs with blank strings
fixed$fixedup <- gsub("NA","", fixed$fixedup)

# View fixed
fixed

# identify those dates where month and day are both less than equal to 12
fixed_md_dm <- fixed %>% mutate(fixedup, month = month(fixedup), day = day(fixedup), to_be_reviewed = day(fixedup) <= 12 & month(fixedup) <= 12)

fixed_md_dm

---------

Exercise 1b Solution:


# Continuing from previous example
# Alternative approach highlighting functions in R

# read the file, since the file does not contain a header, it is false

dates <- read.csv(rawfile, header = FALSE, col.names = c("date"))



# there should be 22

length(dates$date)



#Define function

fixdates <- function(date) {

    # empty vector

    fixed <- c()

    # iterate over the vector containing raw dates

    for(i in 1:length(date)) {

        # is it yyyymmdd?

        if(is.na(ymd(date[i]))) {

            # is it mmddyyyy
            if(is.na(mdy(date[i]))) {
                # is it ddmmyyyy

                if(is.na(dmy(date[i]))) {

                    # none, then undertermined
                    fixed[i] <- NA
                } else {
                    # assign ddmmyyyy
                    fixed[i] <- dmy(date[i])
                }
            } else {
                # assign mmddyyyy
                fixed[i] <- mdy(date[i])
            }
        } else {
            # assign yyyymmdd
            fixed[i] <- ymd(date[i])
        }
    }
    # convert to date
    class(fixed) <- "Date"
    # return vector containing converted dates
    fixed
}

# invoke function to convert raw dates
dates$FixedDate <- fixdates(dates$date)

# view contents
dates

---------

Exercise 1c Solution:

# Continuing from previous example

# Extension exercise: add three different dates


# read the file, since the file does not contain a header, it is false

dates <- read.csv(rawfile, header = FALSE, stringsAsFactors = FALSE, col.names = c("date"))



# there should be 22

length(dates$date)



# adding three more dates
dates[nrow(dates)+1,] <- c("01/02/2013")
dates[nrow(dates)+1,] <- c("12021010")
dates[nrow(dates)+1,] <- c("20021010")

# three turn in valid responses for more than one of yyyymmdd, ddmmyyyy, mmddyyyy!
# should be reviewed by someone who has domain knowledge about this data and 
# hence we need to flag such data accordingly


# use mutate to transform content from dates, ignore the warnings. 
fixed <- mutate(dates, yyyymmdd = ymd(date), mmddyyyy = mdy(date), ddmmyyyy = dmy(date))

# add a new column to know if a certain date observation has to be reviewed
fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy))  + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE))

# extension to above, here we find those which seem to be valid and convert them 

fixed %>% mutate(to_be_reviewed = ifelse((!is.na(fixed$yyyymmdd)) + (!is.na(fixed$mmddyyyy))  + (!is.na(fixed$ddmmyyyy)) == 1, FALSE, TRUE)) %>% filter(to_be_reviewed==FALSE) %>%unite("fixedup", yyyymmdd, ddmmyyyy, mmddyyyy, sep ="") %>% mutate(fixedup = str_replace_all(fixedup, "NA","")) %>% select(date, fixedup)



Trifacta recipe using "Wrangle"





    splitrows colcolumn1 on'\n'
    replace colcolumn1 on`(th)|(st)` with''
    derive valuedateformat(column1'yyyy-MM-dd'as'column2'
    derive value(day(column2) <= 12) && (month(column2) <= 12) as'column3'
    rename colcolumn2 to'Fixed_Dates'
    rename colcolumn3 to'To_Be_Reviewed'

Exercise 2: Fix currency values
Solution:

Using R and R Studio

# Replace this with file name where you want to download the raw file
rawfile <- "ReplaceWithFileNameIncludingPath"

# download sales file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQONW01LXpUWTdmdXM&export=download", rawfile)

# read the file
sales <- read.csv(rawfile, stringsAsFactors = FALSE, header = T, sep = "\n")

# There is a lot going on here. 
# Create a separate column extracting the currency symbol
# Create another column containing numeric content by replacing comma, $ and £
# Convert to numeric values
# Normalize to USD by converting £ to USD using 1.44 conversion ratio

sales %>% mutate(currency = str_extract(sales,"\\$|£") ) %>% mutate(raw_amount = str_replace_all(sales, ",|\\$|£", "") ) %>% mutate(amount = as.numeric(raw_amount)) %>%mutate(convertedAmountInUSD = ifelse(currency == '$', amount, ifelse(currency == '£', round(amount*1.44,2), NA)))



Trifacta recipe using "Wrangle"


    splitrows colcolumn1 on'\n'
    header
    extract colsales on`$|£`
    split colsales on`$|£`
    merge colsales2,sales3 as'column1'
    rename colcolumn1 to'amounts'
    settype colamounts type'Float'
    replace colamounts on`,` with''
    drop colsales2,sales3
    derive value(sales1 == '$') ? amounts : ((sales1 == '£') ? (amounts * 1.44) : 'test'as'column1'
    settype colcolumn1 type'Float'
    set colcolumn1 valuenumformat(column1'$##,###.##')


Exercise 3: Fix Java Log4j log file with exception stack traces
Solution:



Using R & R Studio

Exercise 3 Preparation:

# in addition to the previously loaded libraries
# install another package to deal with Excel using cpp instead of 
# error prone java dependency from R 
install.packages("openxlsx")

Exercise 3 Solution:

# if you have loaded plyr, there are conflict with dplyr. 
# hence detaching it
detach("package:plyr", unload=TRUE)

library(openxlsx)

# specify input and output file
rawfile <- "ReplaceWithFileNameIncludingPath"
outputFile <- "ReplaceWithFileNameIncludingPath"

# download log file from log file
download.file("https://docs.google.com/uc?authuser=0&id=0B6UoSlNMfxQOQWlGQzdvbzhNRG8&export=download", rawfile)

# read log file, each line is a row
logs <- readLines(rawfile)


# read 311 lines
length(logs)

# vector of characters
class(logs)

# convert into a data frame
logsdf <- as.data.frame(logs, stringsAsFactors = FALSE)

# find structure of the data frame
str(logsdf)

# add a new column with line number
logsdf$ID <- seq.int(nrow(logsdf))

# peek at first 6 lines 
head(logsdf)

# look at the class of columns of data frame
sapply(logsdf, class)

# function hacked to group exception lines together 
# feel silly hacking such a contrived function...bear with me here
groupLogs <- function(ID) {
  log_group <- c()
  i <- 1
  while (i <= length(ID)) {
    log_group[i] = ID[i]
    j <- i+1
    while(j <=length(ID)) {
      if(ID[j] - ID[j-1] == 1) {
        log_group[j] <- ID[i]
      } else {
        log_group[j] <- ID[j]
        i <- j
        break
      }
      j <- j+1
      if(j > length(ID))
        i <- j
    }
  }
  log_group
}

# find lines in log file that represent exceptions 
exception_logs <- logsdf[str_detect(logsdf$logs, "^[^\\d]"),]

# invoke function to group together exception lines part of the same stack trace
exception_logs$log_grp <- groupLogs(exception_logs$ID)

# group all the lines in the same exception into a single collapsed row separated # by carriage return
exception_logs <- exception_logs %>% select(logs, log_grp) %>% group_by(log_grp) %>% summarise(stacktrace = paste(logs, collapse = "\r"))


# little math trick to help merge the exception into a the log line previous to
# this exception
exception_logs$log_grp <- ifelse(exception_logs$log_grp == 1, 
exception_logs$log_grp, exception_logs$log_grp - 1)

# log lines that begin with timestamp
proper_logs <- logsdf[str_detect(logsdf$logs, "^[\\d]"),]


# split the log message as a separate column using " - " separator 
proper_logs <- proper_logs %>% separate(logs, c("logs", "message" ), sep = "\\s-\\s") 

# split the log text into time, thread, level and class
proper_logs <- proper_logs %>% separate(logs, c("time", "thread", "level", "class" ), sep = "\\s+")


# left outer join to include exceptions by joining the two data frames created
# earlier
logs <- merge(proper_logs, exception_logs, by.x = c("ID"), by.y = c("log_grp"), all.x = TRUE)

# write the excel file, example /Users/akuntamu/temp/text.xls
write.xlsx(logs, outputFile)

Trifacta recipe using "Wrangle"

  • Requires a two step approach. 
  • Create an intermediate data set that distinguishes the regular log statements which begin with a timestamp from the exception stack traces

splitrows colcolumn1 on'\n'
set colcolumn1 valuemerge([matches([column1], `{start}{digit}`) ? 'Q' : '',column1])
  • Split rows based on the dummy identifier Q added in the previous step. This technique can help span multi row observations/records.

    splitrows colcolumn1 on'Q'
    split colcolumn1 on`\n`
    split colcolumn2 on`{delim-ws}`
    split colcolumn1 on' ' limit3
    replace colcolumn7 on`{start}{delim}` with''
    rename colcolumn2 to'Time'
    rename colcolumn5 to'Thread'
    rename colcolumn6 to'Level'
    rename colcolumn7 to'Class'
    rename colcolumn4 to'Message'
    rename colcolumn3 to'Exception'



Exercise 4: Web Scraping Top 50 pop songs in last decade
  • Web Page
  • If the above page is unavailable, use this cached page
Solution:

Using R & R Studio


Exercise 4 Preparation:

# Install the following packages
install.packages("gtools")

install.packages("rvest")

Exercise 4a Solution:

library(rvest)
library(tidyr)
library(stringr)
library(gtools)


# Scrape the page listing top 50 pop hits in the last decade
scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")
                                
# If the above does not work, I have a cached copy of this page here.
# scraping_top50_pop <- read_html("")

# Scraping using rvest page (which was inspired by BeautifulSoup, python library
top_50_pop <- html_nodes(scraping_top50_pop, ".big") %>% html_text %>% trimws(which ="both")

# split the rank and rest of the text
matrix_top_50_pop <- str_split_fixed(top_50_pop, "\\.", 2)

# supply column names
colnames(matrix_top_50_pop) <- c("rank", "text")

# convert to a data frame
top_50_pop_df <- as.data.frame(matrix_top_50_pop, stringsAsFactors = FALSE)

# convert rank from character to numeric
top_50_pop_df$rank <- as.numeric(top_50_pop_df$rank)

# split the artist based on : delimiter
top_50_pop_df <- top_50_pop_df %>% separate(col = text, into = c("artist", "text"), sep = ":")

# split lines that contains all three fields: title, company, year
top_50_pop_df1 <- top_50_pop_df[which(top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title", "company", "year"), sep = "\\[|\\]")

# split lines that contain only title and year
top_50_pop_df2 <- top_50_pop_df[which(!top_50_pop_df$text %>% str_detect("\\[.*\\]")),] %>% separate(col = text, into = c("title",  "year"), sep = "\\s\\(")


# smartly combine the above data frames since company is not common column in both
top_50_pop_combined <- smartbind(top_50_pop_df1,top_50_pop_df2)

# edit the year field to trim and replace parantheses
top_50_pop_combined$year <- apply(top_50_pop_combined[,"year", drop = FALSE], 2, function(year) {
str_replace_all(trimws(year, which = c("both")), "\\(|\\)", "")
}
)[,"year"]

# trim artist
top_50_pop_combined$artist <- trimws(top_50_pop_combined$artist, which = "both")

# trim title
top_50_pop_combined$title <- trimws(top_50_pop_combined$title, which = "both")

# sort the final data frame using rank in the descending order
top_50_pop_final <- arrange(top_50_pop_combined, -rank)

# print
top_50_pop_final

# Provide output CSV file name
# Example outputCSV <- "/Users/akuntamu/test/top-50-pop.csv"

outputCSV <- "ReplaceWithFileNameIncludingPath"

# export to a CSV 
write.table(top_50_pop_final, outputCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")


Trifacta recipe using "Wrangle":

Since Trifacta currently does not have data ingestion capability at the time of this writing, let's perform the first few steps of data scraping in R and create raw data file using the following commands.

Exercise 4b Solution:

library(dplyr)
library(rvest)
library(tidyr)
library(stringr)

scraping_top50_pop <- read_html("http://www.pastemagazine.com/blogs/lists/2009/11/the-best-albums-of-the-decade.html?a=1")
top_50_pop <- html_nodes(scraping_top50_pop, ".big")  %>% html_text %>% trimws(which ="both")

# replace outputRawCSV 
# Example: outputRawCSV <- "/Users/akuntamu/test/top-50-pop-raw.csv"

outputRawCSV <- "ReplaceWithFileNameIncludingPath"

write.table(as.data.frame(top_50_pop), outputRawCSV, row.names = FALSE, fileEncoding = "UTF-8", sep = "|")

Now upload this file as an input data set into Trifacta Wrangler, create a project and use the following script

    splitrows colcolumn1 on'\n' quote'\"'
    replace colcolumn1 on/^"/ with''
    replace colcolumn1 on/"$/ with''
    delete rowmatches([column1], `top_50_pop`)
    split colcolumn1 on`{delim-ws}`
    split colcolumn3 on`{delim-ws}`
    split colcolumn4 on` [`
    split colcolumn5 on`] `
    extract colcolumn3 on` \({digit}+\)`
    replace colcolumn3 on` \({digit}+\)` with''
    merge colcolumn5,column6 as'column7'
    drop colcolumn5,column6
    set colcolumn7 valuetrim(column7)
    replace colcolumn7 on`\(` with''
    replace colcolumn7 on`\)` with''
    rename colcolumn2 to'rank'
    rename colcolumn1 to'artist'
    rename colcolumn3 to'title'
    rename colcolumn4 to'company'
    rename colcolumn7 to'year'
    sort order-rank
Exercise 5: Web Scrape Using OpenRefine, R

This is a simple concocted exercise to consolidate entire 3 day agenda on this page into a simple table showing time slot and session information.

Solution: 

Using Open Refine
  1. Launch openrefine application
  2. This starts a jetty web server on your machine which hosts the openrefine web app
  3. Go to http://localhost:3333
  4. Create Project
  5. Using Clipboard
  6. Paste the following link: 
    1. http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html
    2. If not available, cached copy
  7. Follow the prompts and select UTF-8 character encoding
  8. Create project 
  9. The recipe for data scraping and transformation can be downloaded from here
  10. Once the project is created successfully, you will find Undo/Redo link 
  11. Click on that link to find Apply button
  12. Paste the content of downloaded file into textbox in the dialog box and click perform operations
  13. It will take a few seconds and run through the 14 steps. 
  14. We will review these 14 steps and walk through these individually
Using R and R Studio

Exercise 5 Preparation:

# Install the following packages
# you may be prompted to restart R session. accept it
install.packages("xml2")

Exercise 5 Solution:

# reloading several packages since R session restarted
library(rvest)
library(xml2)
library(dplyr)
library(gtools)
library(openxlsx)
library(xml2)

# read html content
webpage <- read_html("http://globalbigdataconference.com/64/dallas/big-data-bootcamp/schedule.html")

# extract all tables
tbls <- html_nodes(webpage, "table")

# inspect tables
head(tbls)

# fetch only 1, 3, 4 tables since 2nd and 5th are not useful
tbls_list <- tbls %>% .[c(1,3,4)] %>% html_table(fill=TRUE)

# structure of returned object
str(tbls_list)

# set up column names for each of the data frames representing tables
colnames(tbls_list[[1]]) <- c("time-slot", "session-info")
colnames(tbls_list[[2]]) <- c("time-slot", "session-info")
colnames(tbls_list[[3]]) <- c("time-slot", "session-info")

# smartly append three tables
agenda <- smartbind(tbls_list[[1]],tbls_list[[2]], tbls_list[[3]])

# inspect structure again
str(agenda)

# there is a row with NA, the following command shows how to filter complete rows
agenda <- agenda[complete.cases(agenda),]


# replace outputXLS 
# Example: outputXLS <- "/Users/akuntamu/test/agenda.xls"

outputXLS <- "ReplaceWithFileNameIncludingPath"


# write it out
write.xlsx(agenda, outputXLS)


This concludes this installment of data wrangling exercises using R, R Studio and Trifacta. Hope this wets your appetite for learning more about data wrangling. In future installments, we will look into dealing with missing values, identifying outliers etc using other technologies such as python pandas, open refine and/or any freeware offering by a commercial vendor to get more hands-on expertise 

Feel free to post questions, comments, suggestions for improvement on Twitter. I am reachable @akuntamukkala. 

Look forward to hearing from you!

2 comments:

  1. Very significant Information for us, I have think the representation of this Information is actually superb one. This is my first visit to your site. Get more information from Odoo Business Solutions.

    ReplyDelete
  2. A Dedicated Server requires an operating system that is compatible with the server hardware and the applications or services.

    ReplyDelete