Sunday, May 8, 2016

Webinar: Data Wrangling - Why, What and How?


I had the pleasure of sharing my insights on a very exciting and dynamic industry of data preparation called "Data Wrangling" in a 45 minutes complimentary webinar which was arranged by Global Big Data Conference on May 7th 2016.  

This is a precursor to the 2-1/2 hour hands-on session I will be doing on May 14th 2016. You can find additional details here

In this webinar, I discussed the overarching objective and basics of data wrangling. I reviewed the open source and commercial tools such as Trifacta, Paxata, DataWatch, Datameer etc we can potentially use to curate data.  

The best way in my opinion to learn data wrangling is by getting familiar with the process of data wrangling and doing hands-on exercises by using applicable tools on small "raw" data sets and eventually perform data wrangling at scale by leveraging Big Data infrastructure powered by Hadoop ecosystem. 

If you couldn't attend the webinar yesterday, here is the recording that you can watch at your leisure. 




Please feel free to post comments or share it. I would love to know your feedback and any suggestions to improve the content presented. 

Also, if you'd want me to focus on any particular use cases during the hands-on session, please let me know!

Please find the following section for instructions on what tools you need to have installed on your laptop before the hands-on session on May 14th 2016

Hands-On Data Wrangling: What, How, and Why


Here are the tools you will need to install on your computer before we engage in the hands-on session:

ToolVersionDownload & Install InstructionsType
R language3.2.4https://cran.rstudio.com/Open Source
R Studio0.99.887https://www.rstudio.com/products/rstudio/download/

Please install and load the following R packages
  • stringr
  • dplyr
  • tidyr
  • readxl
  • xlsx
  • lubridate
  • gtools
  • plyr
  • rvest
  • stringdist 
You can use the following commands to install and load a package called "plyr" in R Studio. 

>install.packages("plyr")
>library(plyr)

Similarly, use the above mentioned commands and replace "plyr" with other package names.
Open Source
OpenRefine

2.6

http://openrefine.org/download.htmlOpen Source
Trifacta Wrangler3.0.1-client1https://www.trifacta.com/trifacta-wranglerCommercial but free offering with limited functionality


By no stretch of imagination is this tutorial supposed to be the end-all and all-inclusive learning experience of data wrangling tools and strategies but it merely scratches the scratch of the plethora of tools we have at our disposal to wrangle data.




You will need to have Java installed. You can install JDK 1.7+ 



Thank you!

No comments:

Post a Comment