Sunday, May 8, 2016

Webinar: Data Wrangling - Why, What and How?


I had the pleasure of sharing my insights on a very exciting and dynamic industry of data preparation called "Data Wrangling" in a 45 minutes complimentary webinar which was arranged by Global Big Data Conference on May 7th 2016.  

This is a precursor to the 2-1/2 hour hands-on session I will be doing on May 14th 2016. You can find additional details here

In this webinar, I discussed the overarching objective and basics of data wrangling. I reviewed the open source and commercial tools such as Trifacta, Paxata, DataWatch, Datameer etc we can potentially use to curate data.  

The best way in my opinion to learn data wrangling is by getting familiar with the process of data wrangling and doing hands-on exercises by using applicable tools on small "raw" data sets and eventually perform data wrangling at scale by leveraging Big Data infrastructure powered by Hadoop ecosystem. 

If you couldn't attend the webinar yesterday, here is the recording that you can watch at your leisure. 




Please feel free to post comments or share it. I would love to know your feedback and any suggestions to improve the content presented. 

Also, if you'd want me to focus on any particular use cases during the hands-on session, please let me know!

Please find the following section for instructions on what tools you need to have installed on your laptop before the hands-on session on May 14th 2016

Hands-On Data Wrangling: What, How, and Why


Here are the tools you will need to install on your computer before we engage in the hands-on session:

ToolVersionDownload & Install InstructionsType
R language3.2.4https://cran.rstudio.com/Open Source
R Studio0.99.887https://www.rstudio.com/products/rstudio/download/

Please install and load the following R packages
  • stringr
  • dplyr
  • tidyr
  • readxl
  • xlsx
  • lubridate
  • gtools
  • plyr
  • rvest
  • stringdist 
You can use the following commands to install and load a package called "plyr" in R Studio. 

>install.packages("plyr")
>library(plyr)

Similarly, use the above mentioned commands and replace "plyr" with other package names.
Open Source
OpenRefine

2.6

http://openrefine.org/download.htmlOpen Source
Trifacta Wrangler3.0.1-client1https://www.trifacta.com/trifacta-wranglerCommercial but free offering with limited functionality


By no stretch of imagination is this tutorial supposed to be the end-all and all-inclusive learning experience of data wrangling tools and strategies but it merely scratches the scratch of the plethora of tools we have at our disposal to wrangle data.




You will need to have Java installed. You can install JDK 1.7+ 



Thank you!

Thursday, April 14, 2016

My Interview @ Global Big Data Conference, Dallas

One more month to go for the conference and I am busy preparing slide content and hands-on data wrangling exercises. It is an awesome experience to learn how to teach :)

Anyways, here is the link to my interview that got published a couple of days ago to give the audience a feel for what they can expect. 

http://globalbigdataconference.com/news/129000/interview-with-ashwini-kuntamukkala-software-architect-vizient.html

Monday, March 21, 2016

Upcoming talks at Global Big Data Convention in Dallas in May 13/14/15 2016

I am excited to share that I will be giving two talks at the Dallas Big Data Convention on the weekend of May 13th 2016 as Las Colinas Convention Center. Here are the topics and corresponding abstracts. 


Data Wrangling - What, Why and How? [Industry state, business applicability]

Abstract:

"Garbage in -> Garbage out (GIGO)" is a popular  quote in the field of computer science. But is that really true? 

We are now nearing post "Big Data” era where scaling data storage and compute capacity is almost as easy as pushing a button. The ecosystem of data processing tools is getting richer by the day. In such a thriving environment, data is not the new “oil" but the new “soil” where companies can grow several data driven business models. Many "forward-looking" companies are already unlocking the hidden insights in treasure troves of data they already have along with publicly available data. 

Just as gold is extracted from ore through a very rigorous refinement process, insights from data in the crudest form have to be discovered through rigorous process. This raw data is typically locked up in spread sheets, web pages, web/machine logs, PDFs, CSVs, TSVs, XML, JSON, Word, images, videos, hand written notes, audio, RDFs, sensor signals, databases etc. 

One can’t perform accurate statistical analysis on inaccurate data. So, in order to facilitate effective use of raw data, there is an upsurge in the market place for plethora of tools that ease converting raw data into usable form. This process of converting raw data into usable form is called "Data Wrangling”. 

After this iterative step of curated datasets, we can unleash rich analytics, visualizations etc to drive the end business objectives.  

In this talk, I will cover the essentials of data wrangling, the necessary workflow, open source tools available at our disposal and also provide comparison among popular commercial vendors in this field based on my own experience and use cases. 


Transform your Enterprise into Data-Driven Digital Business

Abstract: 

Is your company in on “Digital Transformation”? 

This phenomenon is causing enterprises to rethink their strategy as they modernize themselves to stay relevant in an ultra competitive world. Darwin's quote “The fittest survive” is more relevant in business today than ever before. The enterprises that adapt and adopt this fundamentally a new “culture" will disrupt their own business and stand the test of time. 

The reason I say “culture” instead of “strategy” is because Peter Drucker, father of modern management, has said that “Culture eats strategy for breakfast”. So in a company if the spirit of innovation, agility are missing, no disruptive strategy has a chance to work.

So if you are someone who wants to be a change agent, a catalyst or a visionary for your company to create more value or enter a new growth spurt or perhaps even a new market but are running into walls because the culture does not facilitate that, you can be frustrated like the driver trying to steer a parked car. 

In this talk, I will share my insights from my own experience and much of others that I have had the good fortune to collaborate with. This will help you carve a successful digital transformation roadmap for your company. 


You will be empowered with practical insights, tidbits, and result oriented successful practices that you can take back and start your company’s "digital transformation".

--------

I look forward to hearing your thoughts as I put together content for my talk. Please feel free to reach on LinkedIn or Twitter [@akuntamukkala] if you have any questions or comments. 

See you there!



Tuesday, November 18, 2014

DZone Refcard on Apache Spark

Glad to share that DZone Refcard for Apache Spark is now available for download at http://t.co/s3tNmWPqcr

It is a short digest of what Apache Spark is about and capabilities it enables for engineers and data scientists. 

Monday, November 10, 2014

DZone - Developer of the Week

Over the years I have benefited from the curated content published by DZone. I am amazed by the way the content writers at DZone publish intriguing and engaging content and especially the useful developer friendly Refcards.

Recently I got an opportunity to work with DZone content creators to write a Refcard on Apache Spark. It was a fantastic experience and one that I highly recommend for anyone who intends to work with the best in the industry.

As a preview to the Refcard, I was honored to be featured as the developer of the week on DZone. My interview is published here. 

I am very excited about the future of Apache Spark in the Big Data ecosystem.
Per the industry trends, many IT professionals are transitioning their careers to Big Data as companies are realizing that data is their new currency as it can potentially unlock the door to new revenue streams.

Herein lies a challenge where many struggle with "separating noise from the voice". Since Hadoop ecosystem has grown over the last 10 years, it can be a daunting task for anyone who wants to get into this space because there are so many tools and solutions to solve plethora of Big Data use cases.

This is exactly the reason I am excited about Apache Spark. It is compelling platform that provides a unified approach to solve most common Big Data use cases classified into batch, interactive and real time data processing.

In the DZone Refcard on Apache Spark, I have catered to new or moderately experienced IT professionals who want to discover the capabilities of Apache Spark. I have included simple hands on examples and techniques that demonstrate how easily can one become productive using Apache Spark and start solving Big Data use cases.

At SciSpike, we are excited about helping our clients adopt Apache Spark in their Big Data infrastructure and prove its merits and become a platform of choice for Big Data applications. 


I look forward to hearing your thoughts on the Refcard. It should be out in Nov 2014.

I am reachable via Twitter @akuntamukkala

Thursday, May 29, 2014

ActiveMQ - Network of Brokers Explained - Part 5


In the previous part 4 we have seen how to load balance remote consumers on a queue using network connectors.

In this part 5, we will see how the same configuration would work if we had concurrent remote durable subscribers on a topic.  Consider the following configuration....  


Fig 1: Network of Brokers - Load balance subscribers on a topic

As shown above, we have Broker-1 which initiates two network connectors to Broker-2 and Broker-3. A producer sends messages to a topic "moo.bar" on Broker-1 while Broker-2 has subscriber C1 and Broker-3 has two subscribers C2 and C3 on the same topic "moo.bar". 

You may observe that this set up is very similar to part 4. The only difference is that here we are dealing with topics while in part 4, we were dealing with queues. 

Let's see this in action


  1. Add the following network connector configuration in Broker-1's activemq.xml configuration file

     <networkConnectors>
    <networkConnector
    name="T:broker1->broker2"
    uri="static:(tcp://localhost:61626)"
    duplex="false"
    decreaseNetworkConsumerPriority="false"
    networkTTL="2"
    conduitSubscriptions="false"
    dynamicOnly="true">
    <excludedDestinations>
    <queue physicalName="&gt;" />
    </excludedDestinations>
    </networkConnector>
    <networkConnector
    name="T:broker1->broker3"
    uri="static:(tcp://localhost:61636)"
    duplex="false"
    decreaseNetworkConsumerPriority="false"
    networkTTL="2"
    conduitSubscriptions="false"
    dynamicOnly="true">
    <excludedDestinations>
    <queue physicalName="&gt;" />
    </excludedDestinations>
    </networkConnector>
    </networkConnectors>


  2. Let's start broker-2, broker-3 and broker-1 in that order.
  3. akuntamukkala@localhost~/apache-activemq-5.8.0/cluster/broker-2/bin$ ./broker-2 console
  4. akuntamukkala@localhost~/apache-activemq-5.8.0/cluster/broker-3/bin$ ./broker-3 console
  5. akuntamukkala@localhost~/apache-activemq-5.8.0/cluster/broker-1/bin$ ./broker-1 console

  6. Broker-1's admin console connections show that two network connectors have been established as configured from Broker-1 to Broker-2 and Broker-3 respectively
  7. Broker-1's Connections @ http://localhost:8161/admin/connections.jsp







  8. Let's start the subscriber C1 on Broker-2 subscribing to messages to topic "moo.bar" and subscribers C2 and C3 on Broker-3 subscribing to messages on same topic "moo.bar"
  9. Durable Subscribers require unique combination of client id and subscriber name. In order for us to create durable subscribers C2 and C3 we need to enhance the functionality provided in /Users/akuntamukkala/apache-activemq-5.8.0/example/src/ConsumerTool.java where /Users/akuntamukkala/apache-activemq-5.8.0 is the directory where ActiveMQ is installed.
  10. The modified code consists of editing build.xml and ConsumerTool.java to add a new parameter "subscriberName". The edited files build.xml and ConsumerTool.java can be obtained from here and here respectively.
  11. Let's start the subscribers now.
  12. akuntamukkala@localhost~/apache-activemq-5.8.0/example$ ant consumer -Durl=tcp://localhost:61626 -Dtopic=true -Dsubject=moo.bar -DclientId=C1 -Ddurable=true -DsubscriberName=mb.C1
  13. akuntamukkala@localhost~/apache-activemq-5.8.0/example$ ant consumer -Durl=tcp://localhost:61636 -Dtopic=true -Dsubject=moo.bar -DclientId=C2 -Ddurable=true -DsubscriberName=mb.C2
  14. akuntamukkala@localhost~/apache-activemq-5.8.0/example$ ant consumer -Durl=tcp://localhost:61636 -Dtopic=true -Dsubject=moo.bar -DclientId=C3 -Ddurable=true -DsubscriberName=mb.C3

  15. Durable subscriber on Broker-2

    http://localhost:9161/admin/subscribers.jsp
  16. Durable subscribers on Broker-3
    http://localhost:10161/admin/subscribers.jsp

  17. Durable subscribers on Broker-1 (because of network connectors)
    http://localhost:8161/admin/subscribers.jsp
  18.  Now let's send 10 durable messages to topic moo.bar on Broker-1
  19. akuntamukkala@localhost~/apache-activemq-5.8.0/example$ ant producer -Durl=tcp://localhost:61616 -Dtopic=true -Dsubject=moo.bar -Dmax=10 -Ddurable=true
  20. See the console on Broker-3
    Log file output on Broker-3
  21. As you may observe, Broker-3 receives the same message twice, once per each subscription C2 and C3. ActiveMQ by default does not permit processing of duplicate messages.
  22. This happens because both the subscriptions mb.C2 and mb.C3 on Broker-3 are propagated to Broker-1. So when 10 messages are published to moo.bar on Broker-1, those messages are sent over to subscribers mb.C2 and mb.C3 on the same broker: Broker-3. Since the messages have the same ID, duplicate messages are discarded and hence the warning shown in the log messages....(shown in step 19)
  23. Here is the console showing statistics on Broker-1
    http://localhost:8161/admin/subscribers.jsp

  24. Here is the console showing statistics on Broker-3
    http://localhost:10161/admin/subscribers.jsp

  25. As you can see even though the enqueue counter shows 20, the dequeue counter shows only 10, since the other 10 messages were discarded by the Broker-3. This is a useful feature which helps to ensure that a message gets processed at most once by a broker.
The reason why this is occurring is because both subscriptions C2 and C3 are propagated to upstream broker Broker-1 


Duplicate Messages on Broker-3


Let's retry the same scenario using a minor tweak in the network connector settings by making conduitSubscriptions="true" on both network connectors from Broker-1 to Broker-2 and Broker-3 respectively. After restarting the brokers, delete the inactive durable subscribers and then repeat the above steps. 

   <networkConnectors>
<networkConnector
name="T:broker1->broker2"
uri="static:(tcp://localhost:61626)"
duplex="false"
decreaseNetworkConsumerPriority="false"
networkTTL="2"
conduitSubscriptions="true"
dynamicOnly="true">
<excludedDestinations>
<queue physicalName="&gt;" />
</excludedDestinations>
</networkConnector>
<networkConnector
name="T:broker1->broker3"
uri="static:(tcp://localhost:61636)"
duplex="false"
decreaseNetworkConsumerPriority="false"
networkTTL="2"
conduitSubscriptions="true"
dynamicOnly="true">
<excludedDestinations>
<queue physicalName="&gt;" />
</excludedDestinations>
</networkConnector>
</networkConnectors>



The following screenshot shows that Broker-1 now sees only two durable subscribers, one from each broker,  Broker-2 and Broker-3. 

Durable Subscribers in Broker-1 when conduitSubscriptions="true"

Upon publishing 10 durable messages on Broker-1, we find that we don't have the same issue of duplicate messages this time. 

As expected all the 10 messages are processed by C1, C2 and C3 as shown by screenshots below. 

Broker-1's Durable Topic Subscribers

Broker-3's Durable Topic Subscribers C2 and C3 receive and process 10 messages each


Hence we have seen how conduitSubscriptions attribute can help in reducing message traffic by avoiding duplicate messages in a network of brokers.


In the next part 6, we will see how ActiveMQ provides "message replay" capabilities in order to prevent stuck message scenarios. 


Friday, May 9, 2014

Speaking at Global Big Data Conference in Dallas May 11, 2014

I am going to be speaking about Apache Spark in Global Big Data Conference on May 11th 2014 from 11.00am to 12.00pm @ Irving Convention Center, 500 W Las Colinas Blvd, Irving, TX 75039 

Here is the abstract of the presentation: 

I am impressed with the capabilities Apache Spark enables to unify batch, streaming and interactive big data use cases. The brilliant folks at AMPLabs @ UC Berkeley have created a tremendous solution that takes big data processing to the next level! 

Let's see some lightning fast big data analytics powered by Apache Spark!

Look forward to seeing you there!