Ronen Kahana: “You really have to know both the data and the Spark internals to produce efficient code”

Principal Systems Engineer at AT&T Chief Data Office, Ronen Kahana has been practicing software development and enterprise architecture since the late 90’s. He is now designing and implementing ML and AI at scale using Scala, Spark, Hadoop, and Kafka.

At Scala Days New York, Ronen, together with Austin Henslee and Johan Muedsam, is hosting a workshop ‘Building Spark Apps With a Flexible, Proven Project Template”. We spoke to Ronen in advance of the workshop about his professional journey, the problems Spark solves and the upcoming workshop at Scala Days New York.  

What’s your background and what does your current role involve?

I have been involved in software development for over 21 years as an engineer, an architect and as a development manager. The vast majority of those years were spent in the Java EE world, although it was called J2EE back when I started. Yes, that is Java 2. Yes, get off my lawn. Most of my roles involved middleware, API development and data movement frameworks like JMS and message busses.

My current role is that of Principal System Engineer in AT&T Communications’ Chief Data Office (CDO). I joined the CDO (which was then called the Big Data Center of Excellence) 4 years ago as an enterprise architect. However, I soon figured out that the software engineers were having much more fun than I was working in these evolving technologies so I went back to coding about 2 years ago. I’m currently part of a data insights team that delivers machine learning models to detect authentication fraud across all AT&T’s business channels. Our deliverables are based on Scala, Spark and various other components of the Hadoop stack.

 

What’s the biggest highlight of your career so far?

I was fortunate enough to be involved in a few projects that I could place on that list, but if pressed I would say it was this: In 2000 I was a development team lead for the first version of Cingular Wireless’ eBill site. The technology stack of Java/EJB/Weblogic Tengah/JSP was rather new and buggy and Stack Overflow was not even a concept – if you had problems you had to figure them out on your own. Finding your way in the technological wilderness without a reference architecture and well-documented examples was challenging but extremely rewarding. Those were stressing times but the fact that we were able to launch on the schedule was truly remarkable and I’m very proud of that project.

 

Why did you pick Spark and what kind of problems does it solve for you?

Using Spark was a natural evolution from our use of the organic Hadoop stack. The in-memory processing provided us with considerable performance improvements over our traditional MapReduce jobs and hive queries. Our engineers loved the native Scala APIs which allowed us to be much more efficient and productive. The interactive shell, the APIs in R and Python and the ability to integrate into Zeppelin notebooks allowed our Data Scientists to use Spark interactively while doing data exploration and model creation; this allowed us to reuse a lot of their code and shorten the timeline between the start of data science and deploying a working model in production. The fact that we could use the same framework for both batch and streaming gave us a lot of flexibility and great opportunity to reuse code and skill sets across projects.

 

What’s the most important challenge Spark developers are facing today?

Developing distributed applications with large datasets can be a real minefield in terms of debugging and performance tuning. This is a challenge I see almost every new Spark developer face.

One wrong join or transformation can cause your application to crawl to a halt; you really have to know both the data and the Spark internals to produce efficient code. You could probably get away with the bad code on smaller datasets but at scale, your mistakes are magnified. Debugging Spark applications is no trip in the park either, especially when you have to sort through the logs of 100s of executors. A lot of times code errors masquerade as OOM exceptions and the knee-jerk reaction is to throw more resources at the problem when a logic change in the code would suffice.

 

What’s one thing that could address this challenge?

Better debugging and performance analysis tools. There are a few open source projects out there that attempt to assist in that area but integrating that functionality in the native Spark web UI will go a long way in assisting developers in understanding what their code is doing and how to improve it.

 

Who should attend your workshop at Scala Days and why?

Anybody wishing to learn the tools and best practices to quickly start developing Scala/Spark applications, should attend this workshop. We will base the training on a flexible base project that we have honed over the last few years while using industry standard tools such as IntelliJ, sbt, Spark, Scala, Hbase, Hive and Kafka. We will explore the various tools and work through exercises that will build up the attendee’s ability to build and test complex Scala/Spark applications locally on a laptop without having to set up an entire cluster.

Don’t miss Ronen Kahana, Austin Henslee and Johan Muedsam and their workshop ‘Building Spark Apps With a Flexible, Proven Project Template” at Scala Days in New York on June 18th and 19th. Book your ticket now!

Leave a Reply