Spark gets automation: Analyzing code and tuning clusters in production
Spark is the hottest big data tool around, and most Hadoop users are moving towards using it in production. Problem is, programming and tuning Spark is hard. But Pepperdata and Alpine Data bring solutions to lighten the load.
Hadoop and MapReduce, the parallel programming paradigm and API originally behind Hadoop, used to be synonymous. Nowadays when we talk about Hadoop, we mostly talk about an ecosystem of tools built around the common file system layer of HDFS, and programmed via Spark.
Spark is the new Hadoop. One of the defining trends of this time, confirmed by both practitioners in the field and surveys, is the en masse move to Spark for Hadoop users. Spark is itself an ecosystem of sorts, offering options for SQL-based access to data, streaming, and machine learning.
People are migrating to Spark for a number of reasons, including easier programming paradigm. Easier than MapReduce does not necessarily mean easy though, and there are a number of gotchas when programming and deploying Spark applications.
So why are people migrating to Spark? The top reason seems to be performance: 91 percent of 1615 people from over 900 organizations participating in the Databricks Apache Spark Survey 2016 cited this as their reason for using Spark. But there’s more. Advanced analytics and ease of programming are almost equally important, cited by 82 percent and 76 percent of respondents.