Big Data Analytics in Spark

Exploring Dataframes, Datasets, RDDs, and Google Colab

2.5 quintillion bytes of data are produced every day. With that kind of data, new technologies are needed to analyze and perform analytics and machine learning.

Big data can’t ideally fit into the disk storage or even the memory of one computer, so, in such scenarios, you’d have to look at distributed computing. This involves spreading the processing of such data to multiple computers.

The biggest challenge of dealing with that kind of data is the lack of big data analysis expertise (because of the newness), storing the data, as well as analyzing and querying the data. In this article, we will look at a technology that solves these problems.

Systems that process huge data in a distributed manner

The most popular technologies that process data in a distributed fashion are Apache Hadoop and Apache Spark. Apache Hadoop enables the operation of data in a distributed environment. This happens across a cluster of computers. It has the capacity to scale to thousands of computers.

Apache Spark

Apache Spark is an open-source engine for analyzing and processing big data. Every Spark application contains a driver program that is responsible for running the user’s main function. The program also executes various operations in parallel on a cluster. A cluster is made up of many nodes. A node is a single machine or server.

Spark applications are controlled by the SparkContext. The SparkContext connects to the cluster manager. There are several cluster managers in Spark, including Spark’s own standalone cluster manager, Mesos, or YARN. The cluster managers allocate resources to various Spark applications.

Executors on nodes are processes that run computations and store data for your application. Therefore, the executor is a process that is initiated for an application on a worker node, it runs tasks and keeps data in memory or disk storage across them. A worker node is any node that can run application code in a cluster. Each application has its own executors. Tasks are sent to the executor by the SparkContext. Every application has its own executor program, so applications are isolated from each other.

A task is a unit of work that will be sent to one executor. A parallel computation involving multiple tasks is known as a job.

Why use Spark

Here are some reasons why you would consider using Spark:

  • Speed: Due to its in-memory computation, Spark is 100X faster than Hadoop.
  • Ease of use: You can use Spark either in Java, Scala, Python, R, or SQL.
  • Runs Everywhere: Spark can run on Hadoop, Apache Mesos, Kubernetes, or even in the cloud.

Apache Spark Data Representations

There are three main data representations in Apache Spark:

  • Resilient Distributed Datasets (RDDs)
  • Dataframes
  • Datasets

A Resilient Distributed Datasets (RDD) is a fault-tolerant collection of elements that can be operated on in parallel. Whenever there is a fault, the RDD is able to recover automatically. It is a good choice when low-level transformations on datasets are needed or when working with unstructured data such as media or text streams.

RDD Creation

An RDD can be created by parallelizing an existing collection in your driver program or by referencing a dataset in an external storage system, such as a shared filesystem. By default, Spark will determine the optimal number of partitions for splitting the dataset, however, you can provide the number of partitions as the second argument to the parallelize method.

my_list = [1, 2, 3, 4, 5]
my_list_distributed = sc.parallelize(my_list,2)

distributed_file= sc.textFile("file.txt")

RDD Persistence

In Spark, you can improve the performance of your application by persisting or caching the dataset. Upon doing that, the computation will be saved in memory the first time it’s computed. In consecutive requests, this item will be obtained from memory and not computed again. The cache is fault-tolerant and any lost partition is recomputed using the transformations that created it.

RDD Operations

There are two main types of operations in Spark: transformations and actions. Transformations create a new dataset from an existing one, e.g., when a map operation is applied to a dataset. Actions return a value after running a computation on the dataset, e.g., the reduce method.

Spark Transformations

Transformations in Spark are lazy. This means that they do not
compute their results right away. Transformations are only computed when an action requires a result to be returned. Here are a couple of transformations that can be done in Apache Spark:

  • map(func) Return a new distributed dataset resulting from passing each element through a function.
  • filter(func) Return a new dataset formed by selecting the items that return true on a certain condition.
  • union(otherDataset)Return a new dataset that is the union of two datasets.

Spark Actions

Let us now look at a couple of actions that you can call on Apache Spark.

  • collect() — return all the elements of the dataset as an array
  • count() — return the number of items in a dataset
  • take(n) — return the first n elements of the dataset
  • first() — return the first item in the dataset

Spark DataFrames

A Spark DataFrame is an immutable distributed collection of data that is very similar to a Pandas DataFrame. One of the main advantages of a Spark DataFrame is that it can be queried as if it was an SQL Table.

Uses and Installation

The process of setting up Spark can be quite intimidating. The most recommended way is to use a docker image or use a platform like Databricks. Databricks free community edition is quite sufficient. You can also set it up on Google Colab or your local machine for practice. That won’t make sense if you are working with big data because it is not scalable. For real world application you’d rather use cloud solutions such as Databricks or Amazon Web Services because they can scale.

Working with Databricks is straightforward since all of the Spark dependencies are already installed. If you are setting up Spark by yourself, the first thing you want to configure is your Java Stallation. This is a required dependency.

Setting up Spark on Google Colab

In order to set up Spark on Google Colab, you will first need to download Java and Spark itself. Next, you will extract Spark and install findspark. This package will ensure that Spark is importable within your Google Colab environment.

!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop2.7.tgz
!tar xf spark-3.0.1-bin-hadoop2.7.tgz
!pip install findspark

The next step is to conjure the path for Java and Spark. This will enable the virtual machine on Google Colab to be able to locate these packages.

import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.1-bin-hadoop2.7"

Finally, you will import findspark and make Spark importable.

import findspark
findspark.init()

Now your Spark installation is ready for use. You can try and import it. The first step in every Spark applications to create the Spark.

import pyspark
sc = pyspark.SparkContext()

You can create an RDD from a list like this:

my_list = [2,3,4,5,6]
rdd = sc.parallelize(my_list)

Now running rdd will not insure you anything because like we have said previously, Spark transformations are lazy. In order to see the result, you have to apply an action such as collect().

rdd.collect()

Setting up Apache Spark on your local machine

The most painless way to set up Apache on your local machine is by using a docker image. Therefore, the first step will be to download docker on your Mac, Ubuntu or Windows computer. Docker enables us to easily run applications by setting up all the environment variables, configurations, and dependencies.

With Docker installed the only thing left is to download the Spark image provided by Jupyter. -p 8888:8888 this means that the Docker container’s port 8888 will be mapped to your computers port 8888, that way you will be able to run that Docker container on a Jupyter Notebook.

When you run this command the first time, the Jupyter docker container will be downloaded. Once it’s done you will be able to access a Jupyter Notebook where you can start running your Spark commands. In the future, you will run the same command to access the Notebook, only now the container won’t be downloaded again.

You can also use Spark on the cloud by creating a free account on Databrick’s community edition.

Final Thoughts

In this article, we have built an understanding of Spark and its key concepts. We have also looked at why you would use it. Finally, we have looked at several options for its installation.

Fritz

Our team has been at the forefront of Artificial Intelligence and Machine Learning research for more than 15 years and we're using our collective intelligence to help others learn, understand and grow using these new technologies in ethical and sustainable ways.

Comments 0 Responses

Leave a Reply

Your email address will not be published. Required fields are marked *

wix banner square