Welcome to Spark Tutorials. The objective of these tutorials is to provide in depth understand of Spark. We will introduce the basic concepts of Apache Spark and the first few necessary steps to get started with Spark.
In addition to free Spark Tutorials, we will cover common interview questions, issues and how to’s of Spark.
Big data and data science are enabled by scalable, distributed processing frameworks that allow organizations to analyze petabytes of data on large commodity clusters. MapReduce (especially the Hadoop open-source implementation) is the first, and perhaps most famous, of these frameworks.
Apache Spark is a fast, in-memory data processing engine with elegant and expressive development APIs in Scala, Java, Python, and R that allow data workers to efficiently execute machine learning algorithms that require fast iterative access to datasets (see Spark API Documentation for more info). Spark on Apache Hadoop YARN enables deep integration with Hadoop and other YARN enabled workloads in the enterprise.
Apache Spark is a general-purpose distributed computing engine for processing and analyzing large amounts of data. Though not as mature as the traditional Hadoop MapReduce framework, Spark offers performance improvements over MapReduce, especially when Spark's in-memory computing capabilities can be leveraged.
Spark programs operate on Resilient Distributed Datasets, which the official Spark documentation defines as "a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel."
MLlib is Spark's machine learning library, which we will employ for this tutorial. MLlib includes several useful algorithms and tools for classification, regression, feature extraction, statistical computing, and more.
At the core of Spark is the notion of a Resilient Distributed Dataset (RDD), which is an immutable collection of objects that is partitioned and distributed across multiple physical nodes of a YARN cluster and that can be operated in parallel.
Typically, RDDs are instantiated by loading data from a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat on a YARN cluster.
Once an RDD is instantiated, you can apply a series of operations. All operations fall into one of two types: transformations or actions. Transformation operations, as the name suggests, create new datasets from an existing RDD and build out the processing Directed Acyclic Graph (DAG) that can then be applied on the partitioned dataset across the YARN cluster. An Action operation, on the other hand, executes DAG and returns a value.
Spark is Hadoop’s sub-project. Therefore, it is better to install Spark into a Linux based system. The following steps show how to install Apache Spark.
Java installation is one of the mandatory things in installing Spark. Try the following command to verify the JAVA version.
$java -version
If Java is already, installed on your system, you get to see the following response −
java version "1.7.0_71" Java(TM) SE Runtime Environment (build 1.7.0_71-b13) Java HotSpot(TM) Client VM (build 25.0-b02, mixed mode)
In case you do not have Java installed on your system, then Install Java before proceeding to next step.
You should Scala language to implement Spark. So let us verify Scala installation using following command.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
In case you don’t have Scala installed on your system, then proceed to next step for Scala installation.
Download the latest version of Scala by visit the following link Download Scala. For this tutorial, we are using scala-2.11.6 version. After downloading, you will find the Scala tar file in the download folder.
Follow the below given steps for installing Scala.
Type the following command for extracting the Scala tar file.
$ tar xvf scala-2.11.6.tgz
Use the following commands for moving the Scala software files, to respective directory (/usr/local/scala).
$ su – Password: # cd /home/Hadoop/Downloads/ # mv scala-2.11.6 /usr/local/scala # exit
Use the following command for setting PATH for Scala.
$ export PATH = $PATH:/usr/local/scala/bin
After installation, it is better to verify it. Use the following command for verifying Scala installation.
$scala -version
If Scala is already installed on your system, you get to see the following response −
Scala code runner version 2.11.6 -- Copyright 2002-2013, LAMP/EPFL
Follow the steps given below for installing Spark.
Interested in mastering Spark Training? Enroll now for FREE demo SparkTraining Online.
The following command for extracting the spark tar file.
$ tar xvf spark-1.3.1-bin-hadoop2.6.tgz
The following commands for moving the Spark software files to respective directory (/usr/local/spark).
$ su – Password: # cd /home/Hadoop/Downloads/ # mv spark-1.3.1-bin-hadoop2.6 /usr/local/spark # exit
Add the following line to ~/.bashrc file. It means adding the location, where the spark software file are located to the PATH variable.
export PATH = $PATH:/usr/local/spark/bin
Use the following command for sourcing the ~/.bashrc file.
$ source ~/.bashrc
Write the following command for opening Spark shell.
$spark-shell
If spark is installed successfully then you will find the following output.
Spark assembly has been built with Hive, including Datanucleus jars on classpath Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/06/04 15:25:22 INFO SecurityManager: Changing view acls to: hadoop 15/06/04 15:25:22 INFO SecurityManager: Changing modify acls to: hadoop 15/06/04 15:25:22 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/06/04 15:25:22 INFO HttpServer: Starting HTTP Server 15/06/04 15:25:23 INFO Utils: Successfully started service 'HTTP class server' on port 43292. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.4.0 /_/ Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_71) Type in expressions to have them evaluated. Spark context available as sc scala>
Start a Spark Script
To run your Spark script on Mortar, you'll need to place the script in thesparkscripts
directory of a Mortar project.
The finished, ready-to-run version of the Spark script is available for your reference in the example project sparkscripts
directory: sparkscripts/text-classifier-complete.py
.
Now that you have a project to work with, you're ready to start writing your own Spark script.
ACTION: In your favorite code editor, create a new blank file called text-classifier.py in the sparkscripts
directory of your project.
-Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.
-Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
-Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.
You liked the article?
Like: 0
Vote for difficulty
Current difficulty (Avg): Medium
TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.