Apache Spark is an open source data processing framework which can perform analytic operations on Big Data in a distributed environment. It was an academic project in UC Berkley and was initially started by Matei Zaharia at UC Berkeley’s AMPLab in 2009. Apache Spark was created on top of a cluster management tool known as Mesos. This was later modified and upgraded so that it can work in a cluster based environment with distributed processing.
We will be using Maven to create a sample project for the demonstration. To create the project, execute the following command in a directory that you will use as workspace:
mvn archetype:generate -DgroupId=com.journaldev.sparkdemo -DartifactId=JD-Spark-WordCount -DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false
If you are running maven for the first time, it will take a few seconds to accomplish the generate command because maven has to download all the required plugins and artifacts in order to make the generation task. Once you have created the project, feel free to open it in your favourite IDE. Next step is to add appropriate Maven Dependencies to the project. Here is the pom.xml
file with the appropriate dependencies:
<!-- Import Spark -->
As this is a maven-based project, there is actually no need to install and setup Apache Spark on your machine. When we run this project, a runtime instance of Apache Spark will be started and once the program has done executing, it will be shutdown. Finally, to understand all the JARs which are added to the project when we added this dependency, we can run a simple Maven command which allows us to see a complete Dependency Tree for a project when we add some dependencies to it. Here is a command which we can use:
mvn dependency:tree
When we run this command, it will show us the following Dependency Tree:
With just two added dependencies, Spark collected all the required dependencies in the project which includes Scala dependencies as well as Apache Spark is written in Scala itself.
As we’re going to create a Word Counter program, we will create a sample input file for our project in the root directory of our project with name input.txt. Put any content inside it, we use the following text:
Hello, my name is Shubham and I am author at JournalDev . JournalDev is a great website to ready
great lessons about Java, Big Data, Python and many more Programming languages.
Big Data lessons are difficult to find but at JournalDev , you can find some excellent
pieces of lessons written on Big Data.
Feel free to use any text in this file.
Before we move on and start working on the code for the project, let’s present here the project structure we will have once we’re finished adding all the code to the project: [caption id=“attachment_20349” align=“aligncenter” width=“399”] Project Structure[/caption]
Now, we’re ready to start writing our program. When you start working with Big Data programs, imports can create a lot of confusion. To avoid this, here are all the imports we will use in our project:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
Next, here is the structure of our class which we will be using:
package com.journaldev.sparkdemo;
public class WordCounter {
private static void wordCount(String fileName) {
public static void main(String[] args) {
All the logic will lie inside the wordCount
method. We will start by defining an object for the SparkConf
class. The object this class is used to set various Spark parameters as key-value pairs for the program. We provide just simple parameters:
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");
The master
specifies local which means that this program should connect to Spark thread running on the localhost
. App name is just a way to provide Spark with the application metadata. Now, we can construct a Spark Context object with this configuration object:
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
Spark considers every resource it gets to process as an RDD (Resilient Distributed Datasets) which helps it to organise the data in a find data structure which is much more efficient to be analysed. We will now convert the input file to a JavaRDD
object itself:
JavaRDD<String> inputFile = sparkContext.textFile(fileName);
We will now use Java 8 APIs to process the JavaRDD
file and split the words the file contains into separate words:
JavaRDD<String> wordsFromFile = inputFile.flatMap(content -> Arrays.asList(content.split(" ")));
Again, we make use of Java 8 mapToPair(...)
method to count the words and provide a word, number
pair which can be presented as an output:
JavaPairRDD countData = wordsFromFile.mapToPair(t -> new Tuple2(t, 1)).reduceByKey((x, y) -> (int) x + (int) y);
Now, we can save the output file as a text file:
Finally, we can provide the entry point to our program with the main()
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("No files provided.");
The complete file looks like:
package com.journaldev.sparkdemo;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
public class WordCounter {
private static void wordCount(String fileName) {
SparkConf sparkConf = new SparkConf().setMaster("local").setAppName("JD Word Counter");
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
JavaRDD<String> inputFile = sparkContext.textFile(fileName);
JavaRDD<String> wordsFromFile = inputFile.flatMap(content -> Arrays.asList(content.split(" ")));
JavaPairRDD countData = wordsFromFile.mapToPair(t -> new Tuple2(t, 1)).reduceByKey((x, y) -> (int) x + (int) y);
public static void main(String[] args) {
if (args.length == 0) {
System.out.println("No files provided.");
We will now move forward to run this program using Maven itself.
To run the application, go inside the root directory of the program and execute the following command:
mvn exec:java -Dexec.mainClass=com.journaldev.sparkdemo.WordCounter -Dexec.args="input.txt"
In this command, we provide Maven with the fully-qualified name of the Main class and the name for input file as well. Once this command is done executing, we can see a new directory is created in our project: [caption id=“attachment_20345” align=“aligncenter” width=“527”] Project Output Directory[/caption] When we open the directory and the file named “part-00000.txt” inside it, its contents are as follows: [caption id=“attachment_20346” align=“aligncenter” width=“668”]
Word Counter Output[/caption]
In this lesson, we saw how we can use Apache Spark in a Maven-based project to make a simple but effective Word counter program. Read more Big Data Posts to gain deeper knowledge of available Big Data tools and processing frameworks.
Download Spark WordCounter Project: JD-Spark-WordCount
