In this blog post
By Bargunan Somasundaram
Before delving into why scala for the Big Data and Machine learning, lets me address what are these jargons and how they are interrelated?
Big data, machine learning, statistics, statistical machine learning are the terms that are surfacing the IT world recently. We are in the new era of huge data being generated each second. According to forbes.com, at our current pace, 2.5 quintillion bytes of data created each day, with the growth of IOT. From these data, the insights are generated to lead new business. The process of analyzing and extracting information to generate insights from this big amount of structured, semi-structured, unstructured data is called big data.
Now with the help of Artificial Intelligence and algorithms if the system is able to automatically learn and improve to generate insights from data, without any explicit intervention or rule-based programming, then it’s called machine learning.
We are in the midst of a data revolution, and this has given rise to completely new data formats and databases of unprecedented scale. This humongous rise in the data and the ability to analyze extract and generate from insights has related big data with machine learning.
As a rule of thumb, the accuracy of pattern finding or data mining or a knowledge discovery of the machine learning algorithm depends on the volume of the data that the algorithm has processed. So more the data, more the learning.
Python and R are the prominent programming languages for machine learning and data sciences. Now scala is climbing the ladder fast due to the rise in usage of Apache Spark.
Scala as Language for Frameworks.
Some of the frameworks that rule the roost in the Big data world are
- Apache Spark is a unified analytics engine for big data processing with lot more features like SQL and DataFrames, MLlib for machine learning, GraphX, and Spark Streaming.
Apache Spark, built on Scala has gained a lot of recognition and is being used widely in productions. Resilient Distributed Datasets (RDD) is a fundamental data structure of Spark. Immutable, distributed, lazily evaluated, catchable are its common properties.
- Apache Kafka - a distributed streaming platform for handling real-time data feeds.
Written in Java and Scala, Kafka is a fast, scalable, durable, and fault-tolerant publish-subscribe messaging system. It works in combination with Apache Storm, Apache HBase and Apache Spark for real-time analysis and rendering of streaming data.
- Apache Samza a stream processing framework developed in scala.
Apache Samza uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management. Samza is similar to Apache Storm while it is easier to operate. Samza stream processing job were written in Scala.
- Apache scalding a Scala API for the Cascading, an abstraction of MapReduce
Built on top of Cascading, a Java library that abstracts Hadoop MapReduce, Scalding simplifies writing the MapReduce jobs in Scala. Scalding is comparable to Pig, while offering tight integration with Scala
- Apache Flink — a framework for distributed stream and batch data processing
Flink’s core is a hybrid (Real-Time Streaming + Batch) distributed data processing engine written in Java and Scala. Flink contains several APIs for batch processing (DataSet API), real-time streaming (DataStream API) and relational queries (Table API) and also domain-specific libraries for machine learning (FlinkML — pure Scala), complex event processing (CEP) and graph processing (Gelly).
- Akka — a concurrent framework for building distributed applications
Akka is an actor-based message-driven runtime for managing concurrency, elasticity and resilience on the JVM that supports Java and Scala. Akka uses Actor Model that is an ideal model for highly scalable and concurrent systems.
Scala packs the punch of both Functional and object-oriented programming.
Not to mention that scala is one of the JVM language and its biggest advantages is its support for both object-oriented and functional programming. Both programming approaches aim to create readable, bug-free code, but they go about it in very different ways. Where object-oriented programming combines data structures with the actions you want to perform on them, functional programming keeps both separate.
Each approach has its advantages. For many people, the object-oriented paradigm makes intuitive sense, and combining behaviors with the data structures they’ll interact with can make it easy to figure out what’s going on in an unfamiliar codebase. At the same time, functional programming’s preference for cleanly separated and immutable data structures and discrete behaviors often allows you to do more with less code. Functional programming aims at the usage of Lambda Expressions. The point of all lambdas is deferred execution. After all, if you wanted to execute some code right now, you’d do that, without wrapping it inside a lambda.
There are many reasons for executing code later, such as
- Running the code in a separate thread
- Running the code multiple times
- Running the code at the right point in an algorithm (for example, the comparison operation in sorting)
- Running the code when something happens (a button was clicked, data has arrived, and so on)
- Running the code only when necessary
It is a good idea to think through what you want to achieve when you set out programming with lambdas. Let us look at a simple example. Suppose you log an event:
logger.info(“x: ” + x + “, y: ” + y);
What happens if the log level is set to suppress INFO messages? The message string is computed and passed to the info method, which then decides to throw it away. Wouldn’t it be nicer if the string concatenation only happened when necessary? Running code only when necessary is a use case for lambdas
Scala is a fully-fledged OOP language, and it’s possible to write highly elegant and expressive programs without even touching its functional attributes. But for those who are curious about functional programming, Scala provides a rich set of collection operations (like map and reduce), higher-order functions, and a strong static typing system.
About that static typing system:
Where many other modern programming languages are dynamically typed, Scala checks types at compile time, meaning that many trivial but costly bugs can be caught at compile time rather than in production. At the same time, Scala has a highly sophisticated type system, meaning that developers can enjoy the security of compile-time type-checking without having to worry about specifying every type every time.
Concise programming with scala
- Scala programming language is concise. Several loops can be replaced by a single word that makes it significantly less verbose than standard Java. In addition, its statically typed and functional nature makes it type-safe.
Eg In java code to reverse a list.
List<String> reversedValues = new ArrayList<String>();
for (String n : nameList)
return reversedValues ;
Scala equivalent of reversing the list.
for (n <- nameList) yield n.reverse or nameList.map(_.reverse)
- Pattern matching mechanism — the second most used feature of Scala, which allows to match on any sort of data with a first-match policy.
- The ability to use functions as variables and reusing utility functions
Streams processing in real-time
While the Hadoop MapReduce can process and generate large datasets in-parallel, it has been criticized for the inability to handle real-time stream processing. Spark gives Scala an edge over other programming languages to process streams in real-time. It has made Scala the computational engine for the fast data processing.
Plethora of Machine learning Libraries and evolving communities
Even though Scala’s libraries are not as comprehensive as Python or R libraries, they provide a solid foundation for big data projects. Awesome Machine Learning which is a curated list of machine learning frameworks, libraries and software (covering several languages), presents a list of useful Scala libraries and tools for Machine Learning, data analysis, data visualization, and NLP. In addition, Typelevel provides several helpful libraries and extensions to Scala.
Following libraries are few of the most used machine learning and data analysis libraries:
- Saddle — a high-performance data manipulation library (strongly influenced by the pandas library for Python)
- ScalaNLP — a suite of different libraries, including Breeze (set of libraries for machine learning and numerical computing) and Epic (high-performance statistical parser and structured prediction library).
- Apache Spark MLlib — machine learning library for Scala, Java, Python, and R
- Apache PredictionIO — a machine learning server based on Apache Spark, HBase and Spray that can be installed as a full machine learning stack
- DEEPLEARNING4J — a distributed deep-learning library for Java and Scala
- Scala-datatable and Framian — for data frames and data tables
Scala has an active community that is expanding rapidly. According to the KDnuggets Analytics/Data Science 2016 Software Poll, Scala was among the tools with the highest growth.
Scala has an active community on Stack Overflow, in addition to its large community on GitHub and Reddit
About the Author: I’m an open source lover and a Java enthusiast. It’s my passion to share my knowledge by writing my experience about them. I believe “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity. “