Bargunan Somasundaram

The core of any organization is its data – the crown jewels. Gone are the days where ‘locking up the crown jewels’ sustained businesses. Big data enables unlocking big insights to find newer business opportunities. Big data is the megatrend because it turns data into information, information into insights and insights are the business! The arrival of Big data also rung the death knell for customer segmentation, since each data over a period is a customer. Analyzing them uncovers the hidden patterns, unknown correlations, market trends, customer preferences, and other useful information. These help in making sound decisions at the macro level and focus on smaller areas of opportunities that might have gone unnoticed. The bigger the data, the more you get to focus on, eliminating the Marketing Myopia.

“If you torture the data long enough, it will confess.” – Ronald H. Coase

Roman census Approach for AI/ML platforms

Applying Big Data analytics in any business is never a cakewalk. To harness the value of big data, a robust Data Processing Architecture must be designed. One of the cornerstones of big data architecture is its processing, referred to as the ‘Roman census approach’. This approach, architecturally, is at the core of the functioning of big data. This approach enables big data architecture to accommodate the processing of almost unlimited amounts of data.

About 2000 years ago, the Romans decided that they wanted to tax everyone in the Roman Empire. But to do so, they had to have a census. They quickly figured out that trying to get every person in the Roman Empire to march through the gates of Rome to be counted was an impossibility. There were Roman citizens across the known world of that time. Trying to transport everyone on ships, carts and donkeys to and from the city of Rome was next to impossible.

So, the Romans realized that creating a census where the processing (i.e. counting) was done centrally was not going to work. They solved the problem by creating a body of census takers and sending them all over the Roman Empire. The results of the census were then tabulated centrally in Rome.

Similarly, the work being done was sent to the data, rather than trying to send the data to a central location and doing the work in one place. By distributing the processing, the Romans solved the problem of creating a census over a large diverse population.  

“Distribute the processing, not the data”

Processing is not centralized, instead, it is distributed if the amount of data to be processed is humongous. In doing so, it’s easy to service the processing over an effectively unlimited amount of data. A well-architected big data platform runs the AI/ML/DL algorithms in a distributed fashion on data.

The common components of big data architecture for AI/ML

Data sources

All big data solutions start with one or more data sources.  This can include data from relational databases, data from real-time sources (such as IoT devices), social media data like Twitter streams, LinkedIn updates, real-time user tracking clickstream logs, web server logs, network logs, storage logs, device logs, among others.

Real-time message ingestion

This layer is the first step for the data coming from variable sources to start its journey. Data here is prioritized and categorized, which makes it flow smoothly in further layers. The big data architecture must include a way to capture and store real-time messages for stream processing. This might be a simple data store, where incoming messages are dropped into Hadoop or caught in Kafka for processing. However, many solutions need a message ingestion store to act as a buffer for messages, and to support scale-out processing, reliable delivery, and other message queuing semantics. This portion of a streaming architecture is often referred to as stream buffering. Options include Azure Event Hubs, Azure IoT Hub, Apache Kafka, Apache Flume, Apache Nifi and Elasticsearch with Logstash.

Data storage

A robust, linearly scalable data store is needed for an all-encompassing big data platform, since this layer is at the receiving end for the big data. It receives data from the various data sources and stores it in the most appropriate manner. This layer can even change the format of the data per the requirements of the system. For example, batch processing data is generally stored in a distributed file storage system such as HDFS that can store high volume data that too in different formats. On the other hand, structured data can be stored using RDBMS only. It all depends on the format of the data and the purpose.

Apache HBase, Apache Cassandra, MongoDB, Neo4j, CouchDB, Riak, Apache Hive, Azure CosmosDB etc. are some of the NoSQL data stores that could be employed in a Big Data architecture

Batch processing

Often the data sets are so large, and the big data architecture must process data files using long-running batch jobs to filter, aggregate, and prepare the data for advanced analytics. Usually these jobs involve reading source files, processing them, and writing the output to new files.

The datasets in batch processing are typically

  • bounded: batch datasets represent a finite collection of data
  • persistent: data is almost always backed by some type of permanent storage
  • large: batch operations are often the only option for processing extremely large sets of data

Batch processing is well-suited for calculations where access to a complete set of records is required. For instance, when calculating totals and averages, datasets must be treated holistically instead of as a collection of individual records. These operations require that state be maintained for the duration of the calculations. Batch processing works well in situations where there isn’t any need for real-time analytics results, and when it is more important to process large volumes of information than it is to get quick analytics results.

Apache Hadoop and its MapReduce processing engine offer a well-tested batch processing model that is best suited for handling larger data sets where time is not a significant factor. The Apache Spark framework is the new kid on the block, that can run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Scala, Java, Python and R are the supported languages for spark.

Stream processing

Stream processing is a golden key in generating analytics results in real-time. Stream processing allows to process data in real-time as they arrive and quickly detect conditions from the point of receiving the data. Stream processing allows you to feed data into analytics tools as soon as they get generated and get instant analytics results. There are multiple open source stream processing platforms such as Apache Kafka, Confluent KSQL, Spark Streaming, Apache Beam, Apache Flink, Apache Storm, Apache Samza, etc.

Analytical data store

After batch or stream processing the generated analytical data must be stored in a data store. Operational Data Systems, consisting largely of transactional data, are built for quicker updates. Analytical Data Systems, which are intended for decision making, are built for more efficient analysis. Thus, analytical data is not a one-size-fits-all by any stretch of the imagination! Analytical data is best stored in a Data System designed for heavy aggregation, data mining, and ad hoc queries, called an Online Analytical Processing system, OLAP. Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools. Some of them are Apache Druid, Apache Hive, Azure Synapse Analytics, Elasticsearch, Apache SOLR, Amazon Redshift, among others.

Analysis and reporting

The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to analyze the data, the architecture may include a data modeling layer, such as a multidimensional OLAP cube or a custom UI. It might also support self-service BI, using the modeling and visualization technologies in Microsoft Power BI or Microsoft Excel or Qlik sense, etc. Analysis and reporting can also take the form of interactive data exploration by data scientists or data analysts.  A noteworthy reporting tool is Apache Zeppelin, a web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala.


Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard. To automate these workflows, orchestration technologies like Azkaban, Luigi, Azure Data Factory, Apache Oozie and Apache Sqoop can be employed.

“Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway.” – Geoffrey Moore

Big data is a broad, rapidly evolving topic. While it is not well-suited for all types of computing, many organizations are turning to big data for certain types of workloads and using it to supplement their existing analysis and business tools. Big data systems are uniquely suited for finding patterns like correlation, prediction, anomalies and providing insight into behaviors that are impossible to find through conventional means. By correctly designing their architecture that deal with big data, organizations can gain incredible value from data that is already available.