In this blog post
Data Management Patterns & Architectures
Data has always been significant to organizations, even before the advent of Big Data and machine-generated data. However, with the recent exponential increase in the volume of data, its importance has grown. In that context, enterprises have come up with different data architecture patterns which help them consolidate their data and generate insights from it.
The following are some data management architectures that have been implemented in enterprises in the last two decades.
- Operational Data Store (ODS): ODS is a central database that provides a snapshot of the latest data from multiple transactional systems for operational reporting. It enables organizations to combine data in its original format from various sources into a single destination to make it available for business reporting. ODS typically associates with OLTP (Online Transaction Processing) systems.
- Data Warehouse: A data warehouse is a type of data management system that is designed to enable and support business intelligence (BI) activities, especially analytics. Data warehouses are solely intended to perform queries and analyses and often contain large amounts of historical data. The data within a data warehouse is usually derived from a wide range of sources such as application log files and transaction applications including ODS.
- Data Mart: It is a subset of a data warehouse concentrated on a specific line of the business, department, or subject area. Data marts make specific data available to a defined group of users, which allows those users to quickly access critical insights without wasting time searching through an entire data warehouse.
- Data Catalog: As the disparate sets of data come from multiple source systems which might be from different geographies it is important that we store the information about the data. A data catalog contains details of all data assets in an organization, designed to help data professionals quickly look up the most appropriate data for any analytical or business purpose. It uses metadata to create an informative and searchable inventory of all data assets in an organization. Metadata can be simply defined as data about data. It is a description and context of the data, which helps to organize, find, and understand data.
Unstructured Data & Streaming Data
The syntax and semantics of the incoming data didn’t use to be clearly defined during ingestion but rather determined during the retrieval of the data. This has made enterprises shift to new kinds of data architectures.
Data Lake: A data lake is a central storage repository that holds big data from many sources in a raw, granular format. It can store structured, semi-structured, or unstructured data, which means data can be kept in a more flexible format for future use.
A data lake works on a principle called schema-on-read – there is no predefined schema into which data needs to be fitted before storage. Only when the data is read during processing is it parsed and adapted into a schema as needed. This saves a lot of time that’s usually spent on defining a schema. This also enables data to be stored as-is, in any format.
Is Data Lake the answer for all Data Management Issues?
Data Lake has brought many advantages to the enterprise data landscape. However, it has its limitations too, the following are some of the advantages and disadvantages of Data Lake.
Clearly, data lakes bring several new capabilities to enterprise data management architecture. However, they miss certain well-defined features of traditional data warehouses and operational data stores. This brings in the context of newer architecture known as Data lake House.
Emerging Pattern of Data Lake House
A data lake house is a data solution concept that combines elements of the data warehouse with those of the data lake. Data lakehouses implement data warehouses’ data structures and management features for data lakes, which are typically more cost-effective for storage.
Data lakehouses are enabled by a new, open system design – implementing similar data structures and data management features to those in a data warehouse, directly on the kind of low-cost storage used for data lakes. Merging them together into a single system enables data teams to work faster without needing to access multiple systems. Data lakehouses ensure the availability of the most complete and up-to-date data for data science, machine learning, and business analytics projects.
Technologies Behind Data Lakehouse
While Data Lakehouse retains most of the underlying technologies of existing Data Lake platforms, however, it has to bring in new technologies to support ACID Transaction support, Schema Enforcement & Governance, and other traditional enterprise features.
One such technology framework is known as Delta Lake.
Delta Lake is an open-source storage layer that brings reliability to data lakes. It enables building a Lakehouse architecture on top of data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.
In their latest trends on Data Management, analysts have come up several new concepts like,
- Data LakeHouse
- Data Fabric
- Big Data to Small & Wide Data
- Distributed SQL
Each of these concepts and architectures are not entirely new, but they are emerging from the existing architectures by refining them, especially by bringing the best of traditional data management with new-age data management.
Data LakeHouse will further help the enterprises to bring more data to their analytics scope along with bringing data governance and quality that is typical of the data warehousing era.
Here is a pictorial representation of Data LakeHouse by DataBricks –