Data has always been key to organizations, even before the emergence of big data and machine generated data. However, with data volumes growing bigger, the need to effectively manage and harness data has been increasingly acknowledged. There is a lot of focus on data consolidation, storage, security, and analytics to keep data safe and to get insights from it. Data Management is now a key organizational priority. There are innumerable data management architectures that enterprises can choose from, based on their data needs.
Operational Data Store: A database that provides the most recent transactional data consolidated from diverse systems that can be leveraged for reporting. It is generally used with OLTP (Online Transaction Processing) systems.
Data Warehouse: A data management system with data derived from varied sources like log files, ODSs, applications, and more. The architecture of a data warehouse supports the storage of vast amounts of historical data and is well suited to serve complex querying, BI (Business Intelligence), and analytics needs.
Data Mart: A subset of a data warehouse containing data pertaining to only a department or line of business and made available to a defined user group. This enables these users to quickly access relevant data for insights.
Data Catalog: This contains detailed information about all the data in the organization. A data catalog leverages metadata (data about data) to create this searchable data inventory. This helps understand data and helps locate required data quickly.
Unstructured Data and Streaming Data: With the continuous generation of data from devices, websites, services, etc. the syntax and semantics of incoming data is not always structured or clearly defined at ingestion time. It is rather determined only during data retrieval. This has forced enterprises to shift to new kinds of data architectures that support unstructured and streaming data.
Data Lake: A data lake accommodates, and stores data as-is in any format and does not mandate a predefined schema. It centrally stores big data from diverse sources in their raw format – regardless of whether they are structured, unstructured, or semi-structured. This flexibility saves time that is spent on defining a schema and is made possible because a data lake works on the ‘schema-on-read’ principle. This means that stored data is modified to fit into a defined schema only when it is read.
Then, are data lakes the answer to all issues with unstructured data management? Well, although data lakes have brought many new capabilities to the enterprise data landscape, they have their limitations too, as listed below. They lack the well-structured and defined features of traditional data warehouses and ODSs.
A Data Lakehouse brings in a good mix of capabilities of data warehouses and data lakes. They implement data warehouse-like data structures and data management features on the kinds of low-cost storage used for data lakes. Data lakehouses also ensure up-to-date consolidated data from varied systems enabling data analysts with one-stop, quick access to reliable data that they would need for their data science, BI, AI/ML requirements. Thus, data lakehouses will help bring more data into the scope of analytics while ensuring data governance and quality which are the key strengths of data warehouses.
While data lakehouses have the same underlying data lake backbone, they need additional capabilities to support ACID Transactions, Schema Enforcement, Governance and other traditional enterprise features. This has spurred the growth of another technology framework known as Delta Lake.
A Delta Lake is an open source storage layer that brings reliability to data lakes, and enables building a lakehouse architecture on top of data lakes. It supports ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS.
On a concluding note, Data Management Analysts have come up with several new architectures and concepts like the ones below, to handle challenges and deficiencies with currently available architectures for the varied kinds of data types and formats that organizations need to handle:
- Data Lakehouse
- Data Fabric
- Big Data to Small and Wide Data
- Distributed SQL
Each of these concepts and architecture while not totally new are emerging from existing architectures by refining them and by infusing the best of traditional data management with new age data management. The following image provides a pictorial representation of the high-level differences between the architectures. Image Source: databricks.com
GS Lab | GAVS offers data management solutions using a combination of multi-modal databases, blockchain technologies, and data mining. These leading-edge solutions are focused on data security, privacy, interoperability, and regulatory compliance while bringing more trust and transparency into data management.
Please Note: Content in this blog may have been taken from different online sources, with a link to source given where possible. The aim of these technical blogs is to provide consolidated information on a topic, before delving into GS Lab | GAVS specifics/inputs.