In this blog post
Today it is said that data is the new oil. I will also add that data is the new gold. Industry 4.0 is focused on data. Data is now considered one of the most important commodities. ‘Big data’ has become an inevitable reality today but bigger isn’t always better, Big insights are more important than Big data. Now to extract value from gold or oil, it needs to be processed – fashioned into jewellery, minted into coins or refined to produce different petroleum products. Similarly, data must be processed and held in a vault (database or datastore). Big insights can be possible only with the right database for daily operations. An explosion of consumer data has enabled IT companies and giants to shift the pH of their databases from ACID to BASE. Let’s see how.
In the early years of computers, ‘punch cards’ were used for input, output, and data storage. Punch cards offered a fast way to enter data, and to retrieve it. After Punch cards, databases came along. Database Management Systems allowed us to organize, store, and retrieve data from a computer. It is a way of communicating with a computer’s “stored memory.” Airlines were one of the first industries that identified the need for relational databases. The SABRE system was used by IBM to help American Airlines manage its data. The datastores have started to evolve from the primitive approach of CODASYL to SQL (ACID) to NoSQL (BASE).
The idea of transactions, their semantics, and guarantees, evolved with data management. As computers became more powerful, they were tasked with managing more data. Eventually, multiple users shared data on a machine. This led to problems of data being changed or overwritten while other users were in the middle of a calculation. This was an issue that needed addressing. Thus, the academics were called in and they came up with the ACID properties for transactions that could solve the consistency issues.
In the context of databases, a sequence of database read/write operations that satisfy the ACID properties (these can be perceived as a single logical operation on the data) is called a transaction.
To understand the importance of transactions, consider this analogy -transferring money from one account to another. This operation includes the below two steps:
- Deduct the balance from the sender’s bank account
- Add the amount to the receiver’s bank account
Now think of a situation where the amount is deducted from the sender’s account but is not transferred to the receiver’s account due to some errors. Such issues are managed by the transaction management, where both steps are performed in a single unit. In the case of a failure, the transaction should be roll-backed.
Below are the basic tenets of the ACID Model, a set of guidelines for ensuring the accuracy of database transactions.
It is the guarantee that a series of operations either succeed or fail together, because all components of a transaction are treated as a single action. If one part of a transaction fails, the database’s state remains unchanged, there are no partial updates.
For example, a business transaction might involve confirming a shipping address, charging the customer and creating an order. If one of these steps fails, all should fail.
Consistency is the second stage of the ACID model. A transaction either creates a new and valid state of data or, if any failure occurs, returns all data to its state before the transaction was started.
For example, a column in a database may only have the values for Days as “Monday” to “Sunday”. If a user were to introduce a new day, then the consistency rules for the database would not allow it.
Transactions require concurrency control mechanisms, and they guarantee correctness even when being interleaved. Isolation brings us the benefit of hiding uncommitted state changes from the outside world, as failing transactions shouldn’t ever corrupt the state of the system. Isolation is achieved through concurrency control using pessimistic or optimistic locking mechanisms
Here is an example: If Bob issues a transaction against a database while Harry issues a different transaction, both transactions should operate on the database in isolation. The database should either perform Bob’s entire transaction before executing Harry’s or vice-versa. This prevents Bob’s transaction from reading intermediate data produced as a side effect of part of Harry’s transaction that will not eventually be committed to the database.
|Bob’s Transaction||Harry’s Transaction|
|Read Bob balance|
|Deduct $100 for||Read Bob’s balance|
|movie||which is $900 not|
|Update account with||Add $600 to Bob’s|
|Update Bob’s account|
|(total = $1500 not|
It is important to note that the isolation property does not ensure that a specific transaction will execute first, only that they will not interfere with each other.
After the successful completion of a transaction in the system, the data remains in the correct state, even in case of a failure and system restart.
The Need for BASE Models
Let’s go through the lifecycle of an application to understand the need for the BASE model. Let’s suppose an e-commerce application is developed. At the initial soft launch, the database is moved from a local workstation to a shared, remotely hosted MySQL instance with a well-defined schema. As soon as the application becomes popular, a problem arises. There are just too many reads hitting the database.
This is quite usual with any application. The first attempt would be to cache frequently executed queries. Generally, memcached or any third-party cache providers like EHCache, OSCache, are employed
for caching. But note that the reads are no longer in compliance with the ACID model. The data is inconsistent because it is in more than one place. This also means that the cache is serving older/stale data till the time DB updates the cache.
As the application’s popularity grows, new features like faceted search, on-page check out, customer reviews, live chat, etc. are introduced. If each feature was in its table, hundreds of joins would be required to prepare such a page. This would increase query complexity. To avoid too many joins, denormalization must be done.
If the application’s popularity surges further, it will swamp the server and slow things down. Thus, server-side computations such as stored procedures must be moved to the client-side. Even after this, there would be some queries that are still slow. So, periodically the most complex queries are pre-materialized, and joins are avoided in most cases.
Now, the reads might be okay, but writes are getting slower. Thus, the secondary indexes and triggers are dropped. At this point the DB is left with:
- No ACID properties due to caching.
- No normalized schema due to denormalization
- No stored procedures, triggers, and secondary indexes.
The ACID model is an overkill or would hinder the operation of the database. These issues gave birth to a softer model called BASE, which is extensively used by the NoSQL datastores.
Basic tenets of BASE model
- Basic Availability
The datastore does guarantee availability, in the presence of multiple failures. Thus, the database appears to work most of the time because of replication.
- Soft State.
Soft State indicates that the state of the system may change over time, even without input. This is because of the eventual consistency model. In a way, datastores don’t have to be write-consistent or mutually consistent all the time.
- Eventual Consistency (Weak consistency)
When multiple copies of the data reside on separate servers, an update may not be immediately made to all copies simultaneously. So, the data is inconsistent for a period of time, but the database replication mechanism will eventually update all the copies of the data to be consistent.
Suitability of the ACID or BASE model varies case-by-case and depends on the read and write patterns. Transactions are omnipresent in today’s enterprise systems, providing data integrity even in highly concurrent environments. So, choose ACID when there is a need for strong consistency in transactions and the schema is fixed.
In the age of IoT, AI/ML, High-Performance Computing is inevitable, and the computing requirements are astronomical. Eventual consistency gives the IT giants edge over others in the industry by enabling their applications to interact with customers across the globe, continuously, with the necessary availability and partition tolerance. All this, while keeping their costs down, systems up, and their customers happy. So, go for the BASE model datastores when there’s a high priority for availability and scalability and the schema is evolving. At the same time, BASE datastores don’t offer guaranteed consistency of replicated data at write time but in the future. BASE consistency model is primarily used by aggregate stores, including column family, key-value and document stores. Hbase, SOLR, cassandra, Elastic search are based on BASE models. Every relational database such as MySQL, postgresql, oracle and Microsoft SQL, support ACID properties of transactions.
Bargunan is a Big Data Engineer and a programming enthusiast. His passion is to share his knowledge by writing his experiences about them. He believes “Gaining knowledge is the first step to wisdom and sharing it is the first step to humanity.”