Bargunan Somasundaram

The amount of data we produce every day is truly mind-boggling. To leverage all this data, is a SQL-based RDBMS good enough? Is NoSQL better than SQL? In this age of XaaS, where everything is offered as a Service, is NewSQL relevant?

Let’s try to find answers for the above questions.

Today’s users are pampered with rapid response times. In a web-based application, if a web page doesn’t refresh within milliseconds customers get quickly frustrated. If a website is down, customers fear it’s the end of civilization! If a major e-commerce website has an outage, it makes global headlines.

The performance of an application depends mainly on how the data is stored and how efficiently it is used for computations. Faced with new data types at extreme scale, businesses are increasingly choosing alternative methods to store and use their data. Choosing the right alternative takes their business to the next level.

For a high performant application, the following factors need constant tuning.

Scalability: Applications should be able to handlegrowing amount of workload by adding resources to the system.

High Availability: Minimizing the downtime isimportant since the applications must run 24×7 and be resilient to failure.

High Performance: As the application scales, itsperformance also must remain stable and optimal. At the extreme end, Amazon estimates it loses $1.6 billion a year for each additional second taken for one of its pages to load!

Velocity: As web connected sensors are increasinglybuilt into machines (your smartphone being the obvious one), transactions can repeatedly arrive at millions of transactions per second.

Real Time Analytics: Nightly batch processing andBusiness Intelligence is no longer acceptable. The line between analytical and operational processing is becoming blurred, and there are demands for real-time decision making.

Understanding the Structure, schema and model of data.

“Not every data is created equal”

Some are structured, others are either semi-structured, or unstructured. Understanding the distinction between them is vital since they are directly related to choosing the type of database technologies, persistent storage, query pattern and processing required for a high performant application.

Structured data

  • Structured data is pretty straightforward to analyze since it adheres to a pre-defined model, that is, it is represented by a logical data model that is defined by a schema in a database
  • It is based on Relational database table since the schema is fixed
  • Matured transactions and various concurrency techniques can be applied
  • It can be versioned over tuples, rows and tables
  • It is less flexible since it is schema dependent
  • It is difficult to scale DB schema for Structured data
  • Since it has a fixed data model, it is extremely efficient in quickly aggregating data from various locations in the database

But structured data represents only 5% to 10% of all informatics data. So, let’s introduce semi structured data.

Semi-structured data

Semi-structured data is a form of structured data that does not conform to the formal structure of data models associated with relational databases or other forms

of data tables, but nonetheless contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data.

Therefore, it is also known as self-describing structure. Examples of semi-structured data include JSON and XML.

  • NoSQL databases are considered semi-structured
  • Transaction adapted from DBMS not matured
  • Versioning over tuples or graph is possible
  • It is more flexible than structured data but less than that of unstructured data

Unstructured data

Unstructured data has an internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human or machine generated. It may also be stored within a non-relational database like NoSQL.

Typical human-generated unstructured data

Text files, Email, Social Media content, Website content, Mobile data like Text messages, locations, Media files like MP3, digital images etc., business applications like MS Office and Libre.

Typical machine-generated unstructured data

Satellite imagery data, Scientific data like oil and gas exploration, space exploration, seismic imagery, atmospheric data, digital surveillance like surveillance photos and videos and Sensor data from traffic, weather, oceanographic sensors.

  • Unstructured data is based on character and binary data
  • Transaction management and concurrency techniques cannot be enforced on unstructured data
  • On the scalability and flexibility front, it is highly scalable and flexible since it is schema-less
  • Only textual queries are possible

Key factors for choosing SQL (RDBMS)

Structure

A SQL Database follows a table like structure which can have an unlimited number of rows and every data present inside the database is properly structured with Predefined Schema, i.e. columns (attributes), rows (records) and keys have constrained logical relationships.

Transactional Properties

A SQL Database is stable and is always preferred when doing heavy duty transactions. The reason being it keeps the data very secure through data integrity and atomicity. It follows ACID Property which stands for Atomicity, Consistency, Isolation, and Durability.

Scalability A SQL Database is always vertically scalable. If there is heavy load on the server, then it can be handled by increasing the CPU, SSD, RAM, etc. This will work only on a single server. Also, the onus of sharding or partitioning is often on the user and is not well supported in SQL. Thus, the load will be high on the single node server.

Support and communities

SQL databases are backed by massive communities, stable codebases, and proven standards. Multitudes of examples are posted online, and experts are available to support those new to programming relational data.

Popular SQL databases and RDBMSs

  • MySQL
  • Oracle
  • Sybase
  • MS SQL Server
  • MariaDB
  • PostgreSQL

Choose SQL when,

  • There is logical relation within the discrete data requirements, which can be identified up-front i.e. Structured data.
  • Handling query intensive complex databases.
  • Data integrity is essential.
  • Standards-based proven technology with good developer experience and support is required.

Key factors for choosing NoSQL Structure

NoSQL databases do not stick to any format, but generally fit into one of the following four broad categories:

  • Column-oriented databases transpose row-oriented RDBMSs, allowing efficient storage of high-dimensional data and individual records with varying attributes.
  • Key-Value stores are dictionaries which access diverse objects with a key unique to each.
  • Document stores hold semi-structured data, objects which contain all their own relevant information, and which can be completely different from each other.
  • Graph databases add the concept of relationships (direct links between objects) to documents, allowing rapid traversal of greatly connected data sets.

Transactional Properties

NoSQL technologies adhere to the ‘CAP’ theorem, which says that in any distributed database, only two of the following properties can be guaranteed at once:

  • Consistency: Every request receives the mostrecent result, or an error. (Note this is different than in ACID)
  • Availability: Every request has a non-error result, regardless of how recent that result is.
  • Partition tolerance: Any delays or lossesbetween nodes will not interrupt the system’s operation.

Scalability

NoSQL databases scale horizontally, meaning that they can handle increased traffic simply by adding more servers to the database. NoSQL databases have the ability to become larger and more powerful, so they are great for handling large or constantly evolving data sets.

Support and communities

NoSQL technologies are being adopted quickly, but communities remain small and more fractured. However, many SQL languages are proprietary or associated with large single vendors, while NoSQL communities benefit from open systems and concerted commitment to onboarding users.

Popular NoSQL Databases

  • MongoDB
  • Apache’s CouchDB
  • HBase
  • Oracle NoSQL
  • Apache’s Cassandra DB

Choose NoSQL when,

  • There is unrelated, indeterminate or evolving data requirements i.e Unstructured or semi-structured data.
  • • While handling hierarchical data, NoSQL database proves to be a better fit.
  • • There are simpler project objectives, and need rapid development.
  • Speed and scalability are imperative.

What’s NewSQL?

NewSQL is a class of modern relational DBMSs that seek to provide the same scalable performance of NoSQL for OLTP workloads and simultaneously guarantee ACID compliance for transactions as in RDBMS. In other words, these systems want to achieve the scalability of NoSQL without having to discard the relational model of SQL and transaction support of the legacy DBMS.

Popular NewSQL Databases

  • ClustrixDB
  • NuoDB
  • CockroachDB
  • MemSQL
  • VoltDB
  • Percona TokuDB
  • ActorDB

Choose NewSQl when,

  • There are less complex applications, greater consistency i.e. Structured data.
  • Convenient standard tooling.
  • SQL influenced extensions.
  • More traditional data and query models for NoSQL-style clustering.

Which Applications Need NewSQL Technology?

Any application which requires very high ingest rates and fast response times (average 1-2 milliseconds), but also demands transactional accuracy provided by ACID guarantees — for example, customer billing.

Typical applications include:

  • Real-Time Authorization — For example,validating, recording and authorizing phone calls for analysis and billing purposes. Typically, 99.999% of database operations must complete within 50 milliseconds.
  • Real-Time Fraud Detection — Used to completecomplex analytic queries to accurately determine the likelihood of fraud before the transaction is authorized.
  • Gaming Analytics — Used to dynamically modifygaming difficulty in real-time based on ability and typical customer behavior. The aim is to retain existing customers and convert others from free to paying players. One client was able increase customer spend by 40% using these techniques, where speed, availability and accuracy are critical.
  • Personalized Web Adverts — To dynamicallyselect personalized web-based adverts in real time, record the event for billing purposes, and maintain the outcome for subsequent analysis

Conclusion

As a rule of thumb, consider SQL when there is a requirement for vertical scaling with ACID properties while NoSQL for horizontal scaling with BASE properties. NewSQL is best choice when working with Bigdata OLTP applications, Since NewSQL is enhancement of SQL providing horizontal scaling while maintaining ACID properties.