Back to blogs

Massive Parallel Processing (MPP)

Jun 14, 2019

by Dharmeswaran P

Big data is a term that describes the large volume of data that inundates businesses on a day-to-day basis. Algorithms that work well on “small” datasets crumble when the size of the data extends into terabytes. Organizations large and small are forced to grapple with problems of big data, which challenge the existing tenets of data science and computing technologies. The importance of big data doesn’t revolve around how much data you have, but what you do with it.

In early 2000s, the big data storage problems were solved by companies like Teradata that offer a unified architecture able to store petabytes of data. Teradata can seamlessly distribute datasets on multiple Access Module Processors (AMPs) and facilitate faster analytics.

Teradata Database is a highly scalable RDBMS produced by Teradata Corporation (TDC). It is widely used to manage large data warehousing operations with Massive Parallel Processing (MPP). It acts as a single data store that accepts many concurrent requests and complex Online Analytical Processing (OLAP) from multiple client applications.

Teradata has patented software Parallel Database Extension (PDE) which is installed on hardware component, this PDE divides the processor of a system into multiple virtual software processors where each virtual processor acts as an individual processor and it can perform all tasks independently. In similar fashion, the hardware disk component of Teradata is also divided into multiple virtual disks corresponding to each virtual processor. Hence, Teradata is called shared-nothing architecture.

Teradata uses parallel processing, and the most important aspect of this is to spread the rows of a table equally among the AMPs who read and write data. It uses a hashing algorithm to determine which AMP is responsible for a data row’s storage and retrieval. It will generate 32-bit hash value whenever the same data value is passed into it.     

Tools and Utilities

Teradata Studio – Client based graphical interface for performing Database administration and Query development.

Teradata Parallel Transporter (TPT)– Parallel and scalable data-loading and unloading utility to/from external sources.

Viewpoint – Provides Teradata customers with a Single Operational View (SOV) – System management and monitoring across the enterprise for both administrators and business users.

Row-level security (RLS)– Allows restricting data access on a row-by-row basis in accordance with site security policies.

Workload Management– A workload is a class of database requests with common traits whose access to the database can be managed with a set of rules. Workload management is the act of managing Teradata Database workload performance by monitoring system activity and acting when pre-defined limits are reached.

Hadoop and Cloud connector

Teradata Connector for Hadoop (TDCH) – Bi-directional data movement utility between Hadoop and Teradata which runs as a MapReduce application inside the Hadoop cluster.

QueryGrid – Teradata-Hadoop connector provides a SQL interface for transferring data between Teradata Database and remote Hadoop hosts.

IntelliCloud– Secure cloud offering that provides data and analytic software as a service (SaaS). It enables an enterprise to focus on data warehousing and analytic workloads and rely on Teradata for the setup, management, maintenance, and support of the software and infrastructure – either in Teradata data centers or using public cloud from Amazon Web Services (AWS) and Microsoft Azure.

Use case in Automobile Industry

One of the famous automobile company used Teradata for product development process. As part of this initiative, employees volunteer to have their vehicle data collected via the OpenXC interface and stored it on Teradata for near real time analysis.

  • A Controller Area Network (CAN) interface is installed in each participating vehicle
  • Vehicle data is streamed to the phone via Bluetooth
  • Data is collected on the phone using mobile app
  • Every day, the app transmits huge amount (~ 1TB) of data to Teradata server through participant’s home broadband Internet
  • The size of database is ~200TB. The data is cleaned and standardized for easy, in depth analysis. Once data is transformed, target database size is ~500GB per day for reporting application to visualize
  • Data is then available for Engineers to transform and analyze in different dimensions for reporting
  • Using this analytics engine, they have analyzed fuel economy, On-Road Conditions, performance of car equipment, failures, battery charge limits …etc.,

Summary The fundamental principle behind Teradata’s ability to ingest petabyte of data and process it at a high speed. It provides the most scalable, flexible, cloud-capable Enterprise Data Warehouse in today’s market. The world’s first parallel database designed to analyze data rather than simply store it. And, it continues in our