AIOps for Service Reliability Engineering (SRE)

Data is the single most accountable yet siloed component within any IT infrastructure. According to a Gartner report, an average enterprise IT infrastructure generates up to 3 times more IT operational data with each passing year. Large businesses find themselves challenged by frequent unplanned downtime of their services, high IT issue resolution times, and consequently poor user experience caused by inefficient management of this data overload, reactive IT operations, and other reasons such as:

  • Traditional legacy systems that do not scale
  • Siloed environments preventing unified visibility into IT landscape
  • Unattended warning signs due to alert fatigue
  • Lack of advanced tools to intelligently identify root causes of cross-tier events
  • Multiple hand-offs that require manual intervention affecting problem remediation workflow

Managing data and automation with AIOps

The surge of AI in IT operations or AIOps is helping bridge the gap between the need for meaningful insights and human intervention, to ensure service reliability and business growth. AIOps is fast becoming a critical need since effective management of the humongous data volumes has surpassed human capabilities. AIOps is powered by AI/ML algorithms that enable automatic discovery of infra & applications, 360o observability into the entire IT environment, noise reduction, anomaly detection, predictive and prescriptive analytics, and automatic incident triage and remediation!

AIOps provides clear insights into application & infrastructure performance and user experience, and alerts IT on potential outages or performance degradation. AIOps delivers a single, intelligent, and automated layer of intelligence across all IT operations, enabling proactive & autonomous IT operations, improved operational efficiencies through reduction of manual effort/fatigue/errors, and improved user experience as predictive & prescriptive analytics drive consistent service levels.

The Need for AIOps for SRE

SRE mandates that the IT team always stays ahead of IT outages and proactively resolves incidents before they impact the user. However, even the most mature teams face challenges due to the rapidly increasing data volumes and expanding IT boundaries, created by modern technologies such as the cloud, and IoT. SRE faces challenges such as lack of visibility and technology fragmentation while executing these tasks in real-time.

SRE teams have started to leverage AI capabilities to detect & analyze patterns in the data, eliminate noise & gain meaningful insights from current & historical data. As AIOps enters the SRE realm, it has enabled accelerated and automated incident management and resolution. With AI at the core, SRE teams can now redirect their time towards strategic initiatives and focus on delivering high value to users.

Transform SRE with AIOps

SREs are moving towards AIOps to achieve these main goals:

  • Improved visibility across the organization’s remote & distributed systems
  • Reduced response time through automation
  • Prevention of incidents through proactive operations

AIOps Platform ZIFTM from GAVS allows enterprises focused on digital transformation to become proactive with IT incidents, by delivering AI-led predictions and auto-remediation. ZIF is a unified platform with centralized NOC powered by AI-led capabilities for automatic environment discovery, going beyond monitoring to observability, predictive & prescriptive analytics, automation & self-remediation enabling outcomes such as:

  • Elimination of digital dirt
  • IT team empowered with end-to-end visibility
  • Breaking away the silos in IT infrastructure systems and operations
  • Intuitive visualization of application health and user experience from the digital delivery chain
  • Increasing precision in intelligent root cause analyses helping drastic cut in resolution time (MTTR)
  • ML algorithms for continuous learning from the environment driving huge improvements with time
  • Zero-touch automation across the spectrum of services, including delivery of cloud-native applications, traditional mainframes, and process workflows

The future of AIOps

Gartner predicts a rapidly growing market size from USD 1.5 billion in 2020. Gartner also claims that the future of IT operations cannot operate without AIOps due to these four main drivers:

  • Redundancy of traditional approaches to handling IT complexities
  • The proliferation of IoT devices, mobile applications & devices, APIs
  • Lack of infrastructure to support IT events that require immediate action
  • Growth of third-party services and cloud infrastructure

AIOps has a strong role in five major areas — anomaly detection, event correlation and advanced data analysis, performance analysis, automation, and IT service management. However, to get the most out of AIOps, it is crucial to choose the right AIOps platform, as selecting the right partner is critical to the success of such an important org initiative. Gartner recommends prioritizing vendors based on their ability to address challenges, data ingestion & analysis, storage & access, and process automation capabilities. We believe ZIF is that AIOps solution for you! For more on ZIF, please visit www.zif.ai.

5 Key Non-Traditional Areas of Cost Optimization in Healthcare

A recent analysis from the Office of the Actuary at CMS, reports that the national healthcare spending reached a total of USD 3.8 trillion before the pandemic hit the world. Conversely, there has been a steady decline of around 20% and 34% in inpatient and outpatient volumes, respectively. As a result, the American Hospital Association (AHA) predicts that the higher-than-usual supply expenses and other pandemic-related costs can lead to USD 323 billion loss for hospitals and health systems across the country. To curb the mounting expenditures in the healthcare industry, healthcare CXOs are shifting focus to cost optimization strategies. However, before focusing on cost optimization, one must consider:

  • Evaluating the role of IT in improving patient care
  • Prioritizing technology initiatives to address medical costs
  • Being prepared for internal backlash on account of budget cuts

Here are five non-traditional areas that can be optimized to bring down healthcare costs.

Medical Waste Management and IoMT

Medical waste in hospitals could have a severe impact. The economic costs associated with medical waste range from USD 760 billion to USD 935 billion, as improper management invites increased premium and out-of-pocket medical expenses, thus increasing healthcare costs. Investing in care coordinators, community health workers, and social workers can help to improve quality and cut costs. Waste management monitoring through sensors and IoMT can help hospitals become more efficient and remain compliant with various regulatory bodies such as OSHA.

Care Management and Process Optimization

In a dynamic environment, a lack of process standardization can increase the chances of variations in patient outcomes, resulting in longer LOS and increased costs to the healthcare system. Leveraging the power of data and data-driven processes, healthcare providers can standardize processes and identify areas that require improvement in terms of care delivery. Over a period of time, the data can help implement evidence-based workflows that improve consistency and coordination across processes.

Diagnostic Testing and Predictive Analytics

A report states that the costs of diagnostic testing account for more than 10% of all healthcare costs. However, there is also a rise in death or disability, with more than a million a year harmed by diagnostic errors, in the US. The repercussions of wrongful diagnosis can insure both direct and indirect costs that include medicolegal costs, increase in medical liability premiums, among others. While there is still an ongoing debate to determine whether diagnostic tests are being overused or underused, the cost factor of unnecessary testing must be taken into consideration. With the growing use of technology in healthcare, hospitals and other medical facilities must leverage electronic health records (EHR) for information transfer across care teams. The data acquired from EHR can be used in combination with new-age technologies such as predictive analytics and machine learning to minimize human errors and costly adverse events.

Data-driven Decision Making

The World Bank reported that the US leads the world in healthcare spending, with almost 18% of its GDP contributing to healthcare costs. One of the focus areas of cost optimization is the quality of consolidated patient information available to physicians, as it is critical to improving safety processes and quality. Lack of insight into patient information and process management leads to an increase in cost due to length of stay or complication rates. To manage frontline costs, it is imperative to establish a robust data-driven system. Using systems such as EDW will enable physicians to get real-time answers to clinical quality improvement queries, thus giving them the opportunity to analyze LOS and make necessary changes. Simply put, a hospital can improve its productivity by up to 26% by creating an environment for better decisions, thus creating more opportunities to optimize cost.

Supply Chain and Standardization

While focusing on increasing the volume, revenue, and growth of hospitals, one of the areas that is affected due to lack of attention is the supply chain. Gartner reports that the total healthcare supply chain cost averages to 37.3% of the total cost of patient care. The supply chain in a hospital can be affected by inefficiencies, service duplication, and poor labor management. To optimize costs in this segment, hospitals must focus on standardizing processes, manage physician schedules, and leverage health systems to handle patient access and flow. To build resilience within the healthcare supply chain, the top management must implement preventive measures such as improvements to data analytics and supplier visibility, and external intelligence focused on the healthcare supply chain.

Having discussed the different focus areas for cost optimization, it is important to implement these strategies wisely. While implementing a cost optimization strategy, one must follow these four rules:

  • Have a clearly defined area of focus
  • Build a functioning operating model
  • Learn and implement the right lessons
  • Demonstrate sustainable value

According to Gartner experts, avoiding reactionary cost-cutting in favor of purposeful prioritization requires a partnership between CIOs, CFOs, and CEOs. To that end, Gartner recommends the following:

  • Adopt scenario planning strategy to identify processes and technology impacts using various parameters such as TCO, ROI, and value creation
  • Expand IT portfolio in clinical processes using RPA, ML, AI, and NLP capabilities
  • Evaluate, review, and protect funding of transformative technology projects in different areas such as administrative modernization, patient engagement, etc.

Customizing OOTB IT Network Security Software Products

Sundaramoorthy S

As global IT is rapidly being digitalized, the network security requirements of major businesses are offered as Out of The Box (OOTB) IT security products by IT OEMs (Information Technology Original Equipment Manufacturers).

The products offered by OEMs adhere to global standards like ISO/IEC 2700, NIST, GDPR, CCPA, and PDPB, which leads to businesses buying licenses for the end products with the intention of saving time and money. However, while integrating, deploying, and maintaining the product solution, the intention of owning the product is violated.  

This article focuses on the customizations of the OOTB products that should be avoided, and steps for tuning the customization of the requirements in the licensed products.

Customization is desirable when it lies within the OOTB product’s radar. Moving beyond the limits leads to multiple operational challenges.

Customizations that are narrower in scope end up being under-utilized. There are certain customizations that can very well be done without. It is ideal to conduct an analysis to validate whether the time and money invested for such customizations will give proportionate benefits/returns.

Product OEMs should be consulted on matters of future releases and implementations before taking such decisions. Choosing the right implementation partner is equally important. Failing to do so may result in issues in production systems, in terms of Audit, Governance, Security, and Operations. Realizing the flaw in later stages costs businesses heavily. Extensive testing must be conducted to ensure the end-to-end capabilities of the OOTB product are not violated.

Listed below are few observations based on my discussions with executives who have faced such issues in ongoing and completed implementations.

Customizations to Avoid

  • OOTB products are customized by overwriting thousands of lines of code. It makes the product tightly coupled to the network and makes the future upgrades and migration of the product complex.
  • Disregarding the recommendations of product architects & SMEs and making customizations to the existing capability of the products to meet the isolated requirements of a business leads to further hidden issues in the products. Finally, what the business demands is to customize, which violates the intent of the OOTB product.
  • Random customizations make the products compatible with the existing enterprise architecture which makes the network vulnerable.
    Below are some challenges:
    • OOTB designed products are unable to consume the business data as it is in some cases
    • Some business users are not willing to migrate to new systems, or unable to educate the users to utilize the new systems.
  • OOTB APIs are not utilized in places where it is required.

Cons of Customizing

  • OEMs provide support for OOTB features only and not for customized ones.
  • The impact of customizations on the product’s performance, optimization, and security is not always clear.
  • Audit and Governance are not manageable if the customizations are not end-to-end.
  • The above issues may lead to a lower return on investment for the customizations

Steps to Avoid Major Customization

For New implementations

  • The Road Map and strategy should be derived by doing a detailed analysis of the current and future state while selecting the product solution.
  • PoCs for requirements of the future state should be done with multiple products which offer similar services in the market to select the right one.
  • Future requirements vs product compliance matrix should be validated.
  • Gap analysis between the current state and future state should be executed through discussions with product owners and key stakeholders in the business.
  • Implementation partners could be engaged in such activities which could refine the analysis and offer their expertise on working with multiple similar products in the market so that the outcome (product selected) is best in terms of cost and techno-functional requirements.

For existing implementations where the product solution is already deployed

  • OOTB product features should be utilized efficiently by vendors, partners, and service providers.
  • To utilize the OOTB product, massaging the existing dataset or minimal restructuring post risk analysis is acceptable. This exercise should be done before onboarding the product solution.
  • For any new requirement which is not OOTB, rather than customizing the product solution independently as an end-user (business entity), a collaborative approach with implementation partners and OEMs’ professional services (minimal) should be taken. This can help address the complexity of requirements without any major roadblocks in the implementation in terms of security and performance of the product solution already deployed in the network. In this approach, support from the product team is available too, which is a great plus.

Role of OEMs

OEMs should take the necessary efforts to understand the needs of the customers and deliver relevant products. This will help in ensuring a positive client experience.

Below are few things the OEMs should consider:

  1. OEMs should have periodic discussions with clients, service providers, and partners, and collect inputs to upgrade their product and remain competitive.
  2. Client-specific local customizations which could be utilized by global clients should be encouraged and implemented.
  3. OEMs should implement the latest technologies and trends in OOTB products sooner than later.
  4. OEMs could use the same technical terminologies across the products which offer similar services, as of now individual products use their own which is not a client and user-friendly.

Since security is the top priority for all, above discussed improvisations, tips and pointers should be followed by all the IT OEMs in the market who produce IT network security products.

Customizations in IT security products are not avoidable. But it should be minimal and configurable based on the business-specific requirements and not major enhancements.

OOTB vs Customization Ratio

Enterprise IT Support Services USA

About the Author –

Sundar has more than 13 years of experience in IT, IT security, IDAM, PAM and MDM project and products. He is interested in developing innovative mobile applications which saves time and money. He is also a travel enthusiast.

Introduction to Shift Left Testing

Abdul Riyaz

Never stop until the very end.

The above statement encapsulates the essence of Shift Left Testing.

Quality Assurance should keep up the momentum of testing during the end-to-end flow. This will ensure Quicker Delivery, Quality Product, and Increased Revenue with higher Profitability. This will help transform the software development process. Let me elucidate how it helps.

Traditional Testing vs Shift Left Testing

For several decades, Software Development followed the Waterfall Model. In this method, each phase depends on the deliverables of the previous phase. But over time, the Agile method provided a much better delivery pattern and reduced the delivery timelines for projects. In this Software Development model, testing is a continuous process that starts at the beginning of a project and reduces the timelines. If we follow the traditional way of testing after development, it eventually results in a longer timeline than we imagined.

Hence, it is important to start the testing process parallel to the development cycle by using techniques such as ‘Business-Driven Development’ to make it more effective and reduce the timeline of delivery. To ensure Shift Left Testing is intact, AUT (Application Under Test) should be tested in an automated way. There are many proven Automation Testing software available in the current world of Information Technology which help better address this purpose.

AI Devops Automation Service Tools
AIOps Artificial Intelligence for IT Operations

End-to-End Testing Applied over Shifting Left!

Software Testing can be predominantly classified in 3 categories – Unit, Integration and End-to-End Testing. Not all testing correspondingly shifts left from Unit test to System test. But this approach is revolutionized by Shift Left Testing. Unit Testing is straightforward to test basic units of code, End-to-End Testing is based on the customer / user for the final product. But if we bring the End-to-End testing to the left, that will result in better visibility of the code and its impact on the entire product during the development cycle itself.

The best way we could leverage ML (Machine Learning) and achieve a Shift-Left towards design and development with testing is indicated by continuous testing, visual testing, API coverage, scalable tests and extendable coverage, predictive analytics, and code-less automation.

AIOps Digital Transformation Solutions

First Time Right & Quality on Time Shift Left Testing not only reduces the timeline of deliveries, but it also ensures the last minute defects are ruled out and we get to identify the software flaws and conditions during the development cycle and fix them, which eventually results in “First Time Right”. The chance of leaking a defect is very less and the time spent by development and testing teams towards fixing and retesting the software product is also reduced, thereby increasing the productivity for “Quality on Time” aspects.

I would like to refer to a research finding by the Ponemon Institute. It found that if vulnerabilities are detected in the early development process, they may cost around $80 on average. But the same vulnerabilities may cost around $7,600 to fix if detected after they have moved into production.

Best AI Auto Discovery Tools

The Shift left approach emphasizes the need for developers to concentrate on quality from the early stages of a software build, rather than waiting for errors and bugs to be found late in the SDLC.

Machine Learning vs AI vs Shift Left Testing There are opportunities to leverage ML methods to optimize continuous integration of an application under test (AUT) which begins almost instantaneously. Making machine learning work is a comparatively smaller feat but feeding the right data and right algorithm into it is a tough task. In our evolving AI world, gathering data from testing is straightforward. Eventually making practical use of all this data within a reasonable time is what remains intangible. A specific instance is the ability to recognize patterns formed within test automation cycles. Why is this important? Well, patterns are present in the way design specifications change and, in the methods, programmers use to implement those specifications. Patterns follow in the results of load testing, performance testing, and functional testing.

ML algorithms are great at pattern recognition. But to make pattern recognition possible, human developers must determine which features in the data might be used to express valuable patterns. Collecting and wrangling the data into a solid form and knowing which of the many ML algorithms to inject data into, is very critical to success.

Many organizations are striving towards inducting shift left in their development process; testing and automation are no longer just QA activities. This certainly indicates that the terms of dedicated developers or testers are fading away. Change is eventually challenging but there are few aspects that every team can work towards to prepare to make this shift very effective. It might include training developers to become responsible for testing, code review quality checks, making testers aware of code, start using the same tools, and always beginning with testability in mind.

Shifting left gives a greater ability to automate testing. Test automation provides some critical benefits;

  • Fewer human errors
  • Improvised test coverage (running multiple tests at same time)
  • Involvement and innovative focus of QA engineers apart from day to day activities
  • Lesser or no production defects.
  • Seamless product development and testing model

Introducing and practicing Shift Left Testing will improve the Efficiency, Effectiveness and the Coverage of testing scope in the software product which helps in delivery and productivity.

References

About the Author –

Riyaz heads the QA Function for all the IP Projects in GAVS. He has vast experience in managing teams across different domains such as Telecom, Banking, Insurance, Retail, Enterprise, Healthcare etc.

Outside of his professional role, Riyaz enjoys playing cricket and is interested in traveling and exploring things. He is passionate about fitness and bodybuilding and is fascinated by technology.

Evolving Telemedicine Healthcare with ZIF™

Ashish Joseph

Overview

Telemedicine is a powerful tool that was introduced in the 1950s to make healthcare more accessible and cost effective for the general public. It had helped patients especially in rural areas to virtually consult physicians and get prompt treatment for their illnesses.  

Telemedicine empowers healthcare professionals to gain access to patient information and remotely monitor their vitals in real time.

In layman terms, Telemedicine is the virtual manifestation of the remote delivery of healthcare services. Today, we have 3 types of telemedicine services;

  • Virtual Consultation: Allowing patients and doctors to communicate in real time while adhering to HIPAA compliance
  • EHR Handling:  Empowering providers to legally share patient information with healthcare professionals
  • Remote Patient Monitoring: Enabling doctors to monitor patient vitals remotely using mobile medical devices to read and transmit data.

The demand from a technology embracing population has brought in a higher rate of its adoption today.

Telemedicine can be operated in numerous ways. The standard format is by using a video or voice-enabled call with a HIPAA compliant tool based on the country of operation. There are also other ways in which portable telemedicine kits with computers and medical devices are used for patient monitoring enabled with video.

patient experience health management

Need of the Hour

The COVID-19 pandemic has forced healthcare systems and providers to adapt the situation by adopting telemedicine services to protect both the doctors and patients from the virus. This has entirely changed the scenario of how we will look at healthcare and consultation services going forward. This adoption of the modern telemedicine services has proven to bring in more convenience, cost saving and new intelligent features that enhance the doctor and patient experience and engagement significantly.

The continuous advancements and innovation in technology and healthcare practices significantly improve the usability and adoption of telemedicine across the industry. In the next couple of years, the industry is to see a massive integration of telemedicine services across practices in the country.

healthcare it support service offerings

A paper titled, “Telehealth transformation: Covid19 and the rise of Virtual Care” from the journal of the American Medical Informatics Association, analyzes the adoption of telemedicine in different phases during the pandemic.

During the initial phase of the pandemic when the lockdown was enforced, telemedicine found the opportunity to scale as per the situation. It dramatically decreased the proportion of in-person care and clinical visits to reduce the community spread of the virus.

As the causalities from the pandemic intensified, there was a peak in demand of inpatient consultations with the help of TeleICUs. This was perfectly suited to meet the demands of inpatient care while reducing the virus spread, expanding human and technical resources, and protecting the healthcare professionals.

With the pandemic infection rates stabilizing, telemedicine was proactive in engaging with patients and effectively managing the contingencies. As restrictions relaxed with declining infection rates, the systems will see a shift from a crisis mode to a sustainable and secure system that preserve data security and patient privacy.

The Future of Telemedicine

With the pandemic economy serving as an opportunity to scale, telemedicine has evolved to a cost effective and sustainable system. The rapid advances in technology enable telemedicine to evolve faster.

The future of Telemedicine revolves around Augmented reality with the virtual interactions simulated in the same user plane. Both Apple and Facebook are experimenting with their AR technology and are expected to make a launch soon.

Now Telemedicine platforms are evolving like service desks, to measure efficiency and productivity. This helps to track the value realizations contributed to the patients and the organization.

The ZIF™ Empowerment

ZIF™ helps customers scale their telemedicine system to be more effective and efficient. It empowers the organization to manage healthcare professionals and customer operations in a New Age Digital Service Desk platform. ZIF™ is a HIPAA compliant platform and leverages the power of AI led automation to optimize costs, automate workflows and bring in an overall productivity efficiency.

ZIF™ keeps people, processes and technology in sync for operational efficiency. Rather than focusing on traditional SLAs to measure performance, the tool focuses more on end user experience and results with the help of insights to improve each performance parameter.

Here are some of the features that can evolve your existing telemedicine services.

AIOps based Predictive and Prescriptive Analytics Platform

Patient engagements can be assisted with consultation recommendations with their treatment histories. The operations can be streamlined with higher productivity with quicker decision making and resolutions. A unified dashboard helps to track performance metrics and sentiment analytics of the patients.

AI based Voice Assistants and Chatbots

Provide consistent patient experience and reduce the workload of healthcare professionals with responses and task automations.

Social Media Integration

Omnichannel engagement and integration of different channels for healthcare professionals to interact with their patients across social media networks and instant messaging platforms.

Automation

ZIF™ bots can help organizations automate their workflow processes through intuitive activity-based tools. The tool offers over 200+ plug-and-play workflows for consultation requests and incident management.

Virtual Supervisor

The Native machine learning algorithms aid in initial triaging of patient consultation requests to the right healthcare professional with its priority assignment and auto rerouting tickets to the appropriate healthcare professional groups.

ZIF™ empowers healthcare organizations to transform and scale to the changing market scenarios. If you are looking for customized solutions for your telemedicine services with the help of ZIF™, feel free to schedule a Demo with us today.

https://zif.ai/

About the Author –

Ashish Joseph is a Lead Consultant at GAVS working for a healthcare client in the Product Management space. His areas of expertise lie in branding and outbound product management.

He runs two independent series called BizPective & The Inside World, focusing on breaking down contemporary business trends and Growth strategies for independent artists on his website www.ashishjoseph.biz

Outside work, he is very passionate about basketball, music, and food.

Anomaly Detection in AIOps

Vimalraj Subash

Before we get into anomalies, let us understand what is AIOps and what is its role in IT Operations. Artificial Intelligence for IT operations is nothing but monitoring and analyzing larger volumes of data generated by IT Platforms using Artificial Intelligence and Machine Learning. These help enterprises in event correlation and root cause analysis to enable faster resolution. Anomalies or issues are probably inevitable, and this is where we need enough experience and talent to take it to closure.

Let us simplify the significance of anomalies and how they can be identified, flagged, and resolved.

What are anomalies?

Anomalies are instances when performance metrics deviate from normal, expected behavior. There are several ways in which this occur. However, we’ll be focusing on identifying such anomalies using thresholds.

How are they flagged?

With current monitoring systems, anomalies are flagged based on static thresholds. They are constant values that provide the upper limits of a normal behavior. For example, CPU usage is considered anomalous when the value is set to be above 85%. When anomalies are detected, alerts are sent out to the operations team to inspect.

Why is it important?

Monitoring the health of servers are necessary to ensure the efficient allocation of resources. Unexpected spikes or drop in performance such as CPU usage might be the sign of a resource constraint. These problems need to be addressed by the operations team timely, failing to do so may result in applications associated with the servers failing.

So, what are thresholds, how are they significant?

Thresholds are the limits of acceptable performance. Any value that breaches the threshold are indicated in the form of alerts and hence subjected to a cautionary resolution at the earliest. It is to be noted that thresholds are set only at the tool level, hence that way if something is breached, an alert will be generated. These thresholds, if manual, can be adjusted accordingly based on the demand.

There are 2 types of thresholds;

  1. Static monitoring thresholds: These thresholds are fixed values indicating the limits of acceptable performance.
  2. Dynamic monitoring thresholds: These thresholds are dynamic in nature. This is what an intelligent IT monitoring tool does. They learn the normal range for both a high and low threshold, at each point in a day, week, month, and so on. For instance, a dynamic system will know that a high CPU utilization is normal during backup, and the same is abnormal on utilizations occurring on other days.

Are there no disadvantages in the threshold way of identifying alerts?

This is definitely not the case. Like most things in life, it has its fair share of problems. Routing from philosophy back to our article, there are disadvantages in the Static Threshold way of doing things, although the ones with a dynamic threshold are minimal. We should also understand that with the appropriate domain knowledge, there are many ways to overcome these.

Consider this scenario. Imagine a CPU threshold set at 85%. We know anything that breaches this, is anomalies generated in the form of alerts. Now consider the same threshold percentage as normal behavior in a Virtual Machine (VM). This time, the monitoring tool will generate alerts continuously until it reaches a value below the threshold. If this is left unattended, it will be a mess as there might be a lot of false alerts which in turn may cause the team to fail to identify the actual issue. It will be a chain of false positives that occur. This can disrupt the entire IT platform and cause an unnecessary workload for the team. Once an IT platform is down, it leads to downtime and loss for our clients.

As mentioned, there are ways to overcome this with domain knowledge. Every organization have their own trade secrets to prevent it from happening. With the right knowledge, this behaviour can be modified and swiftly resolved.

What do we do now? Should anomalies be resolved?

Of course, anomalies should be resolved at the earliest to prevent the platform from being jeopardized. There are a lot of methods and machine learning techniques to get over this. Before we get into it, we know that there are two major machine learning techniques – Supervised Learning and Unsupervised Learning. There are many articles on the internet one can go through to have an idea of these techniques. Likewise, there are a variety of factors that could be categorized into these. However, in this article, we’ll discuss an unsupervised learning technique – Isolation Forest amongst others.

Isolation Forest

The algorithm isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.

The way that the algorithm constructs the separation is by first creating isolation trees, or random decision trees. Then, the score is calculated as the path length to isolate the observation. The following example shows how easy it is to separate an anomaly-based observation:

Best AI Auto Discovery Tools

 

In the above image, the blue points denote the anomalous points whereas the brown ones denote the normal points. Anomaly detection allows you to detect abnormal patterns and take appropriate actions. One can use anomaly-detection tools to monitor any data source and identify unusual behaviors quickly. It is a good practice to research methods to determine the best organizational fit. One way of doing this is to ideally check with the clients, understand their requirements, tune algorithms, and hit the sweet spot in developing an everlasting relationship between organizations and clients.

Zero Incident FrameworkTM, as the name suggests, focuses on trending organization towards zero incidents. With knowledge we’ve accumulated over the years, Anomaly Detection is made as robust as possible resulting in exponential outcomes.

References

About the Author –

Vimalraj is a seasoned Data Scientist working with vast data sets to break down information, gather relevant points, and solve advanced business problems. He has over 8 years of experience in the Analytics domain, and currently a lead consultant at GAVS.

Site Reliability Engineering

Vasu

Vasudevan Gopalan

Software engineering is akin to having children; the labor before birth is painful, and the labor after birth is where we dedicate most of our efforts😊.

Software engineering as a discipline spends more time talking about the first period, but research clearly suggests that 40-90% of the costs are incurred after the birth of the systems. These costs are incurred to keep the platforms reliable.

Why should platforms be reliable? Because the average consumer demands speed, convenience, and reliability from every digital experience. While availability focuses on the platform’s operational quotient, reliability focuses on the platform’s useful quotient.

Site Reliability Engineering is a practice and cultural shift towards creating a robust IT operations process that would instill stability, high performance, and scalability to the production environment.

Reliability is the most fundamental feature of any product; a system is not useful if nobody can use it!

Site Reliability Engineers (SREs) are engineers – applying the principles of computer science and engineering to the design and development of computing systems, generally large, distributed ones. As Ben Treynor Sloss of Google states – SRE is what happens when a software engineer is tasked with what used to be called operations. Automation, Self-healing, Scalability, Resilient – these characteristics become mainstream.

An SRE function is run by IT operational specialists who code. These specialist engineers implement a software-first approach to automate IT operations and preempt failures. They apply cutting-edge software practices to integrated Dev and Ops on a single platform and execute test codes across the continuous environment. They possess advanced software skills, including DNS Configuration, remediating server, network, and infrastructure problems, and fixing application glitches.

The software approach codifies every aspect of IT operations to build resilience within infrastructure and applications. Thus, changes are managed via version control tools and checked for issues leveraging test frameworks, while following the principle of observability.

Best AI Auto Discovery Tools

The Principle of Error Budget

SRE engineers verify the code quality of changes in the application by asking the development team to produce evidence via automated test results. SRE managers can fix Service Level Objectives (SLOs) to gauge the performance of changes in the application. They should set a threshold for permissible minimum application downtime, also known as Error Budget. If the downtime during changes in the application is within the scale of the Error Budget, then SRE teams can approve it. If not, then the changes should be rolled back for improvements to fall within the Error Budget formula.

Error Budgets tend to bring balance between SRE and application development by mitigating risks. An Error Budget is unaffected until the system availability falls within the SLO. The Error Budget can always be adjusted by managing the SLOs or enhancing the IT operability. The ultimate goal remains application reliability and scalability.

DevOps and SRE

We know that DevOps is all about culturally combining development and operations. While DevOps dwells on what needs to be done for this, SRE focuses on how it must be done.

DevOps brings the traditionally separate teams of development and operations under one roof to improve upon collaboration, communication, integration, and software releases. This is accomplished by the focus on end-to-end automation of builds and deployments as well as effectively managing the entire infrastructure as code.

SRE is a discipline that incorporates the various aspects of software development and applies it to issues and tasks in IT operations specifically. The main objective of SRE is to develop a highly reliable and ultra-scalable software application or system. The prime focus is to completely automate (if not all) the tasks to ensure reliability in the systems. The ‘relentless’ pursuit of automation in SRE helps brands eliminate manual work, giving developers more time to innovate and create.

Also, in comparison to DevOps, SRE provides a good set of detailed steps in each part of the framework to reach a particular goal.

Best AIOps Platforms Software

While DevOps and SRE sound like they are on opposite sides of the spectrum, both approaches share the same end goals.

  • To make incremental changes fast and efficiently
  • To reduce the number of organization silos
  • To have a flexible, open-minded, and adaptable working culture
  • Use automation wherever possible
  • To monitor performance and improve when necessary

Just to slightly go back in time. In the old school /era of system administrators – Sysadmin was mostly assembling existing software components and deploying them to work together to produce a service. As the system grows in complexity and traffic volume, the need to have a larger sysadmin team comes into force, thereby increasing both direct and indirect (differences with the dev team in terms of culture, background, skill set, goals, etc.) costs to the organization. While the Dev team would want to launch new features etc., the ops team wants to maintain the status quo, to ensure service continuity. Hence, the two teams’ goals are fundamentally in tension.

Toil is mundane, repetitive operational work providing no enduring value, which scales linearly with service growth. Taking humans out of the release process can paradoxically reduce SRE’s toil while increasing system reliability.

SRE – Google’s approach to Service Management

SRE is what happens when we ask a software engineer to design an operations team, the common aptitude being developing software systems to solve complex problems. Motivated by “as a software engineer, this is how I would want to invest my time to accomplish a set of repetitive tasks.”

SRE teams generally have 50-60% of regular software engineers, other 40-50% being near software engineers who come with rarer skills like Uni system internals, networking expertise, etc.

SRE teams should focus on engineering, to avoid the fate of linear scaling up of the team. SREs, ensure that service “runs and repairs itself”. SREs typically should spend only 50% on ops work, remaining time on coding for the project itself.

Where to start?

Organizations must identify change agents who would create and promote a culture of maximum system availability. They can champion this change by practicing the principle of observability, where monitoring is a subset. Observability essentially requires engineering teams to be vigilant of common and complex problems hindering the attendance of reliability and scalability in the application. See the principles of observability below:

Best Cyber Security Services Companies

Tenets of SRE

For a given service, ensuring Availability, Latency, Performance, Efficiency, Change management, Monitoring, Emergency response, Capacity planning, etc.

Google operates a “blame-free postmortem culture”, with the goal of exposing faults and applying engineering to fix these faults, rather than avoiding or minimizing them.

Change Management

Data suggest that ~70% of outages are due to changes in a live system. Remove humans, and automate to achieve the following:

  1. Implement progressive rollouts
  2. Quick and accurate detecting of problems
  3. Roll back of changes safely when problems arise

SRE represents significant break from existing industry best practice for managing large, complicated services.

Benefits of SRE

Best DCaas Providers in USA

No more toiling, organizations should embrace SRE and make their end-customers happy.

References –

About the Author –

Vasu heads the Engineering function for A&P. He is a Digital Transformation leader with ~20 years of IT industry experience spanning Product Engineering, Portfolio Delivery, Large Program Management, etc. Vasu has designed and delivered Open Systems, Core Banking, Web / Mobile Applications, etc. Outside of his professional role, Vasu enjoys playing badminton and is a fitness enthusiast.

Understanding Data Fabric

Srinivasan Sundararajan

In the recently announced Technology Trends in Data Management, Gartner has introduced the concept of “Data Fabric”. Here is the link to the document, Top Trends in Data and Analytics for 2021: Data Fabric Is the Foundation (gartner.com).

As per Gartner, the data fabric approach can enhance traditional data management patterns and replace them with a more responsive approach. As it is key for the enterprise data management strategy, let us understand more about the details of data fabric in this article.

What is Data Fabric?

Today’s enterprise data stores and data volumes are growing rapidly. Data fabric aims to simplify the management of enterprise data sources and the ability to extract insights from them. A data fabric has the following attributes:

  • Connects to multiple data sources
  • Provides data discovery of data sources
  • Stores meta data and data catalog information about the data sources
  • Data ingestion capabilities including data transformation
  • Data lake and data storage option
  • Ability to store multi-modal data, both structured and unstructured
  • Ability to integrate data across clouds
  • Inbuilt graph engine to link data for providing complex relationships
  • Data virtualization to integrate with data that need not be physically moved
  • Data governance and data quality management
  • Inbuilt AI/ML engine for providing machine learning capabilities
  • Ability to share the data both within enterprises and across enterprises
  • Easy to configure work-flows without much coding (Low Code environment)
  • Support for comprehensive use cases like Customer 360, Patient 360 and more

As evident, Data Fabric aims to provide a super subset of all the desired data management capabilities under a single unified platform, making it an obvious choice for future of data management in enterprises.

Data Virtualization

While most of the above capabilities are part of existing data management platforms for the enterprise, one important capability that is part of data fabric platform is the data virtualization.

Data virtualization creates a data abstraction layer by connecting, gathering, and transforming data silos to support real-time and near real-time insights. It gives you direct access to transactional and operational systems in real-time whether on-premise or cloud.

The following is one of the basic implementations of data virtualizations whereby an external data source is queried natively without actually moving the data.  In the below example, a Hadoop HDFS  data source is  queried from a data fabric platform such that the external data can be integrated  with other data.

AI Devops Automation Service Tools

While this kind of external data source access it there for a while, data fabric also aims to solve the performance issues associated with the data virtualization. Some of the techniques used by data fabric platforms are:

  • Pushes some computations to the external source to optimize the overall query 
  • Scales out computer resources by providing parallelism

Multi Cloud   

As explained earlier, one another critical capability of data fabric platforms is its ability to integrate data from multi cloud providers. This is at the early stages as different cloud platforms have different architecture and no uniform way of connectivity between them. However, this feature will grow in the coming days.

Advanced Use Cases 

Data fabric should support advanced use cases like Customer 360, Product 360, etc. These are basically comprehensive view of all linkages between enterprise data typically implemented using graph technologies. Since data fabric supports graph databases and graph queries as an inherent feature, these advanced linkages are part of the data fabric platform.

Data Sharing  

Data fabric platforms should also focus on data sharing, not within the enterprise but also across enterprise. While focus on API management helps with data sharing, this functionality has to be enhanced further as data sharing also needs to take care of privacy and other data governance needs.

Data Lakes 

While the earlier platforms similar to data fabric have worked on the enterprise data warehouse as a backbone, data fabric utilizes a data lake as it is the backbone. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics – from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Data Fabric Players

At the time of writing this article, there are no ratings from Gartner in the form of magic quadrant for Data Fabric Platforms. However, there is a report from Forrester which ranks data fabric platforms in the form of a Forrester Wave.

Some of the key platforms mentioned in that report are:

  • Talend
  • Oracle
  • SAP
  • Denodo Technologies
  • Cambridge Semantics
  • Informatica
  • Cloudera
  • Infoworks

While the detailed explanation and architecture of these platforms can be covered in a subsequent article, a sample building blocks of Talend data fabric platform is illustrated in the below diagram.

AIOps Artificial Intelligence for IT Operations

Enterprises can also think of building their data fabric platform by combining the best of features of various individual components. For example, from the Microsoft ecosystem perspective:

  • SQL Server Big Data Clusters has data virtualization capabilities
  • Azure Purview has data governance and metadata management capabilities
  • Azure Data Lake Storage provides data lake capabilities
  • Azure Cosmos DB has graph database engine
  • Azure Data Factory has data integration features
  • Azure Machine Learning and SQL Server have machine learning capabilities

However, as evident, we are yet to see strong products and platforms in the areas of multi cloud data management, especially data virtualization across cloud providers in a performance focused manner.

About the Author –

Srini is the Technology Advisor for GAVS. He is currently focused on Healthcare Data Management Solutions for the post-pandemic Healthcare era, using the combination of Multi-Modal databases, Blockchain, and Data Mining. The solutions aim at Patient data sharing within Hospitals as well as across Hospitals (Healthcare Interoperability) while bringing more trust and transparency into the healthcare process using patient consent management, credentialing, and zero-knowledge proofs.