Monitoring Data Refinery
This document focuses on essential monitoring tasks that encompass container, database, task management, and API aspects without delving into exhaustive technical details. These operations provide a seamless user experience by ensuring Data Refinery’s stability and efficiency.
Table of contents
Container Monitoring
Container monitoring ensures the stability and optimal performance of the system by tracking various key metrics related to the containers running in Data Refinery.
Monitoring API Containers
What is Monitored
The number of active API containers in the system is monitored. If there are no active API containers running, a user will be alerted immediately. The software runs on API containers, and if there is no API present, then it means it is non-operational.
Why is it Important
Ensuring that the API service is operational remains crucial for the smooth functionality of the system. The monitoring system quickly detects any service disruptions, allowing for immediate investigation and resolution. This maintains optimal system performance and minimizes downtime.
Monitoring API Container Cluster Performance and Capacity
What is Monitored
The overall performance and capacity of the API container cluster is tracked. This includes monitoring aspects like CPU utilization, memory utilization, and disk I/O operations. If any of these metrics exceed predefined thresholds, a user will be alerted, allowing for quick intervention and adjustments.
Why is it Important
Monitoring the cluster’s performance is essential for maintaining the health and efficiency of the system. By watching these critical aspects, it ensures that the system can detect anomalies and make scaling decisions as needed. This avoids potential bottlenecks and promotes a smooth, responsive user experience.
Monitoring Active AWS Fargate Tasks Proportion (for API Containers)
What is Monitored
A real-time overview of the proportion of active Fargate tasks relative to the total possible tasks in the system is provided. This information is displayed as a percentage and is updated every minute. This gives users up-to-date insight into the system’s current capacity and usage.
Why is it Important
Understanding the proportion of active tasks helps users anticipate the system’s ability to handle additional load. As the system manages scaling decisions automatically, this real-time insight aids in planning the operations effectively. This ensures a smooth and uninterrupted experience.
Database Performance Monitoring
Database Performance Monitoring safeguards the efficiency and reliability of the system. It continually oversees key performance metrics, prevents bottlenecks, and ensures that the databases respond effectively to user demands.
Amazon Aurora Serverless Performance Metrics and Alerting
What is Monitored
CPU Utilization is monitored and users are alerted when the usage approaches critical levels. Additionally, the number of user connections is tracked to avoid overloading. The overall capacity of the Aurora Serverless Database instances in the cluster is monitored.
Why is it Important
Stability and performance metrics prevent system overloads to ensure a smooth and stable user experience. Plus, monitoring helps optimize how the system scales with demands, providing efficient performance at a reasonable cost.
Monitoring Redshift Spectrum
What is Monitored
Tracks prematurely terminated (aborted) queries in Redshift Spectrum.
Why is it Important
Monitoring aborted queries is essential for maintaining the system’s efficiency. By spotting these unsuccessful queries, it optimizes resource usage, ensuring that the system runs smoothly and cost-effectively.
Task Management and Performance Monitoring
Task Management and Performance Monitoring ensure optimal operation by tracking key metrics related to Background Tasks and Glue Crawlers. This provides visibility into system performance, task failures, and efficiency.
Monitor and Log Background Tasks, Task Failures, and Task Execution Duration
What is Monitored
The number of active Background Tasks running at any given time is tracked. If the number of failed tasks crosses a certain threshold, users are alerted. The average execution time for different types of Background Tasks are also tracked.
Why is it Important
Monitoring these aspects provides insights into system performance and reliability. This helps ensure a smooth and efficient user experience.
Glue Crawlers Monitoring
What is Monitored
The number of failed Glue Crawler jobs is tracked and the Glue Crawler jobs that have been successfully completed are measured.
Why is it Important
Monitoring these Glue Crawler jobs enables early failure detection and provides insight into job completion rates.
API Performance, Availability, and Usage Monitoring
API Performance, Availability, and Usage Monitoring focuses on tracking user interactions and response times to ensure a smooth and reliable user experience.
Monitor Expected User Loads / Concurrent Users
What is Monitored
The number of users accessing the system simultaneously (concurrent users) is continually logged. The number of active data Sources and the number of versions for each data Source is logged. Plus, the number of connections to third-party system integrations is logged.
Why is it Important
Monitoring these aspects helps to understand the system load, data complexity, usage patterns, and potential bottlenecks in integration.
Monitor API Response Time and Errors to Ensure System Availability
What is Monitored
The average response time for all API endpoints is tracked and users are alerted when it exceeds predefined thresholds. The percentage of errors across all endpoints is tracked and triggers alerts when error rates surpass predefined thresholds.
Why is it Important
This monitoring is vital to quickly detect and address performance issues, ensuring that the API remains available and reliable for users.
Integration Alert Setup (Placeholder)
In the future, this section will provide detailed instructions on how clients and users can set up alerts to integrate with their existing systems. These systems include Splunk, VictorOps, GitLab, Jira, and more. It will be a comprehensive guide, ensuring that users can effortlessly streamline alerts with the systems they are already using. For now, consider this as a placeholder, and the information will be added as Data Refinery refines the monitoring processes and integrations.