This content is currently WIP. Diagrams, content, and structure are subject to change.
This section covers the observability and monitoring tools integrated with the C3 Agentic AI Platform. You’ll learn about Prometheus for metrics collection, OpenSearch for log management, and Grafana for visualization and dashboards.

What is observability and monitoring?

Observability and monitoring in the C3 Agentic AI Platform consists of tools that collect data about your applications and infrastructure. Monitoring shows when something is wrong, while observability provides context to understand why it’s wrong. The platform integrates standard monitoring tools that collect, store, and display:
  • Metrics: Numerical data points collected at regular intervals
  • Logs: Text records of events and actions
  • Traces: Records of requests as they flow through distributed systems
  • Alerts: Notifications when predefined conditions are met

Monitoring architecture

The C3 Agentic AI Platform’s monitoring architecture follows a layered approach:

Data collection

Agents and exporters that gather metrics, logs, and traces from various sources.

Data storage

Time-series databases and log storage systems that efficiently store monitoring data.

Data processing

Systems that analyze, aggregate, and correlate monitoring data to extract insights.

Visualization and alerting

Dashboards and notification systems that present monitoring data and alert on anomalies.
This architecture provides a comprehensive view of your application’s health and performance, from infrastructure to business metrics.

Core monitoring components

The C3 Agentic AI Platform integrates several industry-standard monitoring tools:

Prometheus for metrics

Prometheus collects and stores time-series metrics from your applications and infrastructure:
  • Resource metrics: CPU, memory, disk, and network usage
  • Application metrics: Request rates, error rates, and latencies
  • Business metrics: User activity, transaction volumes, and other domain-specific metrics

OpenSearch for logs

OpenSearch (formerly Elasticsearch) centralizes logs from all components of your application:
  • Application logs: Messages generated by your application code
  • System logs: Events from the operating system and infrastructure
  • Access logs: Records of API calls and user interactions
  • Audit logs: Security-relevant events and administrative actions

Grafana for visualization

Grafana provides dashboards for visualizing metrics and logs:
  • Predefined dashboards: Ready-to-use visualizations for common monitoring needs
  • Custom dashboards: Tailored views for specific applications or use cases
  • Alerts: Notifications based on metric thresholds or log patterns
  • Annotations: Contextual information about events like deployments or incidents

Monitoring layers

The C3 Agentic AI Platform’s monitoring capabilities span multiple layers:

Infrastructure monitoring

Infrastructure monitoring focuses on the health and performance of the underlying hardware and software:
  • Kubernetes monitoring: Pod status, resource usage, and cluster health
  • Node monitoring: CPU, memory, disk, and network metrics for each server
  • Database monitoring: Query performance, connection counts, and storage usage
  • Network monitoring: Latency, throughput, and error rates

Application monitoring

Application monitoring tracks the behavior and performance of your C3 AI applications:
  • API monitoring: Request rates, error rates, and latencies for each endpoint
  • Service monitoring: Health and performance of individual microservices
  • Dependency monitoring: Interactions with external systems and services
  • Error tracking: Exception rates, stack traces, and error patterns

Business monitoring

Business monitoring focuses on domain-specific metrics that reflect the value your application provides:
  • User activity: Active users, session duration, and feature usage
  • Transaction metrics: Volume, value, and success rates of business transactions
  • Data quality: Completeness, accuracy, and timeliness of data
  • ML model performance: Prediction accuracy, drift, and resource usage

Log analysis

Log analysis helps you understand what’s happening in your application and diagnose issues:

Log collection

Logs are collected from various sources and centralized in OpenSearch for analysis.

Log querying

OpenSearch Dashboards (formerly Kibana) provides a powerful interface for querying and analyzing logs.

Log visualization

OpenSearch Dashboards provides visualizations for log data to help identify patterns and trends.

Alerting

Alerting notifies you when something goes wrong or requires attention:

Alert rules

Alert rules define conditions that trigger notifications based on thresholds and patterns.

Alert notifications

Alert notifications are sent through various channels such as email and Slack.

Best practices

Here are some best practices for observability and monitoring in the C3 Agentic AI Platform:

Metric collection

  • Use meaningful metric names with consistent naming conventions
  • Add relevant labels to provide context
  • Choose appropriate metric types for different use cases
  • Balance granularity and volume to avoid overwhelming storage

Log management

  • Use structured logging with consistent formats
  • Log at appropriate levels based on severity
  • Include context to correlate related events
  • Manage log volume with rotation and retention policies

Dashboard design

  • Start with overview dashboards and provide drill-down capabilities
  • Group related metrics for easier analysis
  • Use consistent time ranges across panels
  • Add context with documentation and runbooks

Alerting strategy

  • Alert on symptoms that impact users, not internal causes
  • Set appropriate thresholds to balance sensitivity and specificity
  • Define clear ownership for each alert
  • Include actionable information for remediation