Context
AI Bulletin
Industry trends, leadership expectations, and technology challenges shaping why SOAR exists
Key Industry Trends
Forces reshaping how operations teams must think, organise, and deliver
Intelligence Over Monitoring
From Monitoring to Operational Intelligence
Operational intelligence delivering insights across incidents, performance, and cost. Explainable analytics supporting decisions and risk management — not just dashboards.
Resilient Outcomes
Reliability & Measurable Service Outcomes
Shifting focus from infrastructure uptime to service reliability and business outcomes. Growing emphasis on proactive operations over reactive firefighting.
Federated Execution
Federated Operations, Centralised Governance
Federated execution with centralised reliability and governance oversight. Integrated ecosystem across cloud providers, platform teams, security, and tooling.
AI as Multiplier
AI-Embedded, Cloud-Scaled Delivery
AI and automation as operational effectiveness multipliers — reducing toil, accelerating decisions, and enabling faster knowledge discovery and collaboration.
Expectations from Leadership
What the organisation needs operations to deliver across people, process, data, and technology
People
Demonstrate measurable operational impact
Upskill teams with AI, governance, and cloud skills
Retain high-value operational talent
Reduce toil and improve engineer experience
Process
Accelerate operational response cycles
Embed FinOps and observability for a holistic view
Standardise tooling to reduce fragmentation
Treat operational data as a strategic asset
Data
Build scalable and reliable operational data platforms
Governance for accurate, explainable, and compliant AI
Cross-platform and cross-domain operational analytics
Embed risk controls and explainability into AI processes
Technology
Build scalable cloud platforms supporting AI-assisted ops
Use AI and automation as effectiveness multipliers
Drive maturity through phased SOAR adoption
Maintain compliance, security, and governance standards
Technology Challenges to Address
SOAR directly addresses each of these challenges through structured, wave-based AI adoption across all operational pillars.
Reference
Operations Layer Diagram
How product offerings map to the operations capabilities required to sustain them
As organisations leverage AI to transform their product offerings, operations must equally evolve to enable reliability, governance, and intelligent support.
Experience Layer
Performance & Experience Monitoring
Usage Analytics
A/B Testing
Latency
Uptime / Availability
Endpoint Protection
Functional Layer
Data Integrity, Retrieval Ops & AI Ops
Rules Validation
Human-in-the-Loop
RAG Checks
Storage Performance
Data Security
Back-up & Recovery
Intelligence Layer
Model Observability, Quality Assurance & MLOps
Workflow Management
LLM / SLM Models
Prompt & Context Ops
Model Bias & Drift
Hallucination Detection
Model Versioning
Data Layer
Resource Optimisation, Reliability & Compliance
Semantic & Feed Layer
Embeddings
Data Sovereignty
Legal Checks
Business Continuity
Incident Management
Infra Layer
FinOps, Compliance Auditing & Networking
Auto-Scaling
GPU Utilisation
API Health
Networking
FinOps
Compliance Auditing
Governance Layer
Security, Risk & Organisational Data
Transactional Data
Proprietary Content
Public / Open Data
Organisation Data
Cost View
Auto-Scaling
Each SOAR pillar maps to one or more layers — ensuring operational capabilities evolve in step with product and AI platform needs.
AI Adoption Program · Cloud Operations
Operational AI Value Tracker
Tracking AI adoption, productivity gains and business value across all SOAR pillars
Program start: Jan 2025
Current wave: Wave 2
Last updated: Mar 2026
Current wave: Wave 2
Last updated: Mar 2026
Program Vision
Transform operations into an intelligence-driven reliability organization powered by AI, automation, and insights.
SOAR evolves operations from reactive execution to intelligence-driven service reliability — augmenting engineers, accelerating decisions, and strengthening resilience.
SOAR Enables — Evidenced in Numbers
SOAR Capability Themes
Sense
Detect & Understand
MTTR improvement
-47%
Alert noise reduction
-62%
Proactive detections
134
Wave 2 Active
2 pillars →
Optimize
Reduce & Improve
Cost saved
$2.4M
Waste eliminated
$890K
Rightsizing adoption
71%
Wave 1 Done
1 pillar →
Accelerate
Speed & Automate
Hours saved
3,200
Ticket resolution
-38%
Deploy failure rate
-22%
Wave 2 Active
1 pillar →
Data
Govern & Enable
Platform availability
99.7%
Data quality score
87%
AI data readiness
72%
Wave 2 Active
1 pillar →
Reinforce
Protect & Scale
Security incidents
-31%
Onboarding time
-40%
SME dependency
-28%
Wave 3 Planned
2 pillars →
Guiding Principles
Delivery Waves
Wave 1 — Complete
Quick Wins
Q1 2025
✓ Incident AI summarization
✓ Runbook assistant
✓ Cost anomaly detection
✓ Ticket summarization
Wave 2 — Active
Decision Enablement
Q2–Q3 2025
◎ Root cause assistant
◎ Deployment risk analysis
◎ Log anomaly detection
Wave 3 — Planned
Autonomous Operations
Q4 2025
○ Predictive incident detection
○ Auto-remediation workflows
○ Intelligent auto-scaling
Recent AI Activity
Cost anomaly detected
Idle EC2 cluster — $42K/mo savings identified
2h ago
Incident triage completed
P1 root cause identified in 4 min vs 90 min avg
5h ago
Runbook auto-generated
DB failover runbook created from incident history
Yesterday
Compliance drift detected
3 misconfigured S3 buckets flagged automatically
Yesterday
Alert deduplication active
412 duplicate alerts suppressed this week
2d ago
Deploy risk assessment
High-risk change blocked pre-production — saved est. 3h outage
3d ago
Team Adoption Progress
Sense
Intelligent Incident Management
AI-assisted triage and root cause analysis enabling faster, smarter incident response
MTTR
48 min
Down from 90 min
↑ -47% improvement
MTTD
6 min
Down from 22 min
↑ -73% improvement
Escalations
-34%
Fewer P1 escalations
↑ AI triage impact
Recurrence
-28%
Repeat incidents
↑ KB generation effect
MTTR Before vs After AI
Mean Time To Resolve
Before AI90 min
After AI48 min
Target30 min
On track to reach 30 min target by Wave 3 with auto-remediation.
AI Maturity — Incident Management
Current Capability Level
Level 0
Manual operations — human driven
Level 1
AI insights — visibility & summarization
Done
Level 2
AI recommendations — root cause & fix suggestions
Current
Level 3
AI assisted execution — with human approval
Wave 3
Level 4
Autonomous — auto-remediation
Future
Active Initiatives
Incident Management AI Portfolio
01
AI incident triage assistant
Active
High
02
Root cause analysis summarization
Active
High
03
Log anomaly detection
Active
Med
04
Auto remediation suggestions
Planned
High
05
Knowledge base generation from incidents
Done
Med
AI Triage — Sample Output
Incident P1-2024-0847 · Live Analysis
AI Summary · generated in 12s
Root Cause: Memory pressure on prod-api-03 caused cascade failure. OOM triggered at 02:14 UTC. Pod evictions followed across 3 nodes.
Impact: 4.2% error rate on /checkout endpoint. ~1,400 affected users. Latency p99 elevated to 8.4s.
Suggested Fix: Increase memory limits on prod-api deployment. Apply runbook RB-0041. Scale horizontally +2 pods. Monitor for 30 min.
Analysis time
12s
vs 90 min manual
Confidence
91%
Root cause match
Runbook
RB-41
Auto-matched
Sense
Observability & Reliability Engineering
Unified operational telemetry enabling proactive, intelligence-driven reliability
Alert Noise
-62%
Alerts suppressed/deduped
↑ 412 suppressed/week
False Positives
-54%
Fewer false alerts
↑ Engineer trust up
Proactive Detections
134
Pre-incident catches YTD
↑ +34 this quarter
SLO Compliance
99.4%
Up from 98.1%
↑ +1.3% improvement
Alert Volume — Before vs After AI
Weekly Alert Categories
Total alerts (before)2,840 / week
After deduplication1,080 / week
Actionable alerts390 / week
AI noise reduction freed ~6h/week of on-call engineer time.
SLO Compliance by Service
Current Period Performance
Notifications SLO below 99% target — AI anomaly probe active.
Active Initiatives
Observability AI Portfolio
01
Smart alert deduplication
Done
High
02
Log anomaly detection
Active
High
03
Performance anomaly detection
Active
Med
04
Failure prediction
Planned
High
05
Capacity risk prediction
Planned
Med
Proactive Detections — YTD Breakdown
AI-caught issues before user impact
Avg detection lead time
38 min
Est. incidents avoided
$1.1M
Optimize
Cloud Cost Optimization
FinOps-driven cost governance and AI model flexibility delivering measurable financial outcomes
Total Cost Saved
$2.4M
Cloud waste eliminated YTD
↑ +18% vs target
Waste Eliminated
$890K
Idle & orphaned resources
↑ 214 resources reclaimed
Rightsizing Adoption
71%
Recommendations accepted
↑ Up from 42% last Qtr
Forecast Accuracy
93%
AI cost forecast precision
↑ Up from 71% manual
Cost Savings Breakdown
AI-identified savings by category
Monthly run rate
$200K
Annual projection
$2.4M
Next anomaly
Today
Recent Cost Anomalies
AI-detected spend anomalies
EC2 cluster spike — prod-batch
+340% above baseline · Auto-scaling misconfiguration
S3 egress anomaly — data-lake
+180% above baseline · Uncompressed exports
RDS over-provisioned — staging
db.r5.4xlarge at 8% avg CPU · Downsize candidate
NAT gateway waste — dev accounts
34 idle NAT gateways across dev accounts
Last scan: 2h ago · Next scheduled: in 4h
Active Initiatives
Cost Optimization AI Portfolio
01
Idle resource detection
Done
High
02
Rightsizing recommendations
Done
High
03
Cost anomaly detection & alerts
Active
High
04
Waste pattern detection
Active
Med
05
AI cost forecasting
Planned
Med
Rightsizing Adoption by Team
Recommendation acceptance rate
Dev teams below 60% target — AI recommendation UX review planned.
Accelerate
Developer Productivity
AI-assisted workflows evolving toward autonomous execution and reduced engineering toil
Hours Saved
3,200
Engineering hours YTD
↑ 1.6 FTE equivalent
Ticket Resolution
-38%
Avg. resolution time
↑ 4.2h → 2.6h avg
Deploy Failures
-22%
Failed deployments
↑ Risk assessment impact
Change Success
94%
Up from 81%
↑ +13 points
Toil Reduction by Activity
Hours saved per engineer task type
Avg saving / engineer
4.2h
Teams benefiting
14
AI Deploy Risk Assessment
Change success rate — before vs after AI
Change success (before)81%
Change success (after AI)94%
Sample Risk Assessment · CHG-2024-1183
Risk Level: Medium — 2 similar changes caused incidents in past 90 days
Recommendation: Deploy during low-traffic window. Enable feature flag. Have rollback ready.
Similar incidents: INC-0812 (DB migration), INC-0934 (cache invalidation)
23 high-risk changes blocked in staging this quarter.
Active Initiatives
Developer Productivity AI Portfolio
01
AI runbook assistant
Done
High
02
AI change risk assessment
Active
High
03
Deployment failure analysis
Active
High
04
AI platform documentation search
Active
Med
05
Infra troubleshooting assistant
Planned
Med
Ticket Resolution Time
Average time by category
All categories improved by 25–45% since AI assistant rollout.
Reinforce
Security & Compliance Automation
Policy-as-code enabling continuous compliance, security resilience, and automated governance
Security Incidents
-31%
YoY reduction
↑ AI detection impact
Time to Patch
-58%
Critical: 14d → 6d avg
↑ AI prioritisation
Compliance Violations
-44%
Drift detections resolved
↑ Continuous monitoring
Risk Score
42
Down from 74 (low = good)
↑ -32 points improvement
Risk Exposure — Before vs After AI
Organisational risk score trend
Risk score (before AI)74 / High
Risk score (current)42 / Medium
Target25 / Low
Open findings by severity
Critical
Unpatched CVEs (CVSS ≥ 9.0)
3
High
Misconfigured IAM policies
11
Medium
Public S3 buckets / open ports
28
Compliance Drift Detections
AI-flagged violations — last 7 days
Encryption at rest — data-store-07
CIS AWS 2.1.1 · Auto-remediated
MFA not enforced — 4 IAM users
SOC2 CC6.1 · Awaiting owner action
CloudTrail disabled — dev account
PCI DSS 10.1 · Ticket raised
VPC flow logs missing — 3 regions
NIST 800-53 AU-12 · Auto-remediated
Auto-remediated
67%
Avg fix time
1.8h
Open items
15
Active Initiatives
Security & Compliance AI Portfolio
01
Misconfiguration detection
Done
High
02
Compliance drift detection
Active
High
03
AI security log analysis
Active
High
04
Vulnerability prioritisation
Planned
Med
05
AI policy explanation assistant
Planned
Med
Time to Patch — By Severity
Average days before vs after AI prioritisation
Critical (CVSS ≥ 9.0)
Before
14d
After
6d
High (CVSS 7–8.9)
Before
30d
After
15d
AI prioritisation cut patch backlog by 58% — on track for 5d critical target.
Reinforce
Knowledge Management & Operational Intelligence
AI-powered knowledge discovery reducing SME dependency and accelerating operational learning
Onboarding Time
-40%
8 weeks → 5 weeks avg
↑ AI onboarding assistant
SME Dependency
-28%
Fewer SME escalations
↑ Self-serve queries up
Knowledge Reuse
73%
Queries resolved by AI KB
↑ Up from 31% baseline
Repeat Incidents
-35%
Same-cause recurrence
↑ AI post-mortems impact
AI Knowledge Assistant — Usage
Query resolution by source
Queries/day
340
Avg response
8s
Satisfaction
4.4/5
AI Post-Incident Analysis — Sample
INC-2024-0931 · Auto-generated in 45s
Post-Incident Summary · AI Generated
What happened: Payment service degraded for 22 min due to DB connection pool exhaustion triggered by a batch job overlap.
Root cause: Batch job RB-cron-042 not throttled. Consumed all 200 pool connections at peak load.
Action items: Add connection limit to batch jobs. Implement pool monitoring alert at 80%. Review cron schedule overlap policy.
Similar past incidents: INC-0714, INC-0823 — same root cause pattern. Prevention runbook created: RB-0054.
Generated in
45s
vs manual
3h
Runbooks created
142
Active Initiatives
Knowledge Management AI Portfolio
01
AI Ops knowledge assistant
Active
High
02
AI runbook generator
Done
High
03
AI post-incident analysis generator
Active
High
04
AI architecture explainer
Planned
Med
05
AI onboarding assistant
Planned
Med
SME Dependency Reduction
Escalation volume by domain
AI knowledge assistant now handles 73% of tier-1 queries without SME involvement.
Data
Data Platforms & AI Governance
Strong data foundations enabling accurate, explainable, and compliant AI operations
Platform Availability
99.7%
Operational data platforms
↑ Up from 98.2%
Data Quality Score
87%
Across governed datasets
↑ Up from 61% baseline
Governed Datasets
1,240
Tagged, searchable, compliant
↑ +340 this quarter
AI Data Readiness
72%
Datasets ready for AI use
→ Target: 90% by Wave 3
Data Platform Health
Availability by Platform
All platforms above 98.5% SLO threshold. Metrics pipeline improved after Nov incident.
AI Data Readiness by Domain
% Datasets Ready for AI Consumption
Capacity metrics dataset below 60% readiness — tagging and lineage work in progress.
Active Initiatives
01
Operational data catalogue & tagging
Data
Done
High
02
Data quality monitoring & alerting
Data
Active
High
03
AI data lineage & explainability tracking
Data
Active
High
04
Compliance data governance framework
Data
Planned
Medium
05
Searchable ops knowledge & data store
Data
Planned
Medium
Treat operational data as a strategic asset — searchable, tagged, governed, and AI-ready across all SOAR pillars.
Roadmap
Initiative Roadmap
Wave-based delivery across all capability pillars
Total Initiatives
18
Across all SOAR pillars
Complete
6
Wave 1 delivered
In Progress
7
Wave 2 active
Planned
5
Wave 3 pipeline
Wave 1 — Complete
Quick Wins · Q1 2025
Fastest credibility — high value, low friction
Incident AI Summarization
Done
AI condenses 5,000 log lines into a plain-English root cause summary in under 30 seconds.
Sense
High value
Readiness
Runbook Assistant
Done
AI surfaces and guides engineers through the correct runbook steps during live incidents.
Accelerate
High value
Readiness
Cost Anomaly Detection
Done
AI monitors spend patterns and alerts teams to anomalies within hours of emergence.
Optimize
High value
Readiness
Wave 2 — Active
Decision Enablement · Q2–Q3 2025
Better decisions, reduced risk
Root Cause Assistant
Active
AI correlates alerts, logs and topology to suggest a ranked list of probable root causes.
Sense
High value
Readiness
Deploy Risk Analysis
Active
Scores change requests against historical failure patterns before approval is granted.
Accelerate
High value
Readiness
Compliance Drift Detection
Active
Continuous scanning against CIS, SOC2 and PCI controls with auto-remediation for safe fixes.
Reinforce
High value
Readiness
Wave 3 — Planned
Autonomous Operations · Q4 2025
Execution acceleration and self-healing
Predictive Incident Detection
Planned
ML models detect degradation signals 20–60 min before a user-impacting incident occurs.
Sense
High value
Readiness
Auto-Remediation Workflows
Planned
AI executes safe remediation steps autonomously — with human approval gates for high-risk actions.
Accelerate
High value
Readiness
Intelligent Auto-Scaling
Planned
Predictive scaling driven by AI demand forecasting, reducing over-provisioning and latency spikes.
Optimize
Med value
Readiness
Value vs Feasibility Matrix
High Value · Low Feasibility
Predictive incidents
Auto-remediation
Plan carefully
High Value · High Feasibility ★
Incident summarisation
Cost anomaly detection
Deploy risk analysis
Compliance drift
Prioritise now
Low Value · Low Feasibility
AI policy explainer
Defer
Low Value · High Feasibility
Architecture explainer
Doc search
Quick add-ons
← Low Feasibility
High Feasibility →
Theme Legend
Sense
Optimize
Accelerate
Reinforce
Readiness Key
Ready
In progress
Prep needed
Value
Value Metrics
Three-layer measurement framework: Adoption → Productivity → Business Value
Layer 1
Adoption
Are people using AI? Without adoption there is no value.
Layer 2
Productivity
Is AI making engineers faster and removing toil?
Layer 3
Business Value
What is the measurable organisational impact? Executives fund this layer.
Layer 1 — Adoption
Are teams using AI tools?
Monthly Active AI Users
68%
of all engineers
↑ Target: 80% by Q3 2026
Teams Onboarded
9 / 12
teams actively using SOAR tools
↑ 3 teams in onboarding
AI-Assisted Incidents
82%
of P1/P2 incidents use AI triage
↑ Up from 12% 12 months ago
Queries per Engineer
14
avg AI queries / engineer / week
↑ Up from 3 at programme start
Adoption Rate by Team
Monthly active users as % of team headcount
Feature Adoption Rate
% of users actively using each AI feature
Layer 2 — Productivity
Are engineers doing more with less toil?
Eng. Hours Saved
3,200h
reclaimed from toil YTD
↑ 1.6 FTE equivalent
Ticket Resolution
-38%
avg resolution time
↑ 4.2h → 2.6h average
Automation Rate
34%
of ops tasks AI-assisted
↑ Target: 60% by Wave 3
Toil Reduction
-41%
self-reported toil per sprint
↑ Engineer NPS up +22pts
Layer 3 — Business Value
What is the executive-level impact?
Financial
Cost impact
Cloud cost reduction
$2.4M
Incidents avoided (est.)
$1.1M
Eng. hours saved ($)
$640K
Total value delivered
$4.14M
Reliability
System stability impact
MTTR improvement
-47%
Availability improvement
+1.3%
Customer incidents
-31%
SLO breach rate
-52%
Risk Reduction
Security & compliance
Security incidents
-31%
Compliance violations
-44%
Velocity
Engineering speed
Deploy frequency
+28%
Change success rate
81%→94%
Maturity
AI Maturity Model
Level 0–4 capability progression across all pillars
Program Overall Maturity
2.1
out of 4.0
AI Recommendations — Level 2
AI provides insights and recommendations. Engineers make final decisions. Augmentation before automation.
L0 ManualL1 InsightsL2 Recommend ◎L3 AssistL4 Autonomous
Sense
Incident Management
L0L1 ✓L2 ◎L3L4
MTTR improvement
-47%
Next target
L3 in Wave 3 →
Sense
Observability
L0L1 ✓L2 ◎L3L4
Alert noise reduction
-62%
Next target
L3 in Wave 3 →
Optimize
Cost Optimization
L0L1 ✓L2 ✓L3 →L4
Cost saved
$2.4M
Next target
L3 near-term →
Accelerate
Dev Productivity
L0L1 ✓L2 ◎L3L4
Hours saved
3,200h
Next target
L3 in Wave 3 →
Reinforce
Security & Compliance
L0L1 ◎L2 →L3L4
Security incidents
-31%
Next target
L2 active →
Reinforce
Knowledge Management
L0L1 ◎L2L3L4
Knowledge reuse
73%
Next target
L2 in Wave 3 →
Maturity Level Definitions
Level 0
Manual
Fully human-driven. No AI assistance. High cognitive load on engineers.
Level 1
Insights
AI provides visibility and summarisation. Engineers still decide and act. Done
Level 2
Recommendations
AI suggests root causes and fixes. Human approves every action. Current
Level 3
Assisted Execution
AI executes safe actions with human approval gates. Builds trust. Wave 3
Level 4
Autonomous
Self-healing operations. AI acts independently within defined guardrails. Future
❝
AI augments engineers to improve effectiveness and reduce operational burden. Augmentation before automation — build trust at every level.
SOAR Programme Guiding Principle
Insights
AI Adoption Insights
What drives successful AI adoption — and what to avoid
Common Pitfalls to Avoid
Patterns that stall adoption across organisations
⚠
Starting with tools instead of problems to solve
⚠
Many pilots launched with no path to scale
⚠
Poor data readiness — untagged, ungoverned, unsearchable
⚠
No clear ROI model or business value definition
⚠
Change management under-invested — communicate intent clearly
⚠
Standalone AI tools deployed with low workflow integration
What Is Working
Validated approaches across the SOAR programme
✓
AI as assistant, not a replacement — trust is built gradually
✓
Identify pain first, apply AI, then measure the outcome
✓
Standardise → Integrate → Scale: structured rollout model
✓
Tracking adoption, productivity, and business impact together
Rollout Model — AIOps Approach
Alert Intelligence
AIOps platform for alert correlation, incident clustering, and de-duplication
LLM Assist
Root cause analysis, ticket creation, and workflow routing via language models
AI Investigation
Diagnostic analysis with suggested mitigation actions — human approves