We spent a decade hardening human sign-ins - MFA, phishing-resistant credentials, risk-based Conditional Access - and quietly left the non-human half of the directory wide open. Every tenant has hundreds of service principals: SaaS connectors, CI deployers, automation runbooks, the random app some team consented to in 2021. Most carry a client secret valid for two years, broad Microsoft Graph application permissions, and zero conditions on where they can authenticate from. They do not get MFA prompts. They do not trigger a “new device” notification. That is exactly why attackers love them: a stolen secret is a quiet, durable foothold that survives the user password resets you do during incident response.
This guide treats workload identities as first-class subjects of governance. We inventory them, put Conditional Access around them, watch them with Identity Protection, strip the secrets out, right-size their permissions, and finish with the playbook you run when one is compromised.
Licensing note up front: workload-identity Conditional Access and workload identity risk detections require Microsoft Entra Workload ID Premium (a per-service-principal add-on), separate from the user-based Entra ID P2 your humans use. Budget for it before you design around it.
1. The blind spot: why service principals are the favorite persistence mechanism
A service principal is the local representation of an application in your tenant - the thing that actually holds credentials and gets tokens. Two properties make it dangerous:
- Credentials are bearer secrets with long lifetimes. A client secret or certificate authenticates the app from anywhere. There is no second factor to steal-then-fail. Default secret lifetimes are measured in months to years.
- Permissions are tenant-wide and silent. Application permissions (app roles) like
Mail.ReadorDirectory.ReadWrite.Allapply across the whole tenant with no signed-in user to scope them. If an attacker controls the principal, they read every mailbox, not one.
Real intrusions (the Midnight Blizzard / Storm-0558 class of attack) followed this pattern: compromise or mint a credential on an over-privileged app, then quietly use Graph to exfiltrate mail and add their own credential for persistence. The user-focused controls never fired because no user was involved.
The fix is the same Zero Trust posture we apply to people, retargeted at workloads: verify the context of the sign-in, grant least privilege, assume the credential will leak, and remove the credential entirely where you can.
2. Inventory: app registrations, enterprise apps, and over-privileged Graph permissions
You cannot govern what you have not enumerated. Start with the two halves of every app:
- App registration (
Get-MgApplication) - the definition, owned in your tenant, where credentials and the requested permissions live. - Service principal / enterprise app (
Get-MgServicePrincipal) - the instance in your tenant, where sign-ins, role assignments, and granted consent live.
Pull a full inventory with Microsoft Graph PowerShell:
Connect-MgGraph -Scopes "Application.Read.All","Directory.Read.All","AuditLog.Read.All"
# All app registrations with credential expiry
Get-MgApplication -All -Property DisplayName,AppId,PasswordCredentials,KeyCredentials |
Select-Object DisplayName, AppId,
@{n='Secrets'; e={ $_.PasswordCredentials.Count }},
@{n='Certs'; e={ $_.KeyCredentials.Count }},
@{n='NextExpiry'; e={ ($_.PasswordCredentials.EndDateTime + $_.KeyCredentials.EndDateTime |
Sort-Object | Select-Object -First 1) }} |
Sort-Object NextExpiry | Format-Table -AutoSize
Now find the genuinely dangerous ones - principals granted high-impact application Graph permissions (not delegated). This is the query that surfaces your blast radius:
$graph = Get-MgServicePrincipal -Filter "appId eq '00000003-0000-0000-c000-000000000000'"
$dangerous = @('Directory.ReadWrite.All','RoleManagement.ReadWrite.Directory',
'AppRoleAssignment.ReadWrite.All','Application.ReadWrite.All',
'Mail.ReadWrite','Mail.Read','User.ReadWrite.All')
Get-MgServicePrincipalAppRoleAssignment -ServicePrincipalId $graph.Id -All |
Where-Object { $_.AppRoleId } | ForEach-Object {
$roleName = ($graph.AppRoles | Where-Object Id -eq $_.AppRoleId).Value
if ($roleName -in $dangerous) {
[pscustomobject]@{ App = $_.PrincipalDisplayName; Permission = $roleName }
}
} | Sort-Object App | Format-Table -AutoSize
AppRoleAssignment.ReadWrite.AllandRoleManagement.ReadWrite.Directoryare effectively tenant takeover permissions - an app holding either can grant itself any other permission or assign itself Global Administrator. Treat them as Tier 0.
Finally, flag the orphans and stragglers: apps with no owner, apps that have not signed in for 90+ days, and apps with secrets but no certificate. Service principal sign-ins live in the MicrosoftGraphActivityLogs and the sign-in logs under the service principal sign-in category - pull Get-MgAuditLogSignIn -Filter "signInEventTypes/any(t: t eq 'servicePrincipal')" (Workload ID Premium) to see who is actually active.
3. Conditional Access for workload identities (IP location restrictions)
Human CA policies do not apply to service principals. There is a separate CA target: workload identities. The single most valuable control here is a location condition - the vast majority of your service principals only ever authenticate from a fixed set of egress IPs (your CI runners, your automation subnet, a partner’s published ranges). Lock them to those IPs and a stolen secret used from an attacker’s infrastructure is simply blocked.
First define a named location for your legitimate egress:
$ip = New-MgIdentityConditionalAccessNamedLocation -BodyParameter @{
"@odata.type" = "#microsoft.graph.ipNamedLocation"
displayName = "Corp egress + CI runners"
isTrusted = $true
ipRanges = @(
@{ "@odata.type" = "#microsoft.graph.iPv4CidrRange"; cidrAddress = "203.0.113.0/24" }
@{ "@odata.type" = "#microsoft.graph.iPv4CidrRange"; cidrAddress = "198.51.100.16/28" }
)
}
Then create a CA policy that targets service principals (not users) and blocks sign-ins from anywhere except that location. Start in enabledForReportingButNotEnforced and read the report-only results before you flip to enabled:
$params = @{
displayName = "WL - Block SP sign-in outside corp egress"
state = "enabledForReportingButNotEnforced" # report-only first
conditions = @{
clientApplications = @{
includeServicePrincipals = @("ServicePrincipalsInMyTenant")
# excludeServicePrincipals = @("<break-glass-automation-app-id>")
}
applications = @{ includeApplications = @("All") }
locations = @{
includeLocations = @("All")
excludeLocations = @($ip.Id) # everything except our egress
}
}
grantControls = @{ operator = "OR"; builtInControls = @("block") }
}
New-MgIdentityConditionalAccessPolicy -BodyParameter $params
Caveats that matter in production:
- Workload-identity CA supports block and location conditions for single-tenant service principals. It does not give you MFA (there is no human to prompt) - the control is “is this sign-in coming from where I expect.”
- Exclude your break-glass automation. Just like human break-glass accounts, keep one tightly-held emergency automation principal out of the policy so a bad IP list cannot lock you out of remediation.
- Managed identities are covered too, but think carefully: an IP restriction on a managed identity used by an Azure PaaS service must include that service’s outbound ranges.
4. Detecting compromised service principals with workload identity risk signals
Entra Identity Protection extends to workload identities and produces service-principal risk detections that are exactly the signals an attacker trips:
| Detection | What it means |
|---|---|
Leaked credentials |
The app’s secret/cert was found in a public leak (paste sites, repos). |
Anomalous service principal activity |
Behavior deviates from the app’s learned baseline (new resources, unusual Graph calls). |
Suspicious sign-ins |
Sign-in patterns inconsistent with the principal’s normal behavior. |
Admin confirmed SP compromised |
An analyst manually flagged it - drives automation downstream. |
Surface risky workload identities via Graph (IdentityRiskyServicePrincipal.Read.All):
Get-MgRiskyServicePrincipal -All |
Where-Object { $_.RiskLevel -in @('high','medium') -and $_.RiskState -ne 'dismissed' } |
Select-Object DisplayName, AppId, RiskLevel, RiskState, RiskLastUpdatedDateTime |
Format-Table -AutoSize
Then close the loop with a risk-based Conditional Access policy for workload identities that blocks any service principal at elevated risk. This is the automated containment that runs at 3 a.m. without you:
$risk = @{
displayName = "WL - Block risky service principals"
state = "enabled"
conditions = @{
clientApplications = @{ includeServicePrincipals = @("ServicePrincipalsInMyTenant") }
applications = @{ includeApplications = @("All") }
servicePrincipalRiskLevels = @("high","medium")
}
grantControls = @{ operator = "OR"; builtInControls = @("block") }
}
New-MgIdentityConditionalAccessPolicy -BodyParameter $risk
For investigation and SIEM correlation, the ServicePrincipalRiskEvents and AADServicePrincipalSignInLogs tables in Log Analytics / Sentinel are where you hunt. A useful KQL starting point:
AADServicePrincipalSignInLogs
| where TimeGenerated > ago(7d)
| where ResultType == 0 // successful sign-ins only
| summarize SignIns = count(), IPs = make_set(IPAddress, 50) by ServicePrincipalName, AppId
| extend DistinctIPs = array_length(IPs)
| where DistinctIPs > 5 // apps suddenly auth'ing from many IPs
| sort by DistinctIPs desc
5. Going secretless: federated credentials and managed identities
The most durable fix is to delete the secret. Two mechanisms cover almost every workload:
Managed identities for anything running in Azure (VMs, App Service, Functions, AKS, Container Apps, Automation). Azure manages the credential lifecycle entirely - there is no secret you can leak. Prefer user-assigned so the identity outlives a single resource and is reusable:
az identity create -g rg-platform-prod -n id-app-prod
PRINCIPAL_ID=$(az identity show -g rg-platform-prod -n id-app-prod --query principalId -o tsv)
# Grant data-plane access, not Owner - e.g. read a Key Vault
az role assignment create --assignee "$PRINCIPAL_ID" \
--role "Key Vault Secrets User" \
--scope "/subscriptions/$SUB_ID/resourceGroups/rg-platform-prod/providers/Microsoft.KeyVault/vaults/kv-app-prod"
Federated identity credentials (FIC) for workloads running outside Azure - GitHub Actions, GitLab, other clouds, or Kubernetes via the OIDC issuer. Instead of a secret, the workload presents a short-lived OIDC token whose issuer/subject/audience you pre-registered. Entra exchanges it for an access token only on an exact claim match. For GitHub Actions on the main branch:
az ad app federated-credential create --id "$APP_ID" --parameters '{
"name": "gha-main",
"issuer": "https://token.actions.githubusercontent.com",
"subject": "repo:kloudvin/platform:ref:refs/heads/main",
"audiences": ["api://AzureADTokenExchange"]
}'
Once federation or a managed identity is in place, delete every password credential on the app so there is nothing left to steal:
$app = Get-MgApplication -Filter "appId eq '$AppId'"
$app.PasswordCredentials | ForEach-Object {
Remove-MgApplicationPassword -ApplicationId $app.Id -KeyId $_.KeyId
}
The subject in a GitHub FIC is an exact string match - no wildcards. Register one credential per branch/environment you actually deploy from. That tightness is the point: a leaked workflow file cannot move the trust to another branch.
6. Right-sizing application permissions and admin consent workflows
Two governance moves cut the permission blast radius:
Prefer delegated over application permissions, and least-privilege scopes. If an app only sends mail as one mailbox, use Application Access Policies in Exchange to constrain Mail.Send to a single mailbox instead of the whole tenant:
New-ApplicationAccessPolicy -AppId $AppId `
-PolicyScopeGroupId "svc-mailers@contoso.com" `
-AccessRight RestrictAccess `
-Description "Limit app to mailboxes in svc-mailers group"
Turn off ad-hoc user consent and route requests through an admin consent workflow. Stop users from consenting apps into your tenant; force a reviewer.
# Restrict user consent to verified publishers + low-impact permissions only
Update-MgPolicyAuthorizationPolicy -BodyParameter @{
defaultUserRolePermissions = @{
permissionGrantPoliciesAssigned = @("ManagePermissionGrantsForSelf.microsoft-user-default-low")
}
}
Then enable the admin consent request workflow so a user who needs a new app generates a request that named reviewers approve - giving you a governed front door instead of silent grants. Configure reviewers and notifications under Entra admin center > Identity > Enterprise applications > Consent and permissions > Admin consent settings, and review the resulting requests as a recurring control.
7. Credential hygiene: expiry alerting, certificate rotation, and orphan cleanup
Even with secretless as the goal, you will have legacy apps holding credentials for a while. Run hygiene as a scheduled job:
# Alert on credentials expiring within 30 days (or already expired)
$threshold = (Get-Date).AddDays(30)
Get-MgApplication -All -Property DisplayName,AppId,PasswordCredentials,KeyCredentials |
ForEach-Object {
foreach ($c in @($_.PasswordCredentials + $_.KeyCredentials)) {
if ($c.EndDateTime -and $c.EndDateTime -lt $threshold) {
[pscustomobject]@{
App = $_.DisplayName; AppId = $_.AppId
Type = if ($c.GetType().Name -match 'Key') {'Certificate'} else {'Secret'}
KeyId = $c.KeyId; Expires = $c.EndDateTime
}
}
}
} | Sort-Object Expires | Format-Table -AutoSize
Hygiene rules worth enforcing:
- Certificates over secrets for any app that must keep a credential - rotate via Key Vault and roll the public key onto the app before removing the old one (overlap, never gap).
- Cap secret lifetime in policy. Use an app management policy to forbid long-lived passwords tenant-wide so no one creates a two-year secret again:
New-MgPolicyAppManagementPolicy -BodyParameter @{
displayName = "No long-lived secrets"
isEnabled = $true
restrictions = @{
passwordCredentials = @(@{
restrictionType = "passwordLifetime"
maxLifetime = "P90D" # 90-day ceiling on client secrets
state = "enabled"
})
}
}
- Cleanup orphans monthly: disable (do not immediately delete) principals with no owner and no sign-in for 90 days; delete after a 30-day grace period. Disable with
Update-MgServicePrincipal -ServicePrincipalId $id -AccountEnabled:$false.
8. Continuous monitoring and the compromise playbook
Wire the signals into Sentinel and define analytics rules for the high-signal events: new credential added to an app/SP, new app role (permission) granted, app added to a privileged directory role, and service principal risk detection raised. The “credential added” rule catches the classic persistence move:
AuditLogs
| where OperationName in ("Add service principal credentials",
"Update application - Certificates and secrets management")
| extend Target = tostring(TargetResources[0].displayName)
| extend Actor = tostring(parse_json(tostring(InitiatedBy.user)).userPrincipalName)
| project TimeGenerated, OperationName, Target, Actor, Result
| sort by TimeGenerated desc
When a principal is compromised, time-to-contain is everything. The playbook:
- Confirm and contain. In Identity Protection, mark the SP compromised (drives risk-based CA to block it). Then hard-disable:
Update-MgServicePrincipal -ServicePrincipalId $id -AccountEnabled:$false. - Revoke all credentials and active tokens. Remove every
PasswordCredentials/KeyCredentialsentry on the app, then revoke issued tokens so existing sessions die:Revoke-MgServicePrincipalSignInSession -ServicePrincipalId $id - Hunt for attacker persistence. Diff the app’s credentials, owners, app role assignments, and federated credentials against your last known-good export. Attackers add a second secret or a rogue FIC so revoking the first one does nothing - check
Get-MgApplicationFederatedIdentityCredentialtoo. - Assess blast radius. Pull
MicrosoftGraphActivityLogsfor that AppId to see exactly which Graph resources the principal touched during the window (which mailboxes, which directory objects). - Rebuild clean. Re-create the identity secretless (managed identity or FIC), grant least privilege, and bring it back behind the location + risk CA policies.
- Post-incident: add the leaked IP ranges to a block, tighten the FIC subject, and add an analytics rule for whatever technique you missed.
Verify
Confirm the controls are real, not just present:
# 1. Workload-identity CA policies exist and target service principals
Get-MgIdentityConditionalAccessPolicy -All |
Where-Object { $_.Conditions.ClientApplications.IncludeServicePrincipals } |
Select-Object DisplayName, State
# 2. Risk-based blocking is live for workloads
Get-MgRiskyServicePrincipal -All | Measure-Object # baseline count of risky SPs
# 3. No app holds Tier-0 Graph permissions it should not (re-run the section-2 query)
# 4. App management policy caps secret lifetime
Get-MgPolicyAppManagementPolicy | Select-Object DisplayName, IsEnabled
- Trigger a sign-in from outside your named location with a test principal and confirm it is blocked in
AADServicePrincipalSignInLogs(look for the CAfailureReason). - Confirm a federated workload gets a token with no secret present on the app (
(Get-MgApplication -Filter "appId eq '$AppId'").PasswordCredentialsreturns empty). - Add a dummy secret to a test app and confirm your Sentinel “credential added” rule fires within the expected window.
Enterprise scenario
A global logistics platform team ran ~340 service principals across three regions. During a tabletop exercise, their red team demonstrated the gap precisely: they planted a CI deployer’s client secret (scraped from an old pipeline log) on a VPS in another country and authenticated to Microsoft Graph successfully - the app had Mail.Read (application) and a secret valid for another 14 months. Nothing alerted, because no user and no MFA were involved.
The constraint was that they could not go fully secretless overnight - a dozen partner-operated apps and on-prem schedulers genuinely needed credentials for another two quarters. So they sequenced it. First, workload-identity CA with a location lock: every in-tenant service principal was restricted to a named location containing only their three NAT egress CIDRs plus their GitHub runner ranges, rolled out in report-only for two weeks to catch surprises (they found four shadow automations running from a developer’s home IP, which became their first cleanup). The same VPS replay attack, re-run after enforcement, was blocked at the token endpoint:
# The location-locked block policy that stopped the replay
$params = @{
displayName = "WL - SP egress lock"
state = "enabled"
conditions = @{
clientApplications = @{ includeServicePrincipals = @("ServicePrincipalsInMyTenant")
excludeServicePrincipals = @("<breakglass-automation-appid>") }
applications = @{ includeApplications = @("All") }
locations = @{ includeLocations = @("All"); excludeLocations = @($corpEgressLocationId) }
}
grantControls = @{ operator = "OR"; builtInControls = @("block") }
}
New-MgIdentityConditionalAccessPolicy -BodyParameter $params
In parallel they migrated everything Azure-resident to user-assigned managed identities, moved GitHub deployers to federated credentials, and capped secret lifetime at 90 days via an app management policy so the long-lived-secret problem could not regrow. Six months later the credential count was down from ~340 secrets to 11 (all partner apps on a tracked retirement plan), and “leaked credentials” risk detections - which had previously been unmonitored - were wired into a Sentinel playbook that auto-disabled the principal and paged the on-call. The single highest-leverage move, in their post-mortem, was not the secretless migration; it was the location lock, because it neutralized stolen secrets on day one while the slower migration caught up.