Idempotent Windows Post-Deploy: Replacing Decades of Bolted-On PowerShell

Idempotent Windows Post-Deploy: Replacing Decades of Bolted-On PowerShell

A large enterprise customer’s Windows post-deploy automation was a tangle of inline PowerShell, blind sleep timers, and “if this thing fails, paper over it with another step” patterns. Ten production Ansible roles replaced it with a clean, idempotent, version-controlled library following a desired-state dispatcher pattern.

The Challenge

The customer’s legacy Windows post-deploy workflow lived in vRealize Orchestrator. It worked — mostly. But it carried the scars of every workaround that had ever been needed.

Five-to-eight-minute blind sleep timers padded around guest operations calls. The timers existed because some operations occasionally took too long, so somebody had added a sleep “just to be safe.” The sleeps stayed. New sleeps got added when new operations exhibited similar issues. The cumulative effect was provisioning that took longer than necessary, with the VM sitting idle for minutes at a time waiting for timers that didn’t actually correspond to any real condition being met.

Domain joins were happening twice in some paths because nobody had ever fixed the original failure. The first attempt would fail silently. The second attempt — added later as a workaround — would succeed. The first attempt was never removed. New code paths inherited the duplicate-call pattern from the existing template, perpetuating the issue.

Credentials were sprinkled throughout JavaScript actions. Service account passwords appeared as string literals in workflow scripts. Domain join credentials lived in workflow attributes. KMS server addresses were hardcoded. Rotating any of these required hunting through dozens of workflow elements.

Configuration drift was inevitable. Each Windows VM left the post-deploy pipeline in a slightly different state depending on which workflow path it took, which timing windows it hit, and which transient errors got papered over. Operators couldn’t trust that two VMs built the same way were actually configured the same.

The customer needed a clean replacement: idempotent, declarative, version-controlled, and re-runnable against existing VMs without breaking them.

The Solution: Ten Idempotent Ansible Roles

The team replaced the entire vRO post-deploy workflow with an Ansible role library built around a desired-state dispatcher pattern. Each role represents one configuration concern. Each role checks current state before changing it. Each role is safely re-runnable against any VM at any time.

The Connection Strategy

WinRM over HTTPS as the interim approach. The customer’s existing Windows templates used WinRM with NTLM authentication on port 5986 with HTTPS. The Ansible playbook connects via winrm transport with kerberos or ntlm auth as configured per environment. Long term, the team documented a migration path to Win32-OpenSSH (Microsoft’s official OpenSSH server for Windows) once the customer’s templates are updated to include it.

The enable_winrm_handler.py ABX bridge. Aria Automation’s deployment lifecycle includes an event hook that fires on compute.provision.post. An ABX (Aria-native function) action runs at this hook, connects to vCenter via the customer’s service account, and pushes Microsoft’s ConfigureRemotingForAnsible.ps1 script to the new VM via VMware Tools’ vmware_vm_shell Guest Operations API. By the time Ansible attempts to connect, WinRM is already configured and listening.

This bridge solves the chicken-and-egg problem: Ansible needs WinRM to configure the VM, but the VM needs to be configured before WinRM is available. Guest Operations bypasses the network entirely — vCenter’s authenticated control channel pushes the configuration directly into the guest OS regardless of network state.

The Ten Roles

Each role represents one configuration concern. The roles run in sequence, and each one is independently idempotent.

1. timezone_set — Configures the Windows timezone. Reads the current timezone via PowerShell, compares it to the desired value, and changes it only if different. Handles the post-change time service restart.

2. kms_activate — Activates the Windows installation against the customer’s KMS server. Checks current activation state via slmgr.vbs /xpr before issuing activation commands. Includes the appropriate KMS keys for each Windows Server version supported.

3. windows_hardening — Applies the customer’s security baseline. Reads the effective policy state via gpresult and PowerShell registry queries, then applies any settings that drift from the desired state. The role is structured around the customer’s specific hardening requirements but the pattern generalizes to CIS or DISA STIG baselines.

4. rdp_enable — Enables Remote Desktop and configures the Windows Firewall rules for it. Checks the current RDP state and firewall rule state before making changes.

5. fleet_agent_install — Installs the customer’s fleet management agent. Checks for the agent’s installer presence, the running service, and the registered enrollment state. Skips reinstallation if the agent is already healthy.

6. build_info_registry — Stamps build metadata into the Windows registry for inventory tracking. Records build date, build version, image source, and provisioning request ID. The role is careful to preserve CreatedDate on re-runs — the original build date doesn’t change if the role runs again later.

7. ad_groups — Adds the VM’s computer object to appropriate AD groups based on its role and location. Reads current group membership before adding to avoid duplicate-add errors.

8. dns_registration_validation — Verifies that the BlueCat-allocated DNS records are actually resolvable from the VM. Tests forward and reverse lookups for the VM’s hostname and IP. Surfaces resolution failures as actionable errors rather than letting them manifest later as application connectivity issues.

9. winrm_https — Promotes the WinRM listener from the bootstrap configuration to a hardened production configuration with proper certificates and authentication restrictions.

10. domain_join — Joins the VM to Active Directory. Always runs last, with a reboot-on-success. Uses Microsoft’s official microsoft.ad.membership Ansible module reading credentials from a vault file. Gates on Win32_ComputerSystem.PartOfDomain so the role is safely re-runnable: if the VM is already domain-joined, the role does nothing.

The Desired-State Dispatcher Pattern

Every role follows the same skeleton:

  1. Read current state — Query the system to determine the actual current configuration
  2. Compare to desired state — Compare the actual state to what the playbook variable specifies
  3. Apply only if different — Make the change only if needed
  4. Verify the change — Re-read state after change to confirm success
  5. Report changed/unchanged — Use Ansible’s changed_when to accurately reflect whether anything actually happened

This pattern delivers true idempotency. Run the playbook five times against the same VM and the second through fifth runs report ok=N changed=0 — they read state, find it matches, and do nothing. The customer can run any role at any time without fear.

Cloud.Ansible for Native Retry

Aria Automation’s Cloud.Ansible resource type handles the connection, authentication, and retry logic that the legacy workflow papered over with sleep timers. The five-plus minutes of blind sleeps are gone. When WinRM isn’t ready yet, Cloud.Ansible retries automatically with exponential backoff. When it becomes ready, the playbook starts immediately rather than waiting for an arbitrary timer.

The result: faster provisioning, more predictable behavior, and no more “is the VM ready yet?” guesswork.

The Results

The new role library replaced the legacy workflow’s tangled PowerShell with clean, version-controlled automation.

Sleep timers eliminated. The five-to-eight minutes of blind sleeps from the legacy workflow are gone. Provisioning is faster and more predictable.

Idempotency throughout. Every role checks current state before changing it. Domain join reads Win32_ComputerSystem.PartOfDomain first. Build info registry writes preserve CreatedDate on re-runs. Hardening checks effective policy before reapplying. The result is a playbook the customer can run against a production VM at 2 a.m. without fear.

Credentials in vault. Service account passwords, domain join credentials, KMS server addresses, AD group definitions — all of it lives in vault.yml encrypted with Ansible Vault. Rotation is a single-file edit. No more credential hunting through workflow scripts.

Configuration drift eliminated. Two VMs built the same way are actually configured the same. The roles enforce the desired state at provision time and any time they’re re-run.

Re-runnability for fleet maintenance. When a security baseline changes, the customer can run the relevant role against every Windows VM in the fleet. The role only changes the VMs that need changing. The roles never break what’s already correct.

Lessons Learned

Idempotency is not optional. Every single role checks state before changing it. This isn’t a stylistic preference — it’s the difference between a role that’s safe to re-run and a role that breaks production at 2 a.m. The customer can run the entire playbook against any production VM without fear because the roles don’t make changes when changes aren’t needed.

The enable_winrm_handler.py bridge is the right pattern. Pre-Ansible Windows configuration through VMware Tools Guest Operations is the cleanest way to bootstrap WinRM. The pattern generalizes — anything that needs to happen before Ansible can connect (from network configuration to certificate installation) belongs in the bridge.

Run domain join last, with a reboot. Domain join is the operation most likely to fail in subtle ways, the operation that affects everything that comes after, and the operation that requires a reboot. Running it last means the rest of the playbook completes against a known-good state, and the reboot happens at a predictable moment.

Use the official module, not raw commands. microsoft.ad.membership handles the edge cases — already-joined VMs, partial join states, credential refresh, error reporting — that ad-hoc PowerShell scripts get wrong. The official module is maintained by Microsoft and tested against every Windows Server version. There’s no good reason to write your own domain join code anymore.

Build a “desired state” mental model. Each role’s variables describe what should be true, not what actions should be taken. timezone: "Eastern Standard Time" rather than set_timezone_command: "tzutil /s 'Eastern Standard Time'". The role figures out what actions are needed to reach the desired state. The customer’s operators reason about what the VM should look like, not about what commands to run.

What We’d Do Differently

Move to Win32-OpenSSH sooner. WinRM is mature and works, but Win32-OpenSSH is the future for Windows automation. Once the customer’s templates include OpenSSH, the connection layer becomes simpler and more aligned with Linux automation patterns. The migration path is documented but the customer’s templates haven’t been updated yet — that’s the next round of template hardening.

Build role-level testing earlier. The team tested the playbook end-to-end against real VMs throughout development, but didn’t add per-role testing harnesses (Molecule, Vagrant test instances) until late in the engagement. Testing each role against an isolated test VM before integrating into the full playbook would have caught some edge cases earlier.

Better integration with the inventory generators. The playbook uses static inventories during initial deployment. The dynamic inventory generators (see the inventory story) supply current VM lists for fleet maintenance runs. Tighter integration — the playbook auto-selecting the right generator based on the run type — would have made fleet maintenance smoother.

Getting Started

The role library requires Ansible 2.15 or later, Aria Automation 8.x with the companion vRO actions package, Windows VMs with VMware Tools installed and reachable from the Ansible controller, an Active Directory environment with a service account for domain joins, and a KMS server reachable from VMs needing activation.

Clone the repository, create a vault.yml with your environment’s credentials, customize the role variables for your hardening baseline and AD configuration, install the enable_winrm_handler.py ABX action in Aria Automation, configure the event subscription on compute.provision.post, and test against a non-production Windows VM before integrating with production blueprints.

Conclusion

The customer’s legacy Windows post-deploy workflow worked but was unmaintainable. Ten focused Ansible roles, built around the desired-state dispatcher pattern, replaced it with automation the customer’s team can read, modify, version-control, and re-run safely.

The pattern generalizes. Whether you’re modernizing a legacy provisioning pipeline or building one from scratch, idempotent role-based automation beats inline PowerShell every time. State-checking before action is non-negotiable. Vault-stored credentials are non-negotiable. Reboot timing matters. Domain join goes last.

The complete role library, ABX bridge, and integration documentation are available on GitHub.


Repository: github.com/noahfarshad/ansible-windows-postdeploy

Related Stories:

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top