Skip to content

Create new role#1

Open
MarcWazo wants to merge 15 commits into
mainfrom
Init_role
Open

Create new role#1
MarcWazo wants to merge 15 commits into
mainfrom
Init_role

Conversation

@MarcWazo

@MarcWazo MarcWazo commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

To facilitate debian13/ansible13 migration, we need to replace our external dependency to ansible-community/consul role.

This new role will replace it and install consul server or agent like the legacy one.
To facilitate migration, we kept majority of variables names.


Note

Medium Risk
New role touches cluster formation, systemd, and host DNS/resolv.conf on the dnsmasq path; mistakes could affect service discovery or resolution, though impact is limited to playbooks that adopt this role.

Overview
Introduces wazo.consul, a full Ansible role meant to replace the external ansible-community/consul dependency for Debian 11–13 / Ansible 8–13 migration. Most legacy variable names are preserved for drop-in migration.

Install & cluster: Consul comes from the HashiCorp APT repo (optional version pin / upgrade), with consul.hcl rendered for server, bootstrap, or client, including optional gossip encryption, static or cloud retry_join, and bootstrap_expect computed from server hosts in inventory all when unset (legacy semantics).

Runtime fixes: A systemd drop-in forces Type=simple on consul.service to avoid single-node Type=notify timeouts from the package unit. The role starts Consul and waits on the HTTP API and leader election.

DNS: At runtime it picks systemd-resolved when that service is running (Consul domain routed via resolved.conf.d); otherwise it installs dnsmasq, forwards the Consul domain, and may rewrite /etc/resolv.conf (including unsafe_writes for container bind mounts). There is no consul_dnsmasq_enable toggle.

Quality gate: Adds pre-commit/ansible-lint tooling, a Jenkins pipeline (tox linters + parallel Molecule on ECR systemd images), and testinfra checks keyed off CONSUL_DNS_RESOLVER for both resolver paths.

Reviewed by Cursor Bugbot for commit 8453f0d. Bugbot is set up for automated code reviews on this repo. Configure here.

@MarcWazo MarcWazo self-assigned this Jun 10, 2026
Comment thread templates/consul.hcl.j2
Comment thread tasks/consul/dnsmasq.yml
Comment thread tasks/consul/install.yml
Comment thread handlers/main.yml Outdated
Comment thread tasks/consul/dnsmasq.yml
Comment thread tasks/consul/dnsmasq.yml
# /etc/resolv.conf is a bind mount inside containers, so the default atomic
# rename fails with EBUSY. Write in place instead. (The symlink case above
# has already been replaced with a regular file.)
unsafe_writes: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolv repointed before dnsmasq

Medium Severity

On the dnsmasq path, /etc/resolv.conf is switched to nameserver 127.0.0.1 immediately after templating, while Restart dnsmasq only runs at the later flush_handlers in consul.yml. Until then, general DNS can fail if the package did not already leave dnsmasq listening on localhost.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit b16da70. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We install dnsmasq from apt, so we can consider it is running and listening from the start.

@MarcWazo MarcWazo requested a review from ahogui12 June 15, 2026 17:41
Comment thread templates/consul.hcl.j2
@@ -0,0 +1,45 @@
# !!! MANAGED BY ANSIBLE !!!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we lost encryption from former role
in the template

{% if consul_encrypt_enable | bool %}
    "encrypt": "{{ consul_raw_key }}",
    "encrypt_verify_incoming": {{ consul_encrypt_verify_incoming | bool | to_json }},
    "encrypt_verify_outgoing": {{ consul_encrypt_verify_outgoing | bool | to_json }},
    {% endif %}

And in the config in dev/staging/prod

    "encrypt": "REDACTED",
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I fixed this and was able to test it on dev.
The conf file now contains all parts we can see in our legacy instances.

Comment thread templates/consul.hcl.j2
@@ -0,0 +1,45 @@
# !!! MANAGED BY ANSIBLE !!!
datacenter = "{{ consul_datacenter }}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as before for the performance configuration

    "performance": {
        "leave_drain_time": "5s",
        "raft_multiplier": 1,
        "rpc_hold_timeout": "7s"
    },

The raft multiplier value is a setup and not a consul default (would be 5). Maybe it is worth testing it before dropping

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, same response as #1 (comment)

Comment thread templates/consul.hcl.j2
Comment thread defaults/main.yml
Comment thread tasks/consul.yml Outdated
Comment thread defaults/main.yml
consul_enable_local_script_checks: false

# Gossip encryption (supply consul_raw_key via vault to enable)
consul_encrypt_enable: false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old default was true. I think we will have a compat issue if we keep this default
consul_encrypt_enable: "{{ lookup('env', 'CONSUL_ENCRYPT_ENABLE') | default(true, true) }}"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, when switching to this new role, we will need to set consul_encrypt_enable to true in the inventories/playbooks to avoid issue.
This is weird the default value was true as it breaks consul if none of the encryption related conf are setup.

Now at least, by default encryption is not set and consul is not breaking.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8453f0d. Configure here.

Comment thread tasks/consul/dnsmasq.yml
# /etc/resolv.conf is a bind mount inside containers, so the default atomic
# rename fails with EBUSY. Write in place instead. (The symlink case above
# has already been replaced with a regular file.)
unsafe_writes: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dnsmasq not ensured running

Medium Severity

On the dnsmasq resolver path, the role installs and templates dnsmasq and only restarts it via a handler when the config template changes. Unlike consul, there is no task that ensures dnsmasq is started and enabled on every run, so a stopped service stays down after idempotent applies and Consul DNS on the host can remain broken while the play still succeeds.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8453f0d. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We install dnsmasq via apt which ensure dnsmasq service is started initially and enabled.

Comment thread tasks/consul.yml
register: consul_leader
until: consul_leader.content | length > 2
retries: 30
delay: 2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leader wait blocks partial bootstrap

Medium Severity

Every Consul node waits until the HTTP leader endpoint returns a non-empty leader, with a bounded retry loop. When bootstrap_expect exceeds the number of servers already running—common if servers are applied serially or in waves—the cluster will not elect a leader in time and the role fails even though later passes would succeed.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8453f0d. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only fail if we bootstrap a cluster (no actual leader) with a bootstrap_expect > number of nodes actually starting.
The only workaround here is to remove leader election check from role as it can happen after node provisioning.

I'd prefer letting it this way, the role will fail if we bootstrap cluster with a expect=3 and serial=1 for example but will ensure each node is actually joining a working cluster in all other cases.

In the case of bootstraping a cluster, we just could play with the expect value to make it work.

@MarcWazo MarcWazo requested a review from ahogui12 June 18, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants