Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,6 +224,23 @@ For example, to run a dashboard for three nodes:
talm dashboard -f node1.yaml -f node2.yaml -f node3.yaml
```

### `talm reset` — META-preserving default

`talm reset` diverges from upstream `talosctl reset` on one default. Upstream defaults to `--wipe-mode=all`, which wipes the Talos META partition along with STATE and EPHEMERAL — the node cannot self-recover and comes up in maintenance mode requiring a full re-apply. Talm instead populates `--system-labels-to-wipe=STATE,EPHEMERAL` when neither `--wipe-mode` nor `--system-labels-to-wipe` was passed, which preserves META so the node rejoins the cluster from its META-stored bootstrap config on the next boot.

Explicit operator intent is honored unchanged:

```bash
# talm default — preserves META, node self-recovers.
talm reset --reboot --graceful=true --nodes $NODE --endpoints $OTHER_NODE

# Explicit destructive opt-in (upstream's default).
talm reset --wipe-mode=all --reboot --nodes $NODE --endpoints $OTHER_NODE

# Operator-specified narrower scope is honored byte-for-byte.
talm reset --system-labels-to-wipe=STATE --reboot --nodes $NODE --endpoints $OTHER_NODE
```

## Customization

You're free to edit template files in `./templates` directory.
Expand Down
64 changes: 59 additions & 5 deletions docs/manual-test-plan.md
Original file line number Diff line number Diff line change
Expand Up @@ -285,18 +285,72 @@ Expected: delete succeeds; read returns `NotFound`.

Expected: refuses with `etcd data directory is not empty`.

### H2. Reset a control-plane node (graceful + reboot, system labels only)
### H2. Reset a control-plane node (talm safe default — preserves META)

⚠️ Destructive. Run only against a cluster you can afford to lose one node from. Requires `--system-labels-to-wipe=STATE` (and optionally `EPHEMERAL`) for a recoverable reset`--wipe-mode=all` (the default) removes META too, which makes self-recovery impossible.
⚠️ Destructive. Run only against a cluster you can afford to lose one node from. The talm default populates `--system-labels-to-wipe=STATE,EPHEMERAL` automatically when neither `--wipe-mode` nor `--system-labels-to-wipe` was passed, so META survives and the node self-recovers on the next boot. Upstream `talosctl reset` defaults to `--wipe-mode=all`, which destroys META; that path is exposed in talm as the explicit `--wipe-mode=all` opt-in (see H2a).

```bash
/tmp/talm-safety reset --graceful=true --reboot \
--system-labels-to-wipe=STATE \
--system-labels-to-wipe=EPHEMERAL \
--nodes $NODE --endpoints $OTHER_NODE
```

Expected: etcd member departs (`talm etcd members` from another node shows 2 members), node reboots, `post check passed`.
Expected: etcd member departs (`talm etcd members` from another node shows 2 members), node reboots, `post check passed`. After the reboot the node returns to etcd as a fresh member with placeholder hostname `talos-XXXXX` within ~90s; the next `talm apply` refreshes the hostname.

Regression anchors:

- `talm reset --help` must show the talm-divergence note on both `--wipe-mode` ("preserves META") and `--system-labels-to-wipe` ("STATE,EPHEMERAL"). Without the help text, the default flip is invisible to operators reading the CLI surface.
- The reset request must succeed without the operator having to type `--system-labels-to-wipe` manually. If the node comes back in maintenance mode requiring fresh apply, the wrapper did not apply the safe default and META was wiped — that is a regression.

### H2a. Reset with explicit destructive opt-in (`--wipe-mode=all` or `--wipe-mode=system-disk`)

⚠️ Highly destructive — META wiped, node CANNOT self-recover and requires fresh apply against `--insecure` maintenance mode. Run only on a cluster where the multi-day re-bootstrap cost is acceptable.

Two opt-out values land in the same destructive server-side branch: `--wipe-mode=all` (full system disk + user disks) and `--wipe-mode=system-disk` (system disk only). Both bypass the safety override and wipe META. `--wipe-mode=user-disks` is safe — it doesn't touch system partitions.

```bash
# Equivalent destructive paths:
/tmp/talm-safety reset --wipe-mode=all --graceful=true --reboot \
--nodes $NODE --endpoints $OTHER_NODE
/tmp/talm-safety reset --wipe-mode=system-disk --graceful=true --reboot \
--nodes $NODE --endpoints $OTHER_NODE
```

Expected: same as H2 up to the reboot; after the reboot the node comes up in maintenance mode (no machine config). `talm get hostnames -i --nodes $NODE` succeeds via the insecure path but the node is not yet a cluster member.

Regression anchor: when EITHER of these commands is run the wrapper MUST NOT silently add `--system-labels-to-wipe=STATE,EPHEMERAL` (which would override the operator's stated intent and quietly turn a destructive reset into a selective one). Verify via `talm reset --wipe-mode=all --help` or by observing that the reset request actually destroys META.

### H2b. Reset with operator-specified narrower scope (`--system-labels-to-wipe=STATE` only)

```bash
/tmp/talm-safety reset --system-labels-to-wipe=STATE --graceful=true --reboot \
--nodes $NODE --endpoints $OTHER_NODE
```

Expected: only STATE wiped, EPHEMERAL kept (containerd image cache survives the reset), node returns. The operator's explicit narrower list must be honored byte-for-byte; the wrapper MUST NOT silently expand to `STATE,EPHEMERAL`.

Regression anchor: after the node returns, `talm dmesg --nodes $NODE | grep -i ephemeral` should show no fresh-format markers for the EPHEMERAL partition. If the wrapper silently expanded the operator's list, EPHEMERAL would have been wiped too.

### H2c. Reset with `--graceful=false` (ungraceful, preserves safe default)

```bash
/tmp/talm-safety reset --graceful=false --reboot \
--nodes $NODE --endpoints $OTHER_NODE
```

Expected: ungraceful reset (no etcd leave), but the wrapper's safe default still fires (STATE+EPHEMERAL labels populated by talm because no wipe flag was passed). Node reboots; etcd cluster recovers via remaining quorum; rejoining member appears within ~120s.

Regression anchor: the default-flip MUST be independent of `--graceful`. A change that conditions the flip on `--graceful=true` is a regression — operators on ungraceful reset are the ones who most need the safe default.

### H2d. Reset triggered from modeline-bearing project root

```bash
cd $PROJECT # directory with nodes/$NODE.yaml carrying the modeline
/tmp/talm-safety reset --reboot --graceful=true
```

Expected: same outcome as H2 — modeline supplies `--nodes` / `--endpoints` from `nodes/$NODE.yaml`, no wipe flags on the CLI, wrapper applies the safe default, META preserved.

Regression anchor: the default-flip is gated on `Changed("wipe-mode") && Changed("system-labels-to-wipe")` only — it is independent of where in the PreRunE chain it runs. A refactor that reorders the dispatch chain must keep this path working (modeline-supplied `--nodes` / `--endpoints` plus no operator-supplied wipe flags must still produce the safe default).

### H3. Etcd quorum after reset

Expand Down
14 changes: 0 additions & 14 deletions pkg/commands/init.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ import (
"github.com/cozystack/talm/pkg/generated"
"github.com/cozystack/talm/pkg/secureperm"
"github.com/spf13/cobra"
"golang.org/x/term"
"gopkg.in/yaml.v3"

"github.com/siderolabs/talos/cmd/talosctl/cmd/mgmt/gen"
Expand Down Expand Up @@ -956,19 +955,6 @@ const (
overwritePolicyNonInteractive
)

// stdinIsTTY reports whether process stdin is connected to a
// terminal. Var-typed so the unit tests can swap a fake.
//
// term.IsTerminal correctly returns false for /dev/null and pipes —
// the naive os.Stdin.Stat()&ModeCharDevice check accepted /dev/null
// (it's a character device) and led the previous version to prompt
// in cron / scripted shells, EOFing the read.
//
//nolint:gochecknoglobals // injection seam for testability; matches stdinReader below.
var stdinIsTTY = func() bool {
return term.IsTerminal(int(os.Stdin.Fd()))
}

// stdinReader is the io.Reader the interactive prompt reads from.
// Var-typed so unit tests can supply canned input.
//
Expand Down
98 changes: 98 additions & 0 deletions pkg/commands/reset_handler.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
// Copyright Cozystack Authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
// See the License for the specific language governing permissions and
// limitations under the License.

package commands

import (
"github.com/cockroachdb/errors"
"github.com/spf13/cobra"
)

const (
resetCmdName = "reset"

// resetSafeDefaultLabels are the system partition labels talm's
// wrapper populates into `--system-labels-to-wipe` when an
// operator runs `talm reset` without explicitly choosing a wipe
// scope. Wiping STATE clears node-specific persistent state
// (machine config, identity); wiping EPHEMERAL clears the
// container/runtime layer. Leaving META untouched is the key
// property: META carries the bootstrap config Talos uses to
// rejoin the cluster on the next boot, so a reset with only
// these two labels self-recovers without operator intervention.
resetSafeDefaultLabels = "STATE,EPHEMERAL"
)

// wrapResetCommand flips talm's `talm reset` default away from
// upstream's destructive `--wipe-mode=all` toward the META-preserving
// selective-wipe recipe. The flip only fires when the operator passed
// neither `--wipe-mode` nor `--system-labels-to-wipe` on the CLI:
//
// - No wipe flags: PreRunE pre-populates
// `--system-labels-to-wipe=STATE,EPHEMERAL`. The server-side
// reset codepath, when SystemPartitionsToWipe is non-empty,
// takes the label-driven path and "keep[s] other partitions
// intact" per upstream's `--system-labels-to-wipe` flag doc in
// `cmd/talosctl/cmd/talos/reset.go`. META survives; on the next
// boot Talos rejoins the cluster from META without operator
// intervention.
// - Operator passed `--wipe-mode=...`: the safety override is
// skipped. `--wipe-mode=all` remains the explicit destructive
// opt-in; `--wipe-mode=system-disk` / `--wipe-mode=user-disks`
// also bypass the flip on the assumption that the operator
// stated wipe-scope intent.
// - Operator passed `--system-labels-to-wipe=...`: the operator's
// list is honored byte-for-byte. The wrapper does not silently
// expand a narrower selection (e.g. STATE alone) to the safe
// default — operators choosing a narrower scope are doing so
// deliberately.
//
// Help-text overrides on both flags spell out the divergence so
// `talm reset --help` carries the operator-facing story.
//
// Chain order: capture the wrapTalosCommand-installed PreRunE first,
// run the flip BEFORE chaining. Order is not load-bearing here
// (modeline does not touch wipe flags), but matching the shape of
// the crashdump / rotate-ca wrappers keeps the dispatch site
// readable.
func wrapResetCommand(wrappedCmd *cobra.Command) {
if wipeFlag := wrappedCmd.Flag("wipe-mode"); wipeFlag != nil {
wipeFlag.Usage = "disk reset mode (talm default: --system-labels-to-wipe=" + resetSafeDefaultLabels +
" preserves META so the node self-recovers; pass --wipe-mode=all or --wipe-mode=system-disk explicitly for upstream's destructive behaviour — both destroy META)"
}

if labelsFlag := wrappedCmd.Flag("system-labels-to-wipe"); labelsFlag != nil {
labelsFlag.Usage = "wipe selected system disk partitions by label, keeping others intact (talm default when no wipe flag is set: " +
resetSafeDefaultLabels + ")"
}

originalPreRunE := wrappedCmd.PreRunE

wrappedCmd.PreRunE = func(cmd *cobra.Command, args []string) error {
if !cmd.Flags().Changed("wipe-mode") && !cmd.Flags().Changed("system-labels-to-wipe") {
if err := cmd.Flags().Set("system-labels-to-wipe", resetSafeDefaultLabels); err != nil {
return errors.WithHint(
errors.Wrap(err, "applying talm safe-default wipe labels"),
"this should not happen at runtime; if it does, fall back to passing --system-labels-to-wipe=STATE,EPHEMERAL explicitly",
)
}
}

if originalPreRunE != nil {
return originalPreRunE(cmd, args)
}

return nil
}
}
Loading
Loading