Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APP-7364] [APP-7366] [RSDK-9684] [APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. #56

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Otterverse
Copy link
Member

Quick PR for easy fixes.

Needs testing for the linked Jira issues

  1. Install
  2. Upgrade/downgrade viam-agent and ensure a "disabled" systemd service remains disabled after upgrade.
  3. Confirm captive portal works for new wifi (mainly that browser doesn't throw errors) and fixes capitalization issue.
  4. Corrupt /etc/viam.json, ensure errors are clear and file is renamed
  5. Make sure it runs okay for a while (15+ minutes) with the new, shorter timeout.
  6. Kill viam-server (or otherwise make it fail) and confirm it logs and restarts quickly.

@Otterverse Otterverse requested a review from maxhorowitz January 9, 2025 20:25
@Otterverse Otterverse changed the title [ APP-7364][APP-7366][RSDK-9684][APP-7154]Add more logging, reduce monitoring loop time, misc small fixes. [ APP-7364][APP-7366][RSDK-9684][APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. Jan 9, 2025
@Otterverse Otterverse changed the title [ APP-7364][APP-7366][RSDK-9684][APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. [APP-7364] [APP-7366] [RSDK-9684] [APP-7154] Add more logging, reduce monitoring loop time, misc small fixes. Jan 9, 2025
Copy link
Member

@maxhorowitz maxhorowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code scan ✅
Have some questions which are mainly aimed at me getting up to speed! I will manually test these fixes later today.

@@ -37,7 +37,7 @@ <h2>Smart Machine Setup</h2>
<div class="form-group">
<label for="network">Network</label>
{{if eq (len .VisibleSSIDs) 0}}
<input type="text" name="ssid" placeholder="Enter Wifi SSID" id="network" required>
<input type="text" name="ssid" placeholder="Enter Wifi SSID" id="network" required autocorrect="off" autocapitalize="off">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context these are "on" if omitted and that was not allowing you to connect with a SSID that was lowercased? Nice find if so 😄

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Micheal Lee found/suggested this fix. But yeah, I guess some phones/browsers see an empty text field, and if you type a password that's a normal "word" it auto-capitalizes it, which isn't ideal for passwords.

We don't use a "password" type field here, because I wanted this to be visible before submitting. Otherwise it's way too easy to accidentally typo and then it takes several minutes for the device to try out the thing, fail, and restart a hotspot.

if err := os.MkdirAll(filepath.Dir(serviceFilePath), 0o755); err != nil {
return errw.Wrapf(err, "creating directory %s", filepath.Dir(serviceFilePath))
}
// use this later to avoid re-enabling an existing agent service a user might have disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Want to confirm my understanding...there are users who have Viam agent installed but who don't want it to automatically start up on boot, so they've "disabled" auto start. When they upgrade their Viam agent, the configuration file where they've "disabled" the auto start gets overwritten such that startup on boot becomes true...and that's what you're fixing in this PR.

Copied the following snippet from the ticket - is this the final product (i.e. the user now has to manually restart Viam agent after an upgrade)?

Jan 07 16:11:09 pluto viam-agent[911]: 2025-01-07T22:11:09.496Z        INFO        viam-agent        agent/subsystem.go:355        viam-agent updated from 0.11.0 to 0.12.0
Jan 07 16:11:10 pluto viam-agent[911]: 2025-01-07T22:11:09.515Z        INFO        viam-agent        viamagent/viamagent.go:131        writing systemd service file to /usr/local/lib/systemd/system/viam-agent.service
Jan 07 16:11:10 pluto viam-agent[911]: 2025-01-07T22:11:09.515Z        INFO        viam-agent        viamagent/viamagent.go:146        enabling systemd viam-agent service
Jan 07 16:11:10 pluto viam-agent[911]: 2025-01-07T22:11:10.091Z        INFO        viam-agent        viamagent/viamagent.go:168        Install complete. Please (re)start the service with 'systemctl restart viam-agent' when ready.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if someone installed viam-agent, then runs systemctl disable viam-agent it won't start at boot. The problem is that, previously, when we did an upgrade, this same code got called, and it ALWAYS enabled the service again. Now it should only enable it on first/new install.

The snippet is the old behavior, where it shows "enabling systemd viam-agent service" was the problem. That should only happen on new installs now.

@@ -24,7 +24,7 @@ import (
)

const (
minimalCheckInterval = time.Second * 60
minimalCheckInterval = time.Second * 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the monitoring loop checking for on a super high level? Is this monitoring for FTDC data in the viam-server? What other things are "monitored" on a regular basis via the agent ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how often it fetches a new config from the cloud. The cloud itself can say "check again after X time") but that's currently not implemented, so it falls to this.

The main loop basically: check for new config > apply changes/updates (if there) > start subsystems that should be started > check health of subsystems > repeat

So the check interval determins (roughly) the overall timing. At 5 seconds, it may have trouble keeping up if there are slow responses or other work to do, but that should be fine.

}
// use this later to avoid re-enabling an existing agent service a user might have disabled
_, err = os.Stat(serviceFilePath)
newInstall := err != nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth checking for a file does not exist error?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This covers edge cases like file corruption too. If we can't stat, then it's best to treat it as a new install, IMHO. But a good thought! In many other cases it'd be worth differentiating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants