Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icinga2 looses state history and notifications during restart #10179

Open
w1ll-i-code opened this issue Sep 30, 2024 · 17 comments · May be fixed by #10191
Open

Icinga2 looses state history and notifications during restart #10179

w1ll-i-code opened this issue Sep 30, 2024 · 17 comments · May be fixed by #10191
Labels
bug Something isn't working ref/IP

Comments

@w1ll-i-code
Copy link

Describe the bug

If an object has a state change during an icinga2 restart (e.g. during a deploy), it is sometimes not written to the database and does not trigger the notifications.

To Reproduce

  1. Import the basket with icingacli director basket restore < icinga-lost-statechange-basket.json
    1. icinga-lost-statechange-basket.json
    2. This basket contains:
      1. A check command that randomly goes into warning to generate the state changes.
      2. A service template that runs the check command
      3. A service group to quickly create lots of services, making the occurrence more likely.
      4. A host template as the target for the apply rule of the service group.
  2. Create a few hosts:
    1. for i in $(seq --equal-width 1 100); do
          icingacli director host create "host-icinga-lost-statehistory-${i}" --imports 'ht-icinga-lost-statechange'
       done
  3. Deploy the config
    1. icingacli director config deploy

With that configuration running, deploy icinga2 a few times: icingacli director config deploy --force --wait

Soon there will be state changes in the state history that should not be possible:
Screenshot from 2024-09-30 14-38-26

In this case, the service went from hard warning into soft warning. The soft warning history says that the last state was Ok, but that was never written into the history.

To find lost state histories quicker I used the following script:
dropped_state_query.tar.gz

It needs as parameters the endpoint, user and password. If the db is postgres, it can be run with the --postgres flag.

Expected behavior

I expect icinga2 to not loose state changes like that.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version):
icinga2 - The Icinga 2 network monitoring daemon (version: r2.14.2-1)

Copyright (c) 2012-2024 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <https://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

Build information:
  Compiler: GNU 8.5.0
  Build host: staging5591master
  OpenSSL version: OpenSSL 1.1.1k  FIPS 25 Mar 2021
  • Operating System and version:
System information:
  Platform: Red Hat Enterprise Linux
  Platform version: 8.10 (Ootpa)
  Kernel: Linux
  Kernel version: 4.18.0-553.el8_10.x86_64
  Architecture: x86_64
  • Enabled features (icinga2 feature list):
Disabled features: command compatlog debuglog elasticsearch gelf graphite icingadb influxdb2 journald opentsdb perfdata syslog
Enabled features: api checker ido-mysql influxdb livestatus mainlog notification
  • Icinga Web 2 version and modules (System - About):
Icinga Web 2  NetEye release 4.39 (Traditional bock)

PHP Version   7.4.33
MODULE                  VERSION
analytics               1.58.0
auditlog                1.15.1
cube                    1.1.0
customproblemview       0.0.0
director                1.11.1
geomap                  1.22.0
grafana                 1.4.2
neteye                  1.155.0-1
host2servicedetailview  1.4.0
idoreports              0.10.1
incubator               0.22.0
ipl                     v0.5.0
lampo                   1.2.2
leafletjs               1.9.4
loginaudit              0.0.1
mapDatatype             0.1.0
monitoring              2.10.5
monitoringview          1.7.0
nagvis                  1.1.1
pdfexport               0.10.2
reactbundle             0.9.0
reporting               1.0.0
shutdownmanager         0.0.0
srwebbackend            0.0.0
tornado                 2.19.2
update                  1.44.1-2
  • Config validation (icinga2 daemon -C):
[2024-09-30 14:45:34 +0200] information/cli: Icinga application loader (version: r2.14.2-1)
[2024-09-30 14:45:34 +0200] information/cli: Loading configuration file(s).
[2024-09-30 14:45:34 +0200] information/ConfigItem: Committing config item(s).
[2024-09-30 14:45:34 +0200] information/ApiListener: My API identity: localhost.localdomain
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 NotificationComponent.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 LivestatusListener.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 InfluxdbWriter.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IdoMysqlConnection.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 CheckerComponent.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ServiceGroup.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 902 Services.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 3 Zones.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 2 NotificationCommands.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 FileLogger.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 IcingaApplication.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 101 Hosts.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 Endpoint.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 6 ApiUsers.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 1 ApiListener.
[2024-09-30 14:45:34 +0200] information/ConfigItem: Instantiated 251 CheckCommands.
[2024-09-30 14:45:34 +0200] information/ScriptGlobal: Dumping variables to file '/neteye/shared/icinga2/data/cache/icinga2/icinga2.vars'
[2024-09-30 14:45:34 +0200] information/cli: Finished validating the configuration file(s).

Additional context

I could observe the loss of notifications in production, have however not yet reproduced that behavior locally. I suspect however that the two behavior are linked.

We could also observe the same behavior when creating objects over the icinga2 api and then immediately sending a check-result. Once again, I have not replicated this locally yet, but I suspect the problem is the same in all these cases.

@w1ll-i-code
Copy link
Author

On vanilla icinga2 I can only reproduce it on the rhel8 package:

image

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Oct 2, 2024

Here is everything you will need to reproduce the same errors locally:

  1. Download the tar archive and unpack it with tar -xaf icinga2-bugreport-reproduce.tar.gz
  2. In the resulting directory you will find two sub-directories:
    1. dropped_state_query contains a script in rust to locate the lost events more easily
    2. reproduce_local contains a docker-compose to setup the test environment-
  3. Fix the credentials in icinga2-bugreport-reproduce/reproduce_local/icinga2-init/rhel8.Dockerfile
    1. The username + password for the icinga2 rhel8 repo
    2. the orgid + activationkey for the redhat subscription
  4. To run the test environment run docker-compose up --build in reproduce_local
    1. The docker-compose is a bit flaky, so if a service fails, it might need to be restarted.
  5. After the test environment is running, run:
    docker exec reproduce_local_icingaweb_1 bash -c "while true; do icingacli director config deploy --force ; sleep 30; done"
  6. In the directory dropped_state_query run
    cargo run -- --host localhost --user icingadb --password icingadb --database icingadb
    1. You might have to install cargo first. For that follow the instructions on https://rustup.rs/

Here is the archive with all needed data:
icinga2-bugreport-reproduce.tar.gz

@w1ll-i-code
Copy link
Author

I could also reproduce the same problem on both Debian and RHEL 9, but it was orders of magnitude less likely to happen there.

@w1ll-i-code
Copy link
Author

I have now also reproduced the same issue with icingadb:

How to Reproduce:

  1. Launch an icinga2 environment from the docker-compose playground
    git clone https://github.com/lippserd/docker-compose-icinga
    cd docker-compose-icinga
    docker-compose up
  2. Copy the icinga director basket into the directory
    icinga-lost-statechange-basket.json
  3. Load the configuration from the basket:
    docker exec docker-compose-icinga_director_1 icingacli director basket restore < icinga-lost-statechange-basket.json
  4. Create a lot of hosts (and with it services from the service set in the basket):
    docker exec docker-compose-icinga_director_1 bash -c 'for i in $(seq -f "%05g" 0 999); do
         icingacli director host create "host-icinga-lost-statehistory-${i}" --imports "ht-icinga-lost-statechange"
       done'
  5. Deploy the new director config:
    docker exec docker-compose-icinga_director_1 icingacli director config deploy
  6. Deploy the new config every 5 minutes to trigger the bug:
    docker exec docker-compose-icinga_director_1 bash -c 'while true; do sleep 300; icingacli director config deploy --force; done'

From my local tests on a VM with 4 Cores and 8GB of RAM, the error should be observable in around 10 Minutes / two deploys.
To locate the errors automatically from the db, I have attached the rust program I used to analyze the history tables.
dropped_state_query.tar.gz

To run that, download the dropped_state_query.tar.gz file and run:

tar -xaf dropped_state_query.tar.gz
cd dropped_state_query
cargo run --release -- --host 127.0.0.1 --user icingadb --password icingadb --icingadb | tee analyzed-history.log

Hint: Rust can be installed easily from https://rustup.rs/

Once the rust program has found some missing state_history and/or notification, you can verify that by looking at the service history.

dropped_state_change_and_notification

@lippserd
Copy link
Member

Hi all,

Thank you for all the details. I am trying to reproduce the scenario, but so far without success. When was the screenshot taken? Immediately after executing the queries to check if something is missing in the database? As there is a high chance that Icinga DB has not inserted everything yet.

Also, missing entries in the database do not necessarily mean that Icinga has not sent a notification. That should rather be verified using custom check and notification plugins.

Best regards,
Eric

@w1ll-i-code
Copy link
Author

w1ll-i-code commented Oct 16, 2024

Hello Eric.

I have run the tests over the weekend. Even after several days they do not appear in icingadb. This is of course obvious to you, as I posted the screenshot on the 14th, while the state changes happened on the 10th. And as you know, if it were a few minutes ago, it would not have shown the date, but rather the delta time since then.

As for the notifications, we became aware of the problem, because one of our services went into critical without sending any notifications. That does normally work and we have tried the configuration to make sure it works. It was on pure coincident that we noticed that, which lead us to investigate.

@w1ll-i-code w1ll-i-code linked a pull request Oct 17, 2024 that will close this issue
@w1ll-i-code
Copy link
Author

@lippserd I have opened a PR to fix the issue.

@w1ll-i-code
Copy link
Author

@lippserd Have you seen my PR? I would really appreciate some feedback.

@Al2Klimov Al2Klimov added the bug Something isn't working label Oct 22, 2024
@yhabteab
Copy link
Member

yhabteab commented Nov 13, 2024

The discussion in #10191 was rapidly superseded by the other comments, which is why I am attempting to clarify it her. I would like to begin by emphasising that I do really appreciate the analyses and PR you have presented and have no intention of discrediting them in any way. However, your script relies heavily on data from the database. I am not saying that the data from the database is entirely inaccurate, but there are other factors at play like Redis, Icinga DB etc. Thus, you can't trace a bug in Icinga 2 solely based on the database info. Though, while writing down this comment, I was able to successfully trigger some of the described behaviours with just a few hosts.

But first, I'll start by outlining the current deactivation order (without your PR) of the config objects. Initially, the checker is deactivated before anything else, and typically, during a normal circumstance, there won't be running checks after scheduler stops its thread.

std::unique_lock<std::mutex> lock(m_Mutex);
m_Stopped = true;
m_CV.notify_all();

if (m_Stopped)
break;

Afterwards, it will not schedule any new checks, but it will not stop already scheduled checks either. As a result, you may still observe running checks even after deactivating the checker. However, such running checks left overs should not cause a state change as once the Host or Service is also deactivated, no check result is processed (should not are the key words here! It is challenging, but it is possible to trigger state change events when the timings are perfectly matched). As illustrated in the example below, even if the check plugin has reported a problem state, most of the check results are nevertheless discarded because the corresponding hosts have already been deactivated.

if (!IsActive())
return Result::CheckableInactive;

Icinga2.conf
include <itl>

object CheckerComponent "checker" {}
object NotificationComponent "notification" {}

object IcingaDB "icingadb" {
  host = "127.0.0.1"
  port = 6381
}

object NotificationCommand "send" {
	command = ["true"]
}

apply Notification "recover" to Host {
	command = "send"
	users = ["icingaadmin"]
	types = [ Recovery ]
	assign where true
}

apply Notification "problem" to Host {
	command = "send"
	users = ["icingaadmin"]
	types = [ Problem ]

	assign where true
}

object User "icingaadmin" {
    types = [ Problem, Recovery ]
}

template Host "default" default {
	check_command = "dummy"
	max_check_attempts = 1
	check_interval = 1s

	var that = this
	vars.dummy_state = () use (that) => {
	    sleep(3s)
            if (that.state != 0.0) {
                log("Host recovered")
                return 0
            } else {
                log("Host run into or is in problem state")
                return 2
            }
    }
	vars.dummy_text = "I'm just testing something"
}

object Host "test" {}
object Host "test1" {}
object Host "test2" {}
object Host "test3" {}
object Host "test4" {}
Local Changes
diff --git a/lib/base/configobject.cpp b/lib/base/configobject.cpp
index 4317771d1..7fee0cc71 100644
--- a/lib/base/configobject.cpp
+++ b/lib/base/configobject.cpp
@@ -599,7 +599,7 @@ void ConfigObject::StopObjects()
 
                for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
 #ifdef I2_DEBUG
-                       Log(LogDebug, "ConfigObject")
+                       Log(LogInformation, "ConfigObject")
                                << "Deactivate() called for config object '" << object->GetName() << "' with type '" << type->GetName() << "'.";
 #endif /* I2_DEBUG */
                        object->Deactivate();
diff --git a/lib/checker/checkercomponent.cpp b/lib/checker/checkercomponent.cpp
index d92101f4e..37a7d3699 100644
--- a/lib/checker/checkercomponent.cpp
+++ b/lib/checker/checkercomponent.cpp
@@ -104,9 +104,12 @@ void CheckerComponent::CheckThreadProc()
                while (idx.begin() == idx.end() && !m_Stopped)
                        m_CV.wait(lock);
 
-               if (m_Stopped)
+               if (m_Stopped) {
+                       Log(LogInformation, "Check Scheduler") << "Scheduler stopped";
                        break;
+               }
 
+               Log(LogInformation, "Check Scheduler") << "Scheduling a check";
                auto it = idx.begin();
                CheckableScheduleInfo csi = *it;
 
diff --git a/lib/icinga/checkable-check.cpp b/lib/icinga/checkable-check.cpp
index 6e3b8764b..509a98881 100644
--- a/lib/icinga/checkable-check.cpp
+++ b/lib/icinga/checkable-check.cpp
@@ -148,6 +148,8 @@ Checkable::ProcessingResult Checkable::ProcessCheckResult(const CheckResult::Ptr
 
        }
 
+       Log(LogInformation, "Checkable") << "ProcessCheckResult IsActive - " << IsActive() << " state - " << cr->GetState() << " output - " << cr->GetOutput();
+
        if (!IsActive())
                return Result::CheckableInactive;
 
diff --git a/lib/icingadb/icingadb-objects.cpp b/lib/icingadb/icingadb-objects.cpp
index 920251969..102e3f3e6 100644
--- a/lib/icingadb/icingadb-objects.cpp
+++ b/lib/icingadb/icingadb-objects.cpp
@@ -34,6 +34,7 @@
 #include <chrono>
 #include <cmath>
 #include <cstdint>
+#include <iostream>
 #include <iterator>
 #include <map>
 #include <memory>
@@ -116,6 +117,7 @@ void IcingaDB::ConfigStaticInitialize()
                const NotificationType& type, const CheckResult::Ptr& cr, const String& author, const String& text,
                const MessageOrigin::Ptr&
        ) {
+               std::cerr << "IcingaDB::Checkable::OnNotificationSentToAllUsers." << std::endl;
                IcingaDB::NotificationSentToAllUsersHandler(notification, checkable, users, type, cr, author, text);
        });
 
@@ -125,6 +127,7 @@ void IcingaDB::ConfigStaticInitialize()
        Checkable::OnFlappingChange.connect(&IcingaDB::FlappingChangeHandler);
 
        Checkable::OnNewCheckResult.connect([](const Checkable::Ptr& checkable, const CheckResult::Ptr&, const MessageOrigin::Ptr&) {
+               std::cerr << "IcingaDB::Checkable::OnNewCheckResult" << std::endl;
                IcingaDB::NewCheckResultHandler(checkable);
        });
 
@@ -1773,6 +1776,7 @@ void IcingaDB::SendSentNotification(
 )
 {
        if (!GetActive()) {
+               std::cerr << "IcingaDB::SendSentNotification dropped a notification" << std::endl;
                return;
        }
$ icinga2 daemon -c icinga2.conf
...
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test1' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:24 +0100] information/config: Host recovered
[2024-11-13 15:17:24 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:24 +0100] information/Checkable: Checkable 'test3' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:24 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/config: Host recovered
[2024-11-13 15:17:25 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:25 +0100] information/Checkable: Checkable 'test4' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/config: Host recovered
[2024-11-13 15:17:25 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 0 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:25 +0100] information/Checkable: Checkable 'test2' has 2 notification(s). Checking filters for type 'Recovery', sends will be logged.
IcingaDB::Checkable::OnNotificationSentToAllUsers.
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:25 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduling a check
^C[2024-11-13 15:17:26 +0100] information/Application: Received request to shut down.
[2024-11-13 15:17:26 +0100] information/Application: Shutting down...
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'checker' with type 'CheckerComponent'.
[2024-11-13 15:17:26 +0100] information/Check Scheduler: Scheduler stopped
[2024-11-13 15:17:26 +0100] information/CheckerComponent: 'checker' stopped.
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'notification' with type 'NotificationComponent'.
[2024-11-13 15:17:26 +0100] information/NotificationComponent: 'notification' stopped.
[2024-11-13 15:17:26 +0100] information/ConfigObject: Deactivate() called for config object 'icingadb' with type 'IcingaDB'.
[2024-11-13 15:17:26 +0100] information/IcingaDB: Flushing history data buffer to Redis.
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 1 state - 2 output - I'm just testing something
IcingaDB::Checkable::OnNewCheckResult
[2024-11-13 15:17:28 +0100] information/Checkable: Checkable 'test' has 2 notification(s). Checking filters for type 'Problem', sends will be logged.
[2024-11-13 15:17:28 +0100] information/Notification: Sending 'Problem' notification 'test!problem' for user 'icingaadmin'
IcingaDB::Checkable::OnNotificationSentToAllUsers.
IcingaDB::SendSentNotification dropped a notification <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
[2024-11-13 15:17:28 +0100] information/Notification: Completed sending 'Problem' notification 'test!problem' for checkable 'test' and user 'icingaadmin' using command 'send'.
[2024-11-13 15:17:28 +0100] critical/IcingaDB: history: 2 queries failed (attempt #3) while we're about to shut down. Giving up and discarding additional 8 queued history queries.
[2024-11-13 15:17:28 +0100] information/IcingaDB: 'icingadb' stopped.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'icingaadmin' with type 'User'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2' with type 'Host'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'icingadb' with type 'CheckCommand'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test3!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test2!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test4!problem' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test1!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'test!recover' with type 'Notification'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'send' with type 'NotificationCommand'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Deactivate() called for config object 'app' with type 'IcingaApplication'.
[2024-11-13 15:17:28 +0100] information/ConfigObject: Dumping program state to file '/Users/yhabteab/Workspace/icinga2/prefix/var/lib/icinga2/icinga2.state'
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something
[2024-11-13 15:17:28 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:28 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something
[2024-11-13 15:17:29 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:29 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something <<<<<<<<<<<<< is discarded
[2024-11-13 15:17:29 +0100] information/config: Host run into or is in problem state
[2024-11-13 15:17:29 +0100] information/Checkable: ProcessCheckResult IsActive - 0 state - 2 output - I'm just testing something <<<<<<<<<<<< is discarded
[2024-11-13 15:17:29 +0100] information/IcingaApplication: Icinga has shut down.

It is possible that the issue you are experiencing may be resolved by the proposed PR #10191, but not because of the checker is stopped last, rather because of the changes to IDO and the Notification class. Therefore, I guess it makes sense to stop IDO, IcingaDB and the Notification component last, as well as to start them before the other objects, as is the case now.

@w1ll-i-code
Copy link
Author

This did indeed alleviate the problem, but it still does not completely solve it.

@yhabteab
Copy link
Member

This did indeed alleviate the problem, but it still does not completely solve it.

Would you mind sharing a few more details on what you've done and what isn't still fixed? :)

@w1ll-i-code
Copy link
Author

Therefore, I guess it makes sense to stop IDO, IcingaDB and the Notification component last, as well as to start them before the other objects, as is the case now.

I took this suggestion and changed the activation_priority of them to -50 for the ido/icingadb and to -25 for the NotificationComponent. I only experienced a single lost statechange (and subsequent missing notification).

@w1ll-i-code
Copy link
Author

I had also a look on the CheckerComponent::ObjectHandler function and saw that if an object was paused, it was erased from m_IdleCheckables and m_PendingCheckables. For testing I cleared both before the shutdown and have not seen any problems since, but I'm still testing. I hope I understood the use of runtimeRemoved correctly?

void CheckerComponent::Stop(bool runtimeRemoved)
{
	{
		std::unique_lock<std::mutex> lock(m_Mutex);
		m_Stopped = true;
		if (!runtimeRemoved) {
			m_IdleCheckables.clear();
			m_PendingCheckables.clear();
		}
		m_CV.notify_all();
	}

	m_ResultTimer->Stop(true);
	m_Thread.join();

	Log(LogInformation, "CheckerComponent")
		<< "'" << GetName() << "' stopped.";

	ObjectImpl<CheckerComponent>::Stop(runtimeRemoved);
}

@yhabteab
Copy link
Member

I took this suggestion and changed the activation_priority of them to -50 for the ido/icingadb and to -25 for the NotificationComponent. I only experienced a single lost statechange (and subsequent missing notification).

Have you verified in the logs whether these objects were stopped last? And how did you detect the missing state change? You can also add some custom logs to trace the execution of the check without having to enable debug logs for all components.

I just quickly did some poor man's patching and testing, and couldn't find any ignored events from Icinga DB.

diff --git a/lib/base/configobject.cpp b/lib/base/configobject.cpp
index 4317771d1..4435456f4 100644
--- a/lib/base/configobject.cpp
+++ b/lib/base/configobject.cpp
@@ -586,6 +586,13 @@ void ConfigObject::StopObjects()
        std::vector<Type::Ptr> types = Type::GetAllTypes();
 
        std::sort(types.begin(), types.end(), [](const Type::Ptr& a, const Type::Ptr& b) {
+               if (a->GetName() == "IcingaDB" || a->GetName() == "NotificationComponent") {
+                       return false;
+               }
+               if (b->GetName() == "IcingaDB" || b->GetName() == "NotificationComponent") {
+                       return true;
+               }
+
                if (a->GetActivationPriority() > b->GetActivationPriority())
                        return true;
                return false;
@@ -599,7 +606,7 @@ void ConfigObject::StopObjects()
 
                for (const ConfigObject::Ptr& object : dtype->GetObjects()) {
 #ifdef I2_DEBUG
-                       Log(LogDebug, "ConfigObject")
+                       Log(LogInformation, "ConfigObject")
                                << "Deactivate() called for config object '" << object->GetName() << "' with type '" << type->GetName() << "'.";

I hope I understood the use of runtimeRemoved correctly?

That parameter indicates whether a configuration object was deleted at runtime, e.g. via the REST API or via a cluster message from some other endpoint. It is unlikely that this will happen for the Checker component though, as it is typically managed in configuration files and not through the API.

@w1ll-i-code
Copy link
Author

Have you verified in the logs whether these objects were stopped last?

Yes, I did as you suggested in the comment and logged the deactivate on LogInformation.

And how did you detect the missing state change?

As always by looking at the database.

That parameter indicates whether a configuration ...

Thanks, exactly as I expected. With the last fix I posted, I can no longer see any missed State changes. I have seen one lost notifications, which is a problem, as we rely on the notifications to trigger a webhook, but that might have effectively just been the db connection, as I can't replicate it anymore.

@w1ll-i-code
Copy link
Author

@yhabteab Thank you for your insight. It was really helpful to get at the bottom of this bug. With the changes I have made to the PR, this issue does no longer occur in my test environment. If there are any concerns left, please let me know and I'll do my best to address them.

@lippserd lippserd added the ref/IP label Dec 5, 2024
@lippserd
Copy link
Member

lippserd commented Dec 5, 2024

ref/IP/52571

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working ref/IP
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants