Zabbix global event correlation explained

Поделиться
HTML-код
  • Опубликовано: 14 июл 2024
  • Use the build in functionality to tell the monitoring software which is core device. This will allow us to concentrate on the root cause in case of event cascade happens.
    Chapters:
    00:00 - Explanation of how and why it works
    05:03 - Create a dummy template to perform simulations really fast.
    06:28 - Create 6 host objects. Now we have 2 locations. Each location has a central router and 2 standard devices.
    07:23 - Label equipment. Configure which device holds the central connection and which devices are behind it.
    09:21 - Final explanation of how the host labaling system makes sense with the 2 rules.
    10:29 - Crash test. Simulation to demonstrate if a device in one site goes down, how the problem goes away if the core device goes down. Device goes down in another site but does not interfere with the equipment in the first site.
    14:04 - Summarize the setup again, making a more bulletproof solution, and simulate again.
    List all events closed by global correlation:
    SELECT repercussion.clock, repercussion.name, rootCause.clock,
    rootCause.name AS name FROM events repercussion
    JOIN event_recovery ON (event_recovery.eventid=repercussion.eventid)
    JOIN events rootCause ON (rootCause.eventid=event_recovery.c_eventid)
    WHERE event_recovery.c_eventid IS NOT NULL
    ORDER BY repercussion.clock ASC;
    Delete all events (syntoms) closed by correlation rule. Root cause will remain:
    DELETE FROM events WHERE eventid IN (
    SELECT repercussion.eventid FROM events repercussion
    JOIN event_recovery ON (event_recovery.eventid=repercussion.eventid)
    JOIN events rootCause ON (rootCause.eventid=event_recovery.c_eventid)
    WHERE event_recovery.c_eventid IS NOT NULL
    ORDER BY repercussion.clock ASC
    );
  • НаукаНаука

Комментарии • 7

  • @d.howardcolesjr4862
    @d.howardcolesjr4862 Год назад

    Hey, this is templateable. hehe. good solution. the location based tag is very good.

  • @d.howardcolesjr4862
    @d.howardcolesjr4862 Год назад

    The template can have a Macro called "{$PARENT_LOCATION}" with a default value of "site1", then have tags such as "Location" with value of: {$PARENT_LOCATION}. Which macro can be overridden at the closest template linked to the node, or the node itself. The Router's template and macros can be the same. So "Location" would have value of {$PARENT_NAME} in the template tag, with the macro having a default value. Then you could override that macro at the router (or parent) itself. I've tested this and it's working like a charm.
    So, as I said this solution can work with templates, with minimal manual intervention.

    • @d.howardcolesjr4862
      @d.howardcolesjr4862 Год назад

      So "Location" would have value of {$PARENT_NAME} in the template tag, with the macro having a default value. should be: So "Location" would have value of {$PARENT_LOCATION} in the template tag, with the macro having a default value. DOH.

  • @clecimarfernandes4992
    @clecimarfernandes4992 3 месяца назад

    Hi Aigars, Thank you for this video. I´m trying to do this and add one more tag to the event correlation, a tag for an event, for example, a pair of tags where tag is problem1. So, only when problem1 occurs will the event correlation run. Is that possible? I´m running some tests.

  • @dombayan
    @dombayan Год назад

    Hi Aigars, thanks very much, very informative video. To me the only drawback will be the constantly raising/closing triggers of non-router hosts. I tried it in my network, but those endless raising/closing alerts while the router is down spoil the trigger count statistics... Do you think there is a way to suppress the dependent triggers? (without using the host dependency ;) )

    • @aigarskadikis
      @aigarskadikis  Год назад

      It's actually recommended to use a cronjob which will erase all syntoms from database once in a while. The SQL commands are in video description. Usually in production when we use this cronjob we add a "clock" contraint to only delete records which are older than 14 days.
      Let us know if that is improving the view of statistics.

  • @stevenleander6101
    @stevenleander6101 9 месяцев назад

    Hi, great video, i tried the setup as pr. your description. And i have an alert action that will send SMS after 3minutes of downtime, allowing the correlation to do the magic, before sending the SMS. All fine, i only receive 1 SMS with problem. But when all the units come online again, then i receives resolved SMS's, for all nodes.? Is there a way to avoid this behavior ?
    Btw. Thanks for a great Summit 2023, I enjoyed it very much.