Don't Make These 6 Prometheus Monitoring Mistakes | Prometheus Best Practices & Pitfalls

Поделиться
HTML-код
  • Опубликовано: 17 дек 2024

Комментарии • 29

  • @rishabhverma2631
    @rishabhverma2631 6 месяцев назад +4

    Very happy that i came across your channel. I am quite new to Prometheus and your content has already helped me out a bunch. Thanks Julius

  • @BlackUnicornVlogs
    @BlackUnicornVlogs Год назад +1

    YAYYY!!! It's awesome to see how this series is progressing. Another winner 😎

  • @90hijacked
    @90hijacked 6 месяцев назад +1

    I am really hyped to find your channel, been working with prometheus for a few years now, it was my primary reason to begin learning golang also
    instantly subscribed, I'll be lurking here on my weekends! ♥

  • @erikchilders5496
    @erikchilders5496 Месяц назад +1

    Thanks Julius!

  • @theophani
    @theophani Год назад +1

    Ahhh!! All mistakes I’ve encountered! (including ones I made …). Great summary. I’m sure I’ll refer back to this one.

    • @PromLabs
      @PromLabs  Год назад

      Ohhh that's good to hear that it's relevant to pitfalls you encountered as well, thank you! :)

  • @rafiraf1522
    @rafiraf1522 Год назад +1

    Great content, thank you for recording it!

  • @GeorgiKobilarov
    @GeorgiKobilarov Год назад +1

    Great video! Thanks Julius

  • @Jam-ht2ky
    @Jam-ht2ky Год назад +1

    This would definitely make a great series!

  • @YuruCampSupermacy
    @YuruCampSupermacy Год назад +1

    Loved the video. Looking forward to more.

  • @R0SS0
    @R0SS0 7 месяцев назад +1

    Thanks for the video! Very informative

  • @vnavalianyi
    @vnavalianyi Год назад +1

    Thank you for great videos!

  • @nikitaruy8619
    @nikitaruy8619 7 месяцев назад +1

    Thank you for the video! :) Very good content!

  • @jhonsen9842
    @jhonsen9842 5 месяцев назад +1

    Great stuff

  • @fauxz3782
    @fauxz3782 8 месяцев назад +1

    Your videos are so good

  • @PromLabs
    @PromLabs  Год назад

    Let me know what other kinds of Prometheus pitfalls and best practices you'd like to learn about! These were some technical ones that I've mentioned a lot in talks, but there are surely many others.

  • @aminjafer4535
    @aminjafer4535 Год назад +1

    Awesome channel

  • @bulmust
    @bulmust Год назад +1

    Nice video and channel.

  • @prinzgonzo6646
    @prinzgonzo6646 Год назад +1

    Hey Julius, thank you for your video. Which metric type would you propose to avoid cardinality bombs in case I want to record labels that can have lots of different values? Is there a way to change the grouping mechanism in a way that prometheus will not create a new time series for every label combination?

    • @PromLabs
      @PromLabs  Год назад +1

      A time series is identified by its metric name and unique set of labels, so any addition, removal, or change of a label will automatically mean that it's a different / new time series. So there's no way to not create a new series for a new label combination. Other than that, the only thing that matters is the total number of series generated by a metric, which results from the combinations of label values it can have. The metric type doesn't have any impact on that (and mostly just depends on what you are trying to measure), except that of course e.g. in the case of histograms, there's always an automatic "le" bucket label that gets multiplied into the total cardinality, since every custom label combination also has one series per configured bucket (so one strategy is just to configure fewer buckets if histogram cost is becoming an issue).

  • @pulithawanniarachchi7991
    @pulithawanniarachchi7991 26 дней назад

    I have a quick question to clarify. When we use the `increase` function in our queries, for the very first time a metric is reported, it will have a value of 1 in Grafana. However, the `increase` function returns zero in this case, causing the query to fail. Other functions, like `rate`, exhibit the same behavior. I believe this issue arises due to the calculation method used in the `increase` function(First time first and last data points are the same). However, starting from the second metric report onward, it works fine. Do you know the reason behind this behavior and how we can avoid this issue?

    • @PromLabs
      @PromLabs  26 дней назад +1

      Yes, when a counter metric just appears for the first time with a value of 1, the rate() and increase() functions do not know whether this was an actual increase, or whether the time series already had the value of 1, but was just temporarily absent for some reason (like a scrape failure or a too short rate window). That's why both functions currently require at least two samples to compare under the provided window. However, there is some work going on to start tracking the creation timestamps of counters, which could then be used in functions like rate() and increase() to handle these situations better. See for example this PromCon 2023 talk: ruclips.net/video/nWf0BfQ5EEA/видео.html. And in this PromCon 2024 talk, there is more information about the possible future metadata store to store all kinds of metadata about metrics, including counter creation timestamps: ruclips.net/video/Torm3M23Uyk/видео.html

    • @pulithawanniarachchi7991
      @pulithawanniarachchi7991 26 дней назад

      @@PromLabs Thank you for the response. However, let’s say a second sample is received after one hour. When the rule is evaluated for the last five minutes at the time the second sample is received, it provides results. (even when I have only one data point).
      In my scenario, I need to trigger an alert if at least one failure is detected (I cannot use the sum function because I need to capture all labels as well). This is important because the next failure could occur after some time, as my service does not experience heavy traffic. I am monitoring metrics for the last 10 minutes.

    • @PromLabs
      @PromLabs  26 дней назад

      @@pulithawanniarachchi7991 "even when I have only one data point" -> No, both functions will return an empty result if they find only one sample under the requested window. What you are maybe seeing is that you do have multiple scraped samples under the window, but only one increment among those samples. In that case, yes, the functions will report that increment.
      If you want to detect whether there was any failure at all under a given window, the best course of action would be to either pre-initialize all relevant counter metrics to 0 upon startup (see also promlabs.com/blog/2023/09/13/dealing-with-missing-time-series-in-prometheus/) or use an expression that has a fallback in case of an empty rate result. For example, you could do something like:
      increase(mymetric[5m]) or (mymetric unless mymetric offset 5m)
      Meaning: give me the increase over "mymetric" over the last 5 minutes, and if it's not present, give me instead the value of "mymetric" right now, but only if it didn't exist already 5 minutes ago (should match the rate window length). Haven't tested it, but something like this.

  • @stephennfernandes
    @stephennfernandes 8 месяцев назад

    hey julien, i am pretty new to prometheus. i want to know that can we also track user level behaviour in prometheus or would i need to use a total separate tooling for that. i wanted to know what my users are doing, and watch indidividual user behaviours

    • @PromLabs
      @PromLabs  8 месяцев назад

      If you have more than a handful of users, then a metrics-based monitoring system like Prometheus is probably not the right choice. Imagine you have a million users and you want to track one metric about each. Then you already have 1 million time series just for that. The total budget for a big server is usually a couple of million series, so this will only work if you have very few users or you don't track a lot of info about each one :)

  • @suparna20100
    @suparna20100 8 месяцев назад

    Hi, just now started using prometheus. I am trying to set rate function for counter.. so i have a counter metric and the values is 1.. when i use rate for counter rate of first entry a at that time starts with 0..why for the rate of first new single instrumwntation entry starts with 0. How to resolve that..i want rate /changes/ increases should starts with non zero and then it should become 0 when its idle how to achieve this

  • @XavierTibudon
    @XavierTibudon Год назад +2

    Awesome content, Julius. I find it extremely useful, but it took me 2 months to finde it! I'll do what I can to spread the word through my channels. Insanely popular as it is, most implementations I come across during SRE consultancy engagements, only exploit a small fraction of its value. All the best, @XavierThibaudon