Prometheus: Lessons Learned



I've wanted to do this lightning talk for a while we recently had a very long project of getting off of our old monitoring system which was Nagios and graphite and a combination of other homegrown pieces of software and onto something more standard and we chose prometheus I know there's a lot of talks like this but this one's kind of a lessons learned and what I would do different or what I would do the same if I had to redo this project me Patrick O'Brian I like cooking dogs tacos and a new hobby skiing it's awesome I'm an SRT at the trade desk we're based in Ventura California but we have many offices worldwide we're always hiring if interested yeah sorry here we go slide one think about your hard alerts a lot of legacy alerting systems have a lot of alerts to find in than that you don't necessarily need anymore and you know you can cut a lot of them 90% of those alerts will be insanely easy to move over it's the remaining 10% that will be difficult and you will want to think upfront and ahead of time how are we actually going to get this useful alert into this new system because often times like especially coming from Nagios we'll have Python scripts that do many different things in that single script to kind of figure out if there is an issue those are the hard ones and that's where your longest tail of the project will be I asked a question earlier to the Prometheus developers about documentation the official documentation for Prometheus is extremely clinical and I think that's by design and I'm super happy to now hear that we can contribute better documentation oh cool awesome yeah we had to rely on kind of learning prom ql' ourselves learning how Prometheus works itself and then writing that internally for our users because you know pointing them at the super clinical Prometheus documentation wasn't always super helpful and you will get a lot of prom coup all questions when you start rolling up Prometheus and it's best to kind of become an expert in that as much as possible do maths we immediately hit cardinality issues because we have a lot of hosts and you know we kind of sold it as don't embed any metadata into the alert net or not the alert name the metric name make it generic add labels we hit two million metrics in the single namespace in like 30 seconds it was terrible and it was very painful and having to kind of walk back on the oh well in this case maybe embed some metadata in that metric name it was not fun internal evangelists are insanely helpful I can talk about how implementing Prometheus will improve the quality of life for people but it's really internal and the evangelists who will actually show people and have those connections that I might not necessarily have and so if they're around like we had a fantastic one who's here Nathan say hi there we go he was great because he knew much more many more developers than I knew and so he was able to kind of work with them show them encode how it works show them the benefits and was able to reach much further than I was able to reach it was fantastic create a dedicated team especially if you're a large company if you can the more opinions on how to do something the better just it makes sense get involved in the community this one kind of speaks for itself you learn more about the product you learn more about the project and you're able to help everybody else out and last we're hiring I think now we do like 11 million requests a second which is kind of alright so if you like Big Data if you like a lot of requests per second and large-scale we're perfect for that and that's it thank you very much Patrick you

Leave a Reply

Your email address will not be published. Required fields are marked *