Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

That is the $65k question and unfortunately I don't have a pat answer for that yet. I probably need to see more types of projects instead of more time on fewer projects which is where I'm at.

But I can give you a partial picture.

You're going to end up with multiple dashboards with duplicate charts on them because you're showing correlation between two charts via proximity. Especially charts that are in the same column in row n±1 or vice versa. You're trying to show whether correlation is likely to be causation or not. Grafana has a setting that can show the crosshairs on all graphs at the same time, but they need to be in the same viewport for the user to see them. Generally, for instance, error rates and request rates are proportional to each other, unless a spike in error rates is being for instance triggered by web crawlers who are now hitting you with 300 req/s each whereas they normally are sending you 50. The difference in the slope of the lines can tell you why an alert fired or that it's about to. So I let previous RCAs inform whether two graphs need to be swapped because we missed a pattern that spanned a viewport. And sometimes after you fix tech debt, the correlation between two charts goes up or way down. So what was best in May not be best come November.

There's a reason my third monitor is in portrait mode, and why that monitor is the first one I see when I come back to my desk after being AFK. I could fit 2 dashboards and group chat all on one monitor. One dashboard showed overall request rate and latency data, the other showed per-node stats like load and memory. That one got a little trickier when we started doing autoscaling. The next most common dashboard which we would check at intervals showed per-service tail latencies versus request rates. You'd check that one every couple of hours, any time there was a weird pattern on the other two, or any time you were fiddling with feature toggles.

From there things balkanized a bit. We had a few dashboards that two or three of us liked and the rest avoided.

 help



Yeah, but that still doesn’t let you see “event A happened before event B which led to C”. I’ve had significantly >> 1 bugs where having good logs lets me investigate and resolve the issue so quickly and easily whereas telemetry would have left you searching around forever.

Here’s the thing though. When you’ve got 1000 req/s split across a couple dozen log files all being scanned in parallel there’s really no such thing as tracing a->b->c anyway. It’s the seashore and you’re looking for a specific shell.

You’ve got correlationids, and if your system isn’t reliably propagating those everywhere you absolutely have to fix that. But you’re going to use those once you already notice an uptick in a weird error you haven’t seen before, and it’s hard to see those when you’re generating 8k log entries per second that are 140-200 characters long and so you’re only seeing twenty of them at a time in Splunk.

You have some chatty frontend that’s firing off three requests at the same time and you’re going to struggle period. You’re going to be down to some janky log searches for that and you don’t need to be paying someone $$ every month to still have it rough.

We used to have QA people for this.


But most requests don't generate errors / warnings / failures, so you can easily discard most of the logs for those that don't.

> there’s really no such thing as tracing a->b->c anyway

> and it’s hard to see those when you’re generating 8k log entries per second that are 140-200 characters long and so you’re only seeing twenty of them at a time in Splunk.

Except as you note you can have a tag to correlate logs across distributed services. This is already done for jaeger tracing. It would be insanity to try to look at all logs at once. When you're looking at logs it's because something like "customer A complains they had a problem with request XYZ". And honestly, 8k/s is child's play for logging. A system I was running had to start tuning down the log verbosity at ~30k requests/s and that's because it was generating like 8 logs per request (so ~100k logs/s).

> You’re going to be down to some janky log searches for that and you don’t need to be paying someone $$ every month to still have it rough

That's between you and your log ingestion system. You get to pick where you send your logs and the capabilities it has. All the companies I worked at self-hosted their log infrastructure and it worked fine for not a lot of money. You're conflating best practices with "what can I pay a SaaS company to solve for me". Honeycomb.io may be helpful here btw. Their pricing wasn't exorbitantly egregious here and at low to medium scale tracing the way they do it can supplant the need for logging.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: