Troubleshooting Blueprint: Essential Strategies for Engineers
Troubleshooting is a part of an engineer’s life. Whether it is API timeouts, issues with functionality, misconfigurations, or any number of other issues, we often need to roll up our sleeves and fix things. Based on my experience and tenure at Zeta, I would like to share some guidelines, learning resources, and tips and tricks that have helped me troubleshoot issues.
Guidelines
Incidents can come anytime and challenge us in new ways. Continuous preparation and learning equip us to solve the incidents. There is an ever-growing list that will continue to evolve and I have attempted to capture some key information with 40’s in mind.
The 4Os
- Observability: Signals emitted by the application contribute toward observability. Lack of observability affects the MTTD because in the absence of the right signals, troubleshooting is based on hypothesis. Therefore, it would require some trial and error to confirm and replicate before fixing.
- Operability: Controls to operate the system like turning on and off features, updating configurations, bumping resources, restarting applications, etc. Good operability controls help solve the incidents once the root cause is identified and helps reduce MTTR.
- Optimization: No system can support 1 Million TPS from the first day. Optimization needs to continuously happen and keep up with the expected traffic from our customers. This involves not only code changes but also tuning configurations, choice of resources, etc.
- Onboarding: The majority of incidents of lower severity or issues might be due to misconfigurations. Right onboarding with proper steps becomes crucial to avoid incidents related to this.
Preparing for Troubleshooting
Preparing the Application
- Ensure your application is publishing the right Signals.
- Use structured logging as it helps in capturing important attributes in logs like entityID, requestID and analyze logs end to end.
- Design the system and APIs using operability in mind. Always design CRUD APIs for an entity and make sure you can use them to fix data issues, disable product features temporarily, etc.
- If operability controls cannot be exposed as APIs, do expose them via JMX. Operations that can be performed via JMX are as follows:
- Clear Cache
- Change log levels
- Disable features
- Increase or decrease cache size
- Get PGWatch and PGBadger enabled for all the PostgreSQL databases your application connects to. This helps in troubleshooting querying performance. These can be monitored regularly.
- Having a good test coverage might not seem related to incidents, but having 80% test cases not only helps prevent incidents but can also help in reproducing issues locally.
- Performance benchmarking the application, having a dedicated setup, and knowing the TPS of your APIs helps to know the right configurations for your application in production, the known supported TPS and doing this exercise is in itself a good learning.
- Have runbooks handy around the flows with mitigation steps.
Preparing the Cluster
- Ensure auditing is proper at all layers. This is very useful in determining which call landed for your APIs, how much time it took, and by whom it was called.
- Ensure all the applications emit appropriate signals and can be effectively monitored.
- Ensure customer services owned by the cluster are well known and documented and runbooks prepared for it.
- Ensure Configurations to operate the system are properly documented, idempotent, and have minimal steps. Keep on iterating them to include new features.
- Maintain a runbook containing flows documented by application and customer service containing known issues and resolution steps.
Preparing Yourself
- Be familiar with Observability Tools used. They provide a lot of insights while troubleshooting incidents.
- Product Context helps a lot. Go through the resources like training videos, documents, and code to be familiar with the critical flows. Connecting the business context, domain, and technical context helps relate the issue with the impact and probable fault points.
- Know thy system well and know the systems you depend on and the systems which depend on you better.
- Familiarity with the tools like Kibana, Prometheus Queries, Grafana, Eclipse MAT, Kubectl, etc., helps a lot.
During Troubleshooting
- While restarting to solve a problem, ensure you always take the thread and heap dump of the java process. It’s all about the evidence.
- Always check if the problem is with only one instance due to multi-threading/concurrency issues. Deadlocks can cause that; in this case, taking a thread dump and restarting can quickly help resolve the issue.
- There are many ways to solve a problem. Try to get out of the incident/issue first and then work on improvement. Sometimes out of the box solutions can save us a lot of time and get us out of tough situations.
- When reporting an issue or passing the baton to another team, always provide as much supporting information as possible. Filling these in the FIR helps with triaging and sometimes can help in quick pointers based on a birds-eye view. Some examples are as follows:
- Kibana Link containing logs
- Inputs Passed
- APIs called
- Errors observed
- Time Window
- Grafana Dashboards Link containing key metrics which might indicate a problem
- Sequence of Steps performed
- Configurations related to the issue
- Reference for Code for your team or others
- Actively report your observations in the preferred internal communication medium of triaging issues or incidents. It helps in keeping the stakeholders posted and might help in parallel debugging and some Eureka moments.
After Troubleshooting
- Rigor in RCA and IAI is very important. Whether the issue is for one user vs. an issue for millions of users, it does not matter, as ignoring an issue with lower impact might lead to the issue getting escalated in terms of impact.
- Always try to identify IAIs. The IAIs can be process related, product related, or tech related. It might be specific to one cluster or applicable across, which is also acceptable.
- When doing RCA of an incident, consider the 5 Whys to be completed only when you know the root cause that will fix the issue for good and avoid reoccurrence.
- Capture all the evidence in the RCA Document. Since links expire, capturing screenshots helps.
- Do not forget to prioritize the IAIs.
- All the evidence may not be available. Refer to the Tips and Tricks section on how to solve these.
- Find IAIs which improve the 4Os.
Tips and Tricks
There are times when we may fall short on the observations and are unclear on what to do next. Some tips and tricks which can help are as follows:
Timeouts
- Ensure common libraries used have the right instrumentation and logs to track ingress and egress flows. If available, use them to check logs around that time window.
- Add logs and metrics around ingress and egress flows for an application and simulate again to reproduce.
- Check the resources allocated to Kubernetes pods. CPU throttling of even 1% can impact the application heavily.
- Check the code line by line from source to destination to see inefficiencies. Some of the common inefficiencies are:
- Connection Pool settings for HTTP Calls.
- Connection Pool settings for DB Calls.
- Time taken by external calls. Percentile 95 and 99 metrics. Variations in them.
- Requests getting queued in the executor used for external calls.
- Time spent in Executor Queues.
- Check the data as problems might be with setup and specific input as the associated data might be the reason for inefficiency.
Queries taking time on PostgreSQL RDS
- Use EXPLAIN and ANALYZE to check the query plan.
- If your database queries are taking more than 20ms Percentile 95, especially for a table with less than 1 million rows, assume there is a problem and start analyzing the problem.
- Slowest individual queries in PGBadger helps.
- Ensure RDS has sufficient resources and is not running low on CPU, Memory, or IOPS.
- Enable Performance Insights to monitor instance performance if not getting an idea of the issue.
- PGWatch has nice dashboards which capture very useful information about what’s happening in the database. Checkout Juno integration in OWCC and ensure PGWatch is enabled with dashboards getting populated.
Learning Resources
- kubectl cheatsheet
- Resource Management for Pods and Containers
- Lucene Syntax
- Building Secure and Reliable Systems
- Amazon RDS DB instance storage
- PostgreSQL Key Optimization Areas