Site Reliability Engineering at Starship | by Martin Pihlak | Spacecraft Technologies




Photo by Ben Davis, Instagram slovaceck_

Running autonomous robots on city streets is a software engineering challenge. Some of this software works on the robot itself, but much of it actually works in the backend. Things like remote control, path finding, matching robots with customers, managing fleet status, but also interactions with customers and traders. It all needs to work 24/7, uninterrupted and dynamically scale to match the workload.

SRE at Starship is responsible for providing the cloud infrastructure and platform services to run these backend services. We have standardized on Governors for our microservices and run it in addition to AWS. MongoDb is the main database for most backend services, but we also like PostgreSQL, especially where strong typing and transactional guarantees are needed. For asynchronous messaging Kafka is the messaging platform of choice and we use it for just about everything other than sending video feeds from bots. For observability, we rely on Prometheus and Grafana, Loki, Left and Jaeger. CICD is managed by Jenkins.

A good portion of SRE’s time is spent maintaining and improving the Kubernetes infrastructure. Kubernetes is our primary deployment platform and there is always something to improve, whether it’s tweaking autoscaling settings, adding pod interrupt policies, or optimizing the use of the Spot instance. Sometimes it’s like laying bricks – just installing a Helm diagram to provide special functionality. But often the “bricks” have to be carefully selected and evaluated (is Loki good at managing logs, is Service Mesh a thing and then who) and sometimes the functionality does not exist in the world and must be written from scratch. When that happens, we usually turn to Python and Golang but also Rust and C when needed.

Another important part of the infrastructure for which SRE is responsible is data and databases. Starship started out with a single monolithic MongoDb – a strategy that has worked well so far. However, as the business grows, we need to revisit this architecture and start thinking about supporting bots by the thousands. Apache Kafka is a part of the scaling story, but we also need to understand partitioning, regional clustering, and microservices database architecture. In addition, we are constantly developing tools and automation to manage the current database infrastructure. Examples: add MongoDb observability with custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate regular failover and recovery testing, collect metrics for re-sharding of Kafka, enable data retention.

Finally, one of the most important goals of site reliability engineering is to minimize downtime for Starship production. While SRE is sometimes called upon to deal with infrastructure outages, the most effective work is done to prevent outages and ensure that we can recover quickly. This can be a very broad topic, ranging from rock solid K8 infrastructure to engineering practices and business processes. There are great opportunities to make an impact!

A day in the life of an SRE

Arrival at work, between 9 and 10 am (sometimes remotely). Have a cup of coffee, check Slack messages and emails. Take a look at the alerts that went off overnight, see if there is anything of interest there.

Find that MongoDb connection latencies increased overnight. As you dig into the Prometheus metrics with Grafana, find that this happens while the backups are running. Why is it suddenly a problem we’ve been running these backups for ages? It turns out that we compress backups very aggressively to save network and storage costs, which consumes all available CPU. It looks like the load on the database has increased a bit to make it noticeable. This happens on a standby node, with no production impact, but still a problem if the primary fails. Add a Jira item to resolve this issue.

By the way, modify the MongoDb (Golang) prober code to add more histogram buckets to better understand the latency distribution. Run a Jenkins pipeline to bring the new probe into production.

At 10 a.m. there is a Standup meeting, share your updates with the team and find out what others have been up to – set up monitoring of a VPN server, instrumenting a Python application with Prometheus, setting up ServiceMonitors for services external, debug MongoDb connectivity issues, drive Canary deployments with Flagger.

After the meeting, resume your scheduled work for the day. One of the things I planned to do today was set up an additional Kafka cluster in a test environment. We are running Kafka on Kubernetes, so it should be simple to take the existing cluster YAML files and modify them for the new cluster. Or, on second thought, should we use Helm instead, or maybe a good Kafka operator is available now? No I’m not going there – too much magic, I want more explicit control over my statefulsets. Raw YAML is. An hour and a half later, a new cluster is running. The setup was pretty straightforward; only the boot containers that register Kafka brokers in DNS required a configuration change. Generating credentials for the applications required a small bash script to set up accounts on Zookeeper. A little that was left hanging was configuring Kafka Connect to capture database change log events – it turns out that test databases don’t run in ReplicaSet mode and Debezium can’t get the oplog from it. Delay this and move on.

Now is the time to prepare a scenario for the wheel of misfortune exercise. At Starship, we run them to improve our understanding of systems and share troubleshooting techniques. It works by breaking part of the system (usually in testing) and having an unhappy person try to fix and mitigate the problem. In this case, I will set up a load test with Hey to override the microservice for route calculations. Deploy it as a Kubernetes job called a “haymaker” and hide it well enough that it doesn’t immediately show up in the Linkerd service mesh (yes, evil 😈). Then run the “Wheel” exercise and note the gaps we have in playbooks, metrics, alerts, etc.

In the last hours of the day, block all interruptions and try to code. I have reimplemented the Mongoproxy BSON parser as asynchronous streaming (Rust + Tokio) and want to know how it works with real data. Turns out there is a bug somewhere in the guts of the parser and I need to add some deep logging to figure this out. Find a wonderful library of plots for Tokio and let yourself be carried away …

Disclaimer: The events described here are based on a true story. All of this did not happen on the same day. Some meetings and interactions with colleagues have been deleted. We are hiring.





Source link

Leave a Reply

Your email address will not be published. Required fields are marked *