Postmortem: December 7th

December 11, 2020

Brita Ulf

min read

Overview:

Link

Hi all,

On December 7th we had elevated error rates in our database throughout the peak of the workday. This caused many users to be unable to update their Streak boxes and contacts, and caused sporadic issues loading data throughout the day.

We know you rely on Streak to keep your business running and we apologize for the problems that this caused you. I wanted to give a little bit of context on what happened, how we resolved the issue, and what we’re doing to make sure it doesn’t happen again.

For background, Streak historically has been powered by the Google Cloud Datastore database. Cloud Datastore is very reliable and easy to maintain, but it’s restrictive in the way in which we can access data. For instance, if we want to get all boxes connected to the contacts in an email thread, we have to first manually fetch all of the contacts connected to the thread, and then in a second step manually fetch all of the boxes connected to those contacts. This makes the Streak experience slower and limits the amount of context we can give you about who you’re talking to, which means more manual work for you.

To better support this, we’ve been migrating some data from Cloud Datastore onto a platform based on MySQL, an industry standard database that provides better support for these kinds of context-based questions. When we started the migration, we ran into some problems early on where the query performance was limited by Google’s hosted MySQL service’s disk performance. To work around this limitation, we dramatically over-provisioned processing power and memory to make up for the disk performance shortfall.

We wanted to make sure we had a stable foundation for future work, so on Sunday, we moved to a different set of instances that have much better disk performance. As part of that move, we moved back to the instance size we were using before the disk issues. Unfortunately, in the intervening period, we had deployed additional queries that legitimately used more of the additional processing power than we anticipated. Unfortunately, this didn’t become evident until we hit the workday peak. Since our database was at full capacity, it wasn’t feasible to migrate to a larger instance until after the workload lessened as folks signed off for the evening in Europe and North America. We made some gains by optimizing queries, but the error rate and latency metrics remained higher than is acceptable for the remainder of the workday.

On the evening of December 7th, we migrated to instances that have both the better disk performance and the higher processing power and memory. Our metrics are back to their target range, and we’ve added additional monitoring in this area.

We’ve also taken process steps to ensure that where possible we add additional capacity in advance of needing it in future migrations. We appreciate your trust in us and apologize again for the issue.

Sincerely,

Fred Wulff
Engineering @ Streak

No items found.

Sales management: Definition, tools, and best practices for sales optimization

Want to enhance your sales management processes? Look no further than these best practices and top tools for effective sales management.

Ron Tuch

Jan 2

min read

ICP sales: How to create an ideal client profile for effective sales and marketing

Learn what an ICP is and how you can create one that'll clarify your sales and marketing efforts and drive new business growth.

Ron Tuch

Dec 19

min read

How to embed a Google Form on a website and track responses

Learn how to leverage the best parts of Google Forms to grow your business. This guide will show you how to embed forms on your website, boost lead generation, and track responses, all without any fancy technical skills.