From 02:16 UTC April 2nd to 13:52 UTC April 14th around 10% of our Gmail Add-on users will have experienced increasingly long delays between messages being received on existing stored conversations in Gmail and them appearing in Capsule. This was due to a processing issue which we have now resolved.
We know a large number of our users have come to rely on our new Gmail Add-on and we’re extremely sorry it hasn’t been working to the usual high standard you would expect from a Capsule feature.
Therefore we feel it’s only right to provide a more detailed explanation and assurances that we’re working to prevent this from happening again.
First a little background - Over the past two weeks we were receiving sporadic reports from our users that our Gmail Add-on wasn’t storing new messages that were part of an existing stored conversation.
We immediately began an investigation into the issue and identified a potential cause. This related to a timeout inside our code that communicates with Gmail.
A fix for the timeout and some additional logging to assist with the investigation was rolled out. We then asked some of the users affected if the issue was resolved for them. It quickly became apparent it was not.
At that point we started analysing the additional logging we had put in place to see if there was any pattern to the reports we were receiving.
The analysis was still ongoing on Friday (April 12th) when we noticed a sudden unusual size increase in one of a number of queues which forms part of the processing for our Gmail Add-on.
The queue is a stream of events we receive from Google for a subset of user mailboxes which has our Gmail Add-on enabled.
We consume these events so we know when something has happened on a stored conversation i.e. a new message is sent or received. When we receive an event for a stored conversation we request all the updates since we last fetched the history of it and store any new messages that have occurred since.
The increasing queue size suggested we had an issue processing the event stream which could easily have accounted for the missing message reports so we focused our investigation on that.
After a short time we identified an issue fetching conversation histories from a single user’s mailbox. The fetch from Gmail was taking an abnormally long time and hitting a transaction timeout we have in our code. In an effort to recover automatically, our code re-attempted the fetch. This caused an infinite loop that we didn’t recover from.
Unfortunately this effectively stalled the processing of the event stream from Google for the users sharing the same queue as that user.
Once we found this loop, we disabled the Gmail Add-on for that single user and restarted the process which consumes the event stream. This kicked off a recovery processing a backlog events. This recovery took a number of hours due to the volume of events which had built up.
After the backlog was processed we confirmed that new messages that were part of an existing stored conversation were being processed and stored in near real time once again.
We’re now in the process of introducing additional monitoring and alerting so we are made aware of any delays in processing.
We’re also introducing additional safeguards to ensure an issue with a single user doesn’t impact other users sharing the same queue.
And finally we’re looking at how we can re-architect the solution to ensure that if another backlog of events ever does form we can process through it as quickly as possible.
This was quite a challenging incident for us and we’ve learnt a lot from it. Once again we’re extremely sorry for the impact this had on our users and please be assured that we’re working hard to prevent it happening again.