Identifying Incidents
There are several ways that DevOps engineers can be notified of an incident:
- Help Centre support ticket submission.
- Direct communication (email, phone) to a team member.
- Noticing outliers in metrics by Developers/DevOps engineers.
Types of incidents
Vendor (Google Cloud) Incident
If our cloud hosting provider suffers an isolated incident in one of our instances we should be able to remain online due to the High Availability setups we’ve configured. In case of a widespread outage in the hosting Region, we will go offline until the upstream service is restored. In any case, you should:
- Confirm the issue really is the upstream provider.
- Monitor the status updates of the upstream provider and communicate internally so we can inform users.
- If service disruption takes longer than a couple of hours, start assessing the possibility of migrating the affected service to another Region.
Security Breach
If you notice a security breach of any kind, you should:
- Communicate the issue internally so it can be escalated internally and communicated to users. It might require us to directly contact customers to inform them.
- Collect evidence that made you classify this as a security breach.
In case of affected instances:
- Turn them off and create snapshots for future investigation.
- Rotate any credential that might have been present in the instances.
In case of affected credentials, like email phishing or other:
- Rotate any credential that might have been compromised.
- Assume more things have been compromised and investigate other possible affected targets.
These include but are not limited to:
- Loss or theft of personal computing devices used to store or access Content Ignite systems.
- Breaches of any Content Ignite systems.
- Unintended disclosure of Content Ignite sensitive information.
Reacting to Incidents
- Ensure the whole team knows by announcing it on the #tcm_everyone channel. Use @channel to attract everyone’s attention.
- Try to identify which services are being affected. If this takes more than a couple of minutes coordinate with other online engineers and ask for help. This might mean initiating a Hangouts chat where you can discuss your findings through the incident without stopping the actual remediation efforts.
- When you’ve identified the affected services, decide on the severity of the incident:
- Was there a security breach?
- Is customer data affected?
- Is the incident part of a larger vendor, outage?
- Will a reliable fix be easy to produce?
- Can you do it on your own?
- How long will it take you to deploy it?
- Do you need someone to review your fix before and after you deploy it?
- Do we need to go into maintenance mode in the meantime?
- Are you sure what you are fixing is the actual root cause of the problem?
- Make sure the DevOps team are aware of the issue. If none of them are online, contact them immediately by phone. Most certainly they know about the issue before anyone else, but it’s better to verify if you’re unsure.
- Create an activity log to track what changes are being made and what is known about the outage. This could be writing small updates in a Slack channel like #dev_department or a Google docs document. This is very useful for hand-overs and post-mortem creation.
- Discuss in the Content Ignite Team channel if we should enter maintenance mode. Maintenance mode should be used if the outage is expected to take more than a few minutes. If it’s decided that we should enter maintenance mode, a developer should immediately do so.
- Log into status.contentignite.com and create a new incident. Use the Generic Incident Report template and customise the messages as you see necessary.Update the component statuses accordingly.
- Update the team on #tcm_everyone channel by sharing a link to the status report.
- As we learn more about the incident, it is important we keep updating status page.
After the Incident is Solved
- Verify that the incident has been indeed resolved.
- Add a “Resolved” update to status page.
- Update the team on #tcm_everyone channel.
- Make sure we’ve left maintenance mode if it was enabled.
- If the maintenance needed was much longer than planned, we will prepare an email to explain ourselves.
- Verify that monitoring is in place to detect this issue in the future.
- If the incident was long in duration or broad in affected services, create a post-mortem analysis with a detailed timeline so we can better understand root cause and improve the process in the future.