Skip to content

Chatie API Server Down Accident Report #98

@su-chang

Description

@su-chang

Token Service Discovery Service Accident

Our wechaty puppet service discovery service has been experiencing out-of-service issues from 3 pm Feb 7.

  1. 10 am Feb 7: notice the disk usage of some instances are abnormal, then clear logs file and make instance keep running right, at the same time the api.chatie.io work well
  2. 3 pm Feb 7: this problem outbreak in the afternoon then we working on it, and found that the http response status code 503 of api.chatie.io
  3. 2 am Feb 8: @huan show some detail info from heroku, see: 🔥🔥🔥 api.chatie.io服务异常,HTTP错误码503 #97 (comment)
  4. 8 am Feb 8: confirm api.chatie.io out-of-service due receive too many requests (init token on api.chatie.io) in few seconds
  5. 9 am Feb 8: find the bug in wechaty-puppet-workpro, one NodeJS Timer function init token on api.chatie.io has not been clear right, and we notice that the only way which could fix this bug temporarily is to restart all containers
  6. 10 am Feb 8: confirm the operation time to restart all containers
  7. 2 pm Feb 8: restart all containers
  8. 2:30 pm Feb8`: the server fully restored
  9. 6 pm Feb 8: create the hotfix PR to fix this problem
  10. 9 pm Feb 8: PR has been merged, and ready to deploy
  11. 0 pm Feb 9: start deploy for some instances

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions