The day DevOps became NoOps

The Internet haz gone down, at least parts of it. Last night, the Virginia EC2 region started experiencing all sorts of issues with their EBS storage (some suspect it’s the bandwidth) and we are still feeling the impact. For blitz.io (which is down, BTW), we use Heroku that runs off of the Virginia region. When we first architected blitz.io, we were afraid of regional failures and the database bottlenecks and ended up deploying an entire CouchDB cluster across multiple regions with master to master replication.

Given the number of dynos we run on Heroku, we were pretty sure that the application won’t be the bottleneck. Boy, were we wrong! As we speak, all of our CouchDB instances are live, the scale engines are reachable across all five regions (these don’t have any local storage), but Heroku is down hard.

picture-1.png

The messages (must be an automated script) are repeating every half hour and all those wonderful DevOps scripts have no place to run. :(

Fixing Heroku (feature request)

I posted a note on the Heroku forums and here’s what we need:

  • Add additional EC2 regions to run the dynos and workers
  • Add the ability to specify regions where the dynos (and workers) are deployed
  • Make proxy.heroku.com resolve to geo-local IP’s for regional performance
  • Add a command-line option to the Heroku gem to manage this
  • Make the region available as an ENV to each dyno, so it lets us pick the local DB

Still believe that cloud is all about the ecosystem, but it’s one stuck up system today. DevOps just became NoOps for lots of companies.

Bookmark and Share