Saturday, January 15, 2005

Падение ЖЖ

Я думаю (шутка), что это происки Microsoft или Google, которые испугались конкуренции со стороны ЖЖ после слияния его с SixApart. Microsoft в том же Сиэтле (где стоят упавшие серверы ЖЖ) вообще базируется, а у Google там есть выносной офис.

Комментарии Фицпатрика на о ситуации с отключением света в ЖЖ

(источник на Slashdot)

(примерное изложение по-русски)

They all came back up when the power came back.

But we intentionally don't have databases come back up on boot because if there was a blip, we want to do an integrity check first. (we run InnoDB, so it's ACID, but we're paranoid

We have clusters of 2 identical databases in separate cabinets, separate switches, separate Internap power feeds... so normally losing one database in each cluster doesn't matter: the other one gets used. But when we lose every single database, in all clusters, all at once... that's the time to be paranoid and double check stuff.


At this point all my whiteboards are full of boxes of each database cluster, the machines in that cluster, which have passed their checksum tests. (innodb checksums each 16k page), which replayed their replay/undo logs, where in binlogs each was writing/reading/executing etc...

So lots of waiting now on the checksum validators. I don't want to put a machine back in and find out in a week there was a database page that was corrupt because the battery-backed write-back cache on the RAID card didn't work as advertised. (which happens on about 95% of RAID cards, in my experience, because they're mostly crap, even the most expensive ones...)

Also whenever there's any doubt about something's integrity, we backup or snapshot the potentially corrupt version before operating on it. That operation can take time too.

It's going to be a fun night.

upd: новости на /powerloss/:

Update #4: 9:12 am: We're back at it. We'll have the site up soon in some sort of crippled state while the clusters with the oldest backups continue to catch up.

Вкратце это означает, что совсем скоро всё будет хорошо, надо только ещё немного подождать.