I’ve previously written Some Notes on DRBD [1] and a post about DRBD Benchmarking [2].
Previously I had determined that replication protocol C gives the best performance for DRBD, that the batch-time parameters for Ext4 aren’t worth touching for a single IDE disk, that barrier=0 gives a massive performance boost, and that DRBD gives a significant performance hit even when the secondary is not connected. Below are the results of some more tests of delivering mail from my Postal benchmark to my LMTP server which uses the Dovecot delivery agent to write it to disk, the rates are in messages per minute where each message is an average of 70K in size. The ext4 filesystem is used for all tests and the filesystem features list is “has_journal ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize“.
p4-2.8 Default Ext4 1663 barrier=0 2875 DRBD no secondary al-extents=7 645 DRBD no secondary default 2409 DRBD no secondary al-extents=1024 2513 DRBD no secondary al-extents=3389 2650 DRBD connected 1575 DRBD connected al-extents=1024 1560 DRBD connected al-extents=1024 Gig-E 1544The al-extents option determines the size of the dirty areas that need to be resynced when a failed node rejoins the cluster. The default is 127 extents of 4M each for a block size of 508MB to be synchronised. The maximum is 3389 for a synchronisation block size of just over 13G. Even with fast disks and gigabit Ethernet it’s going to take a while to synchronise things if dirty zones are 13GB in size. In my tests using the maximum size of al-extents gives a 10% performance benefit in disconnected mode while a size of 1024 gives a 4% performance boost. Changing the al-extents size seems to make no significant difference for a connected DRBD device.
All the tests on connected DRBD devices were done with 100baseT apart from the last one which was a separate Gigabit Ethernet cable connecting the two systems.
ConclusionsFor the level of traffic that I’m using it seems that Gigabit Ethernet provides no performance benefit, the fact that it gave a slightly lower result is not relevant as the difference is within the margin of error.
Increasing the al-extents value helps with disconnected performance, a value of 1024 gives a 4% performance boost. I’m not sure that a value of 3389 is a good idea though.
The ext4 barriers are disabled by DRBD so a disconnected DRBD device gives performance that is closer to a barrier=0 mount than a regular ext4 mount. With the significant performance difference between connected and disconnected modes it seems possible that for some usage scenarios it could be useful to disable the DRBD secondary at times of peak load – it depends on whether DRBD is used as a really current backup or a strict mirror.
Future TestsI plan to do some tests of DRBD over Linux software RAID-1 and tests to compare RAID-1 with and without bitmap support. I also plan to do some tests with the BTRFS filesystem, I know it’s not ready for production but it would still be nice to know what the performance is like.
But I won’t use the same systems, they don’t have enough CPU power. In my previous tests I established that a 1.5GHz P4 isn’t capable of driving the 20G IDE disk to it’s maximum capacity and I’m not sure that the 2.8GHz P4 is capable of running a RAID to it’s capacity. So I will use a dual-core 64bit system with a pair of SATA disks for future tests. The difference in performance between 20G IDE disks and 160G SATA disks should be a lot less than the performance difference between a 2.8GHz P4 and a dual-core 64bit CPU.
Related posts:
What on earth was Ubuntu thinking when it introduced Unity. I dislike it a lot. Infact I heard today they dropped the development team working on Kubuntu too. What on earth is going on.
You surely know someone like this or maybe that someone is you. Second guessers research and research decisions before hesitantly stepping forward only to wonder whether they made the right decision in the in.
If you run Simpana 9.0 and have some issues backing up Windows System State via CentOS/RHEL 6.2 MediaAgents, don’t fear SP5a will fix some quirks.
I had an issue matching the above and seems it will be resolved in SP5a (when released). If you have any issues, I recommend opening a support ticket with CommVault to have it checked out.
Dear people,
Many people have those little e-tags in their cars these days. They allow us to drive along tollways without having to stop and throw money into a machine. Another area where people have to stop their cars and throw money into a machine is in parking stations. We also have to grab the card that it spits out, carry it around and remember to pay for it before we leave, and if it doesn't validate or we lose it we have trouble getting out. However, you people have the solution for that.
Instead, we could drive up to the entrance gate of a parking station, the toll sensor would go 'beep', the boom gate would open and we'd drive in. Then, when we wanted to leave, we'd drive up to the exit gate, the toll sensor would go 'beep', the boom gate would open, we'd drive out again and the parking cost would automatically be debited to our account.
This would save us lots of time - time otherwise spent getting a ticket, paying for it, and feeding it into the exit gate. It would save you a lot of cost maintaining and repairing those machines. I'm sure you're already doing data mining on the journeys people take - this gives you a lot more interesting data. And you get a lot more people wanting to use your e-tags - people who like the convenience of being able to drive right into a shopping centre but aren't already toll users.
Go ahead and use this idea, I don't need any credit - just improve the planet.
Have fun,
Paul
Henrik has already posted it over on the Drizzle Blog, but I thought I’d give a shout out here too.
We’re holding a Drizzle Day right after the Percona Live MySQL Conference and Expo in April. So, since you’re all like me and don’t book your travel this far in advance, it’ll be easy to stay for the extra day and come and learn awesome things about Drizzle.
I’m also pretty glad that my employer, Percona is sponsoring the event.
Congratulations to Clinton Roy, the winner of the by-election. Clinton will be joining the 2012 council as OCM.
If you want to vote in the Drupal Association election you have less than 10 hours left to do so!
Where to vote?https://association.drupal.org/voting
Who can vote?Voting is open to all individuals who registered an account on drupal.org prior to January 18, 2012 and who have logged into that account at least once in the one-year period prior to February 3, 2012. There is no need to register to vote. The voting system has been set up and prepopulated with the list of eligible voters.
How to vote?Rank the candidates in order of your preference for them to win. You don't have to rank all the candidates, but should at least choose 2, because we are electing 2 "At Large" members to sit on the DA board.
For all the details...Go to https://association.drupal.org/2012-elections-voting
Why Vote? This election is the community's chance to appoint someone to the board of the Drupal Association. The DA does NOT govern the Drupal project, it supports the project by providing the legal framework to run DrupalCon in North America, and Europe. Over the coming year it will be looking to add a third DrupalCon to the calendar, in either South America, or the AsiaPacific region. The DA also oversees the Drupal infrastructure that runs Drupal.org, Groups.Drupal.Org and Association.Drupal.Org. See the FAQ on the Drupal Assocation website for more info: https://association.drupal.org/about/faq
Opposition: Make My Day – wtf?
The ABC reports:
“Earlier today, in an address to the Coalition party room, the Opposition Leader channelled Clint Eastwood’s character Dirty Harry, daring [the Prime Minister] to “make my day” by focusing on the economy in the year ahead.”
If it is correct that the Leader of the Opposition is channeling the Dirty Harry character, it is a pretty weird thing to go around saying.
The quote originates from the movie Dirty Harry. After a gun fight, the “Dirty” Harry Callahan character advances on an injured “punk”, with his revolver drawn. Mr Callahan says to the punk that he doesn’t know whether he has any bullets left in his gun, but invites the punk to find out – “go ahead, make my day”. The punk submits but wants to know if Mr Callahan had any bullets left. He points the gun at the punk’s head and pulls the trigger, demonstrating both that the gun is empty and that he knew it was empty.
So, the subtext of this quote, should you ever pull it on someone as a challenge, is that you’re bluffing and you can’t back your challenge up.
Everyone agrees that backups are generally a good thing. But it seems that there is a lot less agreement about how backups should work. Here is a list of 5 principles of backup software that seem to get ignored most of the time:
(1/5) Backups should not be Application SpecificIt’s quite reasonable for people to want to extract data from a backup on a different platform. Maybe someone will want to extract data a few decades after the platform becomes obsolete. I believe that vendors of backup software have an ethical obligation to make it possible for customers to get their data out with minimal effort regardless of the circumstances.
Often when writing a backup application there will be good reasons for not using the existing formats for data storage (tar, cpio, zip, etc). But ideally any data store which involves something conceptually similar to a collection of files in one larger file will use one of those formats. There have been backward compatible extensions to tar and zip for SE Linux contexts and for OS/2 EAs – the possibility of extending archive file formats with no consequence other than warnings on extraction with an unpatched utility has been demonstrated.
For a backup which doesn’t involve source files (EG the contents of some sort of database) then it should be in a format that can be easily understood and parsed. Well designed XML is generally a reasonable option. Generally the format should involve plain text that is readable and easy to understand which is optionally compressed with a common compression utility (pkzip is a reasonable choice).
(2/5) Data Store Formats should be PublishedFor every data store there should be public documentation about it’s format to allow future developers to write support for it. It really isn’t difficult to release some commented header files so that people can easily determine the data structures. This includes all data stores including databases and filesystems. If I suddenly find myself with a 15yo image of a NTFS filesystem containing a proprietary database I should be able to find official header files for the version of NTFS and the database server in question so I can decode the data if it’s important enough.
When an application vendor hides the data formats it gives the risk of substantial data loss at some future time. Imposing such risk on customers to try and prevent them from migrating to a rival product is unethical.
(3/5) Backups should be forward and backward compatibleIt is entirely unreasonable for a vendor to demand that all their users install the latest versions of their software. There are lots of good reasons for not upgrading which includes hardware not supporting new versions of the OS, lack of Internet access to perform the upgrade, application compatibility, and just liking the way the old version works. Even for the case of a critical security fix it should be possible to restore data without applying the fix.
For any pair of versions of software that are only separated by a few versions it should be possible to backup data from one and restore to the other. Even if the data can’t be used directly (EG a backup of AMD64 programs that is restored on an i386 system) it should still be accessible. If a new version of the software doesn’t support the ancient file formats then it should be possible for the users to get a slightly older version which talks to both the old and new versions.
Backups made on 64bit systems running the latest development version of Linux and on 10yo 32bit proprietary Unix systems are interchangeable. Admittedly Unix is really good at preserving file format compatibility, but there is no technical reason why other systems can’t do the same. Source code to cpio, tar, and gnuzip, is freely available!
Apple TimeMachine fails badly in this regard, even a slightly older version of Mac OS can’t do a restore. It is however nice that most of the TimeMachine data is a tree of files which could be just copied to another system.
(4/5) Backup Software should not be DroppedSony Ericsson has made me hate them even more by putting the following message on their update web site:
The Backup and Restore app will be overwritten and cannot be used to restore data. Check out Android Market for alternative apps to back up and restore your data, such as MyBackup.
So if you own a Sony Ericsson phone and it is lost, stolen, or completely destroyed and all you have is a backup made by the Sony Ericsson tool then the one thing you absolutely can’t do is to buy a new Sony Ericsson phone to restore the data.
I believe that anyone who releases backup software has an ethical obligation to support restoring to all equivalent systems. How difficult would it be to put a new free app in the Google Market that has as it’s sole purpose recovering old Sony Ericsson backups onto newer phones? It really can’t be that difficult, so even if they don’t want to waste critical ROM space by putting the feature in all new phones they can make it available to everyone who needs it. When compared to the cost of developing a new Android release for a series of phones the cost of writing such a restore program would be almost nothing.
It is simply mind-boggling that Sony Ericsson go against their own commercial interests in this regard. Surely it would make good business sense to be able to sell replacements for all the lost and broken Sony Ericsson phones, but instead customers who get burned by broken backups are given an incentive to buy a product from any other vendor.
(5/5) The greater the control over data the greater the obligation for protecting itIf you have data stored in a simple and standard manner (EG the /DCIM directory containing MP4 and JPEG files that is on the USB accessible storage in every modern phone) then IMHO it’s quite OK to leave customers to their own devices in terms of backups. Typical users can work out that if they don’t backup their pictures then they risk losing them, and they can work out how to do it.
My Sony Ericsson phones have data stored under /data (settings for Android applications) which is apparently only accessible as root. Sony Ericsson have denied me root access which prevents me running backup programs such as Titanium Backup, therefore I believe that they have a great obligation to provide a way of making a backup of this data and restoring it on a new phone or a phone that has been updated. To just provide phone upgrade instructions which tell me that my phone will be entirely wiped and that I should search the App Market for backup programs is unacceptable.
I believe that there are two ethical options available to Sony Ericsson at this time, one is to make it easy to root phones so that Titanium Backup and similar programs can be used, and the other option is to release a suitable backup program for older phones. Based on experience I don’t expect Sony Ericsson to choose either option.
Now it is also a bad thing for the Android application developers to make it difficult or impossible to backup their data. For example the Wiki for one Android game gives instructions for moving the saved game files to a new phone which starts with “root your phone”. The developers of that game should have read the Wiki, realised that rooting a phone for the mundane task of transferring saved game files is totally unreasonable, and developed a better alternative.
The best thing for developers to do is to allow the users to access their own data in the most convenient manner. Then it becomes the user’s responsibility to manage it and they can concentrate on improving their application.
Why Freedom is ImportantInstalling CyanogenMod on my Galaxy S was painful, but having root access so I can do anything I want is a great benefit. If phone vendors would do the right thing then I could recommend that other people use the vendor release, but it seems that vendors can be expected to act unethically. So I can’t recommend that anyone use an un-modded Android phone at any time. I also can’t recommend ever buying a Sony Ericsson product, not even when it’s really cheap.
Google have done a great thing with their Data Liberation Front [1]. Not only are they providing access to the data they store on our behalf (which is a good thing) but they have a mission statement that demands the same behavior from other companies – they make it an issue of competitive advantage! So while Sony Ericsson and other companies might not see a benefit in making people like me stop hating them, failing to be as effective in marketing as Google is a real issue. Data Liberation is something that should be discussed at board elections of IT companies.
Keep in mind the fact that ethics are not just about doing nice things, they are about establishing expectations of conduct that will be used by people who deal with you in future. Sony Ericsson has shown that I should expect that they will treat the integrity of my data with contempt and I will keep this in mind every time I decline an opportunity to purchase their products. Google has shown that they consider the protection of my data as an important issue and therefore I can be confident when using and recommending their services that I won’t get stuck with data that is locked away.
While Google has demonstrated that corporations can do the right thing, the vast majority of evidence suggests that we should never trust a corporation with anything that we might want to retrieve when it’s not immediately profitable for the corporation. Therefore avoiding commercial services for storing important data is the sensible thing to do.
Related posts:
On New Years Eve 2011 I was in Geelong at a restaurant, 800km from my home in Adelaide. This year I happened to be away from my children, who were staying elsewhere in Adelaide while I was interstate. My home was supposedly vacant. However I knew it was very hot in Adelaide that day (40C) and I wondered if this would affect my power consumption, for example an increased duty cycle on the fridge. I am just that sort of power-geek.
So I checked my Fluksometer via my 3G android phone. I was surprised to see 1000W being used since 1pm – about what my Air-con uses. I also noticed that around 7pm the power jumped by a few 100W, just like the lights had gone on, or perhaps the TV.
Looked like some one was in my home. On New Years Eve. Hmmmmmm.
The 24 hour plot below just was captured on (1 Jan) at 5:30pm, so it actually shows the tail end of the Dec 31 festivities. You can see the 1000W consumption until it shoots up around 1900 hours, then the rapid, parentally-induced decline at around 2030 hours as explained below…..
I was fortunate to be at the restaurant with a couple of people expert in these situations. Teenagers. They suspected “Party”. I was unsure. I called my beloved 16 year old daughter Amy to see if she “knew” anything about this phantom power problem. My gut feel was to call my mother (Amy’s grandmother) and ask her to visit my home but I thought I’d give Amy the benefit of the doubt. Amy said that she was at a friends house but would go around and check my house. She was not keen on using her grandmother to resolve the issue. Exactly 30 minutes later I received a text from her saying the air con and TV was on but she had switched them off.
By this stage half the restaurant (I was with a friend’s extended family) were crowded around my phone, watching the next development with excitement. My teenage brains-trust were calling “Party” but there was no way to know for sure. Sure enough the power drops, down to about 180W. About what the fridge motor uses. However curiously, there was none of the regular fridge cycling on and off. It was as if all the lights were off in the house but the fridge motor was running all the time to cool or freeze something.
I returned to Adelaide the next day (1 Jan). My home was very clean but I found a few tell-tale signs: disposable cups with sticky red liquid in them in one of the bins, a trace of the same red sticky stuff on my sink, and post it notes accidentally left on my fridge saying things like “Molly, you may have to open up another bottle”.
What happened to Amy? Well to be honest I wasn’t very mad, just curious about the mystery. I actually enjoyed the detective work side of guessing what was going on and finding supporting evidence. Bart, the inventer of the Fluksometer, was rolling on the floor laughing when I told him the tale.
All my friends knew about the incident so when Amy joined me in Geelong for the next week she was teased relentlessly. Eventually she came clean, and said:
“All my friends who didnt know Dad said ‘How could he do that? Who measures power from across the country’? Those that did know Dad said ‘He knows. Dont worry!’”
“When I realised we were busted there was a mass exodus. I was the last one out and could see a continuous line of teenagers stretched up the street over three blocks.”
One of Amys friends put it well: “You gotta get dumber parents Amy.”
Links
Flukso Web Site
Flukso – Wifi Household Power Logging
Buying a Fluksometer in Australia
Its early days, but preparations for linux.conf.au 2013 are going well. We're working on the prospectus for sponsors and nailing down the final details for venues at the moment. The next publicly visible deadline is the call for papers, which will be in June 2012.
For subcommittee: linux.conf.au Link: http://lca2013.linux.org.auOh yeah, it should also be great for gaming:
…not.
I don’t have an ADSL2+ modem (this SpeedStream 4200 isn’t syncing above the theoretical ADSL1 maximums) so I need to go buy a proper one. Most likely a NetComm NB6, as I’m fairly familiar with that model and I only need it to function in bridge mode.
If only I had line-of-sight to Oxley Hill — then I could get my work’s awesome wireless service. Believe it or not, it has a lower ping time than ADSL, and in most cases out-performs ADSL2+ in terms of consistency. (And hey, since when do most people get over 12 Mbit/s on ADSL2+ anyway?)
Just a small heads–up: if you’re accessing any sites hosted on this server over SSL (i.e. this blog) or IPv6, then you’ll be served via Cherokee.
Non–SSL IPv4 requests are still being served via Apache (for now) because that is the lowest common denominator that I would like to break the least.
A couple of interesting things have happened since I have been trying this out.
One was a problem that cropped up on my Apache mod_gnutls setup, where the Apache workers were using 100% CPU. Upon stumbling on this thread I realise that the BerkeleyDB–based GnuTLS cache had become corrupted. Had I not found the fix, I would have used that as an excuse to completely move over to Cherokee.
The other was that I have been playing with FastCGI for the first time. Cherokee makes it a breeze to set up FastCGI, so I thought — why not? In particular, WordPress now takes 0.3 seconds to render a page, rather than 0.9 seconds with Apache + mod_suphp + PHP CGI.
Unfortunately there does not appear to be an equivalent of suPHP for Cherokee (or FastCGI in general). suPHP itself won’t work as it behaves as an Apache module (mod_suphp) that in turn executes the PHP CGI. That’s a bit of a fizzer for me because I need some of the sites I host to be able to write to their own files — but at the same time, not affect anybody else’s (thus chown’ing as www-data would be rather suboptimal).
The current version of Cherokee I am running (1.2.2) has a bug where it does not handle Chrome’s SSL false start technique. Thanks to ‘nosey’ on #cherokee for finding that one.
Overall I have found Cherokee to be less stable than Apache. This is to be expected, it being a young whippersnapper. In particular, if it encounters a config it doesn’t like (for example, I get some regex syntax wrong), it likes to shut down entirely, rather than gracefully continuing on with the old config.
Also I have not found Cherokee to be perceptually any faster than Apache thus far. My sites being hosted on a Linode that is 200 msec away obviously doesn’t help in this area. Perhaps I would see more of a difference on a server that is both closer (latency–wise) and busier (hits-per-second–wise).
More than anything this is a bit of fun. Let’s see how we go.
Frédéric Descamps of Percona.
Percona Toolkit is Maatkit & Aspersa combined. Opensource and the tools are very useful for a DBA.
You need Perl, DBI, DBD::mysql, Term::ReadKey. Most tools are written in Perl, and whatever is in Bash is being re-written in Perl. There is also a tarball or RPM or DEB packages.
Know your environment. The hardware & OS are crucial for you to know. How much memory/CPU do you use? Do you use swap? Is this a physical/virtual machine? Do you have free space? What kind of RAID controller? Volumes? Disk? What about the network interfaces? What IO schedulers are used? Which filesystem is the data stored on? To answer all that, just use pt-summary.
Know your MySQL environment. Version? Build? How many databases? Where is the data directory? What about replication? What are key InnoDB settings? Storage engine in use? Index type? Foreign keys? Full text indexes? To answer all this and more use pt-mysql-summary.
pt-slave-find shows you the topology and replication hierarchy of your MySQL replication instances. An inventory of replicas!
Where is my disk I/O going? Use pt-diskstats which is an improved iostat. There is pt-ioprofile but it can be dangerous in production.
Now its time to get more intimate with your database. Let’s try to find the answer to these questions: how are the indexes used? Are there duplicate keys? Which queries are eating most of the resources? You can use pt-duplicate-key-checker to check for duplicate/redundant indexes or foreign keys. pt-index-usage can tell you which indexes are unused. If you think you have bad SQL, check out pt-query-advisor.
You can use pt-query-digest to analyze the slow query log and show a profile of the workload. You mostly use this with slow query logs & tcpdump’s. Be careful when you have dropped packets — results may tend to be fake then!
After all this, its time to maintain your environment.
pt-deadlock-logger checks InnoDB status to log MySQL deadlock information. It needs to run continually to capture things.
pt-fk-error-logger extracts and logs MySQL foreign key errors.
pt-online-schema-change to alter tables. It makes a “shadow copy” and swaps them. Extremely useful for large, long-running ALTER. Facebook uses the same technique.
Validate your upgrades as upgrades are the leading cause of downtime. Are queries using different indexes? Is query execution plan different? New errors? See pt-upgrade for this. Best to run this on a third machine (i.e. the old machine and a new machine to see how it goes).
Verify replication integrity – pt-table-checksum. Perform an online replication consistency check or checksum MySQL tables efficiently on one or many servers. Use it routinely (mandatory for 95% of MySQL users). Put it in a weekly crontab. Repair differences with pt-table-sync.
Repair out-of-sync replicas – pt-table-sync
Measure delay acfurately – pt-heartbeat
Deliberately delay replication – pt-slave-delay
Watch & restart MySQL replication after errors – pt-slave-restart
When there are problems, get the symptoms when it hurts. Look at pt-stalk (wait for a condition to occur them begin collecting data – eg. everytime the threads go over 2,000 you have a problem, so it collects stuff – it calls pt-collect), pt-collect (collect information from a server for some period of time), and pt-sift.
pt-mext looks at many samples of MySQL SHOW GLOBAL STATUS side-by-side. Default STATUS shows counter since the MySQL instances started. It is very helpful to see a delta of recent activity.
The future: pt-query-digest will do query reviews; pt-stalk will do “magical fault detection algorithm”. Its all opensource and its all on Launchpad at lp:percona-toolkit.
Related posts:
ZDNet has an insightful article by Robin Harris predicting the demise of RAID-6 due to the probability of read errors [1]. Basically as drives get larger the probability of hitting a read error during reconstruction increases and therefore you need to have more redundancy to deal with this. He suggests that as of 2009 drives were too big for a reasonable person to rely on correct reads from all remaining drives after one drive failed (in the case of RAID-5) and that in 2019 there will be a similar issue with RAID-6.
Of course most systems in the field aren’t using even RAID-6. All the most economical hosting options involve just RAID-1 and RAID-5 is still fairly popular with small servers. With RAID-1 and RAID-5 you have a serious problem when (not if) a disk returns random or outdated data and says that it is correct, you have no way of knowing which of the disks in the set has good data and which has bad data. For RAID-5 it will be theoretically possible to reconstruct the data in some situations by determining which disk should have it’s data discarded to give a result that passes higher level checks (EG fsck or application data consistency), but this is probably only viable in extreme cases (EG one disk returns only corrupt data for all reads).
For the common case of a RAID-1 array if one disk returns a few bad sectors then probably most people will just hope that it doesn’t hit something important. The case of Linux software RAID-1 is of interest to me because that is used by many of my servers.
Robin has also written about some NetApp research into the incidence of read errors which indicates that 8.5% of “consumer” disks had such errors during the 32 month study period [2]. This is a concern as I run enough RAID-1 systems with “consumer” disks that it is very improbable that I’m not getting such errors. So the question is, how can I discover such errors and fix them?
In Debian the mdadm package does a monthly scan of all software RAID devices to try and find such inconsistencies, but it doesn’t send an email to alert the sysadmin! I have filed Debian bug #658701 with a patch to make mdadm send email about this. But this really isn’t going to help a lot as the email will be sent AFTER the kernel has synchronised the data with a 50% chance of overwriting the last copy of good data with the bad data! Also the kernel code doesn’t seem to tell userspace which disk had the wrong data in a 3-disk mirror (and presumably a RAID-6 works in the same way) so even if the data can be corrected I won’t know which disk is failing.
Another problem with RAID checking is the fact that it will inherently take a long time and in practice can take a lot longer than necessary. For example I run some systems with LVM on RAID-1 on which only a fraction of the VG capacity is used, in one case the kernel will check 2.7TB of RAID even when there’s only 470G in use!
The BTRFS FilesystemThe btrfs Wiki is currently at btrfs.ipv5.de as the kernel.org wikis are apparently still read-only since the compromise [3]. BTRFS is noteworthy for doing checksums on data and metadata and for having internal support for RAID. So if two disks in a BTRFS RAID-1 disagree then the one with valid checksums will be taken as correct!
I’ve just done a quick test of this. I created a filesystem with the command “mkfs.btrfs -m raid1 -d raid1 /dev/vg0/raid?” and copied /dev/urandom to it until it was full. I then used dd to copy /dev/urandom to some parts of /dev/vg0/raidb while reading files from the mounted filesystem – that worked correctly although I was disappointed that it didn’t report any errors, I had hoped that it would read half the data from each device and fix some errors on the fly. Then I ran the command “btrfs scrub start .” and it gave lots of verbose errors in the kernel message log telling me which device had errors and where the errors are. I was a little disappointed that the command “btrfs scrub status .” just gave me a count of the corrected errors and didn’t mention which device had the errors.
It seems to me that BTRFS is going to be a much better option than Linux software RAID once it is stable enough to use in production. I am considering upgrading one of my less important servers to Debian/Unstable to test out BTRFS in this configuration.
BTRFS is rumored to have performance problems, I will test this but don’t have time to do so right now. Anyway I’m not always particularly concerned about performance, I have some systems where reliability is important enough to justify a performance loss.
BTRFS and XenThe system with the 2.7TB RAID-1 is a Xen server and LVM volumes on that RAID are used for the block devices of the Xen DomUs. It seems obvious that I could create a single BTRFS filesystem for such a machine that uses both disks in a RAID-1 configuration and then use files on the BTRFS filesystem for Xen block devices. But that would give a lot of overhead of having a filesystem within a filesystem. So I am considering using two LVM volume groups, one for each disk. Then for each DomU which does anything disk intensive I can export two LVs, one from each physical disk and then run BTRFS inside the DomU. The down-side of this is that each DomU will need to scrub the devices and monitor the kernel log for checksum errors. Among other things I will have to back-port the BTRFS tools to CentOS 4.
This will be more difficult to manage than just having an LVM VG running on a RAID-1 array and giving each DomU a couple of LVs for storage.
BTRFS and DRBDThe combination of BTRFS RAID-1 and DRBD is going to be a difficult one. The obvious way of doing it would be to run DRBD over loopback devices that use large files on a BTRFS filesystem. That gives the overhead of a filesystem in a filesystem as well as the DRBD overhead.
It would be nice if BTRFS supported more than two copies of mirrored data. Then instead of DRBD over RAID-1 I could have two servers that each have two devices exported via NBD and BTRFS could store the data on all four devices. With that configuration I could lose an entire server and get a read error without losing any data!
Comparing RisksI don’t want to use BTRFS in production now because of the risk of bugs. While it’s unlikely to have really serious bugs it’s theoretically possible that as bug could deny access to data until kernel code is fixed and it’s also possible (although less likely) that a bug could result in data being overwritten such that it can never be recovered. But for the current configuration (Ext4 on Linux software RAID-1) it’s almost certain that I will lose small amounts of data and it’s most probable that I have silently lost data on many occasions without realising.
Related posts:
Sergey Petrunia of the MariaDB project & Monty Program.
MySQL 5.5 GA at the end of 2010. MariaDB 5.3 RC towards the end of 2011 (beta in June 2011).
MySQL 5.5 is merged to Percona Server 5.5 which included semi-sync replication, slave fsync options, atuomatic relay log recovery, RBR slave type conversions (question if this is useful or not), individual log flushing (very useful, but not many using), replication heartbeat, SHOW RELAYLOG EVENTS. About 2/3rds of the audience use MySQL 5.5 in production, with only 2 people using semi-sync replication.
MariaDB 5.3 brings replication features brings group commit in the binary log, which is merged into Percona Server 5.5. Checksums for binlog events which is merged from MySQL 5.6. Sergey goes in-depth about the group commit for the binary log. To find out a little more about MariaDB replication changes, see Replication in the Knowledgebase.
There are several implementations of group commit. Facebook started it, followed by MariaDB & Oracle. Percona 5.5 is GA so the feature is there, its not in MySQL 5.6 (yet?), and MariaDB 5.3 is where its at. Seems like the MariaDB implementation is the best so far – refer to the Facebook benchmark performed by Mark Callaghan.
Annotated RBR poses a compatibility problem. MariaDB 5.3 has annotate_rows, while MySQL 5.6 has rows_query event. They are different events. So you cannot have a MariaDB 5.3 master and a MySQL 5.6 slave at this moment. So MySQL 5.6 will have a flag to mark “ignorable” binlog events which will be merged into MariaDB and this will make binary logs compatible again.
There is now also optimized RBR for tables with no primary key.
MySQL 5.6 also has crash-safe slave (replication information stored in tables). Crash-safe master (binary log recovery if the server starts & sees the binary log is corrupted). Parallel event execution is something that is new in MySQL 5.6 which is the most important feature for Sergey.
Pre-heating: There is mk-slave-prefetch (famous quote: “Please don’t use mk-slave-prefetch on #MySQL unless you are Facebook.”). There is replication booster by Yoshinori Matsunobu. There is a Python version of mk-slave-prefetch that Facebook uses.
Related posts:
Recent comments
10 weeks 3 days ago
14 weeks 2 days ago
16 weeks 13 hours ago
27 weeks 3 days ago
1 year 7 weeks ago
1 year 13 weeks ago
1 year 32 weeks ago
1 year 32 weeks ago
1 year 32 weeks ago
1 year 47 weeks ago