{"id":759,"date":"2006-11-13T11:34:48","date_gmt":"2006-11-13T01:34:48","guid":{"rendered":"http:\/\/www.flamingspork.com\/blog\/2006\/11\/13\/disk-allocation-xfs-ndb-disk-data-and-more\/"},"modified":"2010-05-27T17:14:30","modified_gmt":"2010-05-27T07:14:30","slug":"disk-allocation-xfs-ndb-disk-data-and-more","status":"publish","type":"post","link":"https:\/\/www.flamingspork.com\/blog\/2006\/11\/13\/disk-allocation-xfs-ndb-disk-data-and-more\/","title":{"rendered":"Disk allocation, XFS, NDB Disk Data and more&#8230;"},"content":{"rendered":"<p>I&#8217;ve talked about disk space allocation previously, mainly revolving around XFS (namely because it&#8217;s what I use, a sensible choice for large file systems and large files and has a nice suite of tools for digging into what&#8217;s going on).Most people write software that just calls write(2) (or libc things like fwrite or fprintf) to do file IO &#8211; including space allocation. Probably 99% of file io is fine to do like this and the allocators for your file system get it mostly right (some more right than others). Remember, disk seeks are <strong>really<\/strong> <strong><em>really<\/em><\/strong> expensive so the less you have to do, the better (i.e. <strong>fragmentation==bad<\/strong>).<\/p>\n<p>I recently (finally) wrote my patch to use the xfsctl to get better allocation for NDB disk data files (datafiles and undofiles).<br \/>\npatch at:<br \/>\n<a href=\"http:\/\/lists.mysql.com\/commits\/15088\">http:\/\/lists.mysql.com\/commits\/15088<\/a><\/p>\n<p>This actually ends up giving us a rather nice speed boost in some of the test suite runs.<\/p>\n<p>The problem is:<br \/>\n&#8211; two cluster nodes on 1 host (in the case of the mysql-test-run script)<br \/>\n&#8211; each node has a complete copy of the database<br \/>\n&#8211; ALTER TABLESPACE ADD DATAFILE \/ ALTER LOGFILEGROUP ADD UNDOFILE creates files on *both* nodes. We want to zero these out.<br \/>\n&#8211; files are opened with O_SYNC (IIRC)<\/p>\n<p>The patch I committed uses XFS_IOC_RESVSP64 to allocate (unwritten) extents and then posix_fallocate to zero out the file (the glibc implementation of this call just writes zeros out).<\/p>\n<p>Now, ideally it would be beneficial (and probably faster) to have XFS do this in kernel. Asynchronously would be pretty cool too.. but hey :)<\/p>\n<p>The reason we don&#8217;t want unwritten extents is that NDB has some realtime properties, and futzing about with extents and the like in the FS during transactions isn&#8217;t such a good idea.<\/p>\n<p>So, this would lead me to try XFS_IOC_ALLOCSP64 &#8211; which doesn&#8217;t have the &#8220;unwritten extents&#8221; warning that RESVSP64 does. However, with the two processes writing the files out, I get heavy fragmentation. Even with a RESVSP followed by ALLOCSP I get the same result.<\/p>\n<p>So it seems that ALLOCSP re-allocates extents (even if it doesn&#8217;t have to) and really doesn&#8217;t give you much (didn&#8217;t do too much timing to see if it was any quicker).<\/p>\n<p>I&#8217;ve asked if this is expected behaviour on the XFS list&#8230; we&#8217;ll see what the response is (i haven&#8217;t had time yet to go read the code&#8230; i should though).<\/p>\n<p>So what improvement does this patch make? well, i&#8217;ll quote my commit comments:<\/p>\n<pre><a href=\"http:\/\/bugs.mysql.com\/bug.php?id=24143\">BUG#24143<\/a> Heavy file fragmentation with multiple ndbd on single fs\r\n\r\nIf we have the XFS headers (at build time) we can use XFS specific ioctls\r\n(once testing the file is on XFS) to better allocate space.\r\n\r\nThis dramatically improves performance of mysql-test-run cases as well:\r\n\r\ne.g.\r\nnumber of extents for ndb_dd_basic tablespaces and log files\r\nBEFORE this patch: 57, 13, 212, 95, 17, 113\r\nWITH this patch  :  ALL 1 or 2 extents\r\n\r\n(results are consistent over multiple runs. BEFORE always has several files\r\nwith lots of extents).\r\n\r\nAs for timing of test run:\r\nBEFORE\r\nndb_dd_basic                   [ pass ]         107727\r\nreal    3m2.683s\r\nuser    0m1.360s\r\nsys     0m1.192s\r\n\r\nAFTER\r\nndb_dd_basic                   [ pass ]          70060\r\nreal    2m30.822s\r\nuser    0m1.220s\r\nsys     0m1.404s\r\n\r\n(results are again consistent over various runs)\r\n\r\nsimilar for other tests (BEFORE and AFTER):\r\nndb_dd_alter                   [ pass ]         245360\r\nndb_dd_alter                   [ pass ]         211632<\/pre>\n<p>So what about the patch? It&#8217;s actually really tiny:<\/p>\n<pre><span class=\"removed\">\r\n<\/span><span class=\"removed\">--- 1.388\/configure.in\t2006-11-01 23:25:56 +11:00\r\n<\/span><span class=\"added\">+++ 1.389\/configure.in\t2006-11-10 01:08:33 +11:00\r\n<\/span>@@ -697,6 +697,8 @@\r\nsys\/ioctl.h malloc.h sys\/malloc.h sys\/ipc.h sys\/shm.h linux\/config.h \\\r\nsys\/resource.h sys\/param.h)\r\n\r\n<span class=\"added\">+AC_CHECK_HEADERS([xfs\/xfs.h])\r\n<\/span><span class=\"added\">+\r\n<\/span> #--------------------------------------------------------------------\r\n# Check for system libraries. Adds the library to $LIBS\r\n# and defines HAVE_LIBM etc\r\n\r\n<span class=\"removed\">--- 1.36\/storage\/ndb\/src\/kernel\/blocks\/ndbfs\/AsyncFile.cpp\t2006-11-03 02:18:41 +11:00\r\n<\/span><span class=\"added\">+++ 1.37\/storage\/ndb\/src\/kernel\/blocks\/ndbfs\/AsyncFile.cpp\t2006-11-10 01:08:33 +11:00\r\n<\/span>@@ -18,6 +18,10 @@\r\n#include\r\n#include\r\n\r\n<span class=\"added\">+#ifdef HAVE_XFS_XFS_H\r\n<\/span><span class=\"added\">+#include\r\n<\/span><span class=\"added\">+#endif\r\n<\/span><span class=\"added\">+\r\n<\/span> #include \"AsyncFile.hpp\"\r\n\r\n#include\r\n@@ -459,6 +463,18 @@\r\nUint32 index = 0;\r\nUint32 block = refToBlock(request-&gt;theUserReference);\r\n\r\n<span class=\"added\">+#ifdef HAVE_XFS_XFS_H\r\n<\/span><span class=\"added\">+    if(platform_test_xfs_fd(theFd))\r\n<\/span><span class=\"added\">+    {\r\n<\/span><span class=\"added\">+      ndbout_c(\"Using xfsctl(XFS_IOC_RESVSP64) to allocate disk space\");\r\n<\/span><span class=\"added\">+      xfs_flock64_t fl;\r\n<\/span><span class=\"added\">+      fl.l_whence= 0;\r\n<\/span><span class=\"added\">+      fl.l_start= 0;\r\n<\/span><span class=\"added\">+      fl.l_len= (off64_t)sz;\r\n<\/span><span class=\"added\">+      if(xfsctl(NULL, theFd, XFS_IOC_RESVSP64, &amp;fl) &lt; 0)\r\n<\/span><span class=\"added\">+        ndbout_c(\"failed to optimally allocate disk space\");\r\n<\/span><span class=\"added\">+    }\r\n<\/span><span class=\"added\">+#endif\r\n<\/span> #ifdef HAVE_POSIX_FALLOCATE\r\nposix_fallocate(theFd, 0, sz);\r\n#endif<\/pre>\n<p>So get building your MySQL Cluster with the XFS headers installed and run on XFS for sweet, sweet disk allocation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve talked about disk space allocation previously, mainly revolving around XFS (namely because it&#8217;s what I use, a sensible choice for large file systems and large files and has a nice suite of tools for digging into what&#8217;s going on).Most &hellip; <a href=\"https:\/\/www.flamingspork.com\/blog\/2006\/11\/13\/disk-allocation-xfs-ndb-disk-data-and-more\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_feature_clip_id":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":false,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"jetpack_post_was_ever_published":false},"categories":[8,14],"tags":[628,54,137,130],"class_list":["post-759","post","type-post","status-publish","format-standard","hentry","category-linux-kernel","category-mysql","tag-mysql","tag-ndb","tag-posix_fallocate","tag-test"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p5a6n8-cf","jetpack-related-posts":[{"id":511,"url":"https:\/\/www.flamingspork.com\/blog\/2005\/11\/23\/disk-space-allocation-part-1-seeing-whats-happenned\/","url_meta":{"origin":759,"position":0},"title":"disk space allocation (part 1: seeing what&#8217;s happenned)","author":"Stewart Smith","date":"2005-11-23","format":false,"excerpt":"(a little while ago I was writing a really long entry on everything possible. I realised that this would be a long read for people and that less people would look at it, so I've split it up). This sprung out of doing work on the NDB disk data tree.\u2026","rel":"","context":"In &quot;linux-kernel&quot;","block_context":{"text":"linux-kernel","link":"https:\/\/www.flamingspork.com\/blog\/category\/linux-kernel\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":515,"url":"https:\/\/www.flamingspork.com\/blog\/2005\/11\/29\/disk-space-allocation-part-4-allocating-an-extent\/","url_meta":{"origin":759,"position":1},"title":"disk space allocation (part 4: allocating an extent)","author":"Stewart Smith","date":"2005-11-29","format":false,"excerpt":"For XFS, in normal operation, an extent is only allocated when data has to be written to disk. This is called delayed allocation. If we are extending a file by 50MB - that space is deducted from the total free space on the filesystem, but no decision on where to\u2026","rel":"","context":"In &quot;linux-kernel&quot;","block_context":{"text":"linux-kernel","link":"https:\/\/www.flamingspork.com\/blog\/category\/linux-kernel\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":514,"url":"https:\/\/www.flamingspork.com\/blog\/2005\/11\/29\/disk-space-allocation-part-3-storing-extents-on-disk\/","url_meta":{"origin":759,"position":2},"title":"disk space allocation (part 3: storing extents on disk)","author":"Stewart Smith","date":"2005-11-29","format":false,"excerpt":"Here I'm going to talk about how file systems store what part of the disk a part of the file occupies. If your database files are very fragmented, performance will suffer. How much depends on a number of things however. XFS can store some extents directly in the inode (see\u2026","rel":"","context":"In &quot;linux-kernel&quot;","block_context":{"text":"linux-kernel","link":"https:\/\/www.flamingspork.com\/blog\/category\/linux-kernel\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":334,"url":"https:\/\/www.flamingspork.com\/blog\/2005\/01\/06\/effective-bk-usage\/","url_meta":{"origin":759,"position":3},"title":"effective bk usage","author":"Stewart Smith","date":"2005-01-06","format":false,"excerpt":"(inspired by jimw talking about it on Planet MySQL) I take a bit of a different approach... I've got directories for 4.0, 4.1 and 5.0, and within them, i have clones of the main ndb tree (called ndb, so there's a path like \"MySQL\/5.0\/ndb\"). I don't ever edit in this\u2026","rel":"","context":"In &quot;mysql&quot;","block_context":{"text":"mysql","link":"https:\/\/www.flamingspork.com\/blog\/category\/work-et-al\/mysql\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1201,"url":"https:\/\/www.flamingspork.com\/blog\/2008\/09\/08\/setfilevaliddata-function-windows-now-with-added-fail\/","url_meta":{"origin":759,"position":4},"title":"SetFileValidData Function (Windows) &#8211; Now with added FAIL","author":"Stewart Smith","date":"2008-09-08","format":false,"excerpt":"SetFileValidData Function (Windows) There seems to be two options on Win32 for preallocating disk space to files. Basically, I want a equivilent to posix_fallocate or the ever wonderful xfsctl XFS_IOC_RESVSP64 call. The idea being to (quickly) create a large file on disk that is stored efficiently (i.e. isn't fragmented). From\u2026","rel":"","context":"In &quot;mysql&quot;","block_context":{"text":"mysql","link":"https:\/\/www.flamingspork.com\/blog\/category\/work-et-al\/mysql\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":1229,"url":"https:\/\/www.flamingspork.com\/blog\/2008\/10\/14\/mysql-cluster-ndb-on-win32-progress\/","url_meta":{"origin":759,"position":5},"title":"MySQL Cluster (NDB) on Win32 progress","author":"Stewart Smith","date":"2008-10-14","format":false,"excerpt":"Many things have been happenning in the land of NDB on Win32 as of late. I've fixed about 700 compiler warnings (some of which were real bugs) leaving about 161 to go on Win32 (VS2003). We're getting a few more warnings on Win64 (some of which look merely semantic, while\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/www.flamingspork.com\/blog\/category\/general\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts\/759","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/comments?post=759"}],"version-history":[{"count":3,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts\/759\/revisions"}],"predecessor-version":[{"id":2007,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts\/759\/revisions\/2007"}],"wp:attachment":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/media?parent=759"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/categories?post=759"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/tags?post=759"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}