{"id":4896,"date":"2024-01-13T11:02:42","date_gmt":"2024-01-13T19:02:42","guid":{"rendered":"https:\/\/www.flamingspork.com\/blog\/?p=4896"},"modified":"2024-01-13T11:02:42","modified_gmt":"2024-01-13T19:02:42","slug":"using-llvm-mca-for-predicting-cpu-cycle-impact-of-code-changes","status":"publish","type":"post","link":"https:\/\/www.flamingspork.com\/blog\/2024\/01\/13\/using-llvm-mca-for-predicting-cpu-cycle-impact-of-code-changes\/","title":{"rendered":"Using llvm-mca for predicting CPU cycle impact of code changes"},"content":{"rendered":"\n<p>Way back in the distant past, when the Apple ][ and the Commodore 64 were king, you could read the manual for a microprocessor and see how many CPU cycles each instruction took, and then do the math as to how long a sequence of instructions would take to execute. This cycle counting was used pretty effectively to do really neat things such as <a href=\"https:\/\/www.wired.com\/2009\/03\/racing-the-beam\/\" target=\"_blank\" rel=\"noreferrer noopener\">how you&#8217;d get anything on the screen from an Atari 2600<\/a>. Modern CPUs are&#8230; complex. They can do several things at once, in a different order than what you wrote them in, and have an interesting arrangement of shared resources to allocate.<\/p>\n\n\n\n<p>So, unlike with simpler hardware, if you have a sequence of instructions for a modern processor, it&#8217;s going to be pretty hard to work out how many cycles that could take by hand, and it&#8217;s going to differ for each micro-architecture available for the instruction set.<\/p>\n\n\n\n<p>When designing a microprocessor, simulating what a series of existing instructions will take to execute compared to the previous generation of microprocessor is pretty important. The aim should be for it to take less time or energy or some other metric that means your new processor is better than the old one. It can be okay if processor generation to generation some sequence of instructions take more cycles, if your cycles are more frequent, or power efficient, or other positive metric you&#8217;re designing for.<\/p>\n\n\n\n<p>Programmers may want this simulation too, as some code paths get rather performance critical for certain applications. Open Source tools for this aren&#8217;t as prolific as I&#8217;d like, but there is <code><a href=\"https:\/\/llvm.org\/docs\/CommandGuide\/llvm-mca.html\" target=\"_blank\" rel=\"noreferrer noopener\">llvm-mca<\/a><\/code> which I (relatively) recently learned about.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>llvm-mca<\/strong> is a performance analysis tool that uses information available in LLVM (e.g. scheduling models) to statically measure the performance of machine code in a specific CPU.<\/p>\n<cite><a href=\"https:\/\/llvm.org\/docs\/CommandGuide\/llvm-mca.html\" target=\"_blank\" rel=\"noreferrer noopener\">the llvm-mca docs<\/a><\/cite><\/blockquote>\n\n\n\n<p>So, when looking at <a href=\"https:\/\/lore.kernel.org\/netdev\/CANn89i+6d9K1VwNK1Joc-Yb_4jAfV_YFzk=z_K2_Oy+xJHSn_g@mail.gmail.com\/T\/\" target=\"_blank\" rel=\"noreferrer noopener\">an issue in the IPv6 address and connection hashing code in Linux<\/a> last year, and being quite conscious of modern systems dealing with a LOT of network packets, and thus this can be quite CPU usage sensitive, I wanted to make sure that my suggested changes weren&#8217;t going to have a large impact on performance &#8211; across the variety of CPU generations in use.<\/p>\n\n\n\n<p>There&#8217;s two ways to do this: run everything, throw a lot of packets at something, and measure it. That can be a long dev cycle, and sometimes just annoying to get going. It can be a lot quicker to simulate the small section of code in question and do some analysis of it before going through the trouble of spinning up multiple test environments to prove it in the real world.<\/p>\n\n\n\n<p>So, enter <a href=\"https:\/\/llvm.org\/docs\/CommandGuide\/llvm-mca.html\" target=\"_blank\" rel=\"noreferrer noopener\">llvm-mca<\/a> and the ability to try and quickly evaluate possible changes before testing them. Seeing as the code in question was nicely self contained, I could easily get this to a point where I could easily get <code>gcc<\/code> (or <code>llvm<\/code>) to spit out assembler for it separately from the kernel tree. My preference was for gcc as that&#8217;s what most distros end up compiling Linux with, including the Linux distribution that&#8217;s my day job (Amazon Linux).<\/p>\n\n\n\n<p>In order to share the results of the experiments as part of the discussion on where the code changes should end up, I published the code and results in a <a href=\"https:\/\/github.com\/stewartsmith\/inet6_hashfn-sim\/\" target=\"_blank\" rel=\"noreferrer noopener\">github project<\/a> as things got way too large to throw on a mailing list post and retain sanity.<\/p>\n\n\n\n<p>I used a container so that I could easily run it in a repeatable isolated environment, as well as have others reproduce my results if needed. Different compiler versions and optimization levels will very much produce different sequences of instructions, and thus possibly <em>quite<\/em> different results. This delta in compiler optimization levels is partially why the numbers don&#8217;t quite match on some of the mailing list messages, although the delta of the various options was all the same. The other reason is learning how to better use <code>llvm-mca<\/code> to isolate down the <em>exact<\/em> sequence of instructions I was caring about (and not including things like the guesswork that <code>llvm-mca<\/code> has to do for branches).<\/p>\n\n\n\n<p>One thing I learned along the way is how to better use <code>llvm-mca<\/code> to get the results that I was looking for. One trick is to very much avoid branches, as that&#8217;s going to be near <em>complete guesswork<\/em> as there&#8217;s not a simulation of the branch predictor (at least in the version I was using.<\/p>\n\n\n\n<p>The big thing I wanted to prove: is doing the extra work having a small or large impact on number of elapsed cycles. The answer was that <strong>doing a bunch of extra &#8220;work&#8221; was essentially near free<\/strong>. The CPU core could execute enough things in parallel that the incremental cost of doing extra work just&#8230; wasn&#8217;t relevant.<\/p>\n\n\n\n<p>This helped getting a patch deployed without impact to performance, as well as get a patch upstream, fixing an issue that was partially fixed 10 years prior, and had existed since day 1 of the Linux IPv6 code.<\/p>\n\n\n\n<p>Naturally, this wasn&#8217;t a solo effort, and that&#8217;s one of the joys of working with a bunch of smart people &#8211; both at the same company I work for, and in the broader open source community. It&#8217;s always humbling when you&#8217;re looking at code outside your usual area of expertise that was written (and then modified) by Really Smart People, and you&#8217;re then trying to fix a problem in it, while trying to learn all the implications of changing that bit of code.<\/p>\n\n\n\n<p>Anyway, check out<code> llvm-mca<\/code> for your next adventure into premature optimization, as if you&#8217;re going to get started with evil, you may as well start with what&#8217;s at the root of all of it.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Way back in the distant past, when the Apple ][ and the Commodore 64 were king, you could read the manual for a microprocessor and see how many CPU cycles each instruction took, and then do the math as to &hellip; <a href=\"https:\/\/www.flamingspork.com\/blog\/2024\/01\/13\/using-llvm-mca-for-predicting-cpu-cycle-impact-of-code-changes\/\">Continue reading <span class=\"meta-nav\">&rarr;<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2}},"categories":[76,1,7],"tags":[633,779,589,141,625,734,39],"class_list":["post-4896","post","type-post","status-publish","format-standard","hentry","category-code","category-general","category-work-et-al","tag-code","tag-ipv6","tag-kernel","tag-linux","tag-linux-kernel","tag-optimization","tag-programming"],"jetpack_publicize_connections":[],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"jetpack_shortlink":"https:\/\/wp.me\/p5a6n8-1gY","jetpack-related-posts":[{"id":1719,"url":"https:\/\/www.flamingspork.com\/blog\/2009\/10\/10\/how-many-cpu-cycles-does-a-sql-query-take-or-pagefaults-caused-or-l2-cache-misses-or-cpu-migrations\/","url_meta":{"origin":4896,"position":0},"title":"How many CPU cycles does a SQL query take? (or pagefaults caused&#8230; or L2 cache misses&#8230; or CPU migrations&#8230;)","author":"Stewart Smith","date":"2009-10-10","format":false,"excerpt":"I like profilers. I use them when trying to make software (such as Drizzle) faster. Many profilers suck - and pretty much all of them are impossible to attach to a running system. Two notable exceptions are oprofile and dtrace (for Linux and Solaris respectively). The downside of oprofile is\u2026","rel":"","context":"In &quot;drizzle&quot;","block_context":{"text":"drizzle","link":"https:\/\/www.flamingspork.com\/blog\/category\/work-et-al\/drizzle-work-et-al\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3775,"url":"https:\/\/www.flamingspork.com\/blog\/2014\/07\/17\/openpower-firmware-up-on-github\/","url_meta":{"origin":4896,"position":1},"title":"OpenPower firmware up on github!","author":"Stewart Smith","date":"2014-07-17","format":false,"excerpt":"With the whole OpenPower thing, a lot of low level firmware is being open sourced, which is really exciting for the platform - the less proprietary code sitting in memory the better in my books. If you go to https:\/\/github.com\/open-power you'll see code for a bunch of the low level\u2026","rel":"","context":"In &quot;code&quot;","block_context":{"text":"code","link":"https:\/\/www.flamingspork.com\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3965,"url":"https:\/\/www.flamingspork.com\/blog\/2015\/06\/12\/gcov-code-coverage-for-openpower-firmware\/","url_meta":{"origin":4896,"position":2},"title":"gcov code coverage for OpenPower firmware","author":"Stewart Smith","date":"2015-06-12","format":false,"excerpt":"For skiboot (which provides the OPAL boot and runtime firmware for OpenPower machines), I've been pretty interested at getting some automated code coverage data for booting on real hardware (as well as in a simulator). Why? Well, it's useful to see that various test suites are actually testing what you\u2026","rel":"","context":"In &quot;code&quot;","block_context":{"text":"code","link":"https:\/\/www.flamingspork.com\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3884,"url":"https:\/\/www.flamingspork.com\/blog\/2014\/10\/14\/mysql-5-7-5-on-power-thread-priority\/","url_meta":{"origin":4896,"position":3},"title":"MySQL 5.7.5 on POWER &#8211; thread priority","author":"Stewart Smith","date":"2014-10-14","format":false,"excerpt":"Good news everyone! MySQL 5.7.5 is out with a bunch more patches for running well on POWER in the tree. I haven't yet gone and tried it all out, but since I'm me, I look at bugs database and git\/bzr history first. On Intel CPUs, when you're spinning on a\u2026","rel":"","context":"In &quot;IBM&quot;","block_context":{"text":"IBM","link":"https:\/\/www.flamingspork.com\/blog\/category\/work-et-al\/ibm-work-et-al\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":3254,"url":"https:\/\/www.flamingspork.com\/blog\/2013\/03\/12\/is-mysql-bigger-than-linux\/","url_meta":{"origin":4896,"position":4},"title":"Is MySQL bigger than Linux?","author":"Stewart Smith","date":"2013-03-12","format":false,"excerpt":"I'm going to take the numbers from my previous post, MySQL Modularity, Are We There Yet? for the \"kernel\" size of MySQL - that is, everything that isn't a plugin or storage engine. For Linux kernel, I'm just going to use the a-bit-old git tree I have on my laptop.\u2026","rel":"","context":"In &quot;code&quot;","block_context":{"text":"code","link":"https:\/\/www.flamingspork.com\/blog\/category\/code\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":715,"url":"https:\/\/www.flamingspork.com\/blog\/2006\/06\/14\/arjens-mysql-community-journal-hyperthreading-not-on-a-mysql-server\/","url_meta":{"origin":4896,"position":5},"title":"Arjen&#8217;s MySQL Community Journal &#8211; HyperThreading? Not on a MySQL server&#8230;","author":"Stewart Smith","date":"2006-06-14","format":false,"excerpt":"Arjen's MySQL Community Journal - HyperThreading? Not on a MySQL server... I blame the Linux Process Scheduler. At least it's better than the earlier 2.6 days where things would get shunted a lot from one \"cpu\" to the other \"cpu\" for no real reason. Newer kernel verisons are probably better...\u2026","rel":"","context":"In &quot;linux-kernel&quot;","block_context":{"text":"linux-kernel","link":"https:\/\/www.flamingspork.com\/blog\/category\/linux-kernel\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]}],"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts\/4896","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/comments?post=4896"}],"version-history":[{"count":2,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts\/4896\/revisions"}],"predecessor-version":[{"id":4902,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/posts\/4896\/revisions\/4902"}],"wp:attachment":[{"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/media?parent=4896"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/categories?post=4896"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.flamingspork.com\/blog\/wp-json\/wp\/v2\/tags?post=4896"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}