{"id":5228,"date":"2025-05-16T20:02:45","date_gmt":"2025-05-16T12:02:45","guid":{"rendered":"https:\/\/cicserver.com\/pushing-ai-system-cooling-to-the-limits-without-immersion\/"},"modified":"2025-05-16T20:02:45","modified_gmt":"2025-05-16T12:02:45","slug":"pushing-ai-system-cooling-to-the-limits-without-immersion","status":"publish","type":"post","link":"https:\/\/cicserver.com\/de\/pushing-ai-system-cooling-to-the-limits-without-immersion\/","title":{"rendered":"Pushing AI System Cooling To The Limits Without Immersion"},"content":{"rendered":"<p><br \/>\n<\/p>\n<div>\n<figure class=\"entry-thumbnail\">\n<img decoding=\"async\" src=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-xai-rack-logo-824x438.jpg\" alt=\"\" title=\"supermicro-xai-rack-logo\"\/><br \/>\n<\/figure>\n<p>Here is a question for you. What is harder to get right now: 1,665 of Nvidia\u2019s \u201cBlackwell\u201d B200 GPU compute engines or 10 megawatts of power for a four year contract in the Northeast region of the United States?<\/p>\n<p>Without question, it is the latter, not the former, and both will cost on the order of $66 million.<\/p>\n<p>The fun bit is that those GPUs will probably actually take 13.4 megawatts of juice to operate in a GB200 NVL72 rackscale system configuration, and that means they will burn around 88.5 megawatts over four years. And if you don\u2019t need a rackscale coherent memory domain for the GPUs because you are using the GPU machinery for AI training instead of inference (which operates at a scale of tens of thousands of GPUs), you will burn about the same power but you can do it with twice as much space and half the power density.<\/p>\n<p>Here is another fun bit about modern AI datacenters: If you can\u2019t prove that you have the power allocated to you, and in a datacenter that is designed to handle the density of the system, Nvidia will not sell the GPUs to you until you can prove you have power. And, the world on the street last week when we spoke at a conference at the NASDAQ exchange in New York City focused on AI in the financial services industry was that power companies are now trying to stretch their gigawatts of power generation and are increasingly looking at how you are distributing power and doing the cooling in an AI datacenter before they do their allocations.<\/p>\n<p>Increasingly, if you can\u2019t prove you are using the power wisely, you don\u2019t get it, or you don\u2019t get as much as you want.<\/p>\n<p>Add to all of this the fact that compute density is necessary in an AI system running chain of thought models because these require coherent memory links between GPUs with super-low latency for AI inference, and we are in a situation where direct liquid cooling is not inevitable in the future, but is absolutely necessary right now. And a lot of datacenters are not used to it, and those that were way back in the IBM System\/360 and System\/370 mainframe days five and six decades ago have not had liquid cooled iron in their datacenters for a long time.<\/p>\n<p>Which is why companies like Supermicro have to push the envelope on direct liquid cooling for their GPU-accelerated systems.<\/p>\n<p>\u201cAll of the customers that we talk to are thinking in terms of how many GPUs can they power and cool per megawatt,\u201d Michael McNerney, senior vice president of marketing and network security at Supermicro, tells <em>The Next Platform<\/em>. \u201cThey tell us how many megawatts, and they want the maximum number of GPUs possible. The conversations are about GPU density and GPUs per megawatt, and it is not about how much money they can save on power but getting more GPUs to throw at the AI workload.\u201d<\/p>\n<p>Supermicro developed its first generation of direct liquid cooling with cold plates on the CPUs and GPUs with eight-GPU servers based on Nvidia\u2019s \u201cHopper\u201d H100 GPUs in the fall of 2023, which it first became apparent that some of the cooling techniques that have been used for several years in HPC systems needed to go mainstream in AI systems. Supermicro designed and manufactured the whole DLC system, including the cold plates, the coolant distribution units (CDUs) in the racks, and the chillers that provide cool water back to the equipment in the racks.<\/p>\n<p>Notably, half of <a href=\"https:\/\/www.nextplatform.com\/2024\/07\/30\/so-who-is-building-that-100000-gpu-cluster-for-xai\/\">the \u201cColossus\u201d system<\/a> at xAI \u2013 comprising a total of 50,000 H100 GPUs \u2013 in its datacenter in Memphis was built by Supermicro using its DLC-1 technology. The other half of the system (with another 50,000 H100s) was built by Dell and is only air-cooled.<\/p>\n<p><a href=\"http:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-xai-colossus-racks.jpg\" rel=\"attachment wp-att-145770\"><img fetchpriority=\"high\" decoding=\"async\" class=\"aligncenter size-full wp-image-145770\" src=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-xai-colossus-racks.jpg\" alt=\"\" width=\"1042\" height=\"697\" srcset=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-xai-colossus-racks.jpg 1042w, https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-xai-colossus-racks-768x514.jpg 768w, https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-xai-colossus-racks-600x401.jpg 600w\" sizes=\"(max-width: 1042px) 100vw, 1042px\"\/><\/a><\/p>\n<p>Those nodes in the Colossus machines have a pair of CPUs as well as eight of the H100 GPUs. The <a href=\"https:\/\/cicserver.com\/de\/openai-signs-new-4bn-cloud-deal-with-coreweave\/\">Server<\/a> nodes also have eight ConnectX-7 network interface cards (one for each GPU) as well as a pair of lower-speed Ethernet interface cards for system management, PCI-Express switches for linking the GPU complex to the CPUs and the on-node storage and a number of other components. The DLC-1 system used water that was 30 degrees (Celsius), and could remove about somewhere north of 70 percent of the heat out of the system, which was a big improvement in efficiency and power savings. The CDUs in the DLC-1 setup were rated at 100 kilowatts.<\/p>\n<p>But given the dearth of power out there around the globe, and its expense, Supermicro pushed harder with the DLC-2 liquid cooling system announced this week and debuting with the Blackwell B200 GPU nodes.<\/p>\n<p>Here is what one of these new 4U nodes with the DLC-2 cooling looks like:<\/p>\n<p><a href=\"http:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-sys-422GS-server.jpg\" rel=\"attachment wp-att-145769\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-145769\" src=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-sys-422GS-server.jpg\" alt=\"\" width=\"771\" height=\"411\" srcset=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-sys-422GS-server.jpg 771w, https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-sys-422GS-server-768x409.jpg 768w, https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-sys-422GS-server-600x320.jpg 600w\" sizes=\"(max-width: 771px) 100vw, 771px\"\/><\/a><\/p>\n<p>Technically, using the Supermicro naming convention, this machine above is the SYS-422GS-NBRT-LCC. The CDUs are more efficient and can deliver 250 kilowatts of cooling flow, and importantly can run on liquid that is only 45 degrees, which means it can be cooled with outside cooling towers instead of chillers, which cuts back on overall power requirements.<\/p>\n<p>With the DLC-2 setup in the B200 HGX SuperServer, the pair of Intel Xeon 6 CPUs and eight Blackwell B200 GPUs have cold plates, but the main memory DIMMs, the PCI-Express switches in the node, and the power supplies, and voltage regulators are all equipped with cold plates to remove their heat directly, too.<\/p>\n<p><a href=\"http:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-racks.jpg\" rel=\"attachment wp-att-145768\"><img decoding=\"async\" class=\"aligncenter size-full wp-image-145768\" src=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-racks.jpg\" alt=\"\" width=\"875\" height=\"619\" srcset=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-racks.jpg 875w, https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-racks-768x543.jpg 768w, https:\/\/www.nextplatform.com\/wp-content\/uploads\/2025\/05\/supermicro-dlc-2-racks-600x424.jpg 600w\" sizes=\"(max-width: 875px) 100vw, 875px\"\/><\/a><\/p>\n<p>And with the HGX B300 systems that Supermicro will ship later this year, the ConnectX-7 and later network interface cards will also have liquid coolant and thus around 98 percent of the heat generated by the system will be removed with liquid, not air. The SuperServer B300 node, in fact, will only have two small fans, and it won\u2019t make much noise at all.<\/p>\n<p>The upshot of this is that the GPU systems using the DLC-2 cooling will use 40 percent less power to cool the systems than the completely air-cooled HGX H100 systems from only two years ago. The power usage effectiveness of the racks using the DLC-2 setup will also be driven very low. A normal, legacy rack in an enterprise datacenter has a PUE of 1.6 to 2.0, which means the datacenter rack burns 1.6X to 2X the power of the computational units doing work, with the extra power being used to cool the rack. With DLC-1, the Supermicro racks were down to about 1.2 for a PUE, and the target for DLC-2 is a very low 1.02 PUE.<\/p>\n<p>And the noise level for DLC-2 racks drops down to around 50 dB compared to around 75 dB for the DLC-1 racks. Normal conversation is around 60 dB, and heavy traffic (outside the car) is around 85 dB. A rock concert is on the order of 120 dB and a jet engine at takeoff is 140 dB.<\/p>\n<p>The only way to get more efficient with cooling an AI system is to dunk it in\u00a0 bath of baby oil or some other coolant that doesn\u2019t wreck computer components. And that is a very heavy solution, so to speak.<\/p>\n<p>\t\t<!-- QUIZ HERE --><\/p>\n<div id=\"block-9\" class=\"mh-widget mh-posts-2 widget_block widget_media_image\">\n<figure class=\"wp-block-image size-full\"><a href=\"https:\/\/go.theregister.com\/k\/hpe_solutions_AI\"><img decoding=\"async\" src=\"https:\/\/www.nextplatform.com\/wp-content\/uploads\/2023\/05\/HPE_button_19959_V2.png\" alt=\"\" class=\"wp-image-142439\"\/><\/a><\/figure>\n<\/div>\n<div id=\"text-26\" class=\"mh-widget mh-posts-2 widget_text\">\n<h4 class=\"mh-widget-title\"><span class=\"mh-widget-title-inner\">Sign up to our Newsletter<\/span><\/h4>\n<div class=\"textwidget\">\n<p>Featuring highlights, analysis, and stories from the week directly from us to your inbox with nothing in between.<br \/><a class=\"button article-button\" title=\"Subscribe to Newsletter\" href=\"https:\/\/www.nextplatform.com\/register\/\" target=\"_blank\" rel=\"nofollow noopener\">Subscribe now<\/a><\/p>\n<\/div><\/div>\n<section class=\"mh-related-content\">\n<h3 class=\"mh-widget-title mh-related-content-title\">\n<span class=\"mh-widget-title-inner\">Related Articles<\/span><\/h3>\n<\/section><\/div>","protected":false},"excerpt":{"rendered":"<p>Here is a question for you. What is harder to get right now: 1,665 of Nvidia\u2019s \u201cBlackwell\u201d B200 GPU compute engines or 10 megawatts of power for a four year contract in the Northeast region of the United States? Without question, it is the latter, not the former, and both will cost on the order [&hellip;]<\/p>","protected":false},"author":3,"featured_media":5229,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[1],"tags":[],"class_list":{"0":"post-5228","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-blog"},"_links":{"self":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/posts\/5228","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/comments?post=5228"}],"version-history":[{"count":0,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/posts\/5228\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/media\/5229"}],"wp:attachment":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/media?parent=5228"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/categories?post=5228"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/tags?post=5228"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}