{"id":5402,"date":"2025-05-17T02:12:54","date_gmt":"2025-05-16T18:12:54","guid":{"rendered":"https:\/\/cicserver.com\/pliops-bypasses-hbm-limits-for-gpu-servers-blocks-and-files\/"},"modified":"2025-05-17T02:12:54","modified_gmt":"2025-05-16T18:12:54","slug":"pliops-bypasses-hbm-limits-for-gpu-servers-blocks-and-files","status":"publish","type":"post","link":"https:\/\/cicserver.com\/de\/pliops-bypasses-hbm-limits-for-gpu-servers-blocks-and-files\/","title":{"rendered":"Pliops bypasses HBM limits for GPU servers \u2013 Blocks and Files"},"content":{"rendered":"<p><br \/>\n<\/p>\n<div>\n            <!-- image --><\/p>\n<div class=\"td-post-featured-image\"><a href=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser.jpg\" data-caption=\"\"><img fetchpriority=\"high\" decoding=\"async\" width=\"696\" height=\"357\" class=\"entry-thumb td-modal-image\" src=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser-696x357.jpg\" srcset=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser-696x357.jpg 696w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser-300x154.jpg 300w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser-768x394.jpg 768w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser-819x420.jpg 819w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-teaser.jpg 950w\" sizes=\"(max-width: 696px) 100vw, 696px\" alt=\"\" title=\"Pliops teaser\"\/><\/a><\/div>\n<p>            <!-- content --><\/p>\n<p>Key-value accelerator card provider Pliops has unveiled the FusIOnX stack as an end-to-end AI inference offering based on its XDP LightningAI card.<\/p>\n<p>Pliops\u2019 <a href=\"https:\/\/blocksandfiles.com\/2024\/10\/02\/pliops-xdp-lightningai\/\">XDP LightningAI<\/a> PCI card and software augment the high-bandwidth memory (HBM) memory tier for GPU servers and accelerate vLLMs on Nvidia Dynamo by 2.5x. UC Berkeley\u2019s open source virtual large language model (vLLM) library for inferencing and serving uses a key-value (KV) cache as a short-term memory for batching user responses. Nvidia\u2019s <a href=\"https:\/\/www.theregister.com\/2025\/03\/23\/nvidia_dynamo\/\">Dynamo<\/a> framework is open source software to optimize inference engines such as TensorRT LLM and vLLM. The <a href=\"https:\/\/blocksandfiles.com\/2024\/10\/02\/pliops-xdp-lightningai\/\">XDP LightningAI<\/a> is a PCIe add-in card and functions as a memory tier for GPU servers. It is powered by ASIC hardware and software, and caches intermediate LLM process step values on NVMe\/RDMA-accessed SSDs.<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img fetchpriority=\"high\" decoding=\"async\" width=\"950\" height=\"534\" src=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-XDP-Acceleration-Platform-slide.jpg\" alt=\"Pliops slide\" class=\"wp-image-73786\" style=\"width:800px\" srcset=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-XDP-Acceleration-Platform-slide.jpg 950w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-XDP-Acceleration-Platform-slide-300x169.jpg 300w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-XDP-Acceleration-Platform-slide-768x432.jpg 768w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-XDP-Acceleration-Platform-slide-696x391.jpg 696w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-XDP-Acceleration-Platform-slide-747x420.jpg 747w\" sizes=\"(max-width: 950px) 100vw, 950px\"\/><\/figure>\n<\/div>\n<p>Pliops says GPU servers have limited amounts of HBM. Its technology is intended to deal with the situation where a model\u2019s context window \u2013 its set of in-use tokens \u2013 grows so large that it overflows the available HBM capacity, and evicted contexts have to be recomputed. The model is memory-limited and its execution time ramps up as the context window size increases.<\/p>\n<p>By storing the already-computed contexts in fast-access SSDs, retrieving them when needed, the model\u2019s overall run time is reduced compared with recomputing the contexts. Users can get more HBM capacity by buying more GPU servers, but the cost of this is high, and bulking out HBM capacity with a sub-HBM storage tier is much less expensive and, we understand, almost as fast. The XDP LightningAI card with FusIOnX software provides, Pliops says, \u201cup to 8x faster end-to-end GPU inference.\u201d<\/p>\n<p>Think of FusIOnX as AI stack glue for AI workloads. Pliops provides several examples:<\/p>\n<ul class=\"wp-block-list\">\n<li>FusIOnX vLLM production stack: Pliops vLLM KV-Cache acceleration, smart routing supporting multiple GPU nodes, and upstream vLLM compatibility.<\/li>\n<li>FusIOnX vLLM + Dynamo + SGLang BASIC: Pliops vLLM, Dynamo, KV-Cache acceleration integration, smart routing supporting multiple GPU nodes, and single or multi-node support.<\/li>\n<li>FusIOnX KVIO: Key-Value I\/O connectivity to GPUs, distributed Key-Value over network for scale \u2013 serves any GPU in a server, with support for RAG\/Vector-DB applications on CPU servers coming soon.<\/li>\n<li>FusIOnX KV Store: XDP AccelKV Key-Value store, XDP RAIDplus Self Healing, distributed Key-Value over network for scale \u2013 serves any GPU in a server, with support for RAG\/Vector-DB applications on CPU servers coming soon.<\/li>\n<\/ul>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img decoding=\"async\" width=\"625\" height=\"950\" src=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-FusIOnX-config-diagram.jpg\" alt=\"Pliops slide\" class=\"wp-image-73787\" style=\"width:400px\" srcset=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-FusIOnX-config-diagram.jpg 625w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-FusIOnX-config-diagram-197x300.jpg 197w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-FusIOnX-config-diagram-276x420.jpg 276w\" sizes=\"(max-width: 625px) 100vw, 625px\"\/><\/figure>\n<\/div>\n<p>The card can be used to accelerate one or more GPU servers hooked up to a storage array or other stored data resource, or it can be used in a hyperconverged all-in-one mode, installed in a GPU server, providing storage using its 24 SSD slots, and accelerating inference \u2013 an LLM in a box, as Pliops describes that configuration.\u00a0<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-1024x576.jpg\" alt=\"Pliops slide\" class=\"wp-image-73788\" style=\"width:850px\" srcset=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-1024x576.jpg 1024w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-300x169.jpg 300w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-768x432.jpg 768w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-1536x865.jpg 1536w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-696x392.jpg 696w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-1068x601.jpg 1068w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide-746x420.jpg 746w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/PLiops-one-giant-cache-slide.jpg 1830w\" sizes=\"(max-width: 1024px) 100vw, 1024px\"\/><\/figure>\n<\/div>\n<p>Pliops has its PCIe add-in-card method, independent of the storage system, to feed the GPUs with the model\u2019s bulk data, independent of the GPU supplier as well. The XDP LightningAI card runs in a 2RU Dell server with 24 SSD slots. Pliops says its technology accelerates the standard vLLM production stack 2.5x in terms of requests per second:<\/p>\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"459\" src=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-1024x459.jpg\" alt=\"Pliops slide\" class=\"wp-image-73785\" style=\"width:750px\" srcset=\"https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-1024x459.jpg 1024w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-300x134.jpg 300w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-768x344.jpg 768w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-1536x688.jpg 1536w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-2048x917.jpg 2048w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-696x312.jpg 696w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-1068x478.jpg 1068w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-1920x860.jpg 1920w, https:\/\/blocksandfiles.com\/wp-content\/uploads\/2025\/05\/Pliops-vLLM-RPS-acceleration-chart-938x420.jpg 938w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\"\/><\/figure>\n<\/div>\n<p>XDP LightningAI-based FusIOnX LLM and GenAI is in production now. It provides \u201cinference acceleration via efficient and scalable KVCache storage, and KV-CacheDisaggregation (for Prefill\/Decode nodes separation)\u201d and has a \u201cshared, super-fast Key-Value Store, ideal for storing long-term memory for LLM architectures like Google\u2019s <a href=\"https:\/\/arxiv.org\/abs\/2501.00663\">Titans<\/a>.\u201d<\/p>\n<p>There are three more FusIOnX stacks coming. FusIOnX RAG and Vector Databases is in the proof-of-concept stage and should provide index building and retrieval acceleration.<\/p>\n<p>FusIOnX GNN is in development and will store and retrieve node embeddings for large GNN (graph neural network) applications. A FusIOnX DLRM (deep learning recommendation model) is also in development and should provide a \u201csimplified, superfast storage pipeline with access to TBs-to-PBs scale embedding entities.\u201d<\/p>\n<h3 class=\"wp-block-heading\">Comment<\/h3>\n<p>There are various AI workload acceleration products from other suppliers. <a href=\"https:\/\/blocksandfiles.com\/2024\/05\/10\/generative-ai-and-gridgain-in-memory-data\/\">GridGain\u2019s software<\/a> enables a cluster of servers to share memory and therefore run apps needing more memory than that supported by a single server.\u00a0 It provides a distributed memory space atop a cluster or grid of x86 servers with a massively parallel architecture. AI is another workload it can support.<\/p>\n<p>GridGain for AI can support RAG applications, enabling the creation of relevant prompts for language models using enterprise data. It provides storage for both structured and unstructured data, with support for vector search, full-text search, and SQL-based structured data retrieval. And it integrates with open source and publicly available libraries (LangChain, Langflow) and language models. A <a href=\"https:\/\/www.gridgain.com\/resources\/blog\/using-gridgain-ai-unified-data-store-online-ai-applications\">blog post<\/a> can tell you more.<\/p>\n<p>Three more alternatives are Hammerspace\u2019s <a href=\"https:\/\/blocksandfiles.com\/2024\/11\/14\/hammerspace-opens-up-fast-local-gpu-server-nvme-storage-to-speed-ai-training\/\">Tier Zero<\/a> scheme, WEKA\u2019s <a href=\"https:\/\/www.weka.io\/blog\/ai-ml\/new-augmented-memory-grid-revolutionizes-the-economics-of-ai-inference-infrastructure\/\">Augmented Memory Grid<\/a>, and VAST Data\u2019s <a href=\"https:\/\/blocksandfiles.com\/2025\/04\/25\/vasts-vua-flash-caching-virtually-expands-gpu-server-memory-for-ai-token-generation\/\">VUA<\/a> (VAST Undivided Attention), and they all support Nvidia\u2019s GPUDirect protocols.<\/p>\n<\/p><\/div>","protected":false},"excerpt":{"rendered":"<p>Key-value accelerator card provider Pliops has unveiled the FusIOnX stack as an end-to-end AI inference offering based on its XDP LightningAI card. Pliops\u2019 XDP LightningAI PCI card and software augment the high-bandwidth memory (HBM) memory tier for GPU servers and accelerate vLLMs on Nvidia Dynamo by 2.5x. UC Berkeley\u2019s open source virtual large language model [&hellip;]<\/p>","protected":false},"author":3,"featured_media":5403,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[1],"tags":[],"class_list":{"0":"post-5402","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-blog"},"_links":{"self":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/posts\/5402","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/users\/3"}],"replies":[{"embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/comments?post=5402"}],"version-history":[{"count":0,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/posts\/5402\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/media\/5403"}],"wp:attachment":[{"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/media?parent=5402"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/categories?post=5402"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cicserver.com\/de\/wp-json\/wp\/v2\/tags?post=5402"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}