AI HPC cluster buid

Project Summary: Link to heading

Build a brand new HPC cluster to support AI research at Harvard Medical School.

Primary Driver: Link to heading

The Dean of HMS wished to build an HPC cluster for the design and training of AI models for medical research. This was to coincide with the Dean’s innovation awards for the year, a set of projects funded directly by the Dean’s office. The project had a hard budget and tight timeline for availability. There was little direction given on how the cluster was to be built, just the end goal desired.

Decision factors: Link to heading

Time
Budget
Capability

The first and most clear factor in deciding what the system was going to look like was the time. The Dean only gave us 3 months to get the system up and running. Given that we hadn’t even put out a call for proposals at that point, 3 months was a significant obstacle. We had to find someone who could ship AI capable systems in a short period of time. This happened to coincide with a major shortage in GPU hardware making the selection even more difficult.

Budget is always a factor and the Dean had set a hard limit on what we could spend. Given the visibility and time pressure of the project, vendor discussions were done at the highest level of the IT organization.

The system had to be capable of the types of AI workloads the research community was submitting as projects. These included AI training and data cleanup projects that would require significant computing power primarily in GPU resources.

Ultimately the only vendor we could find that could both meet the timeline and the budget was NVIDIA themselves. Thus we ended up with a DGX half pod delivered in less then 30 days and on budget.

A note about the cloud: Link to heading

The question was raised, why don’t we just do all this in the cloud. Most of the reason not to stemmed from budget issues. At the time of the cluster build persistent GPU resources in the cloud were prohibitively expensive. If you are running inconsistent bursty workloads the cloud can make a lot of sense, but we were anticipating running the hardware at full burn. This did turn out to be the case so the thinking there was sound. There is also the question of ingress/egress charges. The datasets the researchers were proposing to use were large, often large sets of images or other binary data. The ingress and storage costs of that sort of data can be expensive and we already had a large and extremely high performing storage system on prem. Given the potentially large and variable costs associated with the cloud and the existing resources on prem hosting this project in the cloud just didn’t make sense.

Major technical challenges: Link to heading

Before we even could receive hardware we had to prep the datacenter space, networks, and storage systems for the new cluster. This involved creating new subnets, new storage exports, and moving equipment around to free up rack space. There was also the issue of power and cooling, we had to upgrade a number of racks in our datacenter to support the NVIDIA DGX nodes. Each node running 8 V200 computational GPU cards pulled significantly more power then anything else currently in that datacenter space.

Basic rollout plan: Link to heading

The rollout plan was largely dictated by the requirements of the system. Our vendor / expeditor team was instrumental in getting the system set up as they had experience with the platform we did not. In addition to the hardware and system configuration we needed an entier software stack built out for the new cluster. It was decided that in parallel to the system build the Research Computing Consulting team would build out new software to be ready to install.

Once the hardware arrived we worked closely with a local expeditor to get the systems installed and cabled. The system as designed by NVIDIA is a turnkey solution with a number of Ethernet and Infiniband networks that needed careful consideration and planning to get right. After everything was cabled the head nodes had to be setup and installed so we could distribute the rest of the cluster. The BCM software used by the cluster was new to our team and represented a steep learning curve for our administration staff. Introducing new hardware as well as new software and a management stack is asking a lot in a short period of time. The team worked well together and we were able to deploy the initial system in just a few days. The tweaking and refinement took another few days to get everything right and how we wanted it. This included not just node configuration but switch configs, storage updates, and firewall changes to get everything working the way it needed to.

The final major hurdle to the roll out was an intensive security review. Fortunately HMS has a talented internal security team that was able to turn around a full assessment in less then a week. Once the system was cleared for the data level required we started migrating the applications the RCC team had been building over to the new cluster.

Build results and onboarding: Link to heading

With a lot of dedication, hard work, and late nights we were able to purchase, build, install and make ready an entire HPC environment in less then 90 days. Allowing us to onboard users right as the Dean’s awards were granted. While there were a number of growing pains for new users, largely around software they themselves didn’t know they needed, the response was fantastic. Users were able to connect and utilize the new system and the new GPU hardware worked as expected. Within a month the system was running at full tilt.

Follow up work: Link to heading

While the construction of this cluster was an amazing feat by an amazing team, nothing built with these constraints is going to be perfect. Building a cluster on the timescale we did required making a lot of concessions. For one, using a turnkey system from a vendor made the deployment way faster but as a result the system is a stand alone one off. It lacks integration into many of the standard systems we use for health monitoring and configuration management. While the BCM system from NVIDIA handles some of this, it is incomplete and unfamiliar, requiring a lot more active attention to manage then our standard HPC environment. There needs to be an assessment of the current state of the new cluster, how it’s managed, and what the future of its configuration needs to be. Making changes to an existing running system on that level are never easy and would require significant downtime.

Any system built on a compressed time scale is naturally going to introduce this level of technical debt. Cleaning up that debt needs to be planned for at the beginning and accounted for by leadership. If there is no plan for better integration being pushed by leadership it’s never going to get the time budget needed to complete it.