Releases: GoogleCloudPlatform/cluster-toolkit
v1.31.1: Updated provisioning guide for A3 VM family
What's Changed
The A3 provisioning guide was updated by @tpdownes to support 2 use cases:
- user-created reservations without compact placement policies that are automatically consumed by matching VMs
- Google Cloud-created reservations that must be specifically identified by Slurm cluster for consumption
See #2420 and reservation consumption documentation for details.
Full Changelog: v1.31.0...v1.31.1
v1.31.0: Improved Local File Management
What's Changed
Key New Features 🎉
Module Improvements 🔨
- Support http_proxy in HTCondor Windows installation by @tpdownes in #2368
- Slurm6. Add support for dynamic nodeset. by @mr0re1 in #1986
Improvements 🛠
- Migrate config-ssh.yaml to Slurm-GCP v6 by @alyssa-sm in #2358
- Add SlurmGCP v6 version of node groups blueprint example by @harshthakkar01 in #2347
- Add Slurm-GCP v6 example for gpu.yaml by @alyssa-sm in #2376
- Add SlurmGCP v6 example of htc-slurm blueprint by @harshthakkar01 in #2348
- Script capture serial by @cboneti in #2380
Deprecations 💤
- Deprecate
schedmd-slurm-gcp-v6-partition.network_storage
by @mr0re1 in #2379 - Remove quota validator by @mr0re1 in #2382
Bug fixes 🐞
Full Changelog: v1.30.0...v1.31.0
v1.30.0 - Cloud HPC Toolkit A3 VM + NeMo Framework Solution
What's Changed
Key New Features 🎉
- Introduction of the Cloud HPC Toolkit A3 VM family blueprint featuring
- A Slurm cluster composed of A3 VMs each with 8 NVIDIA H100 GPUs
- An example for running the NVIDIA NeMo framework
- An example for running the common nccl-tests benchmark
Module Improvements 🔨
- Add support for startup script per nodeset by @mr0re1 in #2296
- Allocate IP for vm-instance by @lemaitre-aneo in #2219
- Add random hex to batch job id and specify id when submitting by @nick-stroud in #2259
- HTCondor: support user-managed secret replication by @tpdownes in #2340
- Support setting system default web proxy in Windows startup by @tpdownes in #2351
Improvements 🛠
- Add TPU v4 blueprint and tutorial to demonstrate running TPU workload by @harshthakkar01 in #2287
- Update parameters for TPU nodeset module and add precondition checks and bump TPU to v3 by @harshthakkar01 in #2293
- Add Slurm v6 version for image builder blueprint by @harshthakkar01 in #2297
- Allow
ghpc deploy blueprint.yaml
by @mr0re1 in #2323 - Slurm GCP version update; will cooldown before deleting orphan nodes by @nick-stroud in #2322
- Add SlurmGCP v6 example of slurm compatible with startup scripts and integration test by @harshthakkar01 in #2346
Version Updates ⏫
Bug fixes 🐞
- Added enable_devel for packer build to fix issue with bp by @cdunbar13 in #2334
New Contributors
- @lemaitre-aneo made their first contribution in #2219
Full Changelog: v1.29.0...v1.30.0
v1.29.0: New Firewall Rules module & Slurm-GCP v6 Improvements
What's Changed
Key New Features 🎉
New Modules 🧱
Module Improvements 🔨
- Set http_proxy, https_proxy variables for user login and during startup-script by @tpdownes in #2237
- Update documentation for Packer to include minimum operational requirements by @tpdownes in #2241
- Modify cloud-storage-bucket to include ability to set bucket viewers by @tpdownes in #2247
- Add "submit" option to batch-job-template module by @aaronegolden in #2210
- Prevent usage of placement with static and auto-scale nodes in same nodeset by @nick-stroud in #2279
Improvements 🛠
- Add login node in spack wrfv3 tutorial example by @alyssa-sm in #2207
Version Updates ⏫
- Update to Slurm-GCP 5.10.4 by @tpdownes in #2251
- Bump cryptography from 42.0.0 to 42.0.2 in /community/front-end/ofe by @dependabot in #2257
- Slurm-GCP Release 6.4.1 by @tpdownes in #2230
- Updated Slurm v6 version, fix placement error by @nick-stroud in #2280
- Bump cryptography from 42.0.0 to 42.0.4 in /community/front-end/ofe by @dependabot in #2288
- Bump Golang version 1.20 -> 1.21 by @mr0re1 in #2291
Bug fixes 🐞
- Fix bug in validation of variable in PBS module by @tpdownes in #2255
- Fix for wait-for-startup.yml ansible task by @cdunbar13 in #2283
Full Changelog: v1.28.1...v1.29.0
v1.28.1: Slurm-GCP v4 reaches End-of-Life, improved Slurm-GCP v6 support
What's Changed
Key New Features 🎉
- Add support for string interpolation by @mr0re1 in #2076
- 🌈 Enable output colorization by default by @mr0re1 in #2145
- Add example of building Slurm on top of Rocky 8 by @nick-stroud in #2155
Module Improvements 🔨
- Slurm6. Make
subnetwork_self_link
required, don't passsubnetwork_project
by @mr0re1 in #2067 - Slurm6. Automagicaly set
nodeset.name
from module id. by @mr0re1 in #2068 - Slurm6. Add support for
additional_networks
,access_config
&reservation_name
by @mr0re1 in #2062 - Reduce default maximum number of HTCondor execute points by @tpdownes in #2127
- Startup stackdriver option by @nick-stroud in #2120
- HTCondor: variable MIG behavior by @tpdownes in #2140
- Extending GKE Scheduler module by @ek-nag in #2137
- Copies python binaries instead of symlink for more isolated venv by @nick-stroud in #2151
- Increase dynamic node count to a more reasonable default value by @nick-stroud in #2153
- Update Chrome Remote Desktop to Debian 12 by default by @tpdownes in #2180
- Update startup-script module to latest release by @tpdownes in #2183
- Updates to HTCondor autoscaler by @tpdownes in #2204
- Change batch-job-base template from json to YAML by @aaronegolden in #2199
- Add Slurm configuration template for long Prolog/Epilog scripts by @tpdownes in #2218
Improvements 🛠
- Check if supplied value matches expected module variable type by @mr0re1 in #2089
- Update spack openfoam example to Slurm V6 by @harshthakkar01 in #2090
- Add
has_to_be_used
behaviour to some of modules by @mr0re1 in #2092 - Update spack wrf example and references to use Slurm V6 by @harshthakkar01 in #2132
- Update spack gromacs example and references to Slurm V6 by @harshthakkar01 in #2138
- Update hpc slurm gromac example and references to use Slurm V6 by @harshthakkar01 in #2149
- Update DAOS blueprints to use google-cloud-daos v0.5.0, slurm v6 by @mark-olson in #2147
- Bring
$(...)
functionality on par with((...))
by @mr0re1 in #2053 - Add
--force
flag toghpc create
by @mr0re1 in #2162 - Update hpc-slurm-legacy-sharedvpc example and references to use Slurm V6 by @harshthakkar01 in #2143
- Add login node in the spack openfoam tutorial example by @harshthakkar01 in #2170
- Add login node to spack gromacs tutorial example by @alyssa-sm in #2227
Deprecations 💤
- Remove pre existing fs example and references by @harshthakkar01 in #2104
- Remove slurm-two-partitions-workstation example and references by @harshthakkar01 in #2106
- Remove use-resources example and references by @harshthakkar01 in #2107
- Update test outputs example and remove slurm partition and controller by @harshthakkar01 in #2109
- Remove test-gcs-fuse example and references by @harshthakkar01 in #2110
- Remove hpc-cluster-service-acct example and references by @harshthakkar01 in #2111
- Remove hpc-cluste-slurm-with-startup example and references by @harshthakkar01 in #2112
- Remove hpc-cluster-project example and references by @harshthakkar01 in #2113
- Remove hpc-cluster-high-io-remote-state example and references by @harshthakkar01 in #2114
- Remove lustre-new-vpc example and references by @harshthakkar01 in #2108
- Remove hpc-slurm-legacy example and references by @harshthakkar01 in #2103
- Remove intel-select blueprints and references by @harshthakkar01 in #2135
- Remove Slurm V4 modules and add note to use and reference V4 modules and examples by @harshthakkar01 in #2160
- Deprecate Dell Omnia module and example blueprint by @tpdownes in #2196
Version Updates ⏫
- Update Slurm Cloud SQL example to V6 by @alyssa-sm in #2168
- Update Slurm-GCP release to 5.10.2 by @tpdownes in #2189
- Bump django from 4.2.7 to 4.2.10 in /community/front-end/ofe by @dependabot in #2213
Bug fixes 🐞
- Update spack openfoam example to use /opt/apps directory by @harshthakkar01 in #2131
- Fix HTCondow Windows URI for latest 23.0 LTS release by @tpdownes in #2141
- Validation added to Slurm v5 login_startup_scripts_timeout by @cdunbar13 in #2148
- Ensure Windows VMs start HTCondor only after successful secret download by @tpdownes in #2174
New Contributors
- @stas00 made their first contribution in #2125
- @aaronegolden made their first contribution in #2199
Full Changelog: v1.27.0...v1.28.0
Submission Checklist
Please take the following actions before submitting this pull request.
- Fork your PR branch from the Toolkit "develop" branch (not main)
- Test all changes with pre-commit in a local branch #
- Confirm that "make tests" passes all tests
- Add or modify unit tests to cover code changes
- Ensure that unit test coverage remains above 80%
- Update all applicable documentation
- Follow Cloud HPC Toolkit Contribution guidelines #
v1.27.0: Spack support for non-root users
What's Changed
Key New Features 🎉
- Perform spack and ramble actions with system user by @nick-stroud in #2052
New Modules 🧱
- Tutorial for FSI VaR MonteCarlo by @jrossthomson in #1874
Module Improvements 🔨
- Making CloudSQL to use internal IP address instead of external for Slurm Accounting DB. by @ek-nag in #1795
- OFE: Various new features and fixes. by @ek-nag in #2040
- Disable firewall rule logging by default by @tpdownes in #2057
- Slurm6. Add support for
enable_slurm_gcp_plugins
by @mr0re1 in #2066 - Support explicit reserved_ip_range for Filestore instances by @tpdownes in #2072
- Adopt gcloud storage over gsutil by default by @tpdownes in #2075
- Skip upgrade of wheel/setuptools if already installed by @tpdownes in #2074
- Support use of http/https proxy for pip/apt/yum package managers by @tpdownes in #2079
Improvements 🛠
- Creation of script to check upcoming maintenance on nodes by @cdunbar13 in #1977
- Add documentation how to use max_hops plugin by @nick-stroud in #1983
- Add Zonal with lower capacity band support for filestore instance by @harshthakkar01 in #1982
- Add codespell: workflow, by @yarikoptic in #1888
- Update slurm modules to point to fork by @mr0re1 in #2009
- Document permissions around manual usage of Spack installation by @nick-stroud in #2015
- Front End Deployment Changes by @max-nag in #1895
vm-instance
is now able touse
a service account by @nick-stroud in #2051
Version Updates ⏫
Bug fixes 🐞
New Contributors
- @yarikoptic made their first contribution in #1888
- @fustic made their first contribution in #2041
Full Changelog: v1.26.1...v1.27.0
v1.26.1: Fix regression in wait-for-startup module
What's Changed
GoogleCloudPlatform/guest-agent@5c85572 introduced a change in bootup logging that prevented the community/modules/scripts/wait-for-startup
module from detecting the end of a failed startup-script. The new solution has been patched to detect failure on new and old releases of the guest-agent.
Bug fixes 🐞
- Hotfix wait for startup update by @cdunbar13 in #2045
Full Changelog: v1.26.0...v1.26.1
GKE support for GCS, colorized output, improved "ghpc create" output
What's Changed
Key New Features 🎉
- Add native support for Google Cloud Storage with GKE by @nick-stroud in #1972
Module Improvements 🔨
- Slurm6. Fix race condition between GCS config files and instances by @mr0re1 in #1932
- Script run warning stage 2 by @cdunbar13 in #1956
Improvements 🛠
- Allow ramble module to point at forked repositories by @douglasjacobsen in #1927
- 🌈 Add colorization to the output by @mr0re1 in #1949
- Output better executable path by @mr0re1 in #1951
Deprecations 💤
Version Updates ⏫
- Bump django from 4.2.3 to 4.2.7 in /community/front-end/ofe by @dependabot in #1926
Other changes
Full Changelog: v1.25.0...v1.26.0
v1.25.0: CAE solution
What's Changed
Key New Features 🎉
New Modules 🧱
- Add Slurm V6 controller, partition & nodeset modules by @mr0re1 in #1871
- Slurm V6. Add login module by @mr0re1 in #1890
- Add TPU nodeset module for Slurm V6 by @harshthakkar01 in #1905
Module Improvements 🔨
- Update ramble modules to have data file management by @douglasjacobsen in #1837
- Improve Spack and Ramble Refs by @cdunbar13 in #1882
Improvements 🛠
- Add support for reading
metadata.yaml
in CFT-format, fallback to har… by @mr0re1 in #1841 - Enable usage of GCS URL as
Module.Source
by @mr0re1 in #1523
Version Updates ⏫
- Upgrade HTCondor to LTS Release 23.0 by @tpdownes in #1862
- Bump minimum Golang version 1.18 -> 1.20 by @mr0re1 in #1873
- Bump google.golang.org/grpc from 1.58.2 to 1.58.3 by @dependabot in #1884
Bug fixes 🐞
- Fix HCLS and gcs fuse installation by @cdunbar13 in #1845
- Fix HTCondor Windows download URI by @tpdownes in #1847
- Update ansible's usage of virtualenv to venv by @cdunbar13 in #1877
- Check for stockouts during bulkInsert in integration tests. by @cdunbar13 in #1880
- Applied recommended changes to gcsfuse and nfs scripts to fix apt-get update by @cdunbar13 in #1901
- Hotfix for vm-instance to allow image names to be used correctly by @cdunbar13 in #1930
Other changes
New Contributors
- @fschuerm made their first contribution in #1836
- @dodecatheon made their first contribution in #1920
Full Changelog: v1.24.0...rc1.25.0
v1.24.0: Support for ephemeral storage on GKE, Slurm-on-GCP update to 5.9.1
What's Changed
Key New Features 🎉
- Add native support for ephemeral storage to GKE modules by @nick-stroud in #1759
Module Improvements 🔨
- Improve A3 VM family support by @tpdownes in #1756
- Improve VPC documentation by @tpdownes in #1764
- Improve pre-existing-vpc documentation by @tpdownes in #1763
- Simplified Script Warning by @cdunbar13 in #1809
- Add instance image name to VM-Instance Module by @cdunbar13 in #1813
- Static nodes in cluster partition via OFE by @ek-nag in #1738
Improvements 🛠
- Create simple command line tool to summarize topology of VMs by @nick-stroud in #1725
- Add support to update mount options for file store instance by @harshthakkar01 in #1783
- Adds basic deploy/destroy test for storage-gke.yaml by @nick-stroud in #1799
- HCLS Integration Test by @cdunbar13 in #1810
Version Updates ⏫
Bug fixes 🐞
- Update the install ramble deps runner based on the spack module by @douglasjacobsen in #1821
- Fix install script in ml-slurm example by @samskillman in #1831
New Contributors
- @nique905 made their first contribution in #1735
- @alyssa-sm made their first contribution in #1823
- @samskillman made their first contribution in #1831
Full Changelog: v1.23.0...1.24.0