v1.42.0: Filestore deletion protection, GCP maintenance as Slurm job, Docker daemon configuration
LatestWhat's Changed
Key New Features 🎉
- Add support for custom Docker daemon configuration by @tpdownes in #3201
- Adopt local SSD storage for A3 docker images by @tpdownes in #3206
- Adopt google Terraform plugin v6.10.0 and drop support for 5.x by @tpdownes in #3189
- Add support to perform GCP maintenance as slurm job by @harshthakkar01 in #3152
- Add support for Filestore deletion protection by @tpdownes in #3183
Module Improvements 🔨
- Updating notebook module to use workbench_instance by @jrossthomson in #3139
- Initial commit for new logging output by @cdunbar13 in #3150
- SlurmGCP. "All or nothing" bulk insert on requests with placements by @mr0re1 in #3157
- Remove redundant provisioner for printing image name by @cdunbar13 in #3151
- Add direct Terraform support for Slurm SchedulerParameters and PrivateData by @tpdownes in #3164
- Add
use_job_duration
option by @abbas1902 in #3142 - Improvements for CloudSQL by @wiktorn in #3147
- Improve Error Message with Reservation Validation by @arajmane-g in #3174
Improvements 🛠
- Use local paths to embedded modules throughout Toolkit by @tpdownes in #3102
- Update default value for subnetwork_project to null by @alyssa-sm in #3193
- Gke update default taints for user node pools by @ankitkinra in #3200
- Update MTU for a3 mega for GKE based on best practices by @ankitkinra in #3175
- add training example for gke parallelstore blueprint by @chengcongdu in #3181
- Update maintenance.py to support additional format by @alyssa-sm in #3208
- Allow latest Terraform google plugin by @tpdownes in #3213
- update a3 machines local ssd to use nvme instead of scsi for better performance by @chengcongdu in #3232
- Improve fetching and caching job details by @harshthakkar01 in #3194
- SlurmGCP. Add
set -e
to prolog mux by @mr0re1 in #3215 - add gpu health check in prolog and epilog by @NinaCai in #3134
Deprecations 💤
- Delete the new-project module to support adoption of TPG v6 by @RachaelSTamakloe in #3171
- Delete Daos Example Blueprints to support adoption of TPG v6 by @RachaelSTamakloe in #3172
Version Updates ⏫
- Bump integration test to support Go 1.23 by @mohitchaurasia91 in #3154
- Bump go version 1.21 -> 1.22 by @mohitchaurasia91 in #3156
- Update bucket module within Slurm controller module by @tpdownes in #3161
- update vm-instance module to support TPG v6 by @RachaelSTamakloe in #3166
- Update IP address module within VPC module by @tpdownes in #3160
- update Batch module to be compatible with TPG v6 by @RachaelSTamakloe in #3187
- update HTCondor modules to be compatible with TPG v6 by @RachaelSTamakloe in #3186
- Update Slurm-GCP v5 to 5.12.1 by @tpdownes in #3185
- Update workload-identity submodule from v29 to v34 by @RachaelSTamakloe in #3196
- Update ml-slurm examples to use recent copies of pytorch and tensorflow by @tpdownes in #3226
- Make gke-node-pool compatible with TPG 6.x by @tpdownes in #3230
Bug fixes 🐞
- Refactor mount/mode setting for local SSD RAID by @tpdownes in #3214
- Fix a bug where try was hiding extraction of gpu driver version by @ankitkinra in #3257
- Fix the gpu_installation_config default for case where no customer input by @ankitkinra in #3259
- SlurmGCP. Fix bug that prevents resourcePolicies clean up. by @mr0re1 in #3266
New Contributors
- @linsword13 made their first contribution in #3211
- @NinaCai made their first contribution in #3134
Full Changelog: v1.41.0...v1.42.0