- 255
- 926 683
Sharcnet HPC
Canada
เข้าร่วมเมื่อ 13 ก.ย. 2012
This channel provides public access to recordings of SHARCNET training events. These recordings are part of a presentation series organized by SHARCNET staff. For more information about SHARCNET events visit our calendar at www.sharcnet.ca/my/news/calendar.
SHARCNET is a consortium of 19 Canadian academic institutions who share a network of high performance computers. With this infrastructure we enable world-class academic research. We aim to:
accelerate computational academic research,
attract the best students and faculty to our partner institutions by providing cutting edge expertise and hardware,
and link academic researchers with corporate partners in a search for new business opportunities
SHARCNET is a partner organization of Compute Ontario and Digital Research Alliance of Canada national advanced research computing platform.
SHARCNET is a consortium of 19 Canadian academic institutions who share a network of high performance computers. With this infrastructure we enable world-class academic research. We aim to:
accelerate computational academic research,
attract the best students and faculty to our partner institutions by providing cutting edge expertise and hardware,
and link academic researchers with corporate partners in a search for new business opportunities
SHARCNET is a partner organization of Compute Ontario and Digital Research Alliance of Canada national advanced research computing platform.
Unlocking the Power of Comet: Streamlining Machine Learning Experimentation
Comet is an easy-to-use platform for tracking and optimizing machine learning experiments. It integrates with popular frameworks like TensorFlow and PyTorch, allowing users to log metrics, hyperparameters, and model results. Comet helps visualize experiment progress in real-time, tune hyperparameters, and ensure reproducibility. A live demo will show how Comet improves collaboration, model performance, and tracks the environmental impact of machine learning projects.
_________________________________________________
This webinar was presented by Nast Shahparian (SHARCNET) on December 18th, 2024, as a part of a series of weekly Compute Ontario Colloquia. The webinar was hosted by SHARCNET. The colloquia cover different advanced research computing (ARC) and high performance computing (HPC) topics, are approximately 45 minutes in length, and are delivered by experts in the relevant fields. Further details can be found on this web page: www.computeontario.ca/training-colloquia . Recordings, slides, and other materials can be found here: helpwiki.sharcnet.ca/wiki/Online_Seminars
SHARCNET is a consortium of 19 Canadian academic institutions who share a network of high performance computers (www.sharcnet.ca). SHARCNET is a part of Compute Ontario (computeontario.ca/) and Digital Research Alliance of Canada (alliancecan.ca).
_________________________________________________
This webinar was presented by Nast Shahparian (SHARCNET) on December 18th, 2024, as a part of a series of weekly Compute Ontario Colloquia. The webinar was hosted by SHARCNET. The colloquia cover different advanced research computing (ARC) and high performance computing (HPC) topics, are approximately 45 minutes in length, and are delivered by experts in the relevant fields. Further details can be found on this web page: www.computeontario.ca/training-colloquia . Recordings, slides, and other materials can be found here: helpwiki.sharcnet.ca/wiki/Online_Seminars
SHARCNET is a consortium of 19 Canadian academic institutions who share a network of high performance computers (www.sharcnet.ca). SHARCNET is a part of Compute Ontario (computeontario.ca/) and Digital Research Alliance of Canada (alliancecan.ca).
มุมมอง: 67
วีดีโอ
Data Wrangling with Tidyverse (part 3)
มุมมอง 3221 วันที่ผ่านมา
Tidyverse is an cohesive set of packages for doing data science in R. In an earlier talk, we began reviewing the data munging portions of tidyvese (dplyr, forcats, tibble, readr, stringr, tidyr, and purr) by using it to reconstruct the data hierarchy in a 500 pages reference PDF given only the words on each page and their bounding boxes. This talk will complete this. * Part 1: th-cam.com/video/...
Causal Inference using Probabilistic Variational Causal Effect in Observational Studies
มุมมอง 76หลายเดือนก่อน
In this presentation, I introduce a novel causal analysis methodology called Probabilistic Variational Causal Effect (PACE) designed to evaluate the impact of both rare and common events in observational studies. PACE quantifies the direct causal effects by integrating total variation, which captures the purely causal component, with interventions on varying treatment levels. This integration a...
Survival guide for the upcoming GPU upgrades (more total power, but fewer GPUs)
มุมมอง 207หลายเดือนก่อน
In the coming months, national systems will be undergoing significant upgrades. In particular, older GPUs (P100, V100) will be replaced with newest H100 GPUs from NVIDIA. The total GPU computing power of the upgraded systems will grow by a factor of 3.5, but the number of GPUs will go down significantly (from 3200 to 2100). This will present a significant challenge for our users, as "the busine...
Git Part 3: Managing Workflows
มุมมอง 96หลายเดือนก่อน
This session explores strategies and tools that enhance collaboration, improve workflow efficiency, and streamline code management. Building on foundational Git concepts presented in parts 1 & 2, we'll explore branching strategies, focusing on how they shape team collaboration and project stability. We'll cover essential commands for rebasing, merging, and resolving conflicts, providing insight...
Graham update
มุมมอง 94หลายเดือนก่อน
This presentation was made by John Morton (Director of Technology, SHARCNET) on October 30th, 2024. SHARCNET is a consortium of 19 Canadian academic institutions who share a network of high performance computers (www.sharcnet.ca). SHARCNET is a part of Compute Ontario (computeontario.ca/) and Digital Research Alliance of Canada (alliancecan.ca).
Parallel Programming: MPI I/O Basics
มุมมอง 1672 หลายเดือนก่อน
MPI-IO is a set of extensions to the MPI library that enable parallel high-performance I/O operations. It provides a parallel file access interface that allows multiple processes to write and read to the same file simultaneously. MPI-IO allows for efficient data transfer between processes and enables high-performance I/O operations on large datasets. It also provides additional features such as...
Introspection for Jobs: in-job monitoring of performance
มุมมอง 822 หลายเดือนก่อน
Several types of data are collected about performance while a job is running; ultimately, much of this winds up in portals and can be examined. This data is also available to the job, while it runs, and can provide additional insight. Here, we'll demonstrate scripts that can execute within the job context to, for instance, provide summaries of performance at the job's end, or within sections of...
Multidimensional Arrays in C++
มุมมอง 1173 หลายเดือนก่อน
The C 2023 standard has std::mdspan which provides a light-weight non-owning multidimensional view of a contiguous single-dimensioned array. This enables the reinterpretation of an underlying contiguous array as a multi-dimensional array with support for different memory layouts (e.g., C and Fortran) and accessing elements (e.g., directly, using atomics, etc.). As found in other programming lan...
Debugging and Optimization of PyTorch Models
มุมมอง 3683 หลายเดือนก่อน
Deep learning models are often viewed as uninterpretable "black boxes". As researchers, we often extend this thinking to the memory and compute utilization of such models. Using PyTorch Profiler, we can identify model bugs and bottlenecks to understand how to improve model performance from an efficiency perspective. This will improve training scaling and allow completion of large hyperparameter...
Using machine learning to predict rare events
มุมมอง 2394 หลายเดือนก่อน
In some binary classification problems, the underlying distribution of positive and negative samples are highly unbalanced. For example, fraudulent credit card transactions are rare compared to the volume of legitimate transactions. Training a classification model in such a case needs to take into account the nature of skewed distribution. In this seminar, we will develop a fraud detector which...
Diagnosing Wasted Resources from User Facing Portals on the National Clusters
มุมมอง 744 หลายเดือนก่อน
Researchers often leave resources on the table when specifying their job requirements on the national systems. This talk builds on previous sessions and uses the Digital Research Alliance of Canada's User Facing Portals to explore what different types of jobs look like when they waste resources. Demonstrations will include interactive jobs, parallel jobs, GPU workflows, and more. With more accu...
The Emergence of WebAssembly (Wasm) in Scientific Computing
มุมมอง 1944 หลายเดือนก่อน
Developed collaboratively by major browser vendors, including Mozilla, Google, Microsoft, and Apple, WebAssembly (Wasm) addresses the limitations of traditional web programming languages like JavaScript. But what makes it so compelling for scientists? First, Wasm allows code written in languages like C/C , Fortran or Rust to be compiled into its instruction format and run directly in the browse...
Exploring Compute Usage from User Facing Portals on the National Clusters
มุมมอง 1205 หลายเดือนก่อน
Previous seminars in this series have described using Python tools to explore job properties and usage characteristics on the Digital Research Alliance of Canada general purpose compute clusters. The end goal of exploring job properties and usage characteristics is to get the most out of the resources available to research accounts and to minimize wait times in the job queue. This seminar revie...
Compute Ontario Summer School 2024
มุมมอง 6167 หลายเดือนก่อน
UPDATE: registration is now open for Compute Ontario Summer School (June 3-21, 2024): training.computeontario.ca/coss2024.php In this colloquium, we will present the curriculum of the 2024 Compute Ontario Summer School, to be held from the 3rd to the 21st of June. Jointly organized by the Centre for Advanced Computing, SciNet, SHARCNET, and in collaboration with the Research Data Management Net...
Data Wrangling with Tidyverse (part 2)
มุมมอง 1257 หลายเดือนก่อน
Data Wrangling with Tidyverse (part 2)
Accelerating data analytics with RAPIDS cuDF
มุมมอง 2048 หลายเดือนก่อน
Accelerating data analytics with RAPIDS cuDF
Make: a declarative, lazy, parallel workload manager. Elegant or obsolete?
มุมมอง 989 หลายเดือนก่อน
Make: a declarative, lazy, parallel workload manager. Elegant or obsolete?
Introduction to GPU programming with OpenMP
มุมมอง 46310 หลายเดือนก่อน
Introduction to GPU programming with OpenMP
False Sharing and Contention in Parallel Codes
มุมมอง 15511 หลายเดือนก่อน
False Sharing and Contention in Parallel Codes
Skorch: Training PyTorch models with scikit-learn
มุมมอง 552ปีที่แล้ว
Skorch: Training PyTorch models with scikit-learn
Squeeze more juice out of a single GPU in deep learning
มุมมอง 252ปีที่แล้ว
Squeeze more juice out of a single GPU in deep learning
Generalized End to End Python and Neuroscience Workflows on a Compute Cluster
มุมมอง 133ปีที่แล้ว
Generalized End to End Python and Neuroscience Workflows on a Compute Cluster
p2rng - A C++ Parallel Random Number Generator Library for the Masses
มุมมอง 146ปีที่แล้ว
p2rng - A C Parallel Random Number Generator Library for the Masses
Exploring job wait times on Alliance compute clusters: a holistic view
มุมมอง 106ปีที่แล้ว
Exploring job wait times on Alliance compute clusters: a holistic view
Automating scientific workflows with AiiDA
มุมมอง 352ปีที่แล้ว
Automating scientific workflows with AiiDA
Isn’t the call at 16:27 “call saxpykernel <<<grid,block>>>…” not call saxpy?
A very important issue that MIG introduces to the multi-GPU systems, is the fact that all types of peer-to-peer accesses would be disabled and the communication time across inter-node GPUs would drop dramatically. This will result in an environment in which CUDA-Aware MPI and UCX cannot utilize NVLinks. How will you guys handle this issue? Would users be able to enable MIG on demand? (it usually requires sudo access) On the other hand, enabling MIG statically on even a small subset of the GPUs would sacrifice more performance from the cluster's potential peak, because all of the inter-GPU NVLinks would be useless until MIG is disabled again. How is this going to work?
Can you describe the scenario in which it would make sense for a job to use multiple MIGs like that? Our main motive with MIGs is to address the extremely common problem of jobs that underutilize a whole GPU. Maybe the short answer is: we expect to have whole GPUs available as well as MIGs. Whole GPUs make sense for jobs that can use them effectively, of course - including across nodes. (I am Sharcnet/DRAC staff.)
I suspect MPS is the answer, but this needs to be tested.
@markhahn0 Well, that's the point. There are virtually no reasonable scenarios in which it would make sense for an MPI job to use multiple MIGs, unless the code is extremely and embarrassingly parallel with little to no communication across the GPUs. That's why most of the users, who knows what they are doing, won't use the MIG-enabled GPUs for their MPI/Multi-GPU jobs, making the race to acquire non-MIG GPUs even more competitive than before. This results in longer waiting times than what you already expect. I get your motivation and it's absolutely important to take action. However, I suspect using MIG for a big chunk of nodes in an HPC cluster is an efficient solution to this problem, and it may backfire. There are some considerations/suggestions that come to my mind about this: 1. Enable MIG only on single GPU nodes (a portion of them, if there are any) 2. Enable dynamic MIG configuration - preferably by slurm itself, and ambitiously, depending on the jobs and their requirements 3. With number two, then maybe you can enable MIG on multi-GPU nodes, if there are enough requests for them. 4. Do what you are doing: instruct users when and why to use MIG, and help them write better code. @SHARCNET_HPC Yeah, MPS is a better choice I guess, and users should be encouraged to use it. I wrote a couple personal posts about MIG and MPS-on-top-of-MIG that might be helpful. amirsojoodi.github.io/posts/Enabling-MPS/ amirsojoodi.github.io/posts/MIG/ amirsojoodi.github.io/posts/MPS+MIG/ P.s. Regarding MIG and its related issues, NVIDIA said they might relax the constraints in the future, but who knows when. Read here: docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#application-considerations and here: docs.nvidia.com/datacenter/tesla/mig-user-guide/index.html#cuda-device-enumeration
15:21 MPS and Hyper-Q are separate, but related concepts. Hyper-Q is basically a hardware feature that is available on GPUs since compute capability 3.5. While MPS is a software solution that utilizes this hardware feature to enable multiple processes sharing the GPU more efficiently.
You are right: "MPS enables cooperative multi-process CUDA applications, typically MPI jobs, to utilize Hyper-Q capabilities on the NVIDIA GPUs."
5736 Stark Trace
Thank you so much! I know this is pretty straightforward stuff, but 54:33 really helped me out fix an issue with containerized experiments.
DBInterface is not an interface... I was wondering how you might create an interface in javascript since it's not natively supported. Typescript would be a better choice for this if you really wanted this, but as this is, this "interface" contains implementation, something an interface is not, by definition, supposed to do.
Where's the github you mention would be available?
how can i install these GPU Utilization packages(GPU dashboard)?
Obrigada pelo vídeo incrível! 🇧🇷
Nice explanation. I converted my 2D parallelized fortran code to 3D and now it gives me NaN, any suggestions please?
thanks. One of the better videos on collaborative groups.👌
Be like hay.
Oh, my goodness.
😮 It must not be the right way to do it, because cythin should be at least 40 times faster. He should probably get out of the juoyter notebook and get better, more reasonable measurements.
Is there cuda fortran for windows?
what do you mean? The compiler? yes
@@muhammadfaridkhandaq4379 thanks
This was nice. Thank you.
hi guys, in 57:11, after "ssh -A -J narval nc10201", do you know why I still got required the password, and then got this error: "Received disconnect from UNKNOWN port 65535:2: Too many authentication failures Disconnected from UNKNOWN port 65535" ? Thank you in advance
Is it possible to automatically get compute node id and connect to it?
Not necessarily new … we did this with the mainframe and the Cray Vector processors … GPUs are great at vector math and solving simultaneous equations
Excellent talk - great content, very informative, very useful! Thank you.
Fantastic presentation, thank you so much! What about xperf on windows environment? Still relevant?
Amazing job!
Thanks for sharing useful contents. It would have been better if you had named it "faculty member edition", cause it is specifically for supervisors.
This deserves 10K thumbs up, not just 39!!!
Nice presentation and useful instructions. Thanks a lot!
Your cython explanation and tutorial is the most crystal clear I've heard...and I have listened to > 10 videos! Excellent content! Thank you for this, it makes things less painful ❤
In the code mentioned at 22:50, the for loop initialisation condition must be g.size()/2 right?. To accomodate any thread group size.
Thank you for the presentation 😊
thanks for the video. Its really helpful I had a doubt, so say you want to train an xgboost model on top of a 200GB dataset and I have a VM with some gpus having a total combined memory of ~100GB (GPU memory + VM memory). Will I be able to train the model successfully on that using LocalCUDACluster .
did you try?
Great tutorial. May I ask you a question? If I have to use an old version of Node for my project that does not support async/await but supports promises, how can I rewrite the "for" loop on 19:16? I need to sustain the ordering of the promises, which means that just start them in loop with "then()" attached to every one of them is not a solution. Also want to ask you - in one of my old web apps I used an inline <script> block in into which I uploaded a file or some web form for new item creation, that script was something like "if (window.parent) { window.parent.doStuff(data) }". It worked fine even in IE6+, but is it a clean approach?
Great video, helped connect to UMIACS HPC
Thanks a lot. It is a very well structured nice video. Without examples it is very hard to learn programming. Thanks a lot. 😄
It was really a nice demonstration for CUDA. Thanks
Glad you liked it!
Very useful !! specially the explanation of the color coding
Glad it was helpful!
is the source code available for the debugger part?
Any idea of when H100 will be available on Graham?
We plan to replace graham with a new cluster in 2024 (if funding comes through), and this new cluster may contain H100 GPUs. Before then we may get a small number of H100 GPUs for testing purposes.
@@pawelpomorski4025 I'm interested in testing H100 GPUs from a researcher's perspective when they are available. 😁
@@dakezhang2845 There is a lot of interest. When we get some H100 cards for testing, we will advertise it, probably in our monthly newsletter which all users receive.
@@pawelpomorski4025 What happens to the old cluster? Can research groups purchase the old cores or GPUs?
@@ameerracle The cluster hardware is owned by the university, which decides how to dispose of it if we (SHARCNET) retire it. Anyway, it looks like graham will be running with the current hardware at least until early 2025.
how to profile in production using scalene pls
i need the steps to take to get this done
Do you think the AI programmer community will support/adopt RocM and to an extent AMD DataCenter GPUs?
RocM uses HIP which is very similar to CUDA, so any program written in CUDA can be ported to AMD (assuming it does not use any advanced features which are exclusive to NVIDIA GPUs). As for data centres, given the high demand and expense of NVIDIA GPUs, more AMD data center GPUs may be adopted.
Can I use this code?
Thank you for the great explanation 🙏
My pleasure!
Well done. I hope you'll make a follow up about a larger python program, with classes, various files, etc. The challenge I find is that a profiler is not so helpful once the work spreads out across components. I need something more specific to say "how can this function call be more efficient"?
24:38 How are you configuring Jupyterhub like this in the background connecting to these cluster resources. Is there documentation on this available or open source github that I can take a look at?
awesome. thanks man
Glad it helped!
'/^seq2$;/
awk '{print $1, $3, $5} filename
Thanks a lot for sharing that. It helped me connect UGent's HPC
Glad it helped
How to overcome with this issue:-RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1670525552843/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, i nternal error, NCCL version 2.14.3 ncclInternalError: Internal check failed. Last error: Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 1a000
same error bro, did u rectified it?
Thanks for the presentation. Please make MIG available for everyone
If you are interested in trying our experimental MIG setup on narval, please submit a ticket with a request. We might be able to give you some access in January.
Dear Ge, thanks for the video. Dou you know if its possible to debug in many nodes ?
The first example shown will not use multiple GPUs, only one. You can see at 27:36 that in the code (it would have been nice to have line numbers) [device = 'cuda:0'...] meaning it only uses the first GPU. I believe this should be [device ='cuda:0,1'...]
u are correct
Can you tell exactly how to use or provide syntax to utilize multi gpu