Training Neural Network using multiple GPUs on PARAM SHAKTI supercomputer| SLURM | Batch Job

แชร์
ฝัง
  • เผยแพร่เมื่อ 27 ต.ค. 2024

ความคิดเห็น • 29

  • @vsksam
    @vsksam 4 หลายเดือนก่อน +1

    Thank You Bro...

  • @s.m.ruhulkabirhowlader7578
    @s.m.ruhulkabirhowlader7578 8 หลายเดือนก่อน +2

    Thank you for this video.

  • @xiangli1133
    @xiangli1133 ปีที่แล้ว +2

    Thanks! So insightful!

  • @NicoleRosi
    @NicoleRosi ปีที่แล้ว +1

    Thank you for this video!

  • @devashishkprasad
    @devashishkprasad 2 ปีที่แล้ว +2

    Thank you for this very helpful video!!

  • @wonder_hd
    @wonder_hd 2 ปีที่แล้ว +2

    Sir, I don't understand hindi. But I can understand from what you are doing.

  • @prashantmore7912
    @prashantmore7912 5 หลายเดือนก่อน +1

    Sir how to create a module and upload.

  • @SubrataBarman-h1g
    @SubrataBarman-h1g 11 หลายเดือนก่อน +1

    I am using WinSCP to access ParamShakti service provided by IIT KGP. But I'm facing issue while transferring data. Although I'm using high speed internet but data transfer speed very very low (hardly go upto 100 kbps). Please let me know how to solve this, and is there any alternative software to transfer data?

    • @pankajkasar9512
      @pankajkasar9512  11 หลายเดือนก่อน +1

      Use mobaxterm

    • @barman5186
      @barman5186 11 หลายเดือนก่อน

      ​@@pankajkasar9512thank you very much Sir for your prompt reply.
      I have tried mobaxterm, but not able to connect. Can you please please address how to login with the hostname using mobaxterm or how to configure it for the same.

  • @himanshu1689
    @himanshu1689 ปีที่แล้ว +1

    Hi Pankaj! nice video. it will be really helpful if u share any sample code for single gpu and multi gpu.

  • @disinlungkamei2869
    @disinlungkamei2869 ปีที่แล้ว +1

    Hello sir , sir if we were to run on 10 GPU nodes then how would we write our slurm script , Thank you sir

    • @pankajkasar9512
      @pankajkasar9512  ปีที่แล้ว

      As paramshakti there are 11 nodes having 2 GPUs each, which means for 10 GPUs you need to reserve 5 nodes, so mention Node =5, and why do you need 10 GPUs? accordingly, you need to write a script for distributed training otherwise it's not possible to use 10 GPUs.

  • @shobhitgautam7058
    @shobhitgautam7058 2 ปีที่แล้ว +1

    sir how to create module? i am getting error that 'no module named tensorflow'

    • @pankajkasar9512
      @pankajkasar9512  ปีที่แล้ว

      its about no package available, you can install package by "pip install tensorflow"

  • @anilkumarsharma8901
    @anilkumarsharma8901 2 ปีที่แล้ว +1

    Apney subscribe👌 wale ko super computer💻 ka use karwa do phir duniya following👌 karegi😇😇

  • @gamermoneyfree8193
    @gamermoneyfree8193 2 ปีที่แล้ว +1

    Sir where to apply for param shakti for remote acess ??

  • @prembabupal1889
    @prembabupal1889 ปีที่แล้ว

    how to run matlab code using paramshakti

  • @IITian1988
    @IITian1988 2 ปีที่แล้ว

    Hello pankaj sir, this is Anup Mahato research scholar from IIT kharagpur, sir i have problem with Running Reg.cm model. in paramshakti Regcm is available and all libraries and mpi available but i made script file also but i am not geeting input file in the server from where i will get that file? please help me. Thanks in advance.

    • @pankajkasar9512
      @pankajkasar9512  2 ปีที่แล้ว

      Please state your issue in detail...

  • @nitinkumarchauhan6559
    @nitinkumarchauhan6559 3 ปีที่แล้ว

    Is it for sequential job Or parallel job with one gpu card or two gpu card?

    • @pankajkasar9512
      @pankajkasar9512  3 ปีที่แล้ว

      Its sequential job having two GPUs on single node (single Machine)

  • @nitinkumarchauhan6559
    @nitinkumarchauhan6559 3 ปีที่แล้ว

    Sir, thanks for this knowledgeable post. When I am trying to plot a curve testing and validation curve for epochs i am getting this error (with paramshivay). How can i overcome this issue?
    QStandardPaths: XDG_RUNTIME_DIR points to non-existing path '/run/user/5475', please create it with 0700 permissions.
    qt.qpa.screen: QXcbConnection: Could not connect to display localhost:22.0
    Could not connect to any X display.

    • @pankajkasar9512
      @pankajkasar9512  3 ปีที่แล้ว

      Why you are displaying such graphs and plots on Supercomputer while training? Don't do that. What I suggest you use "CSVLogger(csv_path)" function available at "from tensorflow.keras.callbacks import CSVLogger". Create a csv file and stire all training performance parameters in that file and then you can download it and monitor it manually. Please add this call back at model.fit() and stored in one file then you can have graphs, visualization etc.

    • @nitinkumarchauhan6559
      @nitinkumarchauhan6559 3 ปีที่แล้ว

      @@pankajkasar9512 Ok sir... that works for me.... there is one more issue. When I am importing a pretrained model on imagenet, I am able to load only Vgg16... other models like resnet, densenet etc are not getting imported as they are showing this error.....
      Downloading data from github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5
      ---------------------------------------------------------------------------
      gaierror Traceback (most recent call last)
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
      1316 h.request(req.get_method(), req.selector, req.data, headers,
      -> 1317 encode_chunked=req.has_header('Transfer-encoding'))
      1318 except OSError as err: # timeout error
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in request(self, method, url, body, headers, encode_chunked)
      1228 """Send a complete request to the server."""
      -> 1229 self._send_request(method, url, body, headers, encode_chunked)
      1230
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in _send_request(self, method, url, body, headers, encode_chunked)
      1274 body = _encode(body, 'body')
      -> 1275 self.endheaders(body, encode_chunked=encode_chunked)
      1276
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in endheaders(self, message_body, encode_chunked)
      1223 raise CannotSendHeader()
      -> 1224 self._send_output(message_body, encode_chunked=encode_chunked)
      1225
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in _send_output(self, message_body, encode_chunked)
      1015 del self._buffer[:]
      -> 1016 self.send(msg)
      1017
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in send(self, data)
      955 if self.auto_open:
      --> 956 self.connect()
      957 else:
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in connect(self)
      1383
      -> 1384 super().connect()
      1385
      /home/apps/DL-Conda-Py3.7/lib/python3.7/http/client.py in connect(self)
      927 self.sock = self._create_connection(
      --> 928 (self.host,self.port), self.timeout, self.source_address)
      929 self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
      /home/apps/DL-Conda-Py3.7/lib/python3.7/socket.py in create_connection(address, timeout, source_address)
      706 err = None
      --> 707 for res in getaddrinfo(host, port, 0, SOCK_STREAM):
      708 af, socktype, proto, canonname, sa = res
      /home/apps/DL-Conda-Py3.7/lib/python3.7/socket.py in getaddrinfo(host, port, family, type, proto, flags)
      747 addrlist = []
      --> 748 for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
      749 af, socktype, proto, canonname, sa = res
      gaierror: [Errno -2] Name or service not known
      During handling of the above exception, another exception occurred:
      URLError Traceback (most recent call last)
      /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/utils/data_utils.py in get_file(fname, origin, untar, md5_hash, file_hash, cache_subdir, hash_algorithm, extract, archive_format, cache_dir)
      221 try:
      --> 222 urlretrieve(origin, fpath, dl_progress)
      223 except HTTPError as e:
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in urlretrieve(url, filename, reporthook, data)
      246
      --> 247 with contextlib.closing(urlopen(url, data)) as fp:
      248 headers = fp.info()
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
      221 opener = _opener
      --> 222 return opener.open(url, data, timeout)
      223
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in open(self, fullurl, data, timeout)
      524
      --> 525 response = self._open(req, data)
      526
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in _open(self, req, data)
      542 result = self._call_chain(self.handle_open, protocol, protocol +
      --> 543 '_open', req)
      544 if result:
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
      502 func = getattr(handler, meth_name)
      --> 503 result = func(*args)
      504 if result is not None:
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in https_open(self, req)
      1359 return self.do_open(http.client.HTTPSConnection, req,
      -> 1360 context=self._context, check_hostname=self._check_hostname)
      1361
      /home/apps/DL-Conda-Py3.7/lib/python3.7/urllib/request.py in do_open(self, http_class, req, **http_conn_args)
      1318 except OSError as err: # timeout error
      -> 1319 raise URLError(err)
      1320 r = h.getresponse()
      URLError:
      During handling of the above exception, another exception occurred:
      Exception Traceback (most recent call last)
      in
      ----> 1 model_densenet = Densenet()
      in Densenet(seed)
      104
      105 def Densenet(seed = None):
      --> 106 denseNet121 = DenseNet121(weights="imagenet", include_top=False)
      107 for layer in denseNet121.layers[:149]:
      108 layer.trainable = False
      /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/applications/__init__.py in wrapper(*args, **kwargs)
      26 kwargs['models'] = models
      27 kwargs['utils'] = utils
      ---> 28 return base_fun(*args, **kwargs)
      29
      30 return wrapper
      /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/applications/densenet.py in DenseNet121(*args, **kwargs)
      9 @keras_modules_injection
      10 def DenseNet121(*args, **kwargs):
      ---> 11 return densenet.DenseNet121(*args, **kwargs)
      12
      13
      /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras_applications/densenet.py in DenseNet121(include_top, weights, input_tensor, input_shape, pooling, classes, **kwargs)
      309 input_tensor, input_shape,
      310 pooling, classes,
      --> 311 **kwargs)
      312
      313
      /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras_applications/densenet.py in DenseNet(blocks, include_top, weights, input_tensor, input_shape, pooling, classes, **kwargs)
      278 DENSENET121_WEIGHT_PATH_NO_TOP,
      279 cache_subdir='models',
      --> 280 file_hash='30ee3e1110167f948a6b9946edeeb738')
      281 elif blocks == [6, 12, 32, 32]:
      282 weights_path = keras_utils.get_file(
      /home/apps/DL-Conda-Py3.7/lib/python3.7/site-packages/keras/utils/data_utils.py in get_file(fname, origin, untar, md5_hash, file_hash, cache_subdir, hash_algorithm, extract, archive_format, cache_dir)
      224 raise Exception(error_msg.format(origin, e.code, e.msg))
      225 except URLError as e:
      --> 226 raise Exception(error_msg.format(origin, e.errno, e.reason))
      227 except (Exception, KeyboardInterrupt):
      228 if os.path.exists(fpath):
      Exception: URL fetch failure on github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5: None -- [Errno -2] Name or service not known

    • @nitinkumarchauhan6559
      @nitinkumarchauhan6559 3 ปีที่แล้ว

      @@pankajkasar9512 Kindly guide me Sir.... I have stucked with the mentioned issue. I am unable to fetch pretrained models.

    • @pankajkasar9512
      @pankajkasar9512  3 ปีที่แล้ว

      @@nitinkumarchauhan6559 Select the URL "github.com/keras-team/keras-applications/releases/download/densenet/densenet121_weights_tf_dim_ordering_tf_kernels_notop.h5", paste at browser, download it manually and then use that manual downloaded weights instead of weights= "imagenet" use this weights="resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5" with proper file path. then it will work. Do above changes at """ Pre-trained ResNet50 Model """
      resnet50 = ResNet50(include_top=False, weights="resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5", input_tensor=inputs)
      Similar is applicable for VGG16UNET and rest all

    • @nitinkumarchauhan6559
      @nitinkumarchauhan6559 3 ปีที่แล้ว

      @@pankajkasar9512 Thank you very much sir