In 6:49 he said, activation have batch dimensions and E parameters have batch dimensions ?? Is that correct ? I used to think batch size is an independent dimension when defining model and is initialised for all parameters including W and activation parameters
Batch size is an independent size yes, so the input of the whole model has batch size and also it's outputs of every layer, which are activations. Activations are not parameter, you don't need these activations at all. What you want to train is the W and V rather than X (input) or Y (output). Parameters like W and V don't have batch size, because every batch inputs will dot with the same W or V.
I'm not convinced. Let me say why. Only a happy few, very few, have the possibility to use a supercomputer or a TPU for that matter. But most of us already have access to a cluster of non homogenius nodes. Some nodes faster, more powerful, some quite slow but with maybe more memory/disk space. It makes more sense to have tensorflow be able to detect capabilities, detect latencies, and build a graph that fits that cluster best. That way ALL can see model parallelism AND data parallelism with affordable equipment. One might take the step a bit further, and even include nodes over the internet, not needing fiber, if the graphs are setup right. But this needs to be automated, while now it is pure manual work.
I believe one of the challenges with homogeneity is the use of collective communications at the end i.e. All reduce. So the fastest nodes end up waiting for the slowest node so that the outputs can be collected and gradients can be redistributed.
So make sure my five sensors are programmed by tensorflow supercomputer federated learning deep learning machine learning replacement because Bluetooth sensors are the very best definition for neural linguistics programming like FBI's co Intel pro Super computer NLP from the fifties
So great!
Please refresh the time line with this design
In 6:49 he said, activation have batch dimensions and E parameters have batch dimensions ?? Is that correct ? I used to think batch size is an independent dimension when defining model and is initialised for all parameters including W and activation parameters
Batch size is an independent size yes, so the input of the whole model has batch size and also it's outputs of every layer, which are activations. Activations are not parameter, you don't need these activations at all. What you want to train is the W and V rather than X (input) or Y (output). Parameters like W and V don't have batch size, because every batch inputs will dot with the same W or V.
No subtitles for this video?
added
I imagine we can also split some layers by h and some layers by d?
I'm not convinced. Let me say why. Only a happy few, very few, have the possibility to use a supercomputer or a TPU for that matter. But most of us already have access to a cluster of non homogenius nodes. Some nodes faster, more powerful, some quite slow but with maybe more memory/disk space. It makes more sense to have tensorflow be able to detect capabilities, detect latencies, and build a graph that fits that cluster best. That way ALL can see model parallelism AND data parallelism with affordable equipment. One might take the step a bit further, and even include nodes over the internet, not needing fiber, if the graphs are setup right. But this needs to be automated, while now it is pure manual work.
I believe one of the challenges with homogeneity is the use of collective communications at the end i.e. All reduce. So the fastest nodes end up waiting for the slowest node so that the outputs can be collected and gradients can be redistributed.
mostly, we don't need to train gaint models
WOOOOOOOOOO Noammmm!
Yay
So make sure my five sensors are programmed by tensorflow supercomputer federated learning deep learning machine learning replacement because Bluetooth sensors are the very best definition for neural linguistics programming like FBI's co Intel pro Super computer NLP from the fifties