Please note we have deprecated the dvcorg/cml-py3 container image. You can get the same results with: - container: docker://dvcorg/cml-py3:latest + steps: + - uses: actions/checkout@v3 + - uses: iterative/setup-tools@v1
Starting in ML from a non-CS background was already hard enough, but Elle came thru and just made me smile and feel better about this complex subject. I'm rewatching this entire series again. After looking at udemy, coursera, and even a few other websites there isn't someone talking about how to go from making ML projects on ur laptop to production environment. Honestly, I'm grateful for the inspiration and I'm more committed to this self-learning route.
Excellent walkthrough! Would be cool to incorporate experiment tracking tools like Weights and Biases to automatically report metrics. But for starters, this is really a job well done!
DVC and CML complement each other. CML was created by the DVC team - see cml.dev A bit more tech details: DVC is usually used to transfer data to CI/CD (CML) runners.
This is extremely helpful Elle and DVCorg. Had a follow-up question - if I wanted to generate multiple metric files and residual plots from the train.py script (say because I am running a loop varying max_depth over [5,10,15] or varying some other hyperparameters), what would be the best way to modify the workflow so that I can see all the data and viz in one commit? A crude way could be to store the metrics and plots with diff names in train.py and in the cml.yml file add them separately to report.md. However, as the no of loops increase, this wouldn't be a scalable method.
So what if you were to write out your metrics in one file using longform? So for example.... max_depth | accuracy 5. | 87 10. | 90 15. | 92 And likewise, put all your plots on one axis- so like, many lines of different colors, using your favorite plotting library. Then you'd be able to print your table and your summary plot in your cml report with only one line of code each, no matter how long your loop is.
Dear Elle, what would be the required changes for implementing CML into GITLAB? Does GITLAB has some type of "GitHub Actions" functionality? If so, where can I check for it?
Good q- GitLab has something called GitLab CI, which is extremely similar and gives you must of the same functionality! There are a few subtle differences in how you setup things like environmental variables/secrets, but it's not too bad. We have some docs here: dvc.org/doc/cml/start-gitlab
thank you very much. but why i have errors. i couldn't run after first commit. i tried nearly everything. it is deom the the line of the importance plot. what it could be?
Let's say we have a couple of commits in the experiment branch and we want to merge the branch with squashed option. What would happen then? All the reports would be combined?
Hi Elle, can you shed some light if I can do the same, but with a different docker image, such as continuum/anaconda3, so I can do the same for a conda environment? Other than the docker image link, what else would I need to change?
Very nice tutorial! I really like this concept of integrating into the normal software stack. How would one handle the situation of adding new metrics over time? E.g. If you begin a project only displaying F1 score, but as you train more models you realise you are also interested in seeing and comparing the precision. Could this be catered for using CML?
Yep, using the existing software stack for ML is one of the ideas behind CML. That's a really good question. The flow relies on Git a lot. So, if the scores were stored\commited then you can derive F1 as well as precision. However, if the scores were not stored/committed you might need to return back, create another experiment just to get the right scores to compare. How do you do that with the other tools or approaches? One relevant discusion - github.com/iterative/dvc/issues/4210
@@dmitrypetrov3542 thanks very much for the reply! Yes, I thought the solution might be something along those lines. For database approaches such as MLFlow one can log metrics later on to previous experiments/runs. I suppose with a git-based system of storing metrics one could manually add an extra commit with the new scores? Or of course rerun the experiment in the normal way with the new scores included, as you suggest. Although for long training times that could be a problem, if you are actually just wanting to do scoring, not training.
@@shaunirwin2016 yes, an additional commit is one of the solutions. Re long-running experiments - you are right, but the same happens with logging tools like mlflow - you need to retrain to get the metrics. The only difference, the commit is not needed.
Dear Elle, would you be so kind as to show/describe how one can implement a dvc pull request that is meant to be run by a .github/workflows "yaml"'s file, so that it is only run on the git remote repository? An approach through which would be possible to "gitignore" the dvc data, while allowing the git remote a temporary access to the data to properly test the CML commited. Perhaps use some kind of data cache by the git remote repository, and later an automatic deletion of this cached data?
One approach is using a local DVC config file, which lets you have a different data remote/different credentials for when you're working locally than what's in your CI/CD system. That means you can still have a DVC config file that gets pushed to your Git repo, but you'll have a local version that gets used when you're developing in your workspace. Docs here: dvc.org/doc/command-reference/remote#example-add-a-default-local-remote Another thought that comes to mind is that you could make the credentials to pull from the DVC remote only available to the runner (via secrets). You might then write a control flow statement... if those environmental variables are present, then run dvc pull. else, don't. : If you want to discuss this in more detail, stop by the CML channel on our Discord: discordapp.com/invite/dvwXA2N
Hm, that sounds like you might be missing a flag in your cml-publish function. Do you have `cml-publish --show-md >> report.md`? If you don't have the `--show-md` flag, you'll get a link to your image instead of an embedded picture.
Hello Elle, this looks great, it seems that it works for Python only? I develop Machine Learning tools in R, and I would love to help integrate this if possible
The tools we're using here (GitHub Actions and CML) work with any language! Here's a blog about a project using R: mribeirodantas.xyz/blog/index.php/2020/08/10/continuous-machine-learning/ There's a GitHub Action for getting R on your runner, too: github.com/r-lib/actions
How can I set a secret token in GitHub actions? My program is calling an API so a need to write the secret token but I don't know if it's correct to write it in cml.yaml because it gonna be public
You can add the secret to your GitHub repository, which will give the runner access to it via an environmental variable. You can set it so the variable will be hidden even in logs- check out their docs! docs.github.com/en/actions/reference/encrypted-secrets
To be clear, CML isn't a competitor to Circle CI. Circle CI is more analogous to GitHub Actions or GitLab CI; it's a continuous integration system. CML is a toolkit that works with a continuous integration system to 1) provide big data management (via DVC & cloud storage), 2) help you write model metrics and data viz to comments in GitHub/Lab, and 3) orchestrate cloud resources for model training and testing. Currently, CML is only available for GitHub Actions and GitLab CI. But it could in the future integrate with Circle CI (i.e., as an Orb).
@@dvcorg8370 thanks for detailed reply. I've got it clear in my head now 😃, I watched the other bids in the series and you explain very clearly..I look forward to videos setting up cloud workflow with CML and versioniglng like S3 , gcp. I'm not sure if you are planning to do DL content.. As a suggestion I Would love to see pytorch workflows on cloud with say multigpus . And like basic training tests in CML workflow , like sanity check :fitting/ evaluation on single batch etc. Please keep up tutorials!
@@jwc7663 Yes- you can set GitHub Actions (& GitLab CI, too) to use self-hosted runners, which can be a local machine. Check out the docs here: docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners
@@dvcorg8370 I would love to see the self hosted GPU flow with the ability to compare the results from the model that is in the master branch repo. And using dvc to roll the data set back to the data set that was used to train the model in master branch. So we could compare both models, on new and old data.
Hi, Thanks for your very useful video. I have a question , because I was trying to replicate this example in my own repo and failed in this part of the cml.yaml ` steps: - uses: actions/checkout@v2 - name: train_model env: repo_token: ${{ secrets.GITHUB_TOKEN }}` do you mean by GITHUB_TOKEN a secret key that I assign in Settings/Secrets tab from the repo? which is a private key. If this is true, I dont know why ifI put my own private key name it doesnt work :(
Hi Leila! You don't have to assign any value to GITHUB_TOKEN- it is assigned by default in a GitHub repository. Please delete any secrets you might have added and try again. If it doesn't work, stop by our Discord channel where we can do more hands-on troubleshooting :) discord.gg/bzA6uY7
Good question- you can integrate lots of tools with CML. For example, you can use it with Tensorboard to get a link to your Tensorboard in a PR whenever the model trains. Check out this use case: github.com/iterative/cml_tensorboard_case/pull/3 We haven't tried with MLFlow in particular yet, but expect there could be a similar approach.
This error typically occurs when trying to run a program that was compiled with a newer version of the GNU C Library (GLIBC) than what's installed on your system. Check that version requirements match up and you should be all set!
Please note we have deprecated the dvcorg/cml-py3 container image.
You can get the same results with:
- container: docker://dvcorg/cml-py3:latest
+ steps:
+ - uses: actions/checkout@v3
+ - uses: iterative/setup-tools@v1
You made a complex topic sound very simple with your easy walkthrough steps! Please keep up the good work.
really appreciate it, Nagarjuna! Always feel free to let us know if there's a topic you'd like to see :)
@@dvcorg8370 could you please make a video on how to make unit tests for models in MLOps?
Starting in ML from a non-CS background was already hard enough, but Elle came thru and just made me smile and feel better about this complex subject.
I'm rewatching this entire series again. After looking at udemy, coursera, and even a few other websites there isn't someone talking about how to go from making ML projects on ur laptop to production environment.
Honestly, I'm grateful for the inspiration and I'm more committed to this self-learning route.
I doff my hat to you Elle...for a very crisp,easy to understand and uncluttered explanation of MLOps...
That diff report in pull request is awesome, thank you for sharing. I will try to use this technique in the future.
Wow, incredible clarity in your presentations. Thanks for all the great work, Elle!
This is so cool. I Loved it. We can use this for writing test cases in PRs. Thank you.
Excellent walkthrough! Would be cool to incorporate experiment tracking tools like Weights and Biases to automatically report metrics. But for starters, this is really a job well done!
Thank you for the excellent tutorial Elle and @DVCorg!
Soo cool to see this Elle! thank you for sharing and teaching us a thing or two in the community!
Wow. This was soo good. She made it so easy to understand.
Great tutorial. Thank you!
Glad it was helpful!
Great stuff, I'm learning
Great video! Such precise and clear explanations! Thank you for sharing.
Excellent tutorial. Keep it up!
Thanks Aleksandr! Much appreciated :)
@@dvcorg8370 It's my pleasure! :-)
You are defining things in rightful manner and things are understood easily. AMAZING 🤩
Thanks so much, Mayur! The kind words are really appreciated :)
pretty to explain the topics about the MLOps..keep it up.good work elle.
Great explanation
Glad you think so!
Very good thank you. Superbly explained.
Thank you that was very helpful!
🔥🔥🔥
What is special about github actions and CML so I use them instead of using something like jenkins for example??
Do you have any video showing how to configure the token ? I’m having a hard time with that config
Hi! That's is the tutorial I was searching for. Thanks a lot!
Thank you so much for this video and all your work, it is just amazing!
You're very welcome!
Awesome presentation. Thank you for your great work
Thanks Phillipe!
Excellent
Thank you! Cheers!
15:08
- I made an an amazing model
cat in the background : Yaaa
Congrats!
Really great video thank you
Very nice explanation indeed, thank you so much, keep it up
Awesome video, want to know how to use tpu and gpu ?
What's the main difference with DVC ? How they articulate together ? or not ? thanks again !
DVC and CML complement each other. CML was created by the DVC team - see cml.dev
A bit more tech details: DVC is usually used to transfer data to CI/CD (CML) runners.
@@dmitrypetrov3542 Ok ! So from my understanding DVC is for experiment tracking and CML is more for for CI/CD MLOps ?
@@jackbauer322 exactly. DVC - data & ML experiments. CML - team collaboration & ML training.
you are just amusing!
this is great! thanks for sharing
This is extremely helpful Elle and DVCorg. Had a follow-up question - if I wanted to generate multiple metric files and residual plots from the train.py script (say because I am running a loop varying max_depth over [5,10,15] or varying some other hyperparameters), what would be the best way to modify the workflow so that I can see all the data and viz in one commit?
A crude way could be to store the metrics and plots with diff names in train.py and in the cml.yml file add them separately to report.md. However, as the no of loops increase, this wouldn't be a scalable method.
So what if you were to write out your metrics in one file using longform? So for example....
max_depth | accuracy
5. | 87
10. | 90
15. | 92
And likewise, put all your plots on one axis- so like, many lines of different colors, using your favorite plotting library.
Then you'd be able to print your table and your summary plot in your cml report with only one line of code each, no matter how long your loop is.
@@dvcorg8370 Ahh yes, a very nice workaround. Thanks.
Dear Elle, what would be the required changes for implementing CML into GITLAB? Does GITLAB has some type of "GitHub Actions" functionality? If so, where can I check for it?
Good q- GitLab has something called GitLab CI, which is extremely similar and gives you must of the same functionality! There are a few subtle differences in how you setup things like environmental variables/secrets, but it's not too bad. We have some docs here: dvc.org/doc/cml/start-gitlab
thank you very much. but why i have errors. i couldn't run after first commit. i tried nearly everything. it is deom the the line of the importance plot. what it could be?
Let's say we have a couple of commits in the experiment branch and we want to merge the branch with squashed option. What would happen then? All the reports would be combined?
Hi, thanks for comprehensive explanation!)
But I have one more question. Can I use CML with Azure TFS ?
Yes you can! See these docs: cml.dev/doc/cml-with-dvc. And please join us in our Discord server if you have more questions! discord.gg/rpgRdvfyAf
thanks for sharing the talk
Hi Elle, can you shed some light if I can do the same, but with a different docker image, such as continuum/anaconda3, so I can do the same for a conda environment? Other than the docker image link, what else would I need to change?
Awesome tutorial!
Thanks Aniketh!
Very nice tutorial! I really like this concept of integrating into the normal software stack.
How would one handle the situation of adding new metrics over time? E.g. If you begin a project only displaying F1 score, but as you train more models you realise you are also interested in seeing and comparing the precision. Could this be catered for using CML?
Yep, using the existing software stack for ML is one of the ideas behind CML.
That's a really good question. The flow relies on Git a lot. So, if the scores were stored\commited then you can derive F1 as well as precision. However, if the scores were not stored/committed you might need to return back, create another experiment just to get the right scores to compare. How do you do that with the other tools or approaches?
One relevant discusion - github.com/iterative/dvc/issues/4210
@@dmitrypetrov3542 thanks very much for the reply! Yes, I thought the solution might be something along those lines. For database approaches such as MLFlow one can log metrics later on to previous experiments/runs. I suppose with a git-based system of storing metrics one could manually add an extra commit with the new scores? Or of course rerun the experiment in the normal way with the new scores included, as you suggest. Although for long training times that could be a problem, if you are actually just wanting to do scoring, not training.
@@shaunirwin2016 yes, an additional commit is one of the solutions.
Re long-running experiments - you are right, but the same happens with logging tools like mlflow - you need to retrain to get the metrics. The only difference, the commit is not needed.
Dear Elle, would you be so kind as to show/describe how one can implement a dvc pull request that is meant to be run by a .github/workflows "yaml"'s file, so that it is only run on the git remote repository? An approach through which would be possible to "gitignore" the dvc data, while allowing the git remote a temporary access to the data to properly test the CML commited. Perhaps use some kind of data cache by the git remote repository, and later an automatic deletion of this cached data?
One approach is using a local DVC config file, which lets you have a different data remote/different credentials for when you're working locally than what's in your CI/CD system. That means you can still have a DVC config file that gets pushed to your Git repo, but you'll have a local version that gets used when you're developing in your workspace. Docs here: dvc.org/doc/command-reference/remote#example-add-a-default-local-remote
Another thought that comes to mind is that you could make the credentials to pull from the DVC remote only available to the runner (via secrets). You might then write a control flow statement... if those environmental variables are present, then run dvc pull. else, don't. : If you want to discuss this in more detail, stop by the CML channel on our Discord: discordapp.com/invite/dvwXA2N
Thx. The tutorial is amazing. In comments, I am not able to see the PNG files, only the links. Do I need to configure something more?
Hm, that sounds like you might be missing a flag in your cml-publish function. Do you have `cml-publish --show-md >> report.md`? If you don't have the `--show-md` flag, you'll get a link to your image instead of an embedded picture.
@@dvcorg8370 Thank you again! Now, it works for me :)
Have you deleted the experiment branch from the repository?
Yes, but you can see the closed PR and browse the branches at previous points in time github.com/andronovhopf/wine/pull/2
One thing I figured that the actions do not always trigger upon a new commit to a branch. Is there a way to prevent it?
They trigger on push requests. For several local commits and a single push it will run only the last one. So, you need to push on each of the commits.
Hello Elle, this looks great, it seems that it works for Python only? I develop Machine Learning tools in R, and I would love to help integrate this if possible
The tools we're using here (GitHub Actions and CML) work with any language! Here's a blog about a project using R: mribeirodantas.xyz/blog/index.php/2020/08/10/continuous-machine-learning/
There's a GitHub Action for getting R on your runner, too: github.com/r-lib/actions
DVCorg thanks, you are doing an amazing job
Beautiful.
Hands up if you also had an espresso while watching this.
Thanks a lot
How can I set a secret token in GitHub actions? My program is calling an API so a need to write the secret token but I don't know if it's correct to write it in cml.yaml because it gonna be public
You can add the secret to your GitHub repository, which will give the runner access to it via an environmental variable. You can set it so the variable will be hidden even in logs- check out their docs! docs.github.com/en/actions/reference/encrypted-secrets
what can this CML tool do that circleci Continous Integration can't do?
To be clear, CML isn't a competitor to Circle CI. Circle CI is more analogous to GitHub Actions or GitLab CI; it's a continuous integration system.
CML is a toolkit that works with a continuous integration system to 1) provide big data management (via DVC & cloud storage), 2) help you write model metrics and data viz to comments in GitHub/Lab, and 3) orchestrate cloud resources for model training and testing.
Currently, CML is only available for GitHub Actions and GitLab CI. But it could in the future integrate with Circle CI (i.e., as an Orb).
@@dvcorg8370 thanks for detailed reply. I've got it clear in my head now 😃, I watched the other bids in the series and you explain very clearly..I look forward to videos setting up cloud workflow with CML and versioniglng like S3 , gcp. I'm not sure if you are planning to do DL content.. As a suggestion I Would love to see pytorch workflows on cloud with say multigpus . And like basic training tests in CML workflow , like sanity check :fitting/ evaluation on single batch etc.
Please keep up tutorials!
@@jordieclive No problem! Let us know any other questions you have :)
Scenario: Need NN model and want to test in using GPU. Is it possible as well?
Yes! We'll be covering that use case in a video soon. For now we have some an example project to browse: github.com/iterative/cml_cloud_case
@@dvcorg8370 That looks good. Will it support local machine(not cloud) as well?
@@jwc7663 Yes- you can set GitHub Actions (& GitLab CI, too) to use self-hosted runners, which can be a local machine. Check out the docs here: docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners
@@dvcorg8370 I would love to see the self hosted GPU flow with the ability to compare the results from the model that is in the master branch repo. And using dvc to roll the data set back to the data set that was used to train the model in master branch. So we could compare both models, on new and old data.
@@efels_com We can do this! Adding this to the list of to-dos.
Hi, Thanks for your very useful video. I have a question , because I was trying to replicate this example in my own repo and failed in this part of the cml.yaml
` steps:
- uses: actions/checkout@v2
- name: train_model
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}`
do you mean by GITHUB_TOKEN a secret key that I assign in Settings/Secrets tab from the repo? which is a private key. If this is true, I dont know why ifI put my own private key name it doesnt work :(
Hi Leila! You don't have to assign any value to GITHUB_TOKEN- it is assigned by default in a GitHub repository. Please delete any secrets you might have added and try again. If it doesn't work, stop by our Discord channel where we can do more hands-on troubleshooting :) discord.gg/bzA6uY7
@@dvcorg8370 Thanks! It did work!
Hi can you make video on mlcertific.com It is providing free certification on MLOps
How would mlflow come in here?
Good question- you can integrate lots of tools with CML. For example, you can use it with Tensorboard to get a link to your Tensorboard in a PR whenever the model trains. Check out this use case: github.com/iterative/cml_tensorboard_case/pull/3
We haven't tried with MLFlow in particular yet, but expect there could be a similar approach.
@@dvcorg8370 Thanks ! Can't wait for the next videos :)
i love u
🦉 We love you too!
how do you get around the " `GLIBC_2.28' not found " error?
This error typically occurs when trying to run a program that was compiled with a newer version of the GNU C Library (GLIBC) than what's installed on your system. Check that version requirements match up and you should be all set!