Hi Julien! First of all, thanks for the video! One question please, to run the processing job, should the Sagemaker Notebook have a permission to access ECR in its IAM role, or should the role passed to the processing job have this permission?
Hi Julien, Thanks for sharing your knowledge. Just one question, If I want to preprocess the rawdata, eg: merging raw data with another dataset for creating new features, which is not available in sklearn standard column transformer techniques... How will I achieve that, I specifically wanted to deploy it as an endpoint so the preprocessing will be applied to the test data as well
Workflows are constructed interactively and you can export them to SageMaker Pipelines docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-export.html
@@juliensimonfr Hi, I would like to elaborate on above question. I have a lot of use cases where I start from a large dataset (easily 500 columns) that requires a lot of cleaning before it can be fed to a model. Creating a preprocessing job interactively is not an option here. It would take way too much time to manually select the right column for each transformation. Having a Python API for data wrangler to create a preprocessing pipeline programmatically would be a really nice feature. I understand that it is also possible to use Sagemaker Processing to execute a preprocessing script. However, all examples and documentation about this feature always focus on scripts that are only used as preparation of model training and that are not deployed as part of the inference pipeline. For example, some sklearn transformations are specified, then the .fit() function is called and then the .transform() function and finally the processed data is exported. The 'trained' transformations (after calling .fit()) are not saved or exported and hence cannot be reapplied when you want to score a model. is there a way to train a preprocessing pipeline and deploy it in the same way as you would train a model?
The presentation by Julien Simon starts at @07:34
Great lectures to revisit.
Thank you, Julien! Looking forward to seeing you on our 9th Data Science UA Conference this week)
Hi Julien! First of all, thanks for the video! One question please, to run the processing job, should the Sagemaker Notebook have a permission to access ECR in its IAM role, or should the role passed to the processing job have this permission?
👍
Hi Julien, Thanks for sharing your knowledge.
Just one question, If I want to preprocess the rawdata, eg: merging raw data with another dataset for creating new features, which is not available in sklearn standard column transformer techniques... How will I achieve that, I specifically wanted to deploy it as an endpoint so the preprocessing will be applied to the test data as well
I think you're looking for Inference Pipelines :) docs.aws.amazon.com/sagemaker/latest/dg/inference-pipeline-real-time.html
Hello @Julien can we automate the Sagemaker Data Wrangler data processing with stepfunctions python sdk. I am curious how that could work
Workflows are constructed interactively and you can export them to SageMaker Pipelines docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-data-export.html
@@juliensimonfr Hi, I would like to elaborate on above question. I have a lot of use cases where I start from a large dataset (easily 500 columns) that requires a lot of cleaning before it can be fed to a model. Creating a preprocessing job interactively is not an option here. It would take way too much time to manually select the right column for each transformation. Having a Python API for data wrangler to create a preprocessing pipeline programmatically would be a really nice feature. I understand that it is also possible to use Sagemaker Processing to execute a preprocessing script. However, all examples and documentation about this feature always focus on scripts that are only used as preparation of model training and that are not deployed as part of the inference pipeline. For example, some sklearn transformations are specified, then the .fit() function is called and then the .transform() function and finally the processed data is exported. The 'trained' transformations (after calling .fit()) are not saved or exported and hence cannot be reapplied when you want to score a model. is there a way to train a preprocessing pipeline and deploy it in the same way as you would train a model?
Hi @julien. Thanks for sharing your knowledge. How can I get the notebooks that you used in this session?
Sure, they're part of my book: github.com/PacktPublishing/Learn-Amazon-SageMaker/tree/master/sdkv2/ch6/lda-ntm