It is the best videoabout falcon7 I've seen! Thank you a lot! You'd be the ultimate champ if you could produce a similar video and Python Notebook for Falcon40 using the "ml.g5.12xlarge" instance type.
It's very simple you just have to change the model name to: hub = {'HF_MODEL_ID':'tiiuae/falcon-40b'}. And: predictor = huggingface_model.deploy( initial_instance_count=1, instance_type='ml.g5.12xlarge', container_startup_health_check_timeout=300 ) Just be careful with the cost of the VM, it is quite expensive. Thank you for your positive feedback!
Hola qué tal, gran video, una pregunta, hay alguna forma de calcular los costos de desplegar y apagar estos modelos por horas usando calculator? Saludos.
The cost of using Amazon SageMaker for inference can vary depending on the approach you take. If you use a virtual machine (VM) like g5.2xlarge, the pricing is based on the time of usage. For example, if the VM is up and running for 5 hours, you will be charged for 5 hours of usage at the hourly rate of the g5.2xlarge instance. On the other hand, if you opt for a serverless endpoint, the pricing is calculated differently. It takes into account the duration of each inference request, the amount of RAM needed to run your model, and the data transferred in and out during the inference process. For instance, if the average duration of an inference request is 20 seconds, and you make 1000 requests, you will be billed for 20,000 seconds of inference time. For a more detailed explanation of the cost breakdown, you can refer to the 'On-demand pricing' section on Amazon SageMaker's pricing page. Specifically, check the information on 'Serverless inference' for details on the serverless endpoint pricing model, and 'Batch transform' for VM-based pricing like g5.2xlarge. Here are the links to the pricing pages for your reference: aws.amazon.com/sagemaker/pricing/?nc1=h_ls instances.vantage.sh/?cost_duration=monthly&selected=g5.2xlarge I hope this clears up any doubts you have about the pricing options for SageMaker inference.
hi am using the instance region in tokyo but when uploading the model it is showing insufficient memory to upload the model instance like that it is showing what to do for that should i give east virgina so that it will upload the falcon model
If you're encountering an 'insufficient memory' issue while trying to upload your model in the Tokyo region, it's important to understand a few key factors. Changing the region to East Virginia might not necessarily resolve the problem. 1 - Model Size: If you're trying to upload a larger model (for example, a 40GB model), it could lead to memory constraints, regardless of the region you choose. 2 - Instance Type: The instance type you're using also plays a significant role in memory availability. If you're using a smaller instance type with limited memory, it could lead to memory-related issues during model upload. You might want to choose an instance type with more memory capacity. Before switching regions, assess these factors and consider making adjustments accordingly. Ensure that your instance type is suitable for the model size you're working with because changing regions might not directly address the memory issue.
Hi, I have another question, I have been getting the error: is not authorized to perform: iam:GetRole on resource: role SageMaker-ExecutionRole with an explicit deny in a service control policy. Do you know why this could be the case? would I need to contact the administrator?
It seems that the issue is related to your IAM permissions. The error message indicates that your IAM user doesn't have the necessary permission to perform the action iam:GetRole, and there is an explicit deny in a service control policy affecting this action. To resolve this, you can try adding the IAMReadOnlyAccess policy to your IAM user. This policy grants read-only access to IAM resources and should allow you to perform the required iam:GetRole action without conflicts. Here's how you can do it: 1. Navigate to the IAM service in the AWS Management Console. 2. Select "Users" from the left-hand menu and find your IAM user in the list. 3. Click on your user to view details and permissions. 4. Under the "Permissions" tab, click on "Add permissions." 5. In the "Attach policies directly" section, search for "IAMReadOnlyAccess." 6. Check the box next to "IAMReadOnlyAccess" to select the policy. 7. Click "Add permissions" to apply the policy to your user. 8. Once you've added the IAMReadOnlyAccess policy, try performing the action again. The error should be resolved, and you should be able to access the required resources. If the issue persists, consider checking for any additional policies attached to your IAM user that might conflict with the required action. Alternatively, if you are using your company AWS account you can reach out to the AWS administrator within your company to review and adjust your permissions if necessary.
@@NechuENG Thanks, I ended up reaching out to the AWS administrator and it worked. Everything worked fine thanks! Waiting for your course on sagemaker and langchain, great content. 👌
Thank you for reaching out! When you say '5mm,' do you mean 5 million prompts? That's indeed a substantial amount of prompts to process, and there might not be a very cheap way to handle such a large workload. Could you also clarify whether these prompts are short or long? The number of words (tokens) in each prompt impacts the final price. Regarding your options, you have a couple of approaches you can consider. One option is to use a service like OpenAI, where you pay per request. For smaller quantities of requests, this can be the most cost-effective choice. However, when processing a very large number of prompts, the cumulative cost can add up significantly. gptforwork.com/tools/openai-chatgpt-api-pricing-calculator The second option is to deploy the model as we did in the video. While the initial cost might be higher, it could potentially become more cost-effective in the long run if you need to process a substantial number of prompts. However, keep in mind that even with this approach, processing 5 million prompts will still require a considerable investment. Do you have any limitations with time? If the number of requests you need to handle is indeed 5 million, with only one VM it might take a long time. To speed up the process, you can launch multiple instances of virtual machines (VMs), but keep in mind that the price will increase accordingly.
Excelente video! Me encantaría ver un deployment real, como conectar el modelo a un chat en una web, trabajando en un notion propio con langchain o algo así Saludos!
¡Muchas gracias por el feedback! Me alegra saber que disfrutaste del video. Definitivamente, el tema de un deployment real conectado a un chat web es interesante y justo estoy trabajando en contenido relacionado. En futuros videos, planeo mostrar cómo conectar modelos a web utilizando herramientas como Langchain y Streamlit. Stay tunned Para darte un adelanto, aquí tienes un enlace a la documentación de Langchain que te muestra cómo integrar tus modelos de SageMaker en tu código, como siempre hacen un muy buen trabajo en simplificar el proceso: python.langchain.com/docs/integrations/llms/sagemaker Además, te invito a seguir mi canal en español, donde podrás encontrar más videos sobre este tema y otros contenidos relacionados con machine learning y despliegues de modelos: www.youtube.com/@NechuBM
Hey Nechu ENg , really nice video! I was wondering if I could help you edit your videos and also make a highly engaging Thumbnail which will help your video to reach to a wider audience .
I appreciate your enthusiasm for improving the content. Let's continue the conversation privately. Please feel free to connect with me on LinkedIn so we can discuss this further: www.linkedin.com/in/daniel-benzaquen-moreno/
Instead of deleting the instance, can you just pause it when you are not using it? I understand that there would probably be storage costs, but I would suspect it would be less expensive than the running costs.
Great question! While pausing instances would be a great feature, it's not currently supported in SageMaker for all use cases. For models smaller than 6 GB, you can indeed use the serverless deployment option, which offers a cost-effective way to deploy smaller models without worrying about managing instances. However, in our case, our model exceeds the 6 GB limit for serverless deployment, which is why we have to resort to deleting (can’t pause) the endpoint and recreating it when needed. During the deletion process, the underlying infrastructure and resources associated with the endpoint are removed, and when you recreate the endpoint, it sets up a new instance with fresh resources to handle your model. It's important to note that this deletion and recreation process might take a few minutes, so keep that in mind when planning the usage of your model.
It is the best videoabout falcon7 I've seen! Thank you a lot!
You'd be the ultimate champ if you could produce a similar video and Python Notebook for Falcon40 using the "ml.g5.12xlarge" instance type.
It's very simple you just have to change the model name to: hub = {'HF_MODEL_ID':'tiiuae/falcon-40b'}.
And:
predictor = huggingface_model.deploy(
initial_instance_count=1,
instance_type='ml.g5.12xlarge',
container_startup_health_check_timeout=300
)
Just be careful with the cost of the VM, it is quite expensive.
Thank you for your positive feedback!
Awesome video. Thank you so much
Thank you for the great feedback!
Nice Video. Appreciate your work:)
I'm glad you enjoyed the video. Your feedback is valuable!
Awesome video! Keep up the good work.
Thanks a lot for the feedback!
Hola qué tal,
gran video, una pregunta, hay alguna forma de calcular los costos de desplegar y apagar estos modelos por horas usando calculator?
Saludos.
Would it be equivalent to searching for the costs of using serverless inference?
The cost of using Amazon SageMaker for inference can vary depending on the approach you take. If you use a virtual machine (VM) like g5.2xlarge, the pricing is based on the time of usage. For example, if the VM is up and running for 5 hours, you will be charged for 5 hours of usage at the hourly rate of the g5.2xlarge instance.
On the other hand, if you opt for a serverless endpoint, the pricing is calculated differently. It takes into account the duration of each inference request, the amount of RAM needed to run your model, and the data transferred in and out during the inference process. For instance, if the average duration of an inference request is 20 seconds, and you make 1000 requests, you will be billed for 20,000 seconds of inference time.
For a more detailed explanation of the cost breakdown, you can refer to the 'On-demand pricing' section on Amazon SageMaker's pricing page. Specifically, check the information on 'Serverless inference' for details on the serverless endpoint pricing model, and 'Batch transform' for VM-based pricing like g5.2xlarge.
Here are the links to the pricing pages for your reference:
aws.amazon.com/sagemaker/pricing/?nc1=h_ls
instances.vantage.sh/?cost_duration=monthly&selected=g5.2xlarge
I hope this clears up any doubts you have about the pricing options for SageMaker inference.
@@NechuENG Great! thank you very much for the clear response and for the great content :)
hi am using the instance region in tokyo but when uploading the model it is showing insufficient memory to upload the model instance like that it is showing what to do for that should i give east virgina so that it will upload the falcon model
If you're encountering an 'insufficient memory' issue while trying to upload your model in the Tokyo region, it's important to understand a few key factors. Changing the region to East Virginia might not necessarily resolve the problem.
1 - Model Size: If you're trying to upload a larger model (for example, a 40GB model), it could lead to memory constraints, regardless of the region you choose.
2 - Instance Type: The instance type you're using also plays a significant role in memory availability. If you're using a smaller instance type with limited memory, it could lead to memory-related issues during model upload. You might want to choose an instance type with more memory capacity.
Before switching regions, assess these factors and consider making adjustments accordingly. Ensure that your instance type is suitable for the model size you're working with because changing regions might not directly address the memory issue.
Hi, I have another question, I have been getting the error: is not authorized to perform: iam:GetRole on resource: role SageMaker-ExecutionRole with an explicit deny in a service control policy. Do you know why this could be the case? would I need to contact the administrator?
It seems that the issue is related to your IAM permissions. The error message indicates that your IAM user doesn't have the necessary permission to perform the action iam:GetRole, and there is an explicit deny in a service control policy affecting this action.
To resolve this, you can try adding the IAMReadOnlyAccess policy to your IAM user. This policy grants read-only access to IAM resources and should allow you to perform the required iam:GetRole action without conflicts. Here's how you can do it:
1. Navigate to the IAM service in the AWS Management Console.
2. Select "Users" from the left-hand menu and find your IAM user in the list.
3. Click on your user to view details and permissions.
4. Under the "Permissions" tab, click on "Add permissions."
5. In the "Attach policies directly" section, search for "IAMReadOnlyAccess."
6. Check the box next to "IAMReadOnlyAccess" to select the policy.
7. Click "Add permissions" to apply the policy to your user.
8. Once you've added the IAMReadOnlyAccess policy, try performing the action again. The error should be resolved, and you should be able to access the required resources.
If the issue persists, consider checking for any additional policies attached to your IAM user that might conflict with the required action.
Alternatively, if you are using your company AWS account you can reach out to the AWS administrator within your company to review and adjust your permissions if necessary.
@@NechuENG Thanks, I ended up reaching out to the AWS administrator and it worked. Everything worked fine thanks! Waiting for your course on sagemaker and langchain, great content. 👌
I have 5mm prompts I want to run. Is this simply unsustainable to do using SageMaker? I want to spend less than $100.
Thank you for reaching out! When you say '5mm,' do you mean 5 million prompts? That's indeed a substantial amount of prompts to process, and there might not be a very cheap way to handle such a large workload. Could you also clarify whether these prompts are short or long? The number of words (tokens) in each prompt impacts the final price.
Regarding your options, you have a couple of approaches you can consider. One option is to use a service like OpenAI, where you pay per request. For smaller quantities of requests, this can be the most cost-effective choice. However, when processing a very large number of prompts, the cumulative cost can add up significantly. gptforwork.com/tools/openai-chatgpt-api-pricing-calculator
The second option is to deploy the model as we did in the video. While the initial cost might be higher, it could potentially become more cost-effective in the long run if you need to process a substantial number of prompts. However, keep in mind that even with this approach, processing 5 million prompts will still require a considerable investment.
Do you have any limitations with time? If the number of requests you need to handle is indeed 5 million, with only one VM it might take a long time. To speed up the process, you can launch multiple instances of virtual machines (VMs), but keep in mind that the price will increase accordingly.
Excelente video!
Me encantaría ver un deployment real, como conectar el modelo a un chat en una web, trabajando en un notion propio con langchain o algo así
Saludos!
Building a UI would be kind of cool.
¡Muchas gracias por el feedback! Me alegra saber que disfrutaste del video.
Definitivamente, el tema de un deployment real conectado a un chat web es interesante y justo estoy trabajando en contenido relacionado. En futuros videos, planeo mostrar cómo conectar modelos a web utilizando herramientas como Langchain y Streamlit. Stay tunned
Para darte un adelanto, aquí tienes un enlace a la documentación de Langchain que te muestra cómo integrar tus modelos de SageMaker en tu código, como siempre hacen un muy buen trabajo en simplificar el proceso: python.langchain.com/docs/integrations/llms/sagemaker
Además, te invito a seguir mi canal en español, donde podrás encontrar más videos sobre este tema y otros contenidos relacionados con machine learning y despliegues de modelos: www.youtube.com/@NechuBM
Thanks! I will take it into account, stay tunned for future content to integrate this model in a UI
@@NechuENG Eso sería super interesante y útil! graicas!!
@@NechuENG Increible, gracias por el contenido de calidad. Ya estoy siguiendo ambos canales
Hey Nechu ENg , really nice video! I was wondering if I could help you edit your videos and also make a highly engaging Thumbnail which will help your video to reach to a wider audience .
I appreciate your enthusiasm for improving the content. Let's continue the conversation privately. Please feel free to connect with me on LinkedIn so we can discuss this further: www.linkedin.com/in/daniel-benzaquen-moreno/
Instead of deleting the instance, can you just pause it when you are not using it? I understand that there would probably be storage costs, but I would suspect it would be less expensive than the running costs.
Great question! While pausing instances would be a great feature, it's not currently supported in SageMaker for all use cases. For models smaller than 6 GB, you can indeed use the serverless deployment option, which offers a cost-effective way to deploy smaller models without worrying about managing instances.
However, in our case, our model exceeds the 6 GB limit for serverless deployment, which is why we have to resort to deleting (can’t pause) the endpoint and recreating it when needed. During the deletion process, the underlying infrastructure and resources associated with the endpoint are removed, and when you recreate the endpoint, it sets up a new instance with fresh resources to handle your model.
It's important to note that this deletion and recreation process might take a few minutes, so keep that in mind when planning the usage of your model.