Given that classification models are trained on images of small resolution eg 224x224px, is there any simple way to tell if there is an object in a larger picture (or higher resolution image)? That is, say I trained a model to identify sheep in a 224x224px. Can I use that to identify if there is a sheep in a larger picture? I don’t need to know the bounding box but just if there is a sheep or not. Which part of ML theory addresses that problem? P.S. thank you for the videos - high level content, very much appreciated. Hard to find something dense and not just someone explaining linear regression to average people!
Hello and thanks for the comment! There are 2 easy techniques that come to mind regarding your question. - You can rescale the big image into 224x224px resolution and apply your model. If the sheep is big enough (>10% of the picture), the model should have no problem identifying that - The "Sliding Window" approach: slide your 224x224px window across the bigger image. If there is a sheep in one such window, you will detect that. Having said that, I need to mention that most of pre-2015 image classification models were trained on ImageNet dataset. It contains "canonical" views of the objects: the sheep will be in the center, facing directly into the camera,. and unobstructed. CoCo dataset, like real life pictures, tend to have a lot of obstructions, objects facing in different directions, and not being located in the center. Proper object detectors trained on CoCo dataset will be much better in terms of quality when used in real life than old-school ImageNet models
First option won’t work, and not just for me I guess: if one has any reasonable decent image it would easily be say 1200x1200, and 224x224 is therefore x25 smaller. Rescaling makes the target object just too small. Also tried it on some images to see the effect visually - can barely find the image with my eye. Second approach will be too expensive. I’ll have to slide the window 1000x1000 times to cover all possible positions. I’m puzzled now - what’s the value of these pretrained networks like AlexaNet if one can not use it to identify an object in a real world picture? Only if it’s a 224x224 or slightly bigger to rescale without too much loss of information. Second thought - what’s the value of the MNIST trained model? Say I have a scanned A4 page with one digit (or a postcode) somewhere on the page. How do I tell if the scanned image contain a postcode? Or if I have an image of the building and I want to know if it contains the house number? How does one use these models to locate the objects in real world situations…without having to train a different network to predict the rectangle.
Apart from scientific value (the entire deep learning received a huge boost due to ImageNet competition), these models typically are unusable in real world applications by themselves, but they are used inside other models, as "backbone" responsible for initial image processing. For example, YOLO model uses ResNet as a backbone, and then trains a few extra layers on top to do the actual object detection.@@zholud
@@makgaidukthanks! Will watch your other videos, in particular more in depth YOLO - haven’t gotten there yet :) and keep up with the videos - really valuable content and at the right technical level for someone who already knows some math :)
Given that classification models are trained on images of small resolution eg 224x224px, is there any simple way to tell if there is an object in a larger picture (or higher resolution image)? That is, say I trained a model to identify sheep in a 224x224px. Can I use that to identify if there is a sheep in a larger picture? I don’t need to know the bounding box but just if there is a sheep or not. Which part of ML theory addresses that problem? P.S. thank you for the videos - high level content, very much appreciated. Hard to find something dense and not just someone explaining linear regression to average people!
Hello and thanks for the comment!
There are 2 easy techniques that come to mind regarding your question.
- You can rescale the big image into 224x224px resolution and apply your model. If the sheep is big enough (>10% of the picture), the model should have no problem identifying that
- The "Sliding Window" approach: slide your 224x224px window across the bigger image. If there is a sheep in one such window, you will detect that.
Having said that, I need to mention that most of pre-2015 image classification models were trained on ImageNet dataset. It contains "canonical" views of the objects: the sheep will be in the center, facing directly into the camera,. and unobstructed. CoCo dataset, like real life pictures, tend to have a lot of obstructions, objects facing in different directions, and not being located in the center. Proper object detectors trained on CoCo dataset will be much better in terms of quality when used in real life than old-school ImageNet models
First option won’t work, and not just for me I guess: if one has any reasonable decent image it would easily be say 1200x1200, and 224x224 is therefore x25 smaller. Rescaling makes the target object just too small. Also tried it on some images to see the effect visually - can barely find the image with my eye.
Second approach will be too expensive. I’ll have to slide the window 1000x1000 times to cover all possible positions.
I’m puzzled now - what’s the value of these pretrained networks like AlexaNet if one can not use it to identify an object in a real world picture? Only if it’s a 224x224 or slightly bigger to rescale without too much loss of information.
Second thought - what’s the value of the MNIST trained model? Say I have a scanned A4 page with one digit (or a postcode) somewhere on the page. How do I tell if the scanned image contain a postcode? Or if I have an image of the building and I want to know if it contains the house number? How does one use these models to locate the objects in real world situations…without having to train a different network to predict the rectangle.
Apart from scientific value (the entire deep learning received a huge boost due to ImageNet competition), these models typically are unusable in real world applications by themselves, but they are used inside other models, as "backbone" responsible for initial image processing. For example, YOLO model uses ResNet as a backbone, and then trains a few extra layers on top to do the actual object detection.@@zholud
@@makgaidukthanks! Will watch your other videos, in particular more in depth YOLO - haven’t gotten there yet :) and keep up with the videos - really valuable content and at the right technical level for someone who already knows some math :)