I just tried this, it is very simple to use but it is basically just a wrapper for the together-ai package. All this is doing is restricting configurability! But thank you very much for the video and pointing me to this project. I was surprised at how accurate it is
Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.
I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website) And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.
Hm…I’d rather use Python to crop the image to a given region, then feed the entire cropped image to the vision model. Not sure why / if you can train a “general vision model” to only look at certain regions of an image…could be interesting but doesn’t that turn the model into a more traditional supervised model at that point?
Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?
I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.
Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc. For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can
I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help
A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.
Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally
already fan of your videos the way you explain. Can you Please tell only for pdf document which llm model is good? i want to use locally. unstructured didn't help. even after pdf to image pixtral also didnt work. i want perfect accuracy.
Use a dedicated OCR model like tesseract or Azure Document Intel if you want to increase accuracy. Vision models should not be used for OCR at this point in the technology, at least not where accuracy matters.
They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model
after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them. i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.
@ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.
This seems objectively bad at the job. The Walmart receipt just flat out ignored the whole central column of numbers. Reordering sections of text... Not seeing its usefulness at this level of error and garbling things. What about a mixed tesseract + LLM to correct it?
yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.
@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.
Using a vision model for OCR is way too prone to hallucinations for anything critical. There are dedicated OCR tools that provide way more accuracy. At this point in the technology, I’d only use vision models for describing images, and only if they were not critical.
@ I don’t plan to use it in production. Unfortunately, of all the free ocrs available to python, none of them worked well enough for license plate reading even with post processing.
What a weird wrapper project. Just use llama vision and say : `Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements. Requirements: - Output Only Markdown: Return solely the Markdown content without any additional explanations or comments. - No Delimiters: Do not use code fences or delimiters like \`\`\`markdown. - Complete Content: Do not omit any part of the page, including headers, footers, and subtext. `; cause literally that's all this project is doing.
I just tried this, it is very simple to use but it is basically just a wrapper for the together-ai package. All this is doing is restricting configurability! But thank you very much for the video and pointing me to this project. I was surprised at how accurate it is
Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.
Nice! Any difference with docling or llamaparse solutions?
I'm gonna suggest this video to PAPERLESS-NGX, I think this needs to be a MUST feature on that project.
there used to be a chrome extension that made any img text editable , where is it
I'd be super interested in knowing the process of training on object detection / region of interest. Anyone have pointers where I can read up on this?
I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website)
And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.
Hm…I’d rather use Python to crop the image to a given region, then feed the entire cropped image to the vision model. Not sure why / if you can train a “general vision model” to only look at certain regions of an image…could be interesting but doesn’t that turn the model into a more traditional supervised model at that point?
Can it recognize license plates in non latin alphabets?
Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?
Can it capture handwritten text perfectly
does it work with handwritten text?
No, but you can train your own to work on your own handwriting specifically, without too much difficulty.
GPT-4o is surprisingly good at handwriting OCR, but as with all GenAI output, you must validate before using it for anything critical.
Doing simple OCR via LLM is shut fly using bazooka.
I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.
Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc.
For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can
Can we get Bounding boxes using this model?
Is it possible to do it in javascript ?
How can i convert op to required op as json
exciting!
when you are integrating it with Agents ?
how to get rid of hallucination especially in this kind of project? i json a good ouptu format?
I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help
A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.
@@coredog64 leaving out is def. a good strategy. it even saves tokens.
Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally
already fan of your videos the way you explain. Can you Please tell only for pdf document which llm model is good? i want to use locally. unstructured didn't help. even after pdf to image pixtral also didnt work. i want perfect accuracy.
Use a dedicated OCR model like tesseract or Azure Document Intel if you want to increase accuracy. Vision models should not be used for OCR at this point in the technology, at least not where accuracy matters.
How many languages supposed?
What are the benefits of using a giant LLM for something as simple as OCR?
They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model
@ Does it help with extracting content from complex layouts? At a semantic level.
after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them.
i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.
OCR is not simple, and quality can be really bad. It also doesn't preserve original layout since it really just looks at characters in isolation.
@ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.
did someone try to integrate it in n8n?
This seems objectively bad at the job.
The Walmart receipt just flat out ignored the whole central column of numbers.
Reordering sections of text...
Not seeing its usefulness at this level of error and garbling things.
What about a mixed tesseract + LLM to correct it?
yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.
@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.
Submitting your data to 3rd party is not PRIVATE !
All the models that I showed here can be run locally, most people wont have the GPUs to do it for the 90b though
Qwen vl is better than llama 3.2 on ocr
Besides the huggingface leaderboards, do you have a live production example proving this?
try the space of it on hf
Its not better for thai license plates, i tested it.
Using a vision model for OCR is way too prone to hallucinations for anything critical. There are dedicated OCR tools that provide way more accuracy. At this point in the technology, I’d only use vision models for describing images, and only if they were not critical.
@ I don’t plan to use it in production. Unfortunately, of all the free ocrs available to python, none of them worked well enough for license plate reading even with post processing.
If the model can return the coordinates, then it will be great and no point to use the OCR service from Microsoft and google anymore.
👀👀
What a weird wrapper project. Just use llama vision and say :
`Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements.
Requirements:
- Output Only Markdown: Return solely the Markdown content without any additional explanations or comments.
- No Delimiters: Do not use code fences or delimiters like \`\`\`markdown.
- Complete Content: Do not omit any part of the page, including headers, footers, and subtext.
`;
cause literally that's all this project is doing.
PS you need 64gb ram to run this version... not a very good script.
@@orangehatmusic225use google colab
Not sure of the usefulness of this. You can always use Lens and runs on a mobile phone ;)
There is Tika for that. Stop showing AI as the address to solved problems
You do realize Tika uses deep learning… which is what fundamentally makes LLMs.