LlamaOCR - Building your Own Private OCR System

Sam Witteveen

มุมมอง 40 463

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 29 ธ.ค. 2024

ความคิดเห็น • 57

@jameswagstaff1962 หลายเดือนก่อน ⁺²
I just tried this, it is very simple to use but it is basically just a wrapper for the together-ai package. All this is doing is restricting configurability! But thank you very much for the video and pointing me to this project. I was surprised at how accurate it is
@Charles-Darwin หลายเดือนก่อน ⁺¹
Vision models be mysterious wizardry. They make me the most excited out of all bc I firmly believe a future conscious 'model' could be iterated from vision models (not new, but not mentioned enough i think). If there were a way to keep the vision model exclusively in virtual space... a whole wealth of experimentation could open up with visualizing things, it might even turn hallucinations into useful features.
@WhyitHappens-911 หลายเดือนก่อน ⁺⁴
Nice! Any difference with docling or llamaparse solutions?
@bzmrgonz หลายเดือนก่อน ⁺¹
I'm gonna suggest this video to PAPERLESS-NGX, I think this needs to be a MUST feature on that project.
@gotonethatcansee 29 วันที่ผ่านมา
there used to be a chrome extension that made any img text editable , where is it
@victorkarlsson5183 หลายเดือนก่อน ⁺³
I'd be super interested in knowing the process of training on object detection / region of interest. Anyone have pointers where I can read up on this?
@KEKW-lc4xi หลายเดือนก่อน ⁺³
I've done it before using YOLOv7 (don't use v8 that requires you to use some cringe website)
And then for labeling images I used CVAT. CVAT will let you label and store your images and then save to yolo format and then it's a matter of piping it through to YOLOv7 framework for training.
@seadude หลายเดือนก่อน
Hm…I’d rather use Python to crop the image to a given region, then feed the entire cropped image to the vision model. Not sure why / if you can train a “general vision model” to only look at certain regions of an image…could be interesting but doesn’t that turn the model into a more traditional supervised model at that point?
@murattosundan หลายเดือนก่อน ⁺¹
Can it recognize license plates in non latin alphabets?
@bzmrgonz หลายเดือนก่อน
Question @Sam, so would the design of forms, documents etc to assist OCR help? For example delimiting label:data with a colon(:). Assuming colons have no reason to exist in text. In your opinon what works best? delimiters, color contrast?
@ifeanyinnaemego หลายเดือนก่อน ⁺¹
Can it capture handwritten text perfectly
@darkreader01 หลายเดือนก่อน ⁺³
does it work with handwritten text?
@gurupartapkhalsa6565 หลายเดือนก่อน
No, but you can train your own to work on your own handwriting specifically, without too much difficulty.
@seadude หลายเดือนก่อน
GPT-4o is surprisingly good at handwriting OCR, but as with all GenAI output, you must validate before using it for anything critical.
@Piotr_Sikora หลายเดือนก่อน ⁺⁵
Doing simple OCR via LLM is shut fly using bazooka.
@_PataNahi หลายเดือนก่อน ⁺²
I think they have the capability to understand the context of the information of the input. If there is any mistakes like simple letter mistakes, there maybe could be a feature to automatically correct those. There could also be a slider to adjust between more most original to most sensible. Without any of these, its just like any other model I guess.
@IoT_ หลายเดือนก่อน ⁺¹
Actually, it can be even worse than the specialized models like YOLO, Tesseract ,Paddle ,etc.
For instance if you have custom ASCII symbols no LLM can provide a good recognition pattern like fine-tuned OCR library can
@SDAravind หลายเดือนก่อน
Can we get Bounding boxes using this model?
@KleiAliaj หลายเดือนก่อน
Is it possible to do it in javascript ?
@beingalien6394 21 วันที่ผ่านมา
How can i convert op to required op as json
@TheRealChrisVeal หลายเดือนก่อน ⁺¹
exciting!
@itsbhardwaj1677 หลายเดือนก่อน
when you are integrating it with Agents ?
หลายเดือนก่อน
how to get rid of hallucination especially in this kind of project? i json a good ouptu format?
@ivan007230 หลายเดือนก่อน ⁺²
I would say that on its own json output alone won’t help. It is only helpful if you know the structure of the data that is to be extracted (say, every document has a title, table with certain columns, etc). Then specifying json schema (expected output format) should help
@coredog64 หลายเดือนก่อน ⁺³
A few things that have helped me: Use a temperature at/near zero. If you have the potential for empty data, prompt to leave it out rather than give empty values.
@sandorkonya หลายเดือนก่อน
@@coredog64 leaving out is def. a good strategy. it even saves tokens.
@samwitteveenai หลายเดือนก่อน ⁺⁴
Another trick if latency isn't an issue is the sample multiple times and use an LLM as a judge to look for what is consistent and what just gets hallucinated occasionally
@nirmesh44 หลายเดือนก่อน
already fan of your videos the way you explain. Can you Please tell only for pdf document which llm model is good? i want to use locally. unstructured didn't help. even after pdf to image pixtral also didnt work. i want perfect accuracy.
@seadude หลายเดือนก่อน
Use a dedicated OCR model like tesseract or Azure Document Intel if you want to increase accuracy. Vision models should not be used for OCR at this point in the technology, at least not where accuracy matters.
@minhsenma หลายเดือนก่อน
How many languages supposed?
@el_arte หลายเดือนก่อน
What are the benefits of using a giant LLM for something as simple as OCR?
@samwitteveenai หลายเดือนก่อน
They can get better results than things like Tesseract. You don't have to use a huge model like the 90b you can often get very good results as a much smaller model
@el_arte หลายเดือนก่อน
@ Does it help with extracting content from complex layouts? At a semantic level.
@hqcart1 หลายเดือนก่อน
after downloading tons of agents, i found out the hardwaym if you are using chatgpt or claud, agents are 100% useless and will give you worse results in real life applications, it's too early to adapt them.
i think agents should actually be an LLM but in a very specific field, for example, an agent just know how to do math, or codes just in js, beats o1 model by a margine, and doesn't know anything else.
@daarrrkko หลายเดือนก่อน ⁺¹
OCR is not simple, and quality can be really bad. It also doesn't preserve original layout since it really just looks at characters in isolation.
@el_arte หลายเดือนก่อน ⁺¹
@ You can get way above 90% accuracy from models with less than 25 million parameters. As for extracting from arbitrary layouts, that remains hard, hence my follow up question.
@staticalmo หลายเดือนก่อน
did someone try to integrate it in n8n?
@alogghe หลายเดือนก่อน
This seems objectively bad at the job.
The Walmart receipt just flat out ignored the whole central column of numbers.
Reordering sections of text...
Not seeing its usefulness at this level of error and garbling things.
What about a mixed tesseract + LLM to correct it?
@samwitteveenai หลายเดือนก่อน
yes this is why I talked about the Regions of Interests concept but I personally wouldn't use Tesseract for this. Also fine tuning the model for the kind of OCR that you want will halp it get much better as well.
@daarrrkko หลายเดือนก่อน
@@samwitteveenaiis there a way to generate synthetic scans at scale based on a certain structure? I think you mentioned using a tool to create the scan.
@OnePlusky หลายเดือนก่อน ⁺³
Submitting your data to 3rd party is not PRIVATE !
@samwitteveenai หลายเดือนก่อน ⁺²
All the models that I showed here can be run locally, most people wont have the GPUs to do it for the 90b though
@viky2002 หลายเดือนก่อน ⁺³
Qwen vl is better than llama 3.2 on ocr
@choiswimmer หลายเดือนก่อน ⁺¹
Besides the huggingface leaderboards, do you have a live production example proving this?
@zmeta8 หลายเดือนก่อน
try the space of it on hf
@murattosundan หลายเดือนก่อน
Its not better for thai license plates, i tested it.
@seadude หลายเดือนก่อน
Using a vision model for OCR is way too prone to hallucinations for anything critical. There are dedicated OCR tools that provide way more accuracy. At this point in the technology, I’d only use vision models for describing images, and only if they were not critical.
@murattosundan หลายเดือนก่อน ⁺¹
@ I don’t plan to use it in production. Unfortunately, of all the free ocrs available to python, none of them worked well enough for license plate reading even with post processing.
@wangbei9 หลายเดือนก่อน
If the model can return the coordinates, then it will be great and no point to use the OCR service from Microsoft and google anymore.
@ShresthShukla-h9n หลายเดือนก่อน
👀👀
@orangehatmusic225 หลายเดือนก่อน
What a weird wrapper project. Just use llama vision and say :
`Convert the provided image into Markdown format. Ensure that all content from the page is included, such as headers, footers, subtexts, images (with alt text if possible), tables, and any other elements.
Requirements:
- Output Only Markdown: Return solely the Markdown content without any additional explanations or comments.
- No Delimiters: Do not use code fences or delimiters like \`\`\`markdown.
- Complete Content: Do not omit any part of the page, including headers, footers, and subtext.
`;
cause literally that's all this project is doing.
@orangehatmusic225 หลายเดือนก่อน ⁺¹
PS you need 64gb ram to run this version... not a very good script.
@suryakantbrewr หลายเดือนก่อน
@@orangehatmusic225use google colab
@nikosterizakis หลายเดือนก่อน
Not sure of the usefulness of this. You can always use Lens and runs on a mobile phone ;)
@greendsnow หลายเดือนก่อน
There is Tika for that. Stop showing AI as the address to solved problems
@erniea5843 หลายเดือนก่อน ⁺³
You do realize Tika uses deep learning… which is what fundamentally makes LLMs.

ต่อไป

เล่นอัตโนมัติ