Best Way to OCR a PDF in Python - spaCy Layout
ฝัง
- เผยแพร่เมื่อ 9 ก.พ. 2025
- In this video, I'm going to show you the best way to OCR a PDF in Python with the new spaCy Layout package. The best part about this package is that it gives you access to all the important metadata generated from a spaCy pipeline alongside layout detection and OCR. This means you will have bounding boxes for the labeled regions of text on a given image. You can also do table detection.
spaCy Layout: github.com/exp...
GitHub Repo: github.com/wjb...
Join this channel to get access to perks:
/ @python-programming
If you enjoy this video, please subscribe.
✅Be my Patron: / wjbmattingly
✅PayPal: www.paypal.com...
If there's a specific video you would like to see or a tutorial series, let me know in the comments and I will try and make it.
If you liked this video, check out www.PythonHumanities.com, where I have Coding Exercises, Lessons, on-site Python shells where you can experiment with code, and a text version of the material discussed here.
You can follow me at:
/ wjb_mattingly
Man, this is an amazing video. So helpful, a big THANK YOU. A Table video would also be fantastic, and thanks in advance for that!😉
Interesting. Waiting video with tables!)
Thanks! I'll work on that table video in the near future. As for the math formulae, I don't work with those often, but I have seen some promising models, specifically fine-tunes of tf-id
Thanks; I have been looking for such a tool.
Glad I could help!
Very interesting. Thank you.
Glad you liked it!
Very interesting, thank you! How would you go about if you had to improve the accuray and train your models to work on specific types of documents ? What are the main steps using these new capabilities ?
I'm working on a small academic helper chatbot. Can I use this to prepare my documents which are just scans of textbooks? I'll be using the output in the RAG workflow.
Would this be able to support extracting mathematical formulae?
Good question! Formula is one of the labels. There are a lot of quality models that can convert formulae to Latex so even if the OCR is bad, you could use the bboxes and feed that image to a better quality model for formulae
I am struggling with trying to extract tilted and vertical texts from PDF documents and embed them back into the pdf document so that it can be searchable, do you have a solution on that? OCRmyPDF library doesnt help, would spacy and CV help with this?
can you make a table video?
I definitely will!
use Gemini OCR with good prompt
Thanks for the comment! That’s a good suggestion for some usecases, but not all. If bounding boxes and labels are important, then this is better, assuming you have standard typed text. Also, this approach is faster and local. It also handles aligning the output as a spaCy Doc which gives you linguistic analysis too.