Text recognition (OCR) with Tesseract and Python

แชร์
ฝัง
  • เผยแพร่เมื่อ 4 ก.ค. 2024
  • In this tutorial we’re going to see how to use Tesseract to recognize text from an image.
    Tesseract is the most popular OCR (Optical character recognition), it is open source and it is developed by google since 2006.
    In this specific tutorial we will see:
    1. How to install Tesseract on (Windows, Mac or Linux)
    2. Read Text from an image
    3. Tune tesseract to improve the text recognition
    Instructions and source code: pysource.com/2020/04/23/text-...
    ➤ Full Videocourses:
    Object Detection: pysource.com/object-detection...
    ➤ Follow me on:
    Instagram: / pysource7
    LinkedIn: / pysource
    ➤ For business inquiries:
    pysource.com/contact

ความคิดเห็น • 79

  • @michaelchung9119
    @michaelchung9119 3 ปีที่แล้ว +2

    This is a really good tutorial. Thank you

  • @Benm2793
    @Benm2793 3 ปีที่แล้ว

    Adaptive threshold is a great tip. Thanks!

  • @andreadotta6653
    @andreadotta6653 4 ปีที่แล้ว

    Hi Sergio,
    Thanks for the video! I've a question. Can you have something ideas to convert raster images in vector images (es. jpg to svg)?

  • @elieferland8561
    @elieferland8561 3 ปีที่แล้ว +4

    Thank you for all your good tutorials! Could you make a video on natural scene text detection using opencv and EAST one day??

  • @ahmedhelal920
    @ahmedhelal920 2 ปีที่แล้ว

    Very good introduction to ocr . thanks 😊

  • @RapidView
    @RapidView 4 ปีที่แล้ว +2

    Tnx a lot. What would be the processing when u get dynamic images.

  • @marwen2594
    @marwen2594 4 ปีที่แล้ว +5

    thanks for the tutorial
    i have question please , can tesseract ocr detect handwriting
    if so can you make another tutorial about that

  • @wimr.9672
    @wimr.9672 2 ปีที่แล้ว

    thanks for the tut! Helped me a lot

  • @ForMyOwn_1
    @ForMyOwn_1 8 หลายเดือนก่อน

    It was very informative and helpful lesson. Thanks

  • @len5204
    @len5204 2 ปีที่แล้ว

    Hi, thanks for the tutorial. But i was wondering how it will be if there's gonna be an upload photo feature. So we dont have to change the image filename to be used everytime. Is it possible?

  • @ryanmoye9189
    @ryanmoye9189 3 ปีที่แล้ว +4

    One issue with the Chinese text is that the first two characters are in traditional Chinese and the second two characters are the simplified. So the first time you ran it and it gave you 汉语,汉语 was correct as it converted the traditional into simplified because you used chi_sim.

  • @ok-kp5jn
    @ok-kp5jn 4 ปีที่แล้ว

    Great content!!

  • @heinhane
    @heinhane 4 ปีที่แล้ว

    This is very helpful .
    Tanks

  • @tushargawade2045
    @tushargawade2045 4 ปีที่แล้ว

    really Great !!! and helpful.
    if we train our model using cnn , will it increase accuracy?

  • @r-beanmondy6203
    @r-beanmondy6203 3 ปีที่แล้ว

    if I wanna use it on live video which part that I should change for my code?

  • @firegames2741
    @firegames2741 3 ปีที่แล้ว

    Thank for such a useful video.
    I need help from you, can you convert captcha file to text. I'm trying, but not converting properly.

  • @kascesar
    @kascesar 4 ปีที่แล้ว

    hello, im looking for a rcnn for this task, do u know a nice one for this task ?

  • @eksimedya4659
    @eksimedya4659 4 ปีที่แล้ว

    Danke schön mein bruder

  • @thesupremeeagle2146
    @thesupremeeagle2146 3 ปีที่แล้ว

    Do you have another way to use tesseract. I want to turn my programme into executable for people who want to download it, but to make it work they need this tesseract-ocr file. So i put it into the download file but the tesseract file is too heavy ! I dont want to make my programme at 1GO because of tesseract ! Please help :c

  • @dukeofminecraft
    @dukeofminecraft 3 ปีที่แล้ว

    can you train pytesseract on handwriting data and return the string data ?

  • @bibhutirajansingh
    @bibhutirajansingh 3 ปีที่แล้ว +2

    Can a multi-page PDF be OCRed this way?

  • @maloukemallouke9735
    @maloukemallouke9735 4 ปีที่แล้ว +1

    Thank you for sharing,
    I am wondering if OCR it's a heavy process to find digits on large images?

    • @AI_CANISTER
      @AI_CANISTER 3 ปีที่แล้ว

      its simple, to get only digits in a large image,,
      config = r'--oem 3 --psm 6 outputbase digits'
      digitd = pytesseract.image_to_data(img, config)
      when you print digits you get only digits

    • @maloukemallouke9735
      @maloukemallouke9735 3 ปีที่แล้ว

      @@AI_CANISTER thanks but i already tested not work

  • @LuisGarcia-tb9po
    @LuisGarcia-tb9po 3 ปีที่แล้ว

    Tesseract has been adding arrows to each each cell in my excel spreadsheets, anyone know why that might be? It recognizes every word and number correctly but adds some kind of ‘illegal” character code that is excel displays as arrows then boxes with a question mark inside

  • @chakrabmonoj
    @chakrabmonoj 3 ปีที่แล้ว

    Hi - thanks for the excellent tutorial. This really helps a lot. While running your code on the image that I have extracted from my PDF document, instead of printing the text in the image file, it is spitting out some texts as below :
    code :
    image =cv2.imread("page-12.png")
    cv2.waitKey(0)
    Output :
    -1
    code :
    text = pytesseract.image_to_string(image)
    Print(text)
    output :
    DF Metadata Extraction with Python

    file = PyPDr2
    xnp = pdf_file.getimphetadatal
    dict = 0
    4 in mp methods
    xnp_dict (i)
    xnp_dict (i)
    pp.pprint xmp dict)

    anytestringobject if pyPDr2 is unable to decode the string (Fenniak, 2016b). Putting
    these two methods together yields a custom function that can be used to extract document
    information dictionary metadata from PDF files (sce

    igure 9). The resulting

    DocumentInformation object which generated by the custom get_doc_info()
    function contains a dictionary with five key:value pairs (see Figure 10). This extracted data
    ‘matches the raw metadata located in the Document Information Dictionary abject located

    at 208 0 obj inthe file (see Figure 3).
    (© 2019 The SANS institute Author retains ful ight
    Am I getting something wrong? Unfortunately I am unable to show you how the input looks.
    Any help much appreciated.

    • @chakrabmonoj
      @chakrabmonoj 3 ปีที่แล้ว

      Hi...I ran your code again and it is working. So you may ignore the above comment. The problem is my file has a dark background and so with the values that you have shown for adjusted threshold do not seem to be converting the background to white and the print(text) is still not showing the text embedded within the picture.
      My picture file is PNG - will that require a different adjustment to your code and the threshold values?
      Thanks - really appreciate

  • @ashishzarekar9599
    @ashishzarekar9599 3 ปีที่แล้ว

    could you please help on how to implement for scanned and digital pdf?

  • @juanpajaro4084
    @juanpajaro4084 4 ปีที่แล้ว

    Gracias amigo, me resolviste muchas dudas.

  • @luismata4086
    @luismata4086 3 ปีที่แล้ว

    How can I make than the algoritm recognize Latex language? ---> pytesseract.image_to_string(img, lang = '?') . What have to use for the parameter "lang"?

  • @vinsmokearifka
    @vinsmokearifka 3 ปีที่แล้ว

    thank you. how to get only title?

  • @revudevendraswamy6632
    @revudevendraswamy6632 4 ปีที่แล้ว

    Is it work for handwritten data ??

  • @dimitheodoro
    @dimitheodoro 3 ปีที่แล้ว

    Thanx alot!!!

  • @NicolaMastrandrea
    @NicolaMastrandrea 4 ปีที่แล้ว

    Grazie 😊

  • @davidralte4572
    @davidralte4572 2 ปีที่แล้ว

    Thank you for Your help, May God Bles You.

    • @sauravsinha8746
      @sauravsinha8746 ปีที่แล้ว

      Will you tell how we can get this data in csv format

  • @marienoellevandervlugt9183
    @marienoellevandervlugt9183 4 ปีที่แล้ว

    I love tour tuto, i'am trench, englich it's difficile. But i want learn python

  • @nikolaydd6219
    @nikolaydd6219 4 ปีที่แล้ว

    Thanks

  • @gowthamns8228
    @gowthamns8228 4 ปีที่แล้ว

    Wow very good, but the problem is "If the text is very clear and crisp its is showing output correctly", But I want to know if the image has multiple data not only text for example "bills, taken calendar photo or any kind of images", How to print the string from that, I tried my self it's not printing anything, Any idea for this?

    • @AI_CANISTER
      @AI_CANISTER 3 ปีที่แล้ว

      it can detect digits since date is mostly in digits i think it will work well.
      config = r'--oem 3 --psm 6 outputbase digits'
      digitd = pytesseract.image_to_data(img, config)
      when you print digits you get only digits

  • @jay4866
    @jay4866 3 หลายเดือนก่อน

    Hi can you do the same thing using reberry pi or Arduino.

  • @sauravsinha8746
    @sauravsinha8746 ปีที่แล้ว

    How can we save the output in csv format please

  • @sauravsinha8746
    @sauravsinha8746 ปีที่แล้ว

    Can we save to Csv format

  • @aichamahfoudh2451
    @aichamahfoudh2451 2 ปีที่แล้ว

    How can we use this on google colab?

  • @sarthakgarg6531
    @sarthakgarg6531 3 ปีที่แล้ว

    how we can read different font like it the image has italic font so how can we do that ?

    • @AI_CANISTER
      @AI_CANISTER 3 ปีที่แล้ว

      I've tried the same method and it worked, I'm sure it will work for you too

  • @hemantchauhan6437
    @hemantchauhan6437 3 หลายเดือนก่อน

    NEED HELP! I am making a website where user can upload a pdf but I want that pdf to upload only if that pdf has images of only HANDWRITTEN text. Thank you for reading.

  • @senpaikun5947
    @senpaikun5947 3 ปีที่แล้ว

    hey... im not able to print that chinese letters in my output... can asome one help me oue plz

  • @sauravsinha8746
    @sauravsinha8746 ปีที่แล้ว

    How can we get this data into csv format

  • @NoamHarel-Google-Is-The-Best
    @NoamHarel-Google-Is-The-Best 3 ปีที่แล้ว

    need some help.. it's wirte this line...
    You need configured Python 2 SDK to render Epydoc docstrings
    thanks a lot

  • @towhidurrahman8202
    @towhidurrahman8202 4 ปีที่แล้ว

    is this possible for number plate recognition form this code ? and the language in bengal

    • @AI_CANISTER
      @AI_CANISTER 3 ปีที่แล้ว

      It can't recognize number plate but it can extract the digits and alphabet after detecting the number plate with a different method

  • @lakshmitejaswi7832
    @lakshmitejaswi7832 3 ปีที่แล้ว

    Make a video on how to build custom ocr

  • @opendllmaster5125
    @opendllmaster5125 2 ปีที่แล้ว

    Isn't it possible to train Tesseract to improve the reading?

  • @user-zo2gm1bh1f
    @user-zo2gm1bh1f 3 ปีที่แล้ว

    Please sir lang arbic text = pytesseract.image_to_string(adaptive_threshold, config=config, lang "arbic" )
    ARB OR AR OR What ???

  • @riyajagtap5006
    @riyajagtap5006 ปีที่แล้ว

    CAN IT WORK FOR 7 SEGMENT LED

  • @abdelrhmanshokr7546
    @abdelrhmanshokr7546 3 ปีที่แล้ว

    this helped a lot but still there is a date value that it doesn't seem to get it I don't know why to be honest

  • @rajeevkalaskar6373
    @rajeevkalaskar6373 3 ปีที่แล้ว

    I want to make a desktop application for this. how can I do it. Need help 🆘

    • @pysource-com
      @pysource-com  3 ปีที่แล้ว

      For commercial projects/consulting services you can contact me here: pysource.com/services

  • @gawaderajesh
    @gawaderajesh 3 ปีที่แล้ว

    I am gettng below error... Please help
    raise TesseractError(proc.returncode, get_errors(error_string))
    pytesseract.pytesseract.TesseractError: (1, 'Error opening data file C:\\Program Files\\Tesseract-OCR/tessdata/chi_sim.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \'chi_sim\' Tesseract couldn\'t load any languages! Could not initialize tesseract.')

  • @lukmanchaiyarab1451
    @lukmanchaiyarab1451 3 ปีที่แล้ว

    can you please show digit recognition ;thank in advance

    • @AI_CANISTER
      @AI_CANISTER 3 ปีที่แล้ว +1

      config = r'--oem 3 --psm 6 outputbase digits'
      digitd = pytesseract.image_to_data(img, config)
      when you print digits you get only digits

  • @AKKJ420
    @AKKJ420 2 ปีที่แล้ว

    Do an ANPR mate

  • @ankursoni8060
    @ankursoni8060 3 ปีที่แล้ว

    How to detect a text from a particular co-ordinate of an image?

    • @pysource-com
      @pysource-com  3 ปีที่แล้ว

      First you need to cut that region. Check my youtube videos "Crop images" and you'll know hot to d that.
      once you did cut the portion, you can parse that one to the OCR

  • @AneleMbabela
    @AneleMbabela 4 ปีที่แล้ว

    Instructions and source code link is broken.

    • @pysource-com
      @pysource-com  4 ปีที่แล้ว +2

      I've just fixed it, thanks for pointing that out

    • @AneleMbabela
      @AneleMbabela 4 ปีที่แล้ว +1

      @@pysource-com Thanks for the work you've been putting out. Its really making a difference. God bless you, brother..

  • @nuwanthajayasinghe115
    @nuwanthajayasinghe115 3 ปีที่แล้ว

    I installed tesseract and try to work in vscode.But when programming python from vscode, tesseract could not be imported into vscode.Can you tell me how to import tesseract to the vscode??

    • @jeffu73
      @jeffu73 3 ปีที่แล้ว

      what is he using vscode or?

  • @Terminator-lx5jx
    @Terminator-lx5jx 3 ปีที่แล้ว

    You didnt solve the text though after pre processing

  • @jeffu73
    @jeffu73 3 ปีที่แล้ว

    For me, the image is not getting recognized.

    • @raquelcosta2730
      @raquelcosta2730 2 ปีที่แล้ว

      same for me :(, i have tried different clear images and it's not working. Any tips?

  • @rohitnara6738
    @rohitnara6738 4 ปีที่แล้ว

    cv2.imread("img.gif") is not working how can we read text from .gif file type please tell

  • @kisamesafe
    @kisamesafe 3 ปีที่แล้ว

    it could be shorter

  • @jasonlo3429
    @jasonlo3429 3 ปีที่แล้ว

    The Chinese words are traditional and not simplified. Change the language to chi_tra and it should work better