Training Tesseract 5 for a New Font

แชร์
ฝัง
  • เผยแพร่เมื่อ 25 ก.ย. 2022
  • Build Tesseract from source video:
    • Building Tesseract 5 f...
    GitHub repository link:
    github.com/astutejoe/tesserac...
    Training command:
    TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=
    eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000
    Correction: I believe the box file contains the bounding box (OBB) coordinates of the character within the image
  • วิทยาศาสตร์และเทคโนโลยี

ความคิดเห็น • 164

  • @taylorbarnes6151
    @taylorbarnes6151 ปีที่แล้ว +13

    God I love you. I just recently started messing with OCR's, specifically Tesseract, and I was reading through some documentation on the steps and after a few hours just wanted to end my life hahahaha. Thank you for this, this is extremely encouraging. I can't wait to try this!

  • @buny0n
    @buny0n 4 หลายเดือนก่อน +6

    Tesseract's documentation is abysmal.

    • @nikolaikrot8516
      @nikolaikrot8516 3 หลายเดือนก่อน

      I tend to think about tesseract documentation as the Augean Stables

  • @45545videos
    @45545videos ปีที่แล้ว +2

    Haven't watched the video yet, but if this works, you'll have my eternal gratitude

  • @wojd_
    @wojd_ 11 หลายเดือนก่อน

    Great tutorial. Using WSL I was constantly getting new errors. Switching to OS installed on VirtualBox solved it. I was able to train my dataset-it's surprisingly easy.

    • @heetshah9394
      @heetshah9394 8 หลายเดือนก่อน

      Could you help me with the directory structure. I am a bit confused on how it is made?

  • @fivalt126
    @fivalt126 2 หลายเดือนก่อน

    Estuve rompiendome la cabeza tratando de entender el tutorial oficial y tú lo explicas de una manera sencilla. Soy tu suscriptor numero 666, Muchas Gracias.

  • @yichenyao5927
    @yichenyao5927 3 หลายเดือนก่อน +2

    I think the reason why the word error rate is high is because the font doesn't distinguish uppercase with lower case (it's all upper case) but the ground truth label distinguish between the two.

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว +1

    If I have the line wise hand written image for any language with bounding box and the words so and so can I train it on this LSTM network ? will it work ? and could you share your thoughts on the backbone of LSTM architecture with the flow diagram says : how fonts is helping with training data

  • @3ombieautopilot
    @3ombieautopilot ปีที่แล้ว +1

    Thank you for making this video. But I can't wrap my head around where to put all those data files to? I'm trying to fine tune variations of letters with accents, and I'm helpless.

  • @user-wi7pn5mw1c
    @user-wi7pn5mw1c 4 หลายเดือนก่อน

    Thank you for doing this tutorial. Can I use the Text2Image approach to generate box files and tif files to train new font for Tesserat 4.0?

  • @akshatjain2925
    @akshatjain2925 5 หลายเดือนก่อน +1

    hi when u say we are using text2image nothing AI, but the text2image must be also some model only right ?

  • @wonkduck4759
    @wonkduck4759 8 หลายเดือนก่อน

    Hi Gabriel! Thank you so much for the video. A question I had was where did you upload your apex legends ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on

    • @rcraftg4mer42
      @rcraftg4mer42 6 หลายเดือนก่อน

      did find any answers?

  • @gyeongwango5434
    @gyeongwango5434 7 หลายเดือนก่อน

    I want to train tesseract with an image file I have (consisting of several lines of text), but I'm not sure how to go about it, starting with creating the train data. I'd really appreciate your tips (URLs for reference, etc).

  • @madhavpandey30
    @madhavpandey30 ปีที่แล้ว +1

    Hey Gabriel, I am following your steps to train on my model on hand writtent text. But it is always failing with this erro:
    unicharset_extractor --output_unicharset "data/Apex/my.unicharset" --norm_mode 2 "data/Apex/all-gt"
    Failed to read data from: data/Apex/all-gt
    Wrote unicharset file data/Apex/my.unicharset
    Can you please help me here? I am stuck. Thanks!

  • @ConfusedProgrammer
    @ConfusedProgrammer 4 หลายเดือนก่อน +2

    I've been experimenting with this tutorial for three days , the file structure and the GitHub doesn't necessarily match, can you please update the repo if possible . I am having too many folder inconsistencies when trying to to connect the dots here as it was brushed over really quickly , thank you :)

  • @snoopi6243
    @snoopi6243 ปีที่แล้ว

    Is there any way to perform RTL languages/fonts fine tuning in windows just like this?

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว +1

    the title is for new font , can I take it as new language ? using TIFF

  • @umandadikwatta178
    @umandadikwatta178 ปีที่แล้ว +1

    Thank you very much for this. One question. Can we train Tesseract with non unicode fonts using the same process?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      I'm pretty sure, as long as text2image works correctly. If text2image doesn't work correctly you can either come up with another clever ways (like Python scripts) of automatically generating ground truth data (.gt.txt, .box and .tif files), or worst case, create them manually.

  • @Leo-hk7kk
    @Leo-hk7kk 7 หลายเดือนก่อน

    I want to custom train Tesseract 5 to read the license plates of the car which are detected using YOLO model. How can I do these as I have couple of thousand images? Help
    What are the steps I need to follow?

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว +1

    I tried with this font for hindi language ( Kruti Dev 010 ) even tried with Kruti Dev 016 but its showing : Error: Call PrepareToWrite before WriteTesseractBoxFile!!

  • @azadehpedram7215
    @azadehpedram7215 3 หลายเดือนก่อน

    I have bunch of plate with some text on it , goal is change the image to text, special font is trained but not effective , how can i train tobetter result, thanks for help

  • @PratibhaVaradkar
    @PratibhaVaradkar ปีที่แล้ว

    Hi Gabriel (@AstuteJoe), thank you for the elaborate tutorial.
    I have a doubt though, once i followed the tutorial, generated the tif, gt.txt and .box manually. My training quits with a zero error rate before the max iterations. But when i use the generated trainneddata file, it gives the error "Error: Tesseract (legacy) engine requested, but components are not present in /use/share/tesseract-ocr/5/tessdata/lang_name.traineddata!! Failed loading language 'lang_name' Tesseract couldn't load any languages! Could not initialize tesseract."
    Can you please suggest what i missed?

  • @nilor7550
    @nilor7550 ปีที่แล้ว

    I didn't understand how to run the training command after downloading the two folders from github. I have Windows system

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn ปีที่แล้ว +1

    .lstmf files are missing. please help me to where i am wrong.

  • @Bobo-wl6bs
    @Bobo-wl6bs 10 หลายเดือนก่อน

    Hi Gabriel. I came across Tesseract today. I'm curious will I be able to train it to learn an arabic font?. I have a bunch of pdfs which are written in an indigenous language. The idea here is to train it on some sample pages so that it will be able to read it. It includes diacritics so I'm not sure if it will work.

    • @AstuteJoe
      @AstuteJoe  10 หลายเดือนก่อน

      Check the comments, a bunch of people train it for this exact intent

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว

    can we train the tesseract without any font ? if not why cant we ?

  • @ivanmongebadilla9454
    @ivanmongebadilla9454 ปีที่แล้ว +1

    Thanks for the tutorial Gabriel. I wanted to ask how could I do this process if I have the images in text? I guess I need to do the .txt file and the .box file and then just run the training command.
    Do you know any software that I could use to create the .box file from the images I have?
    Thanks in advance!

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      I have seen people use the jTessBoxEditor: vietocr.sourceforge.net/training.html

    • @ivanmongebadilla9454
      @ivanmongebadilla9454 ปีที่แล้ว

      @@AstuteJoe one more question, how would you use the newly trained model in python?
      Thank you

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว +1

      @@ivanmongebadilla9454 I think just a parameter lang='your_new_model_name' as long as the new model is in the tessdata folder

    • @heetshah9394
      @heetshah9394 8 หลายเดือนก่อน

      Is it necessary for the box_file to be for each character or is it okay for it to be one word per bounding box?

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว

    I tired with your font but didnt work its throwing like :: Could not find font named 'Arial Unicode MS Regular'.
    Pango suggested font 'Liberation Mono'. tried with arial but didnt work

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn ปีที่แล้ว

    when tesseract training is start it show the bellow warning
    Can't encode transcription: 'पिए वई। ज़ख़मनि जो सूर वधंदो वियो हू चीखन्दो for Sindhi
    how I can handle this problem?

  • @hoangcuong9521
    @hoangcuong9521 4 หลายเดือนก่อน

    Thank you for making this video. It helps me a lot. But I have a problem that when I copy and replace link to save dir or language_code..training_text, it appears that all of those generated image are white blank images. Pls help me out of this :

  • @ManuthVANN
    @ManuthVANN 5 หลายเดือนก่อน

    Thank so much sir for ur clear explaination and code

  • @eusebiosouza2252
    @eusebiosouza2252 8 หลายเดือนก่อน

    Great Video !
    I'm getting this error when i try do run the training command:
    "Failed to read boxes from data/FE_Font-ground-truth/eng_16.tif"
    The file eng_16.tif not seems to be empty and it's very similar to all other trainning files. Im running with MAX_ITERATIONS=100 and with i delete the file that seems to be the problem, tesseract would throw the same error but with a different file. Does anyone could please help me ?

  • @listentomusicfeellikehome
    @listentomusicfeellikehome หลายเดือนก่อน +1

    Hi.I try this on colab. I install tesseract and go on to run split_training_text.py and get this error FileNotFoundError: [Errno 2] No such file or directory: 'text2image'. Is there a solution?

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว +2

    one last question to shoot up, basically the Tesseract is not trained with handwritten text I guess and its trained on line files of system text which again converted to images on line basis for training. ? is my assumption true ?

    • @dhirazz
      @dhirazz ปีที่แล้ว

      Hey, It seems like you were also looking to train tesseract with handwritten text. Did you do it? If so please shade light, I am so lost

    • @ganeshrajv130
      @ganeshrajv130 ปีที่แล้ว

      @@dhirazz training is not an easy thing as you need huge amt of data and they as well clearly said training is not gonna make any sense ( google ) hence,if u wanna try adjusting the parameters then deep dive into cpp

  • @AmphibianDev
    @AmphibianDev ปีที่แล้ว +1

    Hi, I am having issues with the last make training command. It throws out a error "No module named 'PIL'".
    I have the Pillow library install but the error is still there. I am trying to solve this issue for a long, long time.
    If you know something I will appreciate the help. I wanted link to my github issue but I am afraid youtube doesn't allow link.

    • @mohammadmn7364
      @mohammadmn7364 5 หลายเดือนก่อน

      Hey, long time passed, But for others having the same issue, creating an virtual env and then installing requiremnets.txt (of the tesstrain repo) in it may fix the issue, at least for me it worked! also check if all txt files have related box files or not!

  • @NotFlashYT
    @NotFlashYT ปีที่แล้ว

    How do you get suggestions in your terminal for auto completion of commands.

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      fishshell.com/

  • @adityanjsg99
    @adityanjsg99 ปีที่แล้ว +1

    So far, the only tutorial on Tesseract 5, the old model of training by bash has been abandoned since December 2022

    • @faint.2396
      @faint.2396 ปีที่แล้ว

      So, are you saying this video is now not useful at all?

  • @IshaqKhan010
    @IshaqKhan010 ปีที่แล้ว

    Brother you can train for urdu nashtiliq font there no accurate trained data on net please

  • @KINGERTADC_yay
    @KINGERTADC_yay ปีที่แล้ว

    Hey Gabriel, nice vid, I am actually using it to train tesseract on Aurbesh font/language from star wars look it up it would explain a lot, each letter has a corresponding English letter I have collected roughly 100,000 sentences using your program and trained it with the command you provided but when I run a 6 letter word it completely melts down and just outputs the incorrect answer, I have changed iteration to small and big but no luck, I am wondering if you can help me or point me in the right direction. Thanks a lot

    • @ganeshrajv130
      @ganeshrajv130 ปีที่แล้ว +1

      Hey you collected font but whats the training text data is that of Aurbesh ?

    • @kinderpinguiin7064
      @kinderpinguiin7064 ปีที่แล้ว

      Hi ! I don't know if you resolved your issue since 1 month but don't forget to set a huge MAX_ITERATIONS to the make training. I personally set it to 10000 and it was quite better, it might be really enough for you if you have 100000 sentences. If you want to know the result check the log while the model is training, for example :
      At iteration 7800/7800/7800, Mean rms=5.642000%, delta=49.022000%, BCER train=97.817000%,
      BWER train=100.000000%, skip ratio=0.000000%, New best BCER = 97.817000 wrote checkpoint.
      BCER is the error rate for characters and BWER the error rate for words, you can see that at iteration 7800 it was higher than 95% and after the 9500th iteration I got several improvements.

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn ปีที่แล้ว

    when i run the training command, its gives me the bellow error
    Segmentation fault (core dumped) tesseract "data/Apex-ground-truth/eng_62.tif" data/Apex-ground-truth/eng_62 --psm 13 lstm.train
    Makefile:262: recipe for target 'data/Apex-ground-truth/eng_62.lstmf' failed
    make: *** [data/Apex-ground-truth/eng_62.lstmf] Error 139
    Can you help me to fix this?

    • @xzerozdead
      @xzerozdead ปีที่แล้ว

      Your folder was probably named "Apex" and not "Apex-ground-truth"

  • @shadyas.1571
    @shadyas.1571 10 หลายเดือนก่อน +2

    Hi Gabriel.
    Thank you for this tutorial.
    I was trying to run the code but I'm receiving this error:
    Fontconfig error: Cannot load default config file: No such file: (null)
    This error appears to be font-related. I've experimented with several fonts but I'm unable to resolve this issue.
    Could you help me please?

    • @kavachek2
      @kavachek2 8 หลายเดือนก่อน

      такая же проблема

    • @pauliusliaudenskas9269
      @pauliusliaudenskas9269 5 หลายเดือนก่อน

      Have you been able to figure it out? I'm having the same problem

    • @kavachek2
      @kavachek2 5 หลายเดือนก่อน

      @@pauliusliaudenskas9269 к сожелению, не смог. Не понимаю, как это сделать

  • @farazsoftinfo
    @farazsoftinfo ปีที่แล้ว +1

    Hi Gabriel,
    Thanks for making this tutorial, I was waiting for it.
    I will start taring my model soon. 😍
    But how we can fine-tune a model?
    Can you please show me how can I combine this new trained file with another model?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      Glad you liked it! In this tutorial you can see I actually fine-tuned, I started on the eng.traineddata file from Tesseract and trained it further on a new font, this should be enough for most cases.

    • @farazsoftinfo
      @farazsoftinfo ปีที่แล้ว

      ​@@AstuteJoe Hi Gabriel, when I fine-tune I get a very bad result. I just wanna add some new words and some characters, but the final file that I get is worse than the main traineddata file.
      I'm trying to fine-tune an RTL language.
      Thanks a lot.

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      @@farazsoftinfo That's a very different rabbit hole, that's ML techniques, you might be overfitting (training too much) or underfitting (training too little) your model, have you tried generating all the 193k PDFs to train and leaving it to train for a bit?

    • @gabriel2011gabriel
      @gabriel2011gabriel ปีที่แล้ว

      @@farazsoftinfo I'm trying to do the same thing and the result is a bunch of "mmmoooomom...". Is yours the same?

    • @farazsoftinfo
      @farazsoftinfo ปีที่แล้ว +1

      ​@@gabriel2011gabriel I tried it for Persian, but I couldn't get a good result. The main models are still better than what I got. When I try to add some new words and fonts I get a worse model. Maybe I should check it more to figure out the best settings that work for the RTL languages.

  • @aayushjain7793
    @aayushjain7793 ปีที่แล้ว +3

    While running the script 'split_training_text.py'. I am getting the following error:
    Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
    Could you help me how to resolve this?

    • @jayrigger7508
      @jayrigger7508 ปีที่แล้ว

      I am also getting this.. running as sudo helped a bit still getting this "Unable to open '../tmp/fonts.conf' for writing: No such file or directory"

    • @jayrigger7508
      @jayrigger7508 ปีที่แล้ว

      just top add.. I am getting eng_XX.box f eng_XX.tiff and eng_xx.gt.txt

    • @aayushjain7793
      @aayushjain7793 ปีที่แล้ว

      @@jayrigger7508 I have resolved the issue by just changing the --font flag to /usr/share/fonts

  • @Ethiopic
    @Ethiopic 10 หลายเดือนก่อน

    Thank you for this video. I am now able to train Tesseract to ocr my language data in the Mac. This is working great both in the Linux and the Mac. (But, I am unable to do so because I am getting error "tessdata_prefix not recognized" in the Windows. )

    • @wonkduck4759
      @wonkduck4759 8 หลายเดือนก่อน

      Hello, I am currently stuck. Where did you upload your new font ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on?

    • @alirezanadafy9267
      @alirezanadafy9267 7 หลายเดือนก่อน

      Hi
      Just run:
      set TESSDATA_PREFIX="../tesseract/tessdata"
      and then run the text2image....

  • @umandadikwatta178
    @umandadikwatta178 ปีที่แล้ว

    Hello, Can you please explain how to debug the Tesseract code, to get an idea on how the code works ?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      Honestly, I think your best bet is cloning the GitHub repo, readings the docs and then delving onto code, just reading it, eventually you'll be better at knowing where to look and after trying hard you might be comfortable and understand it. And I'm pretty sure in the docs you can dump and inspect some intermediary steps debug files, finally, be sure to run it on verbose mode, probably -v. Ah, and you can compile it with debugging symbols too, should help if you want to set breakpoints etc

  • @sebastianorzechowski4613
    @sebastianorzechowski4613 2 หลายเดือนก่อน

    Helloo is there anyone who tried to learn tesseract polish signs !. I have adjusted this split_training_text for Tesseract 5.0 to create lines of polish set and then teach tesseract. Problem is with font type i think, cause it should know how to recognize those special characters:
    Stripped 4 unrenderable word(s): 'unieważnienie SZKOŁAMI NADZIEJĘ, | '
    I can share my adjusted script to generate those lines with you if you want. I will try with another font. I tried HvDTrial Fabrikat Mono

  • @kallemyllynen9571
    @kallemyllynen9571 4 หลายเดือนก่อน

    Running this on Windows I had to modify the Makefile to make it work

  • @monctrikblitz5674
    @monctrikblitz5674 3 หลายเดือนก่อน

    When running your python script, an error occurs:
    Fontconfig error: Cannot load default config file
    Fontconfig error: Cannot load default config file
    Could not find font named 'Waukegan LDO Bold'.
    Please correct --font arg.
    How can I solve this error? I need to use my unique font "Waukegan LDO Bold.ttf"
    I hope you can help me to solve this problem, thank you in advance.

    • @sebastianorzechowski4613
      @sebastianorzechowski4613 2 หลายเดือนก่อน

      I think that you should install this font in your system first :)

  • @insidethoughts502
    @insidethoughts502 ปีที่แล้ว

    Is tessaract 5 can helpful for detecting only bold text from images

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      Only experimentation will tell, but Tesseract 5 does perform better some times

  • @Bengeljo
    @Bengeljo 4 หลายเดือนก่อน +1

    I always get an error when I want to use a font, it is installed and can be find by windows and even looking it up works perfectly. When I run the split_training_text.py I get the following Error:
    Fontconfig error: Cannot load default config file: No such file: (null)
    Fontconfig error: Cannot load default config file: No such file: (null)
    Could not find font named 'Quadrant'.
    Pango suggested font 'Cascadia Code'.
    Please correct --font arg.
    I want to train the model on Quadrat-Serial-Regular.ttf but it just won't regognize it. I tried to look it up but can't find it. Modifying the font flag doesn't help since it wants a name but it can't find it even tho it is there, but tbh I don't know where it is searching for the fonts.
    The Folder is located on the SSD E: and the operating system is on C: but tesseract and python are in the path of C: so they should get access to it. Please help

    • @TheComputerChip
      @TheComputerChip 3 หลายเดือนก่อน +1

      Having the same problem. Still trying to understand what it is looking for...

    • @Bengeljo
      @Bengeljo 3 หลายเดือนก่อน +1

      @@TheComputerChip I gave up, looked at another method that uses the Google colab and create my own model there it works pretty well. Don't know the video anymore cause probably between then and now I watched approximately 250 vids. Not kidding I don't have a life

    • @TheComputerChip
      @TheComputerChip 3 หลายเดือนก่อน +2

      @@Bengeljo hahaha no worries. I actually ended up getting this to work. The error doesn’t seem to affect the output oddly enough. As long as it finds the font everything still runs. Currently waiting as my PC generates the images and then I’ll sleep as it trains. On video #3 since starting the image creation! lol

    • @ROHIT_S_Patil
      @ROHIT_S_Patil 3 วันที่ผ่านมา

      ​@@Bengeljo Can you share the Google Colab workflow you followed to create your model?

  • @DalvinderKaur-iz5sn
    @DalvinderKaur-iz5sn ปีที่แล้ว

    Thanks for the tutorial Sir. I have a error after run the Training command-TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=
    eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000. the error is :
    "CMakefile:325: recipe for target 'data/foo/checkpoints/foo_checkpoint' failed". And coding of string failed! Failure bytes.... ..Can't encode transcription: .....Please can you help me regarding these issues?

  • @cryptoplusone3850
    @cryptoplusone3850 ปีที่แล้ว

    does this also work on windows or do i have to use a different method?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      I believe it works, but definitely not every step exactly like in the video. But as far as I remember the Tesseract mantainers highly recommend Linux instead

    • @focusofLandD
      @focusofLandD ปีที่แล้ว

      I tried on Windows, not working very well, pls let me know if you are able to solve it

  • @hugolearn
    @hugolearn 28 วันที่ผ่านมา

    So I actually followed this through, handy scripts.. However
    Seems to have seriously overfit my data. No augmentations? No variance in font size or spacing?
    I notice in this video you only actually evaluate your trained model against a ground truth image. This all looks technically correct but as it stands still kinda useless for any practical application?
    How's the output if you generate a new text without text2image and run it against that ?

    • @AstuteJoe
      @AstuteJoe  28 วันที่ผ่านมา

      I imagine you could edit the tex2image utility source code to introduce the variance you need, tesseract is open source

  • @legendevent3911
    @legendevent3911 ปีที่แล้ว

    Hey Gabriel, I have a training_text file with just digits like 1,234,567 in variety combinations. The Problem ist when I try to start your script i get following error message:
    python3 split_training_text.py
    Traceback (most recent call last):
    File "split_training_text.py", line 12, in
    for line in input_file.readlines():
    File "/usr/lib/python3.8/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
    Could you help me to resolve this? Im a newbie in python.
    The tutorial was great!
    Edit: When im changing the script to: with open(training_text_file, 'rb') I get a new error TypeError: write() argument must be str, not bytes

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      Can you send me the whole file? Pastebin or GitHub does it, I believe I know exactly how to fix but I need the whole file to send you the fixed version

    • @abdeldjalilchougui
      @abdeldjalilchougui 11 หลายเดือนก่อน

      Did you solve the problem ? if yes could you share it with me please ?

    • @abdeldjalilchougui
      @abdeldjalilchougui 11 หลายเดือนก่อน

      @@AstuteJoe Did you solve the problem ? if yes could you share it with me please ?

    • @sebastianorzechowski4613
      @sebastianorzechowski4613 3 หลายเดือนก่อน

      I think you have to type encoding='utf-8' insine open function:
      with open(training_text_file,'r',encoding='utf-8') as input_file:

  • @PsychologicalHeat
    @PsychologicalHeat ปีที่แล้ว +1

    I am reciveing this error when I try to run your command:
    Failed to read boxes from data/myFont-ground-truth/eng_45.tif
    Error during processing.
    make: *** [data/myFont-ground-truth/eng_45.lstmf] Error 1
    TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=myFont START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
    I have added eng.traineddata to tessdata. Can you help me fixed it please?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว +1

      Did you generate the .box files successfully?

    • @PsychologicalHeat
      @PsychologicalHeat ปีที่แล้ว

      ​@@AstuteJoe I cleaned the box files but now I get a different error
      Here is my output:
      + tesseract data/myFont-ground-truth/eng_2.tif data/myFont-ground-truth/eng_2 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_0.tif data/myFont-ground-truth/eng_0 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_5.tif data/myFont-ground-truth/eng_5 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_7.tif data/myFont-ground-truth/eng_7 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_3.tif data/myFont-ground-truth/eng_3 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      + tesseract data/myFont-ground-truth/eng_1.tif data/myFont-ground-truth/eng_1 --psm 13 lstm.train
      read_params_file: Can't open lstm.train
      find -L data/myFont-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/myFont/all-lstmf"
      Error: missing ground truth for training
      make: *** [data/myFont/list.train] Error 1
      Your help will be very appreciated 🙂

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      @@PsychologicalHeat Did you generate the .txt.gt files? Those are text files with the actual text in them

    • @PsychologicalHeat
      @PsychologicalHeat ปีที่แล้ว

      ​@@AstuteJoe Yes, I have all gt.txt, .box, and .tiff files
      I think the problem is that I want the ocr to read only uppercase letters?
      I have made a custom training_text file and it only has numbers, '-' and uppercase letters.
      I played around with it and now this is the output:
      find -L data/myFont-ground-truth -name '*.gt.txt' | xargs paste -s > "data/myFont/all-gt"
      unicharset_extractor --output_unicharset "data/myFont/unicharset" --norm_mode 2 "data/myFont/all-gt"
      Bad box coordinates in boxfile string! 36-XR-34928-PN-54460-TN-50758-XB-02919-JP-10263-DG-99350-MF-07358-PK-31144-MB-35731-ZX-758
      Extracting unicharset from plain text file data/myFont/all-gt
      Other case x of X is not in unicharset
      Other case r of R is not in unicharset
      Other case p of P is not in unicharset
      Other case n of N is not in unicharset
      Other case t of T is not in unicharset
      Other case b of B is not in unicharset
      Other case j of J is not in unicharset
      Other case d of D is not in unicharset
      Other case g of G is not in unicharset
      Other case m of M is not in unicharset
      Other case f of F is not in unicharset
      Other case k of K is not in unicharset
      Other case z of Z is not in unicharset
      Wrote unicharset file data/myFont/unicharset
      make: *** No rule to make target `data/myFont-ground-truth/myFont_1.lstmf', needed by `data/myFont/all-lstmf'. Stop.

  • @ikedoriens6149
    @ikedoriens6149 ปีที่แล้ว

    Jezus. Isn't there just a command line possibility like in Tesseract 4.0?
    This seems a bit complicated for someone who's not into programming.

  • @3ombieautopilot
    @3ombieautopilot ปีที่แล้ว +1

    Hello! Can you make a video about how to make tesseract to recognize a character which is out of eng.traineddata? Like ± , Ó mixed with some english text

    • @adityanjsg99
      @adityanjsg99 ปีที่แล้ว

      Train it and the use it

  • @asiburrahman3623
    @asiburrahman3623 ปีที่แล้ว

    I didn't get the font part. Where did you put the font?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      It has to be installed on your system, each OS will have a different way of doing it

    • @asiburrahman3623
      @asiburrahman3623 ปีที่แล้ว +1

      @@AstuteJoe i'm using ubuntu. Is there any way to specify the directory?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      @@asiburrahman3623 askubuntu.com/questions/3697/how-do-i-install-fonts

    • @asiburrahman3623
      @asiburrahman3623 ปีที่แล้ว +2

      @@AstuteJoe I have installed the font but still this error shows:
      Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
      Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
      Could not find font named 'Apex'.

    • @kannapatudompant8535
      @kannapatudompant8535 ปีที่แล้ว

      @@asiburrahman3623 I also have the same problem.
      I tried to add '--fontconfig_tmpdir={fontconf_dir}'. >> the default is /tmp which doesn't have our font directory in it.
      fonts.conf is usually located in etc/share/fonts.
      Now, I could create .box and .tif files.
      Hope this solution could solve your issue too.

  • @TuanLe-ve7lm
    @TuanLe-ve7lm ปีที่แล้ว

    hi Gabo, May I please see your fonts.conf file?

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      Not even sure what is this file now but here you go, this one is on my home folder:
      /home/gabri/tesseract_training/apex_legends.otf

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      This one is on the tesseract project folder:

    • @TuanLe-ve7lm
      @TuanLe-ve7lm ปีที่แล้ว

      I have made a good progress today, I am able to train the Apex font, however when I switch to another font Nato Sans, it's able to generate box and tff but it shows error while training "Makefile:219: *** found no data/Noto Sans-ground-truth/*.gt.txt for Sans/all-gt. Stop." . Seem it does not accept font's name with space in middle ..

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      @@TuanLe-ve7lm That could definitely be it, spaces and Linux (or Windows) don't mix well

  • @blndazeez1973
    @blndazeez1973 ปีที่แล้ว

    Hi Gabriel,
    Great Video! One questions, when I try to retrain Arabic model using this command
    "TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=ara TESSDATA=../tesseract/tessdata MAX_ITERATIONS=200"
    It gives me below error:
    "Error opening data file ../tesseract/tessdata/eng.traineddata"
    The problem I am not using the English model.
    hanks for the video again!

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      That's really odd, I see you changed the START_MODEL so it should work, not super sure now

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      Do you have ara.traineddata in the tessdata folder?

    • @blndazeez1973
      @blndazeez1973 ปีที่แล้ว

      @@AstuteJoe Yes I have and made sure of it couple of times

    • @AstuteJoe
      @AstuteJoe  ปีที่แล้ว

      @@blndazeez1973 Maybe it's because the Apex model was already created when you were trying it out? And it's already on top of the eng trained data?

    • @blndazeez1973
      @blndazeez1973 ปีที่แล้ว +1

      @@AstuteJoe I redo the steps with different model name but gives me the same error, that is strange.

  • @rcraftg4mer42
    @rcraftg4mer42 6 หลายเดือนก่อน

    i love you

    • @AstuteJoe
      @AstuteJoe  6 หลายเดือนก่อน

      lol i love you too

  • @datarkmveri2228
    @datarkmveri2228 ปีที่แล้ว

    please help

  • @_nom_
    @_nom_ ปีที่แล้ว

    No rule to make target 'data/eng-ground-truth/eng.training_text.lstmf'

  • @user-of2lm9ii5g
    @user-of2lm9ii5g ปีที่แล้ว

    Hello, how to fix it?
    Failed to read data from: data/langdata/Apex/Apex.config
    Failed to read data from: data/langdata/radical-stroke.txt
    Error reading radical code table data/langdata/radical-stroke.txt
    make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

    • @user-of2lm9ii5g
      @user-of2lm9ii5g ปีที่แล้ว +4

      solved it: need to run in tesstrain folder:
      make leptonica tesseract
      make tesseract-langdata

    • @user-yj8eh5ft9m
      @user-yj8eh5ft9m ปีที่แล้ว

      thanks

  • @datarkmveri2228
    @datarkmveri2228 ปีที่แล้ว +1

    Hi,
    When I try to Run training command it give a error can you please help me ------->
    Config file is optional, continuing...
    Failed to read data from: data/langdata/Apex/Apex.config
    Failed to read data from: data/langdata/radical-stroke.txt
    Error reading radical code table data/langdata/radical-stroke.txt
    make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1

    • @datarkmveri2228
      @datarkmveri2228 ปีที่แล้ว +2

      command : TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
      combine_tessdata -u ../tesseract/tessdata/eng.traineddata data/eng/Apex

    • @datarkmveri2228
      @datarkmveri2228 ปีที่แล้ว

      tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train
      + tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
      python3 shuffle.py 0 "data/Apex/all-lstmf"
      + head -n 90 data/Apex/all-lstmf
      + tail -n 10 data/Apex/all-lstmf
      combine_lang_model \
      --input_unicharset data/Apex/unicharset \
      --script_dir data/langdata \
      --numbers data/Apex/Apex.numbers \
      --puncs data/Apex/Apex.punc \
      --words data/Apex/Apex.wordlist \
      --output_dir data \
      \
      --lang Apex
      Failed to read data from: data/Apex/Apex.wordlist
      Failed to read data from: data/Apex/Apex.punc
      Failed to read data from: data/Apex/Apex.numbers
      Loaded unicharset of size 113 from file data/Apex/unicharset
      Setting unichar properties
      Other case É of é is not in unicharset
      Other case FI of fi is not in unicharset
      Setting script properties
      Failed to load script unicharset from:data/langdata/Latin.unicharset
      Warning: properties incomplete for index 3 = C
      Warning: properties incomplete for index 4 = H
      Warning: properties incomplete for index 5 = E
      Warning: properties incomplete for index 6 = S
      Warning: properties incomplete for index 7 = -
      Warning: properties incomplete for index 8 = R
      Warning: properties incomplete for index 9 = I
      Warning: properties incomplete for index 10 = K
      Warning: properties incomplete for index 11 = N
      Warning: properties incomplete for index 12 = G
      Warning: properties incomplete for index 13 = B
      Warning: properties incomplete for index 14 = 8
      Warning: properties incomplete for index 15 = 5

    • @user-of2lm9ii5g
      @user-of2lm9ii5g ปีที่แล้ว

      @@datarkmveri2228 solved it: need to run in tesstrain folder:
      make leptonica tesseract
      make tesseract-langdata

  • @Kronzplayz.
    @Kronzplayz. ปีที่แล้ว

    kindly help i'm getting an error while training plz @AstuteJoe
    Failed to read data from: data/OCRA/OCRA.wordlist
    Failed to read data from: data/OCRA/OCRA.punc
    Failed to read data from: data/OCRA/OCRA.numbers
    Loaded unicharset of size 112 from file data/OCRA/unicharset
    Setting unichar properties
    Other case É of é is not in unicharset
    Setting script properties
    Failed to load script unicharset from:data/langdata/Latin.unicharset
    Config file is optional, continuing...
    Failed to read data from: data/langdata/OCRA/OCRA.config
    Failed to read data from: data/langdata/radical-stroke.txt
    Error reading radical code table data/langdata/radical-stroke.txt
    make: *** [Makefile:293: data/OCRA/OCRA.traineddata] Error 1

    • @Kronzplayz.
      @Kronzplayz. ปีที่แล้ว

      I solved this issue 😅

    • @enriqueortiz5875
      @enriqueortiz5875 ปีที่แล้ว

      @@Kronzplayz. how you solved it? I got the same issue

    • @user-of2lm9ii5g
      @user-of2lm9ii5g ปีที่แล้ว

      @@enriqueortiz5875 solved it: need to run in tesstrain folder:
      make leptonica tesseract
      make tesseract-langdata

  • @focusofLandD
    @focusofLandD ปีที่แล้ว

    Hi, Gabriel: I am getting this error: at the last training step when I am trying to train a new font called Bender:
    Failed to read data from : data/bender/bender.worldlist
    Failed to read data from : data/bender/bender.punc
    Failed to read data from : data/bender/bender.numbers
    Failed to read data from : data/bender/bender.config
    Invalid format in radical table at line 0: 19886 3 23 6 3

    • @notAvn
      @notAvn ปีที่แล้ว

      did you manage to train tesseract for bender yet?

  • @ganeshrajv130
    @ganeshrajv130 ปีที่แล้ว

    read_params_file: Can't open make
    read_params_file: Can't open training
    read_params_file: Can't open MODEL_NAME=nakula_hin
    read_params_file: Can't open START_MODEL=hin
    read_params_file: Can't open TESSDATA=/usr/local/share/tessdata/
    read_params_file: Can't open MAX_ITERATIONS=10
    Error, cannot read input file TESSDATA_PREFIX: No such file or directory
    Error during processing. This is what the error I get even though i did followed ur step

  • @faint.2396
    @faint.2396 ปีที่แล้ว

    Hi I'm getting this error:
    Traceback (most recent call last):
    File "C:\Users\HAVASIZ\Desktop\tesseract_tutorial\split_training_text.py", line 34, in
    subprocess.run([
    File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in run
    with Popen(*popenargs, **kwargs) as process:
    File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
    File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
    FileNotFoundError: [WinError 2]

    • @TuanLe-ve7lm
      @TuanLe-ve7lm ปีที่แล้ว

      same to me, have you had a solution yet

    • @faint.2396
      @faint.2396 ปีที่แล้ว

      @@TuanLe-ve7lm No, sadly I gave up on how to train Tesseract 5. I'm going to try to learn how to train Tesseract 4 because there are a lot more videos on youtube.

    • @faint.2396
      @faint.2396 ปีที่แล้ว

      @@TuanLe-ve7lm I actually fixed the issue by using Linux. But now I get other errors lol

    • @abdeldjalilchougui
      @abdeldjalilchougui 11 หลายเดือนก่อน

      @@faint.2396 Did you fix your problem ?

    • @sebastianorzechowski4613
      @sebastianorzechowski4613 2 หลายเดือนก่อน

      I think it could be related with text2image itself. You have to provide path to text2image.exe which in general is located in installed tesseract.

  • @utkarshmishra6194
    @utkarshmishra6194 11 หลายเดือนก่อน

    Hi Gabriel, hope you doing well
    I ran this command
    TESSDATA_PREFIX=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata MAX_ITERATIONS=400
    But I am getting error
    Failed to read data from: data/Apex/Apex.wordlist
    Failed to read data from: data/Apex/Apex.punc
    Failed to read data from: data/Apex/Apex.numbers
    Failed to read data from: data/langdata/Apex/Apex.config
    Null char=2
    lstmtraining \
    --debug_interval 0 \
    --traineddata data/Apex/Apex.traineddata \
    --old_traineddata /mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata/eng.traineddata \
    --continue_from data/eng/Apex.lstm \
    --learning_rate 0.0001 \
    --model_output data/Apex/checkpoints/Apex \
    --train_listfile data/Apex/list.train \
    --eval_listfile data/Apex/list.eval \
    --max_iterations 1000 \
    --target_error_rate 0.01
    Failed to load list of training filenames from data/Apex/list.train
    make: *** [Makefile:319: data/Apex/checkpoints/Apex_checkpoint] Error 1

    • @nithyavenugopal6834
      @nithyavenugopal6834 9 หลายเดือนก่อน

      Hi, were you able to solve this error? If so, how?

  • @athosmba1766
    @athosmba1766 10 หลายเดือนก่อน

    When I use the code TESSDATA_PREFIX=.../tesseract/tessdata make training model_NAME=Apex Start_MODEL=eng TESSDATA=.../tesseract/tessdata MAX_INTERATION=100 it's not work, giving an error about the comand TESSDATA=........

    • @athosmba1766
      @athosmba1766 10 หลายเดือนก่อน

      someone can help me?

    • @Ethiopic
      @Ethiopic 10 หลายเดือนก่อน

      Are you getting "not recognized" error. I am getting the same error on Windows. The exact command works fine on the Mac. Very strange. Do you find a solution?

  • @vishnubalaji9500
    @vishnubalaji9500 ปีที่แล้ว +2

    understood jack shit from this video needs more dumbing down

    • @faint.2396
      @faint.2396 ปีที่แล้ว +4

      fr and I did every step the same and I'm getting errors. Why isn't training Tesseract 5 simple as Tesseract 4? And the thing is there's only ONE video on how to train Tesseract 5 and its this one.

  • @sayantanbiswas9702
    @sayantanbiswas9702 2 หลายเดือนก่อน

    TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=coc START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000

  • @sayantanbiswas9702
    @sayantanbiswas9702 2 หลายเดือนก่อน

    tesseract data/coc-ground-truth/eng_2.tif stdout --tessdata-dir /home/godmode2/tesseract_tutori
    al/tesstrain/data --psm 7 -l coc --loglevel ALL