Training Tesseract 5 for a New Font

Gabriel Garcia

มุมมอง 48 304

เพิ่มลงใน
- เพลย์ลิสต์ของฉัน
- ดูภายหลัง
แชร์

แชร์

ฝัง

ขนาดวิดีโอ:

แสดงแผงควบคุมโปรแกรมเล่น

เล่นอัตโนมัติ

เล่นใหม่

เผยแพร่เมื่อ 6 ม.ค. 2025

ความคิดเห็น •

@taylorbarnes6151 ปีที่แล้ว ⁺¹⁸
God I love you. I just recently started messing with OCR's, specifically Tesseract, and I was reading through some documentation on the steps and after a few hours just wanted to end my life hahahaha. Thank you for this, this is extremely encouraging. I can't wait to try this!
@xbelanch 6 วันที่ผ่านมา ⁺¹
Tesseract docs are hell. Lord knows and for that, after divided light from the night, gave to Gabriel Garcia to make this blessed tutorial.
@yichenyao5927 9 หลายเดือนก่อน ⁺³
I think the reason why the word error rate is high is because the font doesn't distinguish uppercase with lower case (it's all upper case) but the ground truth label distinguish between the two.
@nobafan7515 22 วันที่ผ่านมา
Is there a setting to get it to recognize them?
@buny0n 11 หลายเดือนก่อน ⁺¹⁸
Tesseract's documentation is abysmal.
@nikolaikrot8516 9 หลายเดือนก่อน ⁺¹
I tend to think about tesseract documentation as the Augean Stables
@AchievementHuntGuru 5 หลายเดือนก่อน ⁺¹
This video on training is the only source that by following this you will be able to achieve results! Many thanks for this video!
@donjuanpond1 5 หลายเดือนก่อน ⁺¹
thank you so much man. I've been looking everywhere for a tesseract tutorial, it all just points to the shitty unreadable docs. Without you I don't know where I'd be
@fivalt126 8 หลายเดือนก่อน ⁺¹
Estuve rompiendome la cabeza tratando de entender el tutorial oficial y tú lo explicas de una manera sencilla. Soy tu suscriptor numero 666, Muchas Gracias.
@madhavpandey30 ปีที่แล้ว ⁺³
Hey Gabriel, I am following your steps to train on my model on hand writtent text. But it is always failing with this erro:
unicharset_extractor --output_unicharset "data/Apex/my.unicharset" --norm_mode 2 "data/Apex/all-gt"
Failed to read data from: data/Apex/all-gt
Wrote unicharset file data/Apex/my.unicharset
Can you please help me here? I am stuck. Thanks!
@ConfusedProgrammer 11 หลายเดือนก่อน ⁺²
I've been experimenting with this tutorial for three days , the file structure and the GitHub doesn't necessarily match, can you please update the repo if possible . I am having too many folder inconsistencies when trying to to connect the dots here as it was brushed over really quickly , thank you :)
@45545videos ปีที่แล้ว ⁺²
Haven't watched the video yet, but if this works, you'll have my eternal gratitude
@aayushjain7793 2 ปีที่แล้ว ⁺³
While running the script 'split_training_text.py'. I am getting the following error:
Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
Could you help me how to resolve this?
@jayrigger7508 2 ปีที่แล้ว
I am also getting this.. running as sudo helped a bit still getting this "Unable to open '../tmp/fonts.conf' for writing: No such file or directory"
@jayrigger7508 2 ปีที่แล้ว
just top add.. I am getting eng_XX.box f eng_XX.tiff and eng_xx.gt.txt
@aayushjain7793 2 ปีที่แล้ว
@@jayrigger7508 I have resolved the issue by just changing the --font flag to /usr/share/fonts
@goksel9908 4 หลายเดือนก่อน
@@aayushjain7793 you mean,
'--font= Apex',
you changed this to
'--font= /usr/share/fonts/Apex',
this?
@ganeshrajv130 ปีที่แล้ว ⁺¹
I tried with this font for hindi language ( Kruti Dev 010 ) even tried with Kruti Dev 016 but its showing : Error: Call PrepareToWrite before WriteTesseractBoxFile!!
@shadyas.1571 ปีที่แล้ว ⁺²
Hi Gabriel.
Thank you for this tutorial.
I was trying to run the code but I'm receiving this error:
Fontconfig error: Cannot load default config file: No such file: (null)
This error appears to be font-related. I've experimented with several fonts but I'm unable to resolve this issue.
Could you help me please?
@kavachek2 ปีที่แล้ว
такая же проблема
@pauliusliaudenskas9269 11 หลายเดือนก่อน
Have you been able to figure it out? I'm having the same problem
@kavachek2 11 หลายเดือนก่อน
@@pauliusliaudenskas9269 к сожелению, не смог. Не понимаю, как это сделать
@nobafan7515 22 วันที่ผ่านมา
Hi. Theres a font used in a game i would like to prepare for training. Would all i need to do is screencapture the words used in that font according to what you describe, or do i need a different approach?
@DalvinderKaur-iz5sn ปีที่แล้ว
when tesseract training is start it show the bellow warning
Can't encode transcription: 'पिए वई। ज़ख़मनि जो सूर वधंदो वियो हू चीखन्दो for Sindhi
how I can handle this problem?
@Leo-hk7kk ปีที่แล้ว
I want to custom train Tesseract 5 to read the license plates of the car which are detected using YOLO model. How can I do these as I have couple of thousand images? Help
What are the steps I need to follow?
@Bengeljo 10 หลายเดือนก่อน ⁺¹
I always get an error when I want to use a font, it is installed and can be find by windows and even looking it up works perfectly. When I run the split_training_text.py I get the following Error:
Fontconfig error: Cannot load default config file: No such file: (null)
Fontconfig error: Cannot load default config file: No such file: (null)
Could not find font named 'Quadrant'.
Pango suggested font 'Cascadia Code'.
Please correct --font arg.
I want to train the model on Quadrat-Serial-Regular.ttf but it just won't regognize it. I tried to look it up but can't find it. Modifying the font flag doesn't help since it wants a name but it can't find it even tho it is there, but tbh I don't know where it is searching for the fonts.
The Folder is located on the SSD E: and the operating system is on C: but tesseract and python are in the path of C: so they should get access to it. Please help
@TheComputerChip 9 หลายเดือนก่อน ⁺¹
Having the same problem. Still trying to understand what it is looking for...
@Bengeljo 9 หลายเดือนก่อน ⁺¹
@@TheComputerChip I gave up, looked at another method that uses the Google colab and create my own model there it works pretty well. Don't know the video anymore cause probably between then and now I watched approximately 250 vids. Not kidding I don't have a life
@TheComputerChip 9 หลายเดือนก่อน ⁺²
@@Bengeljo hahaha no worries. I actually ended up getting this to work. The error doesn’t seem to affect the output oddly enough. As long as it finds the font everything still runs. Currently waiting as my PC generates the images and then I’ll sleep as it trains. On video #3 since starting the image creation! lol
@ROHIT_S_Patil 6 หลายเดือนก่อน
@@Bengeljo Can you share the Google Colab workflow you followed to create your model?
@mukilanru 5 หลายเดือนก่อน
I want to be able to OCR '±' which is being detected as '+'.
tesseract 5.4.0.20240606
pytesseract 0.3.10
python 3.12
@DalvinderKaur-iz5sn 2 ปีที่แล้ว ⁺¹
.lstmf files are missing. please help me to where i am wrong.
@listentomusicfeellikehome 8 หลายเดือนก่อน ⁺¹
Hi.I try this on colab. I install tesseract and go on to run split_training_text.py and get this error FileNotFoundError: [Errno 2] No such file or directory: 'text2image'. Is there a solution?
@3ombieautopilot 2 ปีที่แล้ว ⁺¹
Thank you for making this video. But I can't wrap my head around where to put all those data files to? I'm trying to fine tune variations of letters with accents, and I'm helpless.
@ganeshrajv130 ปีที่แล้ว ⁺²
one last question to shoot up, basically the Tesseract is not trained with handwritten text I guess and its trained on line files of system text which again converted to images on line basis for training. ? is my assumption true ?
@dhirazz ปีที่แล้ว
Hey, It seems like you were also looking to train tesseract with handwritten text. Did you do it? If so please shade light, I am so lost
@ganeshrajv130 ปีที่แล้ว
@@dhirazz training is not an easy thing as you need huge amt of data and they as well clearly said training is not gonna make any sense ( google ) hence,if u wanna try adjusting the parameters then deep dive into cpp
@ganeshrajv130 ปีที่แล้ว ⁺¹
the title is for new font , can I take it as new language ? using TIFF
@wojd_ ปีที่แล้ว
Great tutorial. Using WSL I was constantly getting new errors. Switching to OS installed on VirtualBox solved it. I was able to train my dataset-it's surprisingly easy.
@heetshah9394 ปีที่แล้ว
Could you help me with the directory structure. I am a bit confused on how it is made?
@azadehpedram7215 10 หลายเดือนก่อน
I have bunch of plate with some text on it , goal is change the image to text, special font is trained but not effective , how can i train tobetter result, thanks for help
@Ayaangaddam 10 หลายเดือนก่อน
Thank you for doing this tutorial. Can I use the Text2Image approach to generate box files and tif files to train new font for Tesserat 4.0?
@YashhBhushan 6 หลายเดือนก่อน
Buddy i need help i need to learn this software but im absolutley clueless any sources tutorils and videoa i can watch
@eusebiosouza2252 ปีที่แล้ว
Great Video !
I'm getting this error when i try do run the training command:
"Failed to read boxes from data/FE_Font-ground-truth/eng_16.tif"
The file eng_16.tif not seems to be empty and it's very similar to all other trainning files. Im running with MAX_ITERATIONS=100 and with i delete the file that seems to be the problem, tesseract would throw the same error but with a different file. Does anyone could please help me ?
@ganeshrajv130 ปีที่แล้ว
I tired with your font but didnt work its throwing like :: Could not find font named 'Arial Unicode MS Regular'.
Pango suggested font 'Liberation Mono'. tried with arial but didnt work
@adityanjsg99 2 ปีที่แล้ว ⁺¹
So far, the only tutorial on Tesseract 5, the old model of training by bash has been abandoned since December 2022
@faint.2396 2 ปีที่แล้ว
So, are you saying this video is now not useful at all?
@ganeshrajv130 ปีที่แล้ว ⁺¹
If I have the line wise hand written image for any language with bounding box and the words so and so can I train it on this LSTM network ? will it work ? and could you share your thoughts on the backbone of LSTM architecture with the flow diagram says : how fonts is helping with training data
@hoangcuong9521 10 หลายเดือนก่อน
Thank you for making this video. It helps me a lot. But I have a problem that when I copy and replace link to save dir or language_code..training_text, it appears that all of those generated image are white blank images. Pls help me out of this :
@gyeongwango5434 ปีที่แล้ว
I want to train tesseract with an image file I have (consisting of several lines of text), but I'm not sure how to go about it, starting with creating the train data. I'd really appreciate your tips (URLs for reference, etc).
@sebastianorzechowski4613 9 หลายเดือนก่อน
Helloo is there anyone who tried to learn tesseract polish signs !. I have adjusted this split_training_text for Tesseract 5.0 to create lines of polish set and then teach tesseract. Problem is with font type i think, cause it should know how to recognize those special characters:
Stripped 4 unrenderable word(s): 'unieważnienie SZKOŁAMI NADZIEJĘ, | '
I can share my adjusted script to generate those lines with you if you want. I will try with another font. I tried HvDTrial Fabrikat Mono
@AmphibianDev 2 ปีที่แล้ว ⁺¹
Hi, I am having issues with the last make training command. It throws out a error "No module named 'PIL'".
I have the Pillow library install but the error is still there. I am trying to solve this issue for a long, long time.
If you know something I will appreciate the help. I wanted link to my github issue but I am afraid youtube doesn't allow link.
@mohammadmn7364 ปีที่แล้ว
Hey, long time passed, But for others having the same issue, creating an virtual env and then installing requiremnets.txt (of the tesstrain repo) in it may fix the issue, at least for me it worked! also check if all txt files have related box files or not!
@PsychologicalHeat 2 ปีที่แล้ว ⁺¹
I am reciveing this error when I try to run your command:
Failed to read boxes from data/myFont-ground-truth/eng_45.tif
Error during processing.
make: *** [data/myFont-ground-truth/eng_45.lstmf] Error 1
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=myFont START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
I have added eng.traineddata to tessdata. Can you help me fixed it please?
@AstuteJoe 2 ปีที่แล้ว ⁺¹
Did you generate the .box files successfully?
@PsychologicalHeat 2 ปีที่แล้ว
@@AstuteJoe I cleaned the box files but now I get a different error
Here is my output:
+ tesseract data/myFont-ground-truth/eng_2.tif data/myFont-ground-truth/eng_2 --psm 13 lstm.train
read_params_file: Can't open lstm.train
+ tesseract data/myFont-ground-truth/eng_0.tif data/myFont-ground-truth/eng_0 --psm 13 lstm.train
read_params_file: Can't open lstm.train
+ tesseract data/myFont-ground-truth/eng_5.tif data/myFont-ground-truth/eng_5 --psm 13 lstm.train
read_params_file: Can't open lstm.train
+ tesseract data/myFont-ground-truth/eng_7.tif data/myFont-ground-truth/eng_7 --psm 13 lstm.train
read_params_file: Can't open lstm.train
+ tesseract data/myFont-ground-truth/eng_3.tif data/myFont-ground-truth/eng_3 --psm 13 lstm.train
read_params_file: Can't open lstm.train
+ tesseract data/myFont-ground-truth/eng_1.tif data/myFont-ground-truth/eng_1 --psm 13 lstm.train
read_params_file: Can't open lstm.train
find -L data/myFont-ground-truth -name '*.lstmf' | python3 shuffle.py 0 > "data/myFont/all-lstmf"
Error: missing ground truth for training
make: *** [data/myFont/list.train] Error 1
Your help will be very appreciated 🙂
@AstuteJoe 2 ปีที่แล้ว
@@PsychologicalHeat Did you generate the .txt.gt files? Those are text files with the actual text in them
@PsychologicalHeat 2 ปีที่แล้ว
@@AstuteJoe Yes, I have all gt.txt, .box, and .tiff files
I think the problem is that I want the ocr to read only uppercase letters?
I have made a custom training_text file and it only has numbers, '-' and uppercase letters.
I played around with it and now this is the output:
find -L data/myFont-ground-truth -name '*.gt.txt' | xargs paste -s > "data/myFont/all-gt"
unicharset_extractor --output_unicharset "data/myFont/unicharset" --norm_mode 2 "data/myFont/all-gt"
Bad box coordinates in boxfile string! 36-XR-34928-PN-54460-TN-50758-XB-02919-JP-10263-DG-99350-MF-07358-PK-31144-MB-35731-ZX-758
Extracting unicharset from plain text file data/myFont/all-gt
Other case x of X is not in unicharset
Other case r of R is not in unicharset
Other case p of P is not in unicharset
Other case n of N is not in unicharset
Other case t of T is not in unicharset
Other case b of B is not in unicharset
Other case j of J is not in unicharset
Other case d of D is not in unicharset
Other case g of G is not in unicharset
Other case m of M is not in unicharset
Other case f of F is not in unicharset
Other case k of K is not in unicharset
Other case z of Z is not in unicharset
Wrote unicharset file data/myFont/unicharset
make: *** No rule to make target `data/myFont-ground-truth/myFont_1.lstmf', needed by `data/myFont/all-lstmf'. Stop.
@nilor7550 ปีที่แล้ว
I didn't understand how to run the training command after downloading the two folders from github. I have Windows system
@physicfor 5 หลายเดือนก่อน
It will never work for windows
@akshatjain2925 11 หลายเดือนก่อน ⁺¹
hi when u say we are using text2image nothing AI, but the text2image must be also some model only right ?
@rabbitpiet7182 5 หลายเดือนก่อน
This comment isn't ai
@rabbitpiet7182 5 หลายเดือนก่อน
I mean it's not rendered with ai
@kallemyllynen9571 11 หลายเดือนก่อน
Running this on Windows I had to modify the Makefile to make it work
@Schwartz999 9 หลายเดือนก่อน
When running your python script, an error occurs:
Fontconfig error: Cannot load default config file
Fontconfig error: Cannot load default config file
Could not find font named 'Waukegan LDO Bold'.
Please correct --font arg.
How can I solve this error? I need to use my unique font "Waukegan LDO Bold.ttf"
I hope you can help me to solve this problem, thank you in advance.
@sebastianorzechowski4613 9 หลายเดือนก่อน
I think that you should install this font in your system first :)
@umandadikwatta178 2 ปีที่แล้ว ⁺¹
Thank you very much for this. One question. Can we train Tesseract with non unicode fonts using the same process?
@AstuteJoe 2 ปีที่แล้ว
I'm pretty sure, as long as text2image works correctly. If text2image doesn't work correctly you can either come up with another clever ways (like Python scripts) of automatically generating ground truth data (.gt.txt, .box and .tif files), or worst case, create them manually.
@umandadikwatta178 ปีที่แล้ว
Hello, Can you please explain how to debug the Tesseract code, to get an idea on how the code works ?
@AstuteJoe ปีที่แล้ว
Honestly, I think your best bet is cloning the GitHub repo, readings the docs and then delving onto code, just reading it, eventually you'll be better at knowing where to look and after trying hard you might be comfortable and understand it. And I'm pretty sure in the docs you can dump and inspect some intermediary steps debug files, finally, be sure to run it on verbose mode, probably -v. Ah, and you can compile it with debugging symbols too, should help if you want to set breakpoints etc
@DalvinderKaur-iz5sn 2 ปีที่แล้ว
when i run the training command, its gives me the bellow error
Segmentation fault (core dumped) tesseract "data/Apex-ground-truth/eng_62.tif" data/Apex-ground-truth/eng_62 --psm 13 lstm.train
Makefile:262: recipe for target 'data/Apex-ground-truth/eng_62.lstmf' failed
make: *** [data/Apex-ground-truth/eng_62.lstmf] Error 139
Can you help me to fix this?
@xzerozdead ปีที่แล้ว
Your folder was probably named "Apex" and not "Apex-ground-truth"
@IshaqKhan010 ปีที่แล้ว
Brother you can train for urdu nashtiliq font there no accurate trained data on net please
@ivanmongebadilla9454 2 ปีที่แล้ว ⁺¹
Thanks for the tutorial Gabriel. I wanted to ask how could I do this process if I have the images in text? I guess I need to do the .txt file and the .box file and then just run the training command.
Do you know any software that I could use to create the .box file from the images I have?
Thanks in advance!
@AstuteJoe 2 ปีที่แล้ว
I have seen people use the jTessBoxEditor: vietocr.sourceforge.net/training.html
@ivanmongebadilla9454 2 ปีที่แล้ว
@@AstuteJoe one more question, how would you use the newly trained model in python?
Thank you
@AstuteJoe 2 ปีที่แล้ว ⁺¹
@@ivanmongebadilla9454 I think just a parameter lang='your_new_model_name' as long as the new model is in the tessdata folder
@heetshah9394 ปีที่แล้ว
Is it necessary for the box_file to be for each character or is it okay for it to be one word per bounding box?
@NotFlashYT ปีที่แล้ว
How do you get suggestions in your terminal for auto completion of commands.
@AstuteJoe ปีที่แล้ว
fishshell.com/
@wonkduck4759 ปีที่แล้ว
Hi Gabriel! Thank you so much for the video. A question I had was where did you upload your apex legends ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on
@rcraftg4mer42 ปีที่แล้ว
did find any answers?
@datarkmveri2228 2 ปีที่แล้ว ⁺¹
Hi,
When I try to Run training command it give a error can you please help me ------->
Config file is optional, continuing...
Failed to read data from: data/langdata/Apex/Apex.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1
@datarkmveri2228 2 ปีที่แล้ว ⁺²
command : TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=100
combine_tessdata -u ../tesseract/tessdata/eng.traineddata data/eng/Apex
@datarkmveri2228 2 ปีที่แล้ว
tesseract "data/Apex-ground-truth/eng_44.tif" data/Apex-ground-truth/eng_44 --psm 13 lstm.train
+ tesseract data/Apex-ground-truth/eng_44.tif data/Apex-ground-truth/eng_44 --psm 13 lstm.train
python3 shuffle.py 0 "data/Apex/all-lstmf"
+ head -n 90 data/Apex/all-lstmf
+ tail -n 10 data/Apex/all-lstmf
combine_lang_model \
--input_unicharset data/Apex/unicharset \
--script_dir data/langdata \
--numbers data/Apex/Apex.numbers \
--puncs data/Apex/Apex.punc \
--words data/Apex/Apex.wordlist \
--output_dir data \
\
--lang Apex
Failed to read data from: data/Apex/Apex.wordlist
Failed to read data from: data/Apex/Apex.punc
Failed to read data from: data/Apex/Apex.numbers
Loaded unicharset of size 113 from file data/Apex/unicharset
Setting unichar properties
Other case É of é is not in unicharset
Other case FI of fi is not in unicharset
Setting script properties
Failed to load script unicharset from:data/langdata/Latin.unicharset
Warning: properties incomplete for index 3 = C
Warning: properties incomplete for index 4 = H
Warning: properties incomplete for index 5 = E
Warning: properties incomplete for index 6 = S
Warning: properties incomplete for index 7 = -
Warning: properties incomplete for index 8 = R
Warning: properties incomplete for index 9 = I
Warning: properties incomplete for index 10 = K
Warning: properties incomplete for index 11 = N
Warning: properties incomplete for index 12 = G
Warning: properties incomplete for index 13 = B
Warning: properties incomplete for index 14 = 8
Warning: properties incomplete for index 15 = 5
@АлексейПетров-ч1и5д 2 ปีที่แล้ว
@@datarkmveri2228 solved it: need to run in tesstrain folder:
make leptonica tesseract
make tesseract-langdata
@saviomilbratz 5 หลายเดือนก่อน ⁺¹
Training Tesseract is almost an impossible task.
There could be an easier way just using Pyhton or something simpler.
For regular Windows user like me, this task is almost impossible.
@DalvinderKaur-iz5sn 2 ปีที่แล้ว
Thanks for the tutorial Sir. I have a error after run the Training command-TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=
eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000. the error is :
"CMakefile:325: recipe for target 'data/foo/checkpoints/foo_checkpoint' failed". And coding of string failed! Failure bytes.... ..Can't encode transcription: .....Please can you help me regarding these issues?
@DalvinderKaur-iz5sn 2 ปีที่แล้ว
MODEL_NAME=foo
@ganeshrajv130 ปีที่แล้ว
can we train the tesseract without any font ? if not why cant we ?
@snoopi6243 2 ปีที่แล้ว
Is there any way to perform RTL languages/fonts fine tuning in windows just like this?
@physicfor 5 หลายเดือนก่อน
On windows text2image will never find the font name so better install some lnx vertual machine
@Bobo-wl6bs ปีที่แล้ว
Hi Gabriel. I came across Tesseract today. I'm curious will I be able to train it to learn an arabic font?. I have a bunch of pdfs which are written in an indigenous language. The idea here is to train it on some sample pages so that it will be able to read it. It includes diacritics so I'm not sure if it will work.
@AstuteJoe ปีที่แล้ว
Check the comments, a bunch of people train it for this exact intent
@legendevent3911 2 ปีที่แล้ว
Hey Gabriel, I have a training_text file with just digits like 1,234,567 in variety combinations. The Problem ist when I try to start your script i get following error message:
python3 split_training_text.py
Traceback (most recent call last):
File "split_training_text.py", line 12, in
for line in input_file.readlines():
File "/usr/lib/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
Could you help me to resolve this? Im a newbie in python.
The tutorial was great!
Edit: When im changing the script to: with open(training_text_file, 'rb') I get a new error TypeError: write() argument must be str, not bytes
@AstuteJoe 2 ปีที่แล้ว
Can you send me the whole file? Pastebin or GitHub does it, I believe I know exactly how to fix but I need the whole file to send you the fixed version
@abdeldjalilchougui ปีที่แล้ว
Did you solve the problem ? if yes could you share it with me please ?
@abdeldjalilchougui ปีที่แล้ว
@@AstuteJoe Did you solve the problem ? if yes could you share it with me please ?
@sebastianorzechowski4613 9 หลายเดือนก่อน
I think you have to type encoding='utf-8' insine open function:
with open(training_text_file,'r',encoding='utf-8') as input_file:
@Ethiopic ปีที่แล้ว
Thank you for this video. I am now able to train Tesseract to ocr my language data in the Mac. This is working great both in the Linux and the Mac. (But, I am unable to do so because I am getting error "tessdata_prefix not recognized" in the Windows. )
@wonkduck4759 ปีที่แล้ว
Hello, I am currently stuck. Where did you upload your new font ttf file in the code directory like where should it be place? I have a custom font ttf file that I want to train on?
@alirezanadafy9267 ปีที่แล้ว
Hi
Just run:
set TESSDATA_PREFIX="../tesseract/tessdata"
and then run the text2image....
@KINGERTADC_yay ปีที่แล้ว
Hey Gabriel, nice vid, I am actually using it to train tesseract on Aurbesh font/language from star wars look it up it would explain a lot, each letter has a corresponding English letter I have collected roughly 100,000 sentences using your program and trained it with the command you provided but when I run a 6 letter word it completely melts down and just outputs the incorrect answer, I have changed iteration to small and big but no luck, I am wondering if you can help me or point me in the right direction. Thanks a lot
@ganeshrajv130 ปีที่แล้ว ⁺¹
Hey you collected font but whats the training text data is that of Aurbesh ?
@kinderpinguiin7064 ปีที่แล้ว
Hi ! I don't know if you resolved your issue since 1 month but don't forget to set a huge MAX_ITERATIONS to the make training. I personally set it to 10000 and it was quite better, it might be really enough for you if you have 100000 sentences. If you want to know the result check the log while the model is training, for example :
At iteration 7800/7800/7800, Mean rms=5.642000%, delta=49.022000%, BCER train=97.817000%,
BWER train=100.000000%, skip ratio=0.000000%, New best BCER = 97.817000 wrote checkpoint.
BCER is the error rate for characters and BWER the error rate for words, you can see that at iteration 7800 it was higher than 95% and after the 9500th iteration I got several improvements.
@prakashchavda2813 หลายเดือนก่อน
I guess Linux machine is must for training tesseract 5, because its not working in Windows OS.
@insidethoughts502 2 ปีที่แล้ว
Is tessaract 5 can helpful for detecting only bold text from images
@AstuteJoe 2 ปีที่แล้ว
Only experimentation will tell, but Tesseract 5 does perform better some times
@3ombieautopilot 2 ปีที่แล้ว ⁺¹
Hello! Can you make a video about how to make tesseract to recognize a character which is out of eng.traineddata? Like ± , Ó mixed with some english text
@adityanjsg99 2 ปีที่แล้ว
Train it and the use it
@farazsoftinfo 2 ปีที่แล้ว ⁺¹
Hi Gabriel,
Thanks for making this tutorial, I was waiting for it.
I will start taring my model soon. 😍
But how we can fine-tune a model?
Can you please show me how can I combine this new trained file with another model?
@AstuteJoe 2 ปีที่แล้ว
Glad you liked it! In this tutorial you can see I actually fine-tuned, I started on the eng.traineddata file from Tesseract and trained it further on a new font, this should be enough for most cases.
@farazsoftinfo 2 ปีที่แล้ว
@@AstuteJoe Hi Gabriel, when I fine-tune I get a very bad result. I just wanna add some new words and some characters, but the final file that I get is worse than the main traineddata file.
I'm trying to fine-tune an RTL language.
Thanks a lot.
@AstuteJoe 2 ปีที่แล้ว
@@farazsoftinfo That's a very different rabbit hole, that's ML techniques, you might be overfitting (training too much) or underfitting (training too little) your model, have you tried generating all the 193k PDFs to train and leaving it to train for a bit?
@gabriel2011gabriel 2 ปีที่แล้ว
@@farazsoftinfo I'm trying to do the same thing and the result is a bunch of "mmmoooomom...". Is yours the same?
@farazsoftinfo 2 ปีที่แล้ว ⁺¹
@@gabriel2011gabriel I tried it for Persian, but I couldn't get a good result. The main models are still better than what I got. When I try to add some new words and fonts I get a worse model. Maybe I should check it more to figure out the best settings that work for the RTL languages.
@blndazeez1973 2 ปีที่แล้ว
Hi Gabriel,
Great Video! One questions, when I try to retrain Arabic model using this command
"TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=ara TESSDATA=../tesseract/tessdata MAX_ITERATIONS=200"
It gives me below error:
"Error opening data file ../tesseract/tessdata/eng.traineddata"
The problem I am not using the English model.
hanks for the video again!
@AstuteJoe 2 ปีที่แล้ว
That's really odd, I see you changed the START_MODEL so it should work, not super sure now
@AstuteJoe 2 ปีที่แล้ว
Do you have ara.traineddata in the tessdata folder?
@blndazeez1973 2 ปีที่แล้ว
@@AstuteJoe Yes I have and made sure of it couple of times
@AstuteJoe 2 ปีที่แล้ว
@@blndazeez1973 Maybe it's because the Apex model was already created when you were trying it out? And it's already on top of the eng trained data?
@blndazeez1973 2 ปีที่แล้ว ⁺¹
@@AstuteJoe I redo the steps with different model name but gives me the same error, that is strange.
@TuanLe-ve7lm 2 ปีที่แล้ว
hi Gabo, May I please see your fonts.conf file?
@AstuteJoe 2 ปีที่แล้ว
Not even sure what is this file now but here you go, this one is on my home folder:
/home/gabri/tesseract_training/apex_legends.otf
@AstuteJoe 2 ปีที่แล้ว
This one is on the tesseract project folder:
@TuanLe-ve7lm 2 ปีที่แล้ว
I have made a good progress today, I am able to train the Apex font, however when I switch to another font Nato Sans, it's able to generate box and tff but it shows error while training "Makefile:219: *** found no data/Noto Sans-ground-truth/*.gt.txt for Sans/all-gt. Stop." . Seem it does not accept font's name with space in middle ..
@AstuteJoe 2 ปีที่แล้ว
@@TuanLe-ve7lm That could definitely be it, spaces and Linux (or Windows) don't mix well
@_nom_ ปีที่แล้ว
No rule to make target 'data/eng-ground-truth/eng.training_text.lstmf'
@ManuthVANN 11 หลายเดือนก่อน
Thank so much sir for ur clear explaination and code
@milankoncsard5701 3 หลายเดือนก่อน
Were you able to find a fix for this?
@ikedoriens6149 2 ปีที่แล้ว
Jezus. Isn't there just a command line possibility like in Tesseract 4.0?
This seems a bit complicated for someone who's not into programming.
@cryptoplusone3850 2 ปีที่แล้ว
does this also work on windows or do i have to use a different method?
@AstuteJoe 2 ปีที่แล้ว
I believe it works, but definitely not every step exactly like in the video. But as far as I remember the Tesseract mantainers highly recommend Linux instead
@focusofLandD 2 ปีที่แล้ว
I tried on Windows, not working very well, pls let me know if you are able to solve it
@asiburrahman3623 2 ปีที่แล้ว
I didn't get the font part. Where did you put the font?
@AstuteJoe 2 ปีที่แล้ว
It has to be installed on your system, each OS will have a different way of doing it
@asiburrahman3623 2 ปีที่แล้ว ⁺¹
@@AstuteJoe i'm using ubuntu. Is there any way to specify the directory?
@AstuteJoe 2 ปีที่แล้ว
@@asiburrahman3623 askubuntu.com/questions/3697/how-do-i-install-fonts
@asiburrahman3623 2 ปีที่แล้ว ⁺²
@@AstuteJoe I have installed the font but still this error shows:
Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
Fontconfig warning: "/tmp/fonts.conf", line 4: empty font directory name ignored
Could not find font named 'Apex'.
@kannapatudompant8535 2 ปีที่แล้ว
@@asiburrahman3623 I also have the same problem.
I tried to add '--fontconfig_tmpdir={fontconf_dir}'. >> the default is /tmp which doesn't have our font directory in it.
fonts.conf is usually located in etc/share/fonts.
Now, I could create .box and .tif files.
Hope this solution could solve your issue too.
@kurobane_sama 5 หลายเดือนก่อน
Impossible to use another language than english :(
@ahmetfatih4121 2 หลายเดือนก่อน ⁺¹
I can feel your pain bro, my heart breaks everytime your voice breaks :( Dealing with all those endless instructions, terminal commands designed by some d*ck head to make life miserable for all of us and just all kinds of bullshit. You have my sympathy.
@datarkmveri2228 2 ปีที่แล้ว
please help
@Kronzplayz. 2 ปีที่แล้ว
kindly help i'm getting an error while training plz @AstuteJoe
Failed to read data from: data/OCRA/OCRA.wordlist
Failed to read data from: data/OCRA/OCRA.punc
Failed to read data from: data/OCRA/OCRA.numbers
Loaded unicharset of size 112 from file data/OCRA/unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Failed to load script unicharset from:data/langdata/Latin.unicharset
Config file is optional, continuing...
Failed to read data from: data/langdata/OCRA/OCRA.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:293: data/OCRA/OCRA.traineddata] Error 1
@Kronzplayz. 2 ปีที่แล้ว
I solved this issue 😅
@enriqueortiz5875 2 ปีที่แล้ว
@@Kronzplayz. how you solved it? I got the same issue
@АлексейПетров-ч1и5д 2 ปีที่แล้ว
@@enriqueortiz5875 solved it: need to run in tesstrain folder:
make leptonica tesseract
make tesseract-langdata
@rcraftg4mer42 ปีที่แล้ว
i love you
@AstuteJoe ปีที่แล้ว
lol i love you too
@АлексейПетров-ч1и5д 2 ปีที่แล้ว
Hello, how to fix it?
Failed to read data from: data/langdata/Apex/Apex.config
Failed to read data from: data/langdata/radical-stroke.txt
Error reading radical code table data/langdata/radical-stroke.txt
make: *** [Makefile:293: data/Apex/Apex.traineddata] Error 1
@АлексейПетров-ч1и5д 2 ปีที่แล้ว ⁺⁴
solved it: need to run in tesstrain folder:
make leptonica tesseract
make tesseract-langdata
@tsaitsai6666 ปีที่แล้ว
thanks
@PratibhaVaradkar ปีที่แล้ว
Hi Gabriel (@AstuteJoe), thank you for the elaborate tutorial.
I have a doubt though, once i followed the tutorial, generated the tif, gt.txt and .box manually. My training quits with a zero error rate before the max iterations. But when i use the generated trainneddata file, it gives the error "Error: Tesseract (legacy) engine requested, but components are not present in /use/share/tesseract-ocr/5/tessdata/lang_name.traineddata!! Failed loading language 'lang_name' Tesseract couldn't load any languages! Could not initialize tesseract."
Can you please suggest what i missed?
@focusofLandD 2 ปีที่แล้ว
Hi, Gabriel: I am getting this error: at the last training step when I am trying to train a new font called Bender:
Failed to read data from : data/bender/bender.worldlist
Failed to read data from : data/bender/bender.punc
Failed to read data from : data/bender/bender.numbers
Failed to read data from : data/bender/bender.config
Invalid format in radical table at line 0: 19886 3 23 6 3
@notAvn ปีที่แล้ว
did you manage to train tesseract for bender yet?
@faint.2396 2 ปีที่แล้ว
Hi I'm getting this error:
Traceback (most recent call last):
File "C:\Users\HAVASIZ\Desktop\tesseract_tutorial\split_training_text.py", line 34, in
subprocess.run([
File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 501, in run
with Popen(*popenargs, **kwargs) as process:
File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 969, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "C:\Users\HAVASIZ\AppData\Local\Programs\Python\Python310\lib\subprocess.py", line 1438, in _execute_child
hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
FileNotFoundError: [WinError 2]
@TuanLe-ve7lm 2 ปีที่แล้ว
same to me, have you had a solution yet
@faint.2396 2 ปีที่แล้ว
@@TuanLe-ve7lm No, sadly I gave up on how to train Tesseract 5. I'm going to try to learn how to train Tesseract 4 because there are a lot more videos on youtube.
@faint.2396 ปีที่แล้ว
@@TuanLe-ve7lm I actually fixed the issue by using Linux. But now I get other errors lol
@abdeldjalilchougui ปีที่แล้ว
@@faint.2396 Did you fix your problem ?
@sebastianorzechowski4613 8 หลายเดือนก่อน
I think it could be related with text2image itself. You have to provide path to text2image.exe which in general is located in installed tesseract.
@ganeshrajv130 ปีที่แล้ว
read_params_file: Can't open make
read_params_file: Can't open training
read_params_file: Can't open MODEL_NAME=nakula_hin
read_params_file: Can't open START_MODEL=hin
read_params_file: Can't open TESSDATA=/usr/local/share/tessdata/
read_params_file: Can't open MAX_ITERATIONS=10
Error, cannot read input file TESSDATA_PREFIX: No such file or directory
Error during processing. This is what the error I get even though i did followed ur step
@athosmba1766 ปีที่แล้ว
When I use the code TESSDATA_PREFIX=.../tesseract/tessdata make training model_NAME=Apex Start_MODEL=eng TESSDATA=.../tesseract/tessdata MAX_INTERATION=100 it's not work, giving an error about the comand TESSDATA=........
@athosmba1766 ปีที่แล้ว
someone can help me?
@Ethiopic ปีที่แล้ว
Are you getting "not recognized" error. I am getting the same error on Windows. The exact command works fine on the Mac. Very strange. Do you find a solution?
@utkarshmishra6194 ปีที่แล้ว
Hi Gabriel, hope you doing well
I ran this command
TESSDATA_PREFIX=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata make training MODEL_NAME=Apex START_MODEL=eng TESSDATA=/mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata MAX_ITERATIONS=400
But I am getting error
Failed to read data from: data/Apex/Apex.wordlist
Failed to read data from: data/Apex/Apex.punc
Failed to read data from: data/Apex/Apex.numbers
Failed to read data from: data/langdata/Apex/Apex.config
Null char=2
lstmtraining \
--debug_interval 0 \
--traineddata data/Apex/Apex.traineddata \
--old_traineddata /mnt/c/Users/Asus/PycharmProjects/tesseract_tutorial/tesseract/tessdata/eng.traineddata \
--continue_from data/eng/Apex.lstm \
--learning_rate 0.0001 \
--model_output data/Apex/checkpoints/Apex \
--train_listfile data/Apex/list.train \
--eval_listfile data/Apex/list.eval \
--max_iterations 1000 \
--target_error_rate 0.01
Failed to load list of training filenames from data/Apex/list.train
make: *** [Makefile:319: data/Apex/checkpoints/Apex_checkpoint] Error 1
@nithyavenugopal6834 ปีที่แล้ว
Hi, were you able to solve this error? If so, how?
@vishnubalaji9500 2 ปีที่แล้ว ⁺²
understood jack shit from this video needs more dumbing down
@faint.2396 2 ปีที่แล้ว ⁺⁵
fr and I did every step the same and I'm getting errors. Why isn't training Tesseract 5 simple as Tesseract 4? And the thing is there's only ONE video on how to train Tesseract 5 and its this one.
@sayantanbiswas9702 8 หลายเดือนก่อน
tesseract data/coc-ground-truth/eng_2.tif stdout --tessdata-dir /home/godmode2/tesseract_tutori
al/tesstrain/data --psm 7 -l coc --loglevel ALL
@sayantanbiswas9702 8 หลายเดือนก่อน
TESSDATA_PREFIX=../tesseract/tessdata make training MODEL_NAME=coc START_MODEL= eng TESSDATA=../tesseract/tessdata MAX_ITERATIONS=10000

ต่อไป

เล่นอัตโนมัติ

How to Preprocess Images for Text OCR in Python (OCR in Python Tutorials 02.02)