Видео 2
Просмотров 55 404

Creating Tesseract OCR using Python: part-1 installing and getting started with Tesseract

Tesseract OCR - Lesson 2: Training Tesseract for new font

jTessBox Editor: sourceforge.net/projects/vietocr/files/jTessBoxEditor/
Step 1: Make box files for images that we want to train
Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] batch.nochop makebox
Eg:tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox
{*Note: After making box files we have to change or modify wrongly identified characters in box files.}
Step 2: Create .tr file (Compounding image file and box file)
Syntax: tesseract [langname].[fontname].[expN].[file-extension] [langname].[fontname].[expN] box.train
Eg: tesseract train.my.exp.tif train.my.exp0 box.train
step 3: Extract the charset from the box files (Output for this command i...

Видео

Creating Tesseract OCR using Python: part-1 installing and getting started with Tesseract

4:14

Creating Tesseract OCR using Python: part-1 installing and getting started with Tesseract

Просмотров 5 тыс.3 года назад

This video is the first part of the series to create an OCR using Tesseract in Python. In this video, I will be teaching how to download and install Tesseract and get started with it. Index of Tesseract: digi.bib.uni-mannheim.de/tesseract/

@shivmangalyadav4427 26 дней назад
can you share your traindataset
@rahulpandey-bj7ps Месяц назад
we can use this command for box file instead of external tool.tesseract /Users/sonam/Downloads/train_sample.png train_sample batch.nochop makebox
@computerz009 Месяц назад
this was very helpful thank you very much!
@dikyarif8027 Месяц назад
damn you''re my hero
@jessd.294 4 месяца назад
This worked perfectly for me! I trained a model to decipher text from the Gravity Falls ARG (I didn't want to do the soul contract by hand). It needs a little fine tuning, but in the end, it gave me the majority of the text correctly! Thank you!
@CookWithKuroOfficial 5 месяцев назад
good job bro (y)
@davidwebchile 8 месяцев назад
thk!, please upload part 3
@madhurgoel66 8 месяцев назад
Very good video. Please continue your channel and make more such videos please.
@MohammadJAbuNasserMAAN 8 месяцев назад
thank you for the video, what about if i want to make training for multi images, and result one train file ?
@rockntt783 9 месяцев назад
i run mftraining command and it only says no shape table file, and then nothing happens
@banguncool 8 месяцев назад
I'm facing the same issue.
@youtubehodol3386 11 месяцев назад
facing error 'tesseract' is not recognized as an internal or external command, operable program or batch file.
@dalinsixtus6752 11 месяцев назад
set the path correctly , search for path in window's search and then in variables , open path file and create new path ( eg:-c:/programfiles/tesseractocr)
@smklearn-hy9me 11 месяцев назад
Hey, How can I combine two traineddata files into single traineddata file
@cuongo1094 Год назад
cám ơn bạn rất nhiều
@Leo-hk7kk Год назад
How can I use this custom trained tesseract model and use it with YOLOv8 to recognize license plate number????? Pls Help
@dalinsixtus6752 11 месяцев назад
did you find the solution???
@Leo-hk7kk 11 месяцев назад
@@dalinsixtus6752 No Sir
@Kenbreg Год назад
Is this some sort of joke? You downloaded jTessBoxEditor and then did the whole process in a command line. What the hell is the purpose of jTessBoxEditor then??
@jagajitrabha1276 11 месяцев назад
To edit the bounding boxes. You can add bounding boxes wherever necessary when trainning for new languages.
@larabassabah202 10 месяцев назад
You need jbox to correct data because when you train it befor correcting it will give you failure
@logeshpaul Год назад
This helped a lot in understanding the generation process of traineddata. Thank you!
@saviomilbratz Год назад
Thanks a lot for the video! Gave up making part 3?! You should do it! Congratulations!
@Rocketos Год назад
Where is part 3 ?
@HocineFerradj Год назад
you saved my code & my day ... thanks ( stdout is a masterpiece )
@shikhugupta2703 Год назад
How can we train the model with some specific user's handwritten data?
@MishaDisable Год назад
Good tutorial, one of the best, thanks!
@meggiotto Год назад
Note: if shapetable file didn't create, you need to run shapeclustering command to generate for you. example: shapeclustering -F <font_properties file created previusly> -U <unicharset created previusly> <tr file created previusly> or, in windows shapeclustering.exe -F <font_properties file created previusly> -U <unicharset created previusly> <tr file created previusly>
@samuelbastias3752 Год назад
Hey, thanks for your contribution! I still haven't been able to finish the process because, even after running your command, shapetable doesn't seem to generate. It's only generated after I run the next command (step 5), but the other two files in the video are not created. When I try to run the command again, I get an error saying "Failed to read shape table shapetable" Do you know why this may be?
@TuanLe-ve7lm 2 года назад
great video, waiting for Lesson 3
@jehangirkhankhattak8002 2 года назад
when i copy past this command in cmd tesseract train.my.exp0.tif train.my.exp0 batch.nochop makebox it say that it doesn't recognize it
@adityanjsg99 2 года назад
You saved like a week load of work for me!
@gokulkumar8232 2 года назад
I am trying to train tesseract in a Linux machine, I am getting segmentation fault in Step 5??
@subramanyagopalbellary9845 2 года назад
hi im getting error : "APPLY_BOXES: boxfile line 6/25 ((421,1325),(494,1378)): FAILURE! Couldn't find a matching blob" while creating .tr file if any one know how to solve plese provide soluation
@TuanLe-ve7lm 2 года назад
do you have an answer to it?
@RZ-iv6qn 2 года назад
Hello I have a business inquiry. Please DM me.
@人榜鄭 2 года назад
Thank you, hope have lesson 3~~
@shrutisalunkhe1873 2 года назад
Thank you for your video. It was very much useful. Can you please share the next part too?
@hritiktyagi7043 Год назад
Hey! Have you done your work on tesseract or doing?
@waynewu7763 2 года назад
I have an error at the last step to use it to read the image. it says error opening data file. make sure tessdata_prefix environment variable is set to tessdata directory. But I already put the program file\Tesseract-OCR into my path environment variable. Can you help witht his?
@promaster6310 2 года назад
I trying follow with this video in step 5 have error: "Warning: No shape table file present: shapetable" What happen with it?
@samuelbastias3752 Год назад
Hey, did you ever figure it out? I'm getting the same error message.
@Faruk_ck 3 месяца назад
@@samuelbastias3752 I think doing them in adminstator permissions and deleting the older files will fix your issues
@siux94 2 года назад
This is old way, pre Tesseract 4, not for LTSM network. Classical Indian youtuber
@hasanaqbayli149 2 года назад
Thanks @ The Code....not all files generated !!! what should be the issue ?
@lucaavitabile7687 2 года назад
Thanks a lot! Very useful tutorial, and thanks for the material too!
@adamchochowski5357 2 года назад
Hi Man, awesome tutorial. Quick question: Struggling with step 5, my tesseract creating only one file (train.unicharset) instead of four as on your tutorial (missing: inttemp, pffmtable, normproto) , so receiving in cmd: Warning: No shape table file present: shapetable Reading train.my.exp0.tr ... Flat shape table summary: Number of shapes = 11 max unichars = 1 number with multiple unichars = 0 on 04:41 can see that you get 3 more lines from cmd.. maybe you can give me some advice?
@adamchochowski5357 2 года назад
Issue occurred on Tesseract 5.X.... after installing Tesseract 4.1 issue is not present
@samuelbastias3752 Год назад
@@adamchochowski5357 Thank you so much for following up with the solution! MVP
@roshanacharya8054 2 года назад
For multiple images should i do multiple traineddata or only single traineddata. if single means how to train multiple data
@d0ugparker 2 года назад
Excellent, thank you. At 1:16, an incidental note on pronunciation, the “v” in “converting” is a voiced “f” sound, rather than any “w” related sounds. “v” is positioned next to “w” but that's misleading-they don't sound alike. Their sound production is different. “v” is more closely related to “f". Say the word “fee.” Make and hold the “f” sound. Then, while holding the “f” sound, hum while making the “f” sound. “v” is a vibrating “f”. Regards
@techgalaxy100 2 года назад
Thanks for the tutorial. How do I train data for Urdu and Arabic Languages. What would be the font properties. I have an urdu font and lots of 100s of urdu data in jpg format. No clue where to start how to start.
@EvonOSmith1 2 года назад
Thanks for this. I was able to duplicate the process in Linux. However, there was zero improvement in the recognition of my hand writing at all. I don't know if I did something wrong or Tesseeract is that bad lol. Thanks again.
@aaronm4523 2 года назад
this is what I get when I test the png. what these errors are? C:\Users\Laser\Desktop\Tesseract>tesseract HONEYBEE FONT.png stdout -l train read_params_file: Can't open stdout read_params_file: Can't open l read_params_file: Can't open train Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica Error in fopenReadStream: file not found Error in findFileFormat: image file not found Error during processing. ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 0138B858 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatalstm-punc-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 0138B908 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatalstm-word-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 013AC150 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatalstm-number-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 04899FC0 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatapunc-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 013B52A0 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddataword-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 03646348 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatanumber-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 04D43150 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatabigram-dawg) ObjectCache(6AAB5A88)::~ObjectCache(): WARNING! LEAK! object 04D43788 still has count 1 (id \Program Files (x86)\Tesseract-OCR\tessdata/eng.traineddatafreq-dawg)
@Dailythingsx 2 года назад
Hello can you please upload part 2 how to prepare images for better accuracy.
@aaronm4523 2 года назад
that does this mean? Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica libpng warning: iCCP: known incorrect sRGB profile
@oguzhanylmaz4586 2 года назад
cannot find letters on geometric shapes. how can i solve this?
@eltradermexicano 2 года назад
What is your Tesseract version
@professionalgambling6783 2 года назад
4.0
@nurahmadmiftahudin8314 3 года назад
Why my Tesseract just reading .tr file but not write the pffmtable, intemp, and normproto?
@haziqidrose4336 3 года назад
have u found the solution bro?
@haziqidrose4336 3 года назад
i'm having the same problem\
@nurahmadmiftahudin8314 3 года назад
Yes, I use Tesseract v4.0.0 and work fine
@aiesewss6659 2 года назад
use tesseract v4.0.0 and ensure eng.traineddata file present in tessdata folder.
@DiscworldZA 2 года назад
I tried running mftraining but it never ends? Any fix for this?
@vijayk7819 3 года назад
Nice explanation, Easley understood the steps. Can you share the content /Video to train and use the GD&T (Mechanical Characters).
@viteralex 2 года назад
Hi did you find some good exapmples with GD&T?
@meve404 3 года назад
Thank you! Finally, I found somebody that explains this for beginners!
@marsmediainfo 3 года назад
It appears that you need tesseract 4.1 running for this tutorial as with 5.0-alpha i couldn't pass the last steps
@professionalgambling6783 2 года назад
that's true
@professionalgambling6783 2 года назад
@Devdevdevdev idk, the probably can, but you will need a lot of samples to train that thing
@professionalgambling6783 2 года назад
@Devdevdevdev how many pages do you train with
@professionalgambling6783 2 года назад
@Devdevdevdev yes you can train more, and you probably should
@professionalgambling6783 2 года назад
@Devdevdevdev i didn't post any kind of script, i think you are mistaking me with someone, you should watch some kind of tutorial how to generate the training data, first of all, you should have a font. If you don't have a font, which is obvious in the case of hand written stuff, then the only way to generate 5, 10, or 50+ pages would be to make a software, that can cut the predefined rectangle positions, and then generate a page containing randomly spread letters with predefined rectangles containing data which letter it is, if you can program that shouldn't be hard, then generate many pages containing the letters.
@devartimahakalkar8822 3 года назад
Hi, I am getting error while training the data. Could you please tell which tesseract version you are using?
@professionalgambling6783 2 года назад
it's in the movie, it's 4.0

The Code

Комментарии