Training UiPath Document Understanding ML Models - Data Manager - Part 2 | RPA

Lahiru Fernando

Просмотров 9 тыс.

Добавить в
- Мой плейлист
- Посмотреть позже
Поделиться

HTML-код

Размер видео:

Показать панель управления

Автовоспроизведение

Автоповтор

Опубликовано: 17 окт 2024
This part of the video focuses on how we utilize the Data Manager to use training data generated through the Document Understanding workflow.
Topics covered in the video:
Use of Machine Learning Extractor Trainer
Generating training data through the data validation process of the DU workflow
Using training data in Data Manager
Use of single Data Manager session for multiple training batches
Training ML Model
In the next video, we will have a look at how to use the generic Document Understanding model to build an ML model for different documents other than invoices\ purchase orders
Download the source code from here:
drive.google.c...
#UiPath #MachineLearning #DocumentUnderstanding #ArtificialIntelligence #RPA #UiPathCommunity
Наука

Комментарии • 64

@akshaykulkarni1771 4 месяца назад ⁺¹
Hi @Lahiru, Thank you so much for the video series. I have one q, in the uipath documentation it is recommended to retrain the model on minor version as base version that is 0. Is there any particular reason you have chosen the minor version as retrained model version while running the pipeline at 44:30?
@LahiruFernando 4 месяца назад ⁺¹
Hello Akshay<
Thank you for the comment and a great question.
Yes, the recommended one is to always use the base version (version 0) for the minor version. It was a small mistake on my side when I created the video.. Unfortunately I cannot edit that part on this one. But I published a separate video following this explaining which one to go with. So you are correct. it should always be zero for the minor version.
@donyc 3 года назад ⁺¹
Lahiru, this is an invaluable resource, and your hard work is truly appreciated! Wonderful work, excellent pace, and clear as well. Thank you very much.
@LahiruFernando 3 года назад
Wow.. Thank you so much for your thoughts my friend!! This really means a lot!! Thanks again!
@bhavanakoturu597 9 месяцев назад ⁺¹
Sir, I can see all the data fields in ML skill after retrained instead of mentioned fields in labelling.. could you please help me am I missing anything here
@LahiruFernando 9 месяцев назад
Hello,
In the machine learning extractor activity, have you retrieved the latest fields from the model?
You can do this by clicking on Configure Extractors on the data extraction scope and clicking on the settings button for ML extractor. There you’ll see a button called Get Capabilities
@bhavanakoturu597 9 месяцев назад ⁺¹
@@LahiruFernando Thanks for your quick reply sir.. after retrain the model I can able to see the fields
@LahiruFernando 9 месяцев назад
@@bhavanakoturu597 Awesome... So its sorted :)
@anilkumarandra7557 3 года назад ⁺¹
Thank you so much sir video is so much helpful and please provide the continuation video for getting more knowledge on DU
@LahiruFernando 3 года назад
Thank you so much for your thoughts... I am planning to publish part 3 soon.. stay tuned!
@jacobchiengsh2210 2 года назад ⁺¹
Hi @Lahiru,
I am quite impressed for this video and it give me ideas on how can i train my own machine models. I have few queries on the content as below:
1. May i know why you need to create two dataset in the video? Is it because of different type of invoices? If i have a lot of invoice and have the different structured and different detail are needed for each type of invoice, may i know in this case, do i need to create few datasets? or one dataset will do?
2. We would need to train the model in the AI center and then doing the same as well in the UiPath workflow (after the validation station). May i know what are the difference between these two training? If i only train in the AI center , is it sufficient?
3. From the video, we can set and make the Model being trained automatically, does it mean it train the same documents ?Or we would need to upload the different document for every training? If it train the same documents, why we would need to train it everytime?
Many thanks again for the great video tutorial.
@LahiruFernando 2 года назад
Hey.. Great questions again :)
Here are the answers:
1. Just one dataset is enough for each Data Manager session. The same dataset will contain the input data and the output generated by the Data Manager. However, if you have several AI models, make sure to have separate datasets for each model so that the model does not get confused with different types of documents.
2. Training in the Data Manager (AI Center) is very important. It generates a lot more data compared to the retraining data generated by the validation station. Data Manager training is the foundation for the model to learn where as the data generated by the validation station is just an improvement to the foundation you created. I explained this in the following video.. Have a look :)
Link: ruclips.net/video/DWAa6XPeOrE/видео.html
3. Once you enable automatic training, it always get trained on the old dataset + new dataset (if available). This way, every training run will make sure to train the model on a larger dataset than the previous. This approach helps to improve the model as it works on large datasets. Again, this concept is explained in the link I shared just above. It has the answer in very detail :)
@cjgomezr 2 года назад ⁺¹
Thanks Lahiru, really good video. I have a question, I always have to import the data into the data manager and then export it to run the pipeline and train the model? Or is there a way in which I just run the process in Studio and it charge the data into the data set and the pipeline runs automacally? Thank you.
@LahiruFernando 2 года назад
Hi Camilo,
Thank you so much for your feedback..
About the question: yes.. this is possible. So if you have a look at the Part 3 video of this series, I have explained how to auto export and schedule the training runs on AI Center :)
In case you are unable to find: here is the link:
ruclips.net/video/0pnEpfTjbr0/видео.html
@Danpm658 2 года назад ⁺¹
Hello, absolutely great tutorial! I do have a question after it. After the Machine learning extractor trainer, you uploaded the data into Document Manager, you corrected the fields and exported the data into the Dataset. Isn't the same thing if you directly publish the dataset directly from the machine learning extractor trainer and make a run in the pipeline? I mean, it's the same corrected data that gets into the dataset with or without the document manager. I might be missing something, but I am looking for your answer. Cheers!
@Danpm658 2 года назад
Actually, there is no need for an answer here, you showed the steps in the part 3. Great job, you earned a subscriber
@LahiruFernando 2 года назад
Hey Mihai,
Nice point man.. it is the same thing bro.. So you can manually add the data to Document Manager (which is the old way). But now, as you said, we can directly push the validated data (validated data coming from Validation Station/ Action) to the dataset. If you have the export scheduled in Document Manager then it will automatically perform this step and get the data ready for your pipeline. You can also schedule the pipeline to run on latest data.
Hope this helps.. Let me know if anything is not clear brother..
@Danpm658 2 года назад ⁺¹
I do have one more question though. What if we want to extract a custom field(item tax percentage for example, that is not in the schema.json or in the invoice package)? Would the Skill learn to extract it by indicating it at first?
@LahiruFernando 2 года назад ⁺¹
@@Danpm658 Yes.. You can add the new field in the Data Labeling session and start training your documents with those values and export it. It will eventually learn. However, if you have a lot of columns that do not exist in the current model, and what you expect is totally different, the recommended would be to train your own model using the Generic Document Understanding Model. This model can be customized to fit into any document type that you have.
@yashobantadash6670 2 года назад ⁺¹
great video bro! 😊😊🙌🙌one qn on my mind bro. Why do we need to create separate batch in doc manager session when we can use the existing batch to upload the documents?
@LahiruFernando Год назад ⁺¹
Hey bro...
It depends on the need. If you want to add the files into the same batch that you already have, you can use that name. However, if you have a requirement to have the files in a way you can identify those separately, then you can use a different batch name. It is all about grouping..
@yashobantadash6670 Год назад ⁺¹
@@LahiruFernando thanks a lot bro! 🙌🙌😍😍
@sruthi5287 3 года назад ⁺¹
Thank you for this informative and a great tutorial.
@LahiruFernando 3 года назад
Thank you so much! :)
@drGacayte 3 года назад ⁺¹
Hi Lahiru, thanks allot for your videos, its really helpful and informative. Question: I have been training a batch of invoices (13 invoices from one company) . I trained about 2 times, both using the dataset manager like in part 1 and using studio then exporting out the train set like in part 2. BTW I am using invoice framework I found on the Uipath website. My question is: Every time I ran the batch through studio, confidence is low and it keeps bringing up the validation station. How do I ensure that this batch is trained and confidence is high enough that it outputs the data and not keep bring up the validation station? Thank you
@LahiruFernando 3 года назад ⁺¹
Hello
Thank you so much for your thoughts, this is a great question.
Here is my answer:
I believe your ML model is still new and has done limited training with limited number of documents (13 docs). I have faced the same scenario. The reason for this is, it require more training and more sample documents to train with. So, how you can achieve this is:
1. Rerun the Training pipeline few more times on the same data you already trained it with.
2. If you have new documents available with you, feed it into the DU workflow, and get the data uploaded into the Dataset as I have explained in Part 3. Then perform multiple training runs on the current and the new data to improve the accuracy.
It will take some time, and will need to run multiple times to get a better result. This is quite common in the initial stages. However, over time it will learn and will have a higher confidence level.
In my case also, I did the same thing. That improved the accuracy and eventually it will reduce the need of manual verification.
In short, keep doing more training and it will improve :)
Let me know if you have and questions or issues my friend. Happy to help any time.
Have a great day!
@drGacayte 3 года назад ⁺¹
@@LahiruFernando Thank you so much sir
@andreaskurz6166 3 года назад ⁺¹
Hi Lahiro
Great tutorial as usual! Short question on the retraining and minor version (around 44:00): From UiPath Forum it was mentioned to not retrain on retrained version and hence increase the minor version when running training pipelines. Is still still a best practice? Or is it adviced as shown by you to continue training on retrained minor versions?
@LahiruFernando 3 года назад
I have also seen this.. im not too sure why that is the way to train. But in my case I have trained on retrained versions and that seems to be working fine for me..
I will also explore a bit on this to see what else I can find and will post here my findings.. :)
@shreyjain7959 2 года назад ⁺¹
You have rerun the training run multiple times on the same dataset. How do I do that in order to improve accuracy? I have already done the training run once.
@LahiruFernando 2 года назад
Hello Shrey,
You can also do that by creating multiple training pipelines pointing to the same dataset.
In the pipeline creation, always make sure to select the Minor Version properly to 0 so that it trains the base version. This also generates better accuracy.
@shreyjain7959 2 года назад ⁺²
@@LahiruFernando But for the subsequent versions i should select the latest versions right? For example I trained once and upgraded from 14.0 to 14.1. Now for the next training run I should select the major version as 14 and the minor version as 1 right? So that my latest model will be 14.2
@LahiruFernando 2 года назад
@Shrey Jain you can train it like that too.. but that is not recommended because if you do that, you train on smaller dataset.
Major version can be the highest (14 in your case)
But setting the minor version to 0 will make sure it uses the training done by uipath (if pretrained) + your previous training knowledge is also used for the new training. Which means you train on a much larger dataset.
This setting will also ensure it upgrades to 14.2 next time.
The concept is explained in more detail in this video too..
Link: ruclips.net/video/DWAa6XPeOrE/видео.html
@shreyjain7959 2 года назад
@@LahiruFernando Ok Lemme watch the video
@shreyjain7959 2 года назад
@@LahiruFernando Ok I watched the video and understood. You just earned a new sub brother!
@shreyjain7959 2 года назад ⁺¹
After validating the documents you import them into a separate Data manager. Shouldn't you import them in the same Data manager you created last time so that the training data increases and hence gives good results? Kindly correct me if I am wrong!
Also you mention in the best video practices that minor version should be 0 so that a larger dataset will be available for training. But here you select the minor version as 4. Please solve the confusion!
@shreyjain7959 2 года назад ⁺¹
Kindly reply
@LahiruFernando 2 года назад
Hello Shrey,
Sorry for the late reply. Your comment did not show up for some reason, and luckily I found it. Hence, replying to you now..
Yes, In the video I used it bit differently as it was a demo to understand the concept. in he real scenario, what you mentioned is correct. So to answer to your first question:
1. Yes, we have to use the same Data Manager session we used for the initial training to import the data and enhance the accuracy through live running data. Your understanding is correct
2. In the video, I used the minor version (instead of using zero) is to run on a smaller dataset to do the video faster. But in the real scenario (as explained in the best practices video), we have to always select 0 as the minor version because here we use a large dataset for the training. The version 0 always represent the base version. So every time we train on 0 version, it means that we use the initial training data, and all the additional data we have to do the training. Larger the dataset, higher the accuracy becomes..
Hope this resolves the confusion you had.. :)
@shreyjain7959 2 года назад ⁺¹
@@LahiruFernando Thank you for replying and yes you solved the confusion. Thanks a lot!
@thebujoco9381 2 года назад ⁺¹
@@LahiruFernando Hi, I tried to deploy the ML skill with the minor version 0 but it's failing. However, when I deployed it on minor version 1 and its successful.
Do you happen to know why it would be failing with Kunerbetes error?
Also, if there's any issue with deploying minor version 1?
How many document pages would you recommend us to train before we skip the present validation data?
@LahiruFernando 2 года назад
Hello @@thebujoco9381
Very good questions.. Let me answer...
When deploying an ML skill, its preferred always use the latest for the minor version and major version. The reason is, deploying an ML model require a trained and enhanced version. Minor version 0 means sometimes it is not trained or trained by uipath and not retrainable. So if you are using 0 for a model that is not trained, it will fail. So do the training and use the latest minor version for deployment as a Skill..
I believe this is the reason to fail with the error you got already. Maybe you can share the error so I can be more specific if you need?
Well, about the number of pages needed -> This actually depends on several factors such as:
1. How good the initial training is
2. Number of documents used for the initial training
3. Documents you get for the real DU process to validate through Validation Station (are those similar to the previously trained ones or totally new layouts)
In general, if its an invoices processing job, we at least request about 50 documents per vendor. If we cannot get 50, get as much as possible, and stay above 10 at least. This way we can get a good accuracy level.
Hope this helps... Feel free to ask anything :)
@SavitabhKumar Год назад
Hi I am using the ML Skill for Generic Documents , it is trained on 10 docs but still no data is extracted in Validation Station , please help on this
@LahiruFernando Год назад
Hello,
Have you enabled the Extractor to extract the fields through the Configure Extractors option in the Data Extraction Scope?
If that doesn't work, try whether it is working on a trained document. If not, you may need more sample documents to train the model better.
Let me know if this helps
@savitabhkumar5714 Год назад
Yes Sir I have configured the extractor and on get capability I am getting all the required fields
@savitabhkumar5714 Год назад
On the invoice type document it is working but then on generic it is not working, I have 20 sample data so I just used 10 let me add a few more and check
@j.v.manojkumar2673 10 месяцев назад
Very helpful video. I did the same approach suggested by you but when i am trying to retain the ML skill i am getting error : some documents contain fields which are not found in current schema do you have any idea on this error
@LahiruFernando 10 месяцев назад
Hello..
That’s a weird error.. Have you used the schema from out of the box model or did you create it from scratch?
@j.v.manojkumar2673 10 месяцев назад
@@LahiruFernando I used India invoice scheme from box model
@LahiruFernando 10 месяцев назад
@@j.v.manojkumar2673 Can you try the Normal Invoice model and see if that works?
@j.v.manojkumar2673 10 месяцев назад
@@LahiruFernando Sure I will
try it and let you know. Thanks for responding.
@prabhuteja7621 3 года назад ⁺¹
hai sir the new Machine learning extractor trainer is not having those project and data set options how to do with new one sir
@LahiruFernando 3 года назад
Hi Prabhu,
Are you sure you are using the latest version of the IntelligentOCR and Document Understanding activities? It is available in the most recent one. I used it a couple of days ago too :)
Can you quickly verify the version and make sure it is the latest and let me know how it goes?
@drGacayte 3 года назад ⁺¹
@@LahiruFernando i don't see the versions you are using, the newest version i see is v4.13.2 for ocr and v1.7.0 for documentUnderstanding. Any ideas why?
@LahiruFernando 3 года назад ⁺¹
@@drGacayte Very good morning
I am using the DU.ML.Activities 1.7.0 (latest stable version).
For OCR.Activities - I believe what you see is the latest stable version that was released couple of weeks back. It was not available at the time I did the recording :) So in the video, you see the latest preview version, which is some what similar to the latest.
In case you want to try the preview versions you need to do the following:
- Go to Manage packages
- on the top filter, click on the filter icon, and enable include pre-releases.
This will allow you to use the preview versions and that will include the one I used.
Hope this helps :)
Have a great day!
@drGacayte 3 года назад ⁺¹
@@LahiruFernando Thanks allot mate. Also by any chance do you have your source code? would you like to share please thanks
@LahiruFernando 3 года назад ⁺¹
@@drGacayte Yeah.. lucky enough I did not delete it :)
I uploaded it to Google drive. You can get the download link from the video description. Let me know if you are not able to access by any chance.
@TharaRaman 3 года назад ⁺¹
Thank you Sir.
@LahiruFernando 3 года назад
You are welcome!! :)

Следующие

Автовоспроизведение

Training UiPath Document Understanding ML Models - Data Manager - Part 3 | RPA