How does Automated Machine Learning work?

Katya Tompoidi
Apr 22, 2021 10:17:26 AM

Automated Machine Learning (autoML) is a recent trend in the world of Machine Learning to train and optimize ML-models with as little as possible human intervention. In this article, I will introduce the main concepts of autoML and explain how exactly Doculayer.ai integrates autoML.

Let us first have a look at the machine learning lifecycle. Machine learning can be represented as a step by step process. This process starts with the collection and annotation of raw data where the user who carries domain knowledge, can assign the data into correct categories. After being annotated, the data needs to be preprocessed in such a way that only the essential information in each document is kept. For instance, very common words or numbers are usually removed or replaced by place holders to restrict the feature space and speed up the training process. Next, a machine learning algorithm is selected based on the problem these data try to solve. Usually, this would be either a classification or a regression problem. After this, the data are combined with the algorithm in the training step.

Thorough testing & monitoring of the model is crucial
The training requires a specialist to monitor the training and if needed to intervene in the process to tune the hyperparameters of the algorithm. The result of the training is a model which in turn needs to be thoroughly tested before being integrated into production environments. If the model is not thoroughly tested, it is possible to run into a low generalizability problem. This happens when the model performs well only with the data it was trained on but when encountered with unseen data, it fails tremendously. Once the model’s performance is evaluated on the unseen data, it can be deployed into production. The next step consists of using the model in a production setup and monitoring its performance over time. This is required to make sure the model is replaced with a better one in the long term.

Now that we know the steps in the machine learning lifecycle, autoML can be defined as a process in which all those steps are performed automatically. 

The use case below further clarifies the ML lifecycle steps with an example. In case all is clear feel free to jump to the next chapter to continue with how autoML is featured in Doculayer.ai. 

A use case 
Imagine you work in a company that processes a lot of financial documents. This company has among others, three types of documents, ‘Yearly income overview’, ‘Contracts’ and ‘Property overview’.  These three types of documents would be the most common in a department, so the employees there would like to have a system which could help them keep the documents grouped properly without putting too much effort into it. 

One of the employees comes up with an idea to keep each document in a folder with a name corresponding to the document’s type. The next day, the same employee is assigned the task to group all the documents into corresponding folders. This task takes a day and requires the employee to open each document and check for words or other patterns that could give the gist of the type, the document belongs to. 

The next days the employee is busy with a different assignment, so she has no time to group the new documents reaching her department into the correct folders. As a result, after a month, she is encountered with the same tedious task of classifying a pile of documents into correct types and storing them into correct folders. To prevent the employee from repeating the same task, the department’s manager decides to ask a data scientist from a different department to find and configure an algorithm that could learn to classify those three types of documents automatically. 

After a week of work, the data scientist comes back with a perfect model which could classify 95% of documents in the training set. But this model would need to be integrated into the document management system such that it could automatically assign the document to the correct folder whenever a new document is uploaded into the system. So, the department's manager also asks an engineer which could integrate it into the production environment. 

After another month the model is integrated into the production, but the employees don’t seem to be very happy with it. It seems that the model constantly makes mistakes with the documents of type ‘Yearly income overview’. So, the data scientist is requested to check what’s going on.  By looking into the training data and production data, it was apparent that documents of type ‘Yearly income overview’ turned out to have much more formats than those in the training set. As a result, the data scientist had to keep part of the data out of the training process, to see how the model would perform on the data it had not encountered before (test set). Eventually, many more modifications were required in the training process, before the model could reach 95% accuracy in the test set and it could be taken into production again.

Half a year later, a new regulation was introduced which modified the structure of the documents of the type ‘Contracts’. As a result, the model when classifying a document stopped being very confident of the type it assigned. This led in general to many mistakes, which led to extra manual work for the employees. So, over time more employees would start complaining about the model’s efficacy. The manager of the department had to ask the data scientist to figure out what was wrong with the model. The data scientist after identifying the change followed the same steps as before to train the model on the new and old data. 

Once the new model was ready, the engineer would have to replace the old model with the new one but also enrich the system with a database where the classification scores could be stored and accessed by the data scientist to monitor whether the model’s confidence reduced. By monitoring the model’s confidence fluctuations, the data scientist can decide to retrain the model again in order to keep the efficacy high.

AutoML within Doculayer.ai
At Doculayer.ai we understand that things change fast, so no model trained now will remain relevant in the future. Therefore, Doculayer.ai offers the feature which supports autoML. Doculayer.ai offers an ecosystem in which: 

  • The user can upload and annotate the documents; 
  • Data preprocessing is done automatically;
  • The best matching algorithm is discovered and configured automatically;
  • The model using the best matching algorithm is trained automatically;
  • The model is tested to provide the user with an expectation of the model’s performance in production;
  • The best model is instantly deployed into production;
  • Model’s confidence is monitored;  
  • Model’s re-training is automatically suggested to the user. 

With Doculayer autoML training and optimization of the used ML-models is highly automated which gives better results with less effort.

Want to know more about autoML with Doculayer.ai
If you are interested to know more about the benefits of this feature, do not hesitate to contact one of our sales representatives.

You May Also Like

These Stories on AI

Subscribe by Email

No Comments Yet

Let us know what you think