AI is a hot topic, and a lot of executives eager to jump on the highest AI wave and ride it as fast as possible. The problem is, they don’t realize that it takes time to prepare on land before heading to the water.
Quite often, when we talk to prospects, we see they get very excited about Doculayer and the technologies we use. They get enthusiastic by the very thought that some of their challenges can be resolved with AI. But as we start a more in-depth investigation of their specific case, we see that some companies have a very vague understanding of how AI works and what it takes to get ready for AI.
With this article, we want to help you answer 2 fundamental questions before initiating any AI-related projects:
Do you really need AI?
Are you ready for AI?
But let’s start with the basics.
Why is AI such a big deal?
There is a lot of buzz around AI these years. Gartner predicts that by 2021, 70% of organizations will employ AI in one way or another to assist employees’ productivity.
But why AI is gaining so much attention?
Artificial Intelligence allows machines to perform human-like tasks, continuously learn from experience, and improve their performance. It’s a powerful technology that drastically boosts our productivity and thrives on jobs that we, humans, find meticulous.
Just a few things that AI does better than us:
Imagine you have to go through more than 10 000 contracts, classify them manually, extract the start and end date, and assign each contract to the right account manager. How long will it take you to finish this task? How many mistakes will you make? How soon will you get sick and tired of reading all these contracts?
Tools powered by AI, on the other hand, can accomplish this task in a matter of minutes with little to no errors. On top, no complaints, and no coffee breaks.
Companies have so much data at their disposal nowadays that there is no way for a human to go through it all. Even the brightest data scientist cannot compete with the AI capabilities to absorb huge volumes of data, analyze, and distill them in small bites of information. AI can see patterns in historical data and unlock tons of opportunities that we otherwise can overlook.
When it comes down to solving complex problems, AI also outperforms us. It can make unbiased, unemotional decisions based only on data and statistics. No human being can be as data-driven as AI.
AI-powered solutions can be real game-changers in today's business world. They complement our daily jobs and help us overcome the natural limitations of our brains.
Only one slight remark.
Real AI with super-human intelligence doesn’t exist yet.
Tools that software providers usually wrap in a shiny AI package are mathematics, statistics, machine learning (ML), deep learning (DL), natural language processing (NLP), and neural nets.
We all are guilty of using these terms interchangeably, but if you understand the differences and say AI intentionally, it’s not a big deal.
In this article, we also will drop “AI” now and then, but you should read it as innovative technologies that work with big data.
So, as we now made it clear, let’s continue with more fundamental questions.
Do you really need AI?
One of the companies reached out to us recently, seeking help with document classification.
Document classification, if done manually, is a tiresome and error-prone task, but you cannot neglect it. It’s a necessary step before you can glean any insights from documents. Therefore, a lot of companies try to take this burden off employees’ shoulders and automate document classification.
From the Machine Learning perspective, this task is rather trivial and can be easily solved. The entire process consists of 2 steps: data preparation and ML training.
Step 1: Data preparation
While this step is the most time-consuming, it’s also the most significant one. It entails collecting, consolidating, cleaning, and labeling data. Proper data preparation produces a well-curated training set and results in more accurate outcomes.
Think of converting pdf to machine-readable text files and assigning them to a specific category, for example, “Invoice” or “NDA.”
Step 2: ML training
The next step is feeding this clean labeled training set into the ML engine. During this process, we teach an algorithm to tell apart categories to which documents belong to, based on the sample data.
An algorithm performs mathematical computations on the training set to find an optimal function that will be capable of predicting the correct category of documents, it has never seen before.
Doculayer can solve data classification problems easily, so without any delays, we decided to put our hands on the new case.
However, when we received the sample documents from the prospect, we’ve realized that we might need to go back to the meeting table and start our conversation all over again.
The documents we received were messy. The quality of scans was very low, and the diversity of data was enormous. They were in 5 different languages, with different layouts and structures. Some documents didn’t have any text layer on top of them, others were partially converted into fully searchable documents.
It turned out that they have 7 different categories of documents. But the volume for such a diverse group of documents was very low, 1000 in total. Fair enough classifying 1000 documents manually is challenging. But for ML, it’s not enough to learn how to differentiate them.
While considering AI technologies, it’s important to understand upfront the trade-off between time invested and the business value gained. In this particular case, the efforts that we would have put into training the ML engine would have never paid off. In fact, we were even unable to guarantee that AI results would be accurate enough. This challenge was simply not suitable for AI.
What is necessary to apply AI?
Two things are necessary to benefit from Machine Learning application: the sufficient amount of data and its high quality. That’s exactly what we were missing in the above-mentioned case.
Machine Learning, as we said earlier, is the process during which given a training set, an algorithm tries to find a mathematical formula to describe that data. In case of document classification, this mathematical formula defines the document category.
To understand why you need more data, you need to know the concept of overfitting in ML. When the mathematical formula that from now on I will call a model is deduced from a very small amount of data, the chances are that at some point in time, there will be a case that will differ entirely from the data instances the model was trained on.
If we want to do it right, we need to have a representative dataset that includes most of the possible options. You can never have it all, and as it usually happens, over time things become outdated or irrelevant. Still, your training data set should contain as many instances as possible. In this way, the model can generalize sufficiently to documents it had never been trained on. If it’s impossible, the overfitting happens.
In simple words, overfitting is the problem of Machine Learning models that succeed tremendously with classifying sample documents during the training but fail with the same intensity when encountering new, slightly different documents.
Now let’s consider data quality, which has 2 aspects.
First, the quality has to do with the type of data. When solving a document classification problem, we deal with textual data. These text files can be saved in various formats such as .txt, .doc, .xls, and pdf. The last one is tricky.
A pdf file can be created digitally or from scanned files. While digitally created pdfs have text layer, scans might not have it, and then we are left with an image of a document. With modern technologies such as OCR, it’s possible to add a text layer on top of the image. However, the results are never 100% accurate. OCR usually produces errors and misinterprets characters. To boost the quality of the outcome, we apply our unique Post-OCR processing step. Here we fix errors that crept into the text file. However, if the scan is distorted or when the text is blurred, even Post-OCR might miss the errors. Poor quality of scanned documents is one aspect of the data quality.
Second, the quality has to do with how balanced the data is. The balanced data means that the instances of data per category are evenly distributed and not skewed. For example, we have 2 categories: category 1 has 5000 data instances, and category 2 has 200. During the training, the chances are the model will learn to output as a prediction only category 1. We will see high accuracy and will be happy with it. But category 2 is never predicted correctly.
To be fair, there are augmentation and oversampling techniques to fight the problem of skewness in a dataset. Still, those techniques are only partially efficient. They can prove to be insufficient if sampled data is also not representative enough of real distribution.
Now you may ask: “Is it even possible to have distribution-representative data, as things are constantly changing, and there will always be documents entirely differing from the rest?” It's acceptable for the model to make incorrect predictions. However, if the percentage of the mistakes is too high, then it's obvious that the data is of a low quality. Keep in mind, the models get outdated over time and need to be updated again and again. (See Interactive Learning)
Data quality and its volume are 2 necessary elements for any AI-related project.
So, the answer to the question if you really need AI, depends on if you have enough data and if its quality is good enough to train the ML algorithm.
If you still think that AI is an excellent fit for your challenge, then keep on reading. In the second series of this article, we will help you assess your readiness for AI technologies based on the maturity model of your organization.
About the co-author:
Katya is an Artificial Intelligence specialist who is part of Doculayer's Machine Learning team. She practices Deep Learning in domains of Computer Vision and NLP to solve various content management challenges.