In this article you will find out how the most advanced and innovative technology on the market in terms of data classification works, the one boosted by Artificial Intelligence and Machine Learning. You will learn what all this technology will provide organisations to improve their data security, specially the most sensitive and confidential one.
TABLE OF CONTENTS
- 1. The current data classification solutions
- 2. Disadvantages of the current data classifiers
- 3. IA and Machine Learning applied to Information Classification
- 4. Improvement of the model
- 5. Different data classification dimensions and the need for a flexible approach
- 6. Benefits of AI and Machine Learning applied to Data Classification
- 7. SealPath Data Classification powered by Getvisibility boosted by AI and Machine Learning
The current data classification solutions
The process of information labelling allows organisations to identify and establish the nature of the data regarding sensitivity, that is, the degree of damage that could cause to the organisation in case it was extracted and disseminated. This enables the right information to reach to the correct people at the time they need it. Or the sensitive information not to get into the wrong hands that should not have access to it.
Actually, the classification process is quite simple, when a document or email is created, the owner assigns the pertinent confidentiality level. This level indicates the reach of the distribution of data: Public, Internal, Confidential or Secret. This identifies and communicates the protection level that those data should have and the audience that consumes them. A confidential document should not be publicly distributed.
Disadvantages of the current data classifiers
Many organisations use classification systems, having established very precise policies, but reality is that they are usually difficult to implement. Information is normally classified based on a theoretical model rather than one that considers a specific context, adapted to the reality of the organisation and of the day-to-day. This leads to confusion when applying the levels of classification and sharing information. Many users ask the question: Is it internal or confidential? Can I share this document if I label it as confidential?.
User errors can lead to the exposure of critical information, provoking exfiltrations by malware, ransomware or malicious agents. Despite the effort the organisation has invested, the opposite of enhancing the information security can happen, which is why data classification policies are being introduced.
IA and Machine Learning applied to Information Classification
Artificial Intelligence and machine learning techniques are of great value, helping to improve technology by providing security analysts with a faster and more efficient way of evaluating potential threats. Thanks to their application, they allow us to detect strange patterns or unusual user behaviour to help us anticipate attacks against our information.
Thanks to machine learning algorithms, this advanced data classification technology can be applied, raising the accuracy level regarding the characteristics that make the specific contents of a file confidential. The models, taking advantage of machine learning, are previously trained for several years using data that contain personal information, medical, financial and other data. This previous learning helps predict the sensitivity level of previously non-labelled data.
The used machine learning algorithms may include support vector machines (SVM), neural networks, logistic regressions, linear regressions, decision tree and natural language processing (NLP), among others. Model example:
Training allows the Machine Learning system to decide the classification type of a document after an inference on the set of parameters in a document or email. The trained model, together with a powerful specialised data classification software, helps minimise human error, costs and time in labelling corporate information.
Improvement of the model
The level of accuracy at the time the AI system delivers a verdict is greatly conditioned by the data the model has been trained with. The parameters to define the sensitivity level of a financial services company may significantly vary from those of other sectors such as industry. Hence the importance of having a properly trained model, since an improperly trained one can lead to erroneous evaluations of the sensitivity level.
Continuously training these models with sectorial data, specific to their sector or activity, as they create a specific type of data, helps to improve even more the accuracy of the decision systems. By being continuously fed by sectorial and regulatory data, the models are fed back and improved by user verdicts on their classification. This allows to correct the possible accuracy inferences in a specific organisation in different iterations. The system, thanks to AI, can suggest the user types of classification, so it does not need to be trained to classify documents, as in the case of more rudimentary data classification systems. Accuracy of 97% can be achieved by identifying intellectual property.
Figure 3 shows how accuracy can be improved for new types of documents after a few scans reaching more than 90%. The software learns and adapts to different types of documents and it does not require a thorough review and classification by the staff.
Staff assistance and authorisation allows to use anonymised vector signatures of files to improve accuracy. With staff assistance, the accuracy improvement can be accelerated at a rate of approximately 8% every hour per user spent reviewing classifications (depending on the variety of files scanned).
Different data classification dimensions and the need for a flexible approach
When data are going to be classified, the following dimensions can be seen:
Data sensitivity: determines the level of damage it can cause to the organisation if it falls into the wrong hands. We can usually find categories such as: highly confidential, internal confidential, public.
Associated regulation: some documents can be classified according to the regulation to which the information they contain is related. EU-GDPR with personal data, PCI for payment and credit card data, etc.
Data types: a document can contain personal data, financial data, health data… These types of data are often directly related to regulations. Such as personal data with EU-GDPR.
Reach of dissemination: establishes the extent to which a specific critical document can be distributed: internal dissemination, suppliers, etc.
Restricting classification to a diagram based exclusively on the sensitivity of the information is not always the most appropriate approach. When using rudimentary classification tools, users may wonder what sensitivity level should be assigned to the document. Some organisations have well-defined and rule-marked classification diagrams based on the sensitivity or criticality of the information (e. g., NATO classification diagram), but this is not usual in most companies and organisations.
In regulated sectors such as finance, it can be more beneficial to label data as subject to PCI regulation, instead of using labels based on criticality or dissemination scope, since it is much easier for corporate users to identify. The previously mentioned AI and Machine Learning can be flexible when adapting the classification based on all those dimensions, without being limited to only one classification based on the sensitivity level. Furthermore, it can make suggestions to the user taking into account all dimensions.
Benefits of AI and Machine Learning applied to Data Classification
Thanks to this technology, organisations can get to understand better their data footprint with greater accuracy. They can fully trust the classification and know their current position regarding the risk of the data they manage. This is a very important point, since having knowledge of the risk of all their data allows a proactive management and security.
Another key point is to make the classification process easier for users so that there are no errors that could lead to serious consequences. This way, users will not need training to classify documents, anyone can do it correctly. Technologies must be easy to implement and use, but they also have to be effective so that their adoption does not incur high costs or wastes of time without fully reliable results, something which is intended to avoid with this innovative technology based on AI and ML.
SealPath Data Classification powered by Getvisibility boosted by AI and Machine Learning
SealPath is a leading provider of security focused on data and digital rights management, that applies the latest Artificial Intelligence and Machine Learning technology in its “SealPath Data Classification powered by Getvisibility” solution.
SealPath Data Classification provides an advanced solution for visibility, protection, control and dynamic understanding of data as they are created. This innovative AI-enhanced data classification and automatic protection of tagged information tool provides the technology that corporate customers need to classify data in a secure and precise way throughout its life cycle. This way, organisations of any sector get the capacity to avoid data leaks and gain compliance of the most severe data protection regulations.
With SealPath Data Classification, the user receives suggestions about the classification level when creating and editing a document. The software learns and adapts to different types of documents by continuously improving its accuracy through AI, and allows organisations to classify unstructured information with unprecedent confidence.
SealPath’s information protection tool is perfectly integrated with the intelligent data classification system, so that files classified with a certain classification level or held to a specific regulation can also be protected automatically and without user intervention with SealPath’s digital rights management solution.
SealPath’s protection, together with the classification system boosted by AI and Machine Learning, accelerates an organisation’s efforts to avoid data classification errors in a quick and worthwhile way.