Patrick Naim, risk modelling expert.
Published Sept, 24, 2019
This article is a summary of the preliminary results of a research project on the use of neural networks for learning the language of risk. This project was carried out last summer with 3 brilliant students (1) from Mahindra Ecole Centrale in India.
We (2) belong to the previous generation of neural network researchers, a time when you had to watch the error signal slowly decrease for days on a text console on your PC. It has now been several years since we left neural networks aside to focus on Bayesian networks and their application in the field of risk modelling.
For several months now we have been following the growing interest of the risk management community in neural networks - now called AI in a somewhat exaggerated way. We are generally doubtful about the use of neural networks in the field of risk. Indeed, the new generation of neural networks has proven to be very effective on applications with a large number of examples known as "labelled" examples, i. e. for which we know the answer. If you have already recognized bridges or hydrants for a captcha, you know how Google makes you work to train its neural networks.
Risks, and in particular major risks, are totally unsuitable for this supervised learning framework. For a neural network to learn to recognize a trader who could become the future Kerviel, it would have to be shown a lot. Fortunately for banks, and unfortunately for neural networks, there are at most 1 or 2 per year.
Neural networks are the subject of very intensive research for image recognition. The main application envisaged is the autonomous car. This also explains the images of bridges, traffic lights, and fire hydrants.
The other success of neural networks, perhaps a little more discreet, but potentially also disruptive, can be observed in language processing, and particularly in translation. An article by Yoshua Bengio published in 2006 laid the foundations for the use of a neural network to represent, and ultimately “understand” text.
Unlike an image that can be understood by looking at the position of its pixels, a word cannot be understood by looking at its letters. The letters of a word, nor indeed its pronunciation, do not give any information about its meaning. It is therefore necessary to find a representation of a word that informs about its meaning. Bengio, with other researchers, proposed the idea of representing a word by its context. This is kind of what we do when we try to follow a conversation in a foreign language in which we are not very fluent. By catching one or two words of context, we can remove ambiguity from other words and understand the overall meaning.
To put this idea into practice, Yoshua Bengio (3) had the idea of training a neural network to learn the next word in a sequence of words. Let us explain this in more detail. We start by constructing a trivial representation for each word. Assume we have a vocabulary of 10,000 words. The words are ordered, and each word is represented by a vector with 10,000 coordinates. All the coordinates are set to 0, except one, which corresponds to the sequence order of the word. For instance if the word "fraud" is the 2,327th word in the vocabulary, it will be represented by a vector with 2,326 "0, a "1", then another 7,673 "0".
Then we build a neural network to perform the following task: predict the next word of a sequence using the preceding words as inputs. If we use a window of, say, five words, the input layer will have 50,000 input neurons, and the output layer will have 10,000 neurons. The trick is to force the learning to take place through a compact representation of words, for instance using a vector of only 100 dimensions, as shown in the graph below.
On the internal layers, each word is represented as a vector of 100 neurons. This internal vector of 100 neurons will be obtained using a constant formula applied to the vector of 10,000 neurons initially representing one word. Then, the 5 vectors of 100 neurons corresponding to the 5 input words are linked to another vector of 100 neurons, which is trained to represent the following word.
The training is done on a corpus of text as large as possible, by "scrolling" this corpus in front of the neural network, training it to provide the word that will most likely follow the previous 5 words. Once this learning is completed, the vector of 100 neurons representing a word is a "semantic" representation of a word, in the sense that two words of close meaning will have a close representation. Once you have a representation for a word, you can, by various methods, neural or not, generate a representation for a sentence and a text. The discussion on this subject is too long to be addressed here. Geoffrey Hinton, another neural network “guru”, has summarized the function of these neural networks as to create "Thought Vectors" containing the meaning of a text (4).
The principle of this method is used by Google Translate, and other translation applications. At this very moment, and out of laziness, I am writing this article in French and it is being translated live by DeepL, another online neural translator.
To apply this in the area of risk, we wanted to answer 2 questions:
1) Is it possible to automatically classify a text to assign it a risk category?
2) Does the use of a vector representation of words learned from a text corpus specialized in risk management produce better results?
We used a corpus of 1500 articles or press releases related to Conduct Risk cases, to be classified into 8 main categories, as defined in our book Operational Risk Modeling in Financial Services: The Exposure, Occurrence, Impact Method:
To answer question 1, we created a Thought Vector representing each article simply by averaging the Thought Vectors of each of its words. We used off-the-shelf Thought Vector mappings - also called embeddings, as provided by Google, Facebook, Stanford, etc (5). Then we taught a small neural network to associate the Thought Vector of the text with the matching category. The performances obtained were already excellent, as the best model had a only 5% error on the training set and 12% on the out-of-sample test.
To answer question 2, we taught a specialized Thought Vector from the same corpus, which included about 15,000 unique words used in a risk description context. This work was much longer and more complex, since we had to recreate a neural network similar to the one proposed by Bengio. And the answer was again positive! Using a specialized "risk language" has improved performance, as the best model had less than 1% error on the training set and 8% error on the out-of-sample test.
You can have a look at our results on the following website http://risk2vec.mstar.org.uk/
The first page shows a 2D mapping of the articles used. Each point is a projection on the the 100-dimensional vector on the 2 most significant dimensions. One already sees that the Thought Vectors did a good job in separating examples, but that some boundaries are more fuzzy (mis-selling retail vs mis-selling wholesale, mis-selling retail vs improper loan management, etc.)
You may also want to test our classifier on new article. Enter the URL of an article about a recent case of antitrust, mis-selling, or financial crime for example, and observe the conclusion of the network. Here are some URLs that we have tested and that work well. They are very recent and have therefore not been used in learning. Just copy the URL and enter it in the input box provided in the demo:
The following one is interesting. This is an article from Reuters in which a case of customer overcharging from Danske Bank is discussed. However, the article talks more of the recent allegation of money laundering at Danske Bank. You will see that the classification reflects this. This shows also that press articles are more difficult to classify than official regulators press releases.
You can also test our neural network on the paper presenting the top 10 oprisks for 2019 as defined by Risk.net :
Interestingly, 2 conduct risks are listed in the top 10 : Regulatory Risk (#7) and Mis-selling (#10). Both are shown in the neural network output.
Don't hesitate to use your own links, and let us know if you find that it doesn't work well! It is a work in progress, which will improve, and for which we have many applications in mind.
(1) Hemanth Chaturvedula, Shreyas Rajesh and Niraj Srinivas
(2) Laurent Condamin and Patrick Naim
(3) See for instance Neural Probabilistic Language Models (Bengio et al)
(4) See for instance How Google Converted Language Translation Into a Problem of Vector Space Mathematics
(5) For Stanford (Glove) Representations, https://nlp.stanford.edu/projects/glove/ To be precise, we used the embeddings under Wikipedia 2014 + Gigaword 5 (6B tokens, 400K vocab) embeddings to compare our results.
For Google (word2vec) representations, https://code.google.com/archive/p/word2vec/ The embeddings we used can be found at GoogleNews-vectors-negative300.bin.gz.
For Facebook (fasttext) representations : https://fasttext.cc/docs/en/support.html And the embeddings in wiki-news-300d-1M.vec.zip on the webpage https://fasttext.cc/docs/en/english-vectors.html