"How Does A.I. Diagnose Diabetes?"
JAMA published research demonstrating the efficacy of a deep learning algorithm. They were able to train a deep learning neural network to recapitulate the majority decision of 7 or 8 US board certified ophthalmologists in the task of grading for a diabetic retinopathy. The type of deep learning algorithm used to detect diabetic retinopathy in that study is called a Convolutional Neural Network, or CNN. CNNs enable computer systems to analyze and classify data. When applied to images, CNNs can recognize that an image shows a dog rather than a cat. They can recognize the dog whether it's a small part or a large part of the picture - size doesn't matter for this technique.
It can also classify the dog by breed. CNN systems have also been developed to help clinicians do their work including selecting cellular elements on pathological slides, correctly identifying the spatial orientation of chest radiographs, and, as Dr. Peng mentioned, automatically grading retinal images for diabetic retinopathy. So let's open the deep learning black box to understand how this works. First, a CNN is not one process.
It's actually a complex network of interconnected processes, organized in layers. With each layer, the CNN can detect higher-level, more abstract features. When the CNN is identifying these features, it uses something called a filter. Here's how Larry Carin, one of the authors of a JAMA Guide to Statistics and Methods article on CNNs, describes a filter: So, we think about a medical image, a medical image in radiology or ophthalmology or dermatology is characterized by local structure, could be textures, it could be edges, it could be curves, corners, etc. And what these filters are doing are constituting little miniature versions of each of these little building blocks. And the way that the CNN looks for these building blocks is the C in CNN, and it stands for convolution. It's a mathematical operation that looks pretty complex. But, actually, it's very simple.
It's a very simple concept. It's kind of like you've got this filter, and you're walking to every part of the image, and you're just asking the question, how much does this image look like that filter? Think of it like this: you have a drawing, that's the image, and you have a stencil, that's the filter. You take that stencil and pass that stencil over that drawing that you have, and as you do that you will see that some parts of the drawing become more visible than others as you do that, right? And that process of sliding that stencil across this drawing is essentially the process of convolution. Now that we've explained what a filter is and introduced the concept of convolution, let's use an analogy of written language to understand the relationship between the filters and the hierarchical structure of the layers in a CNN. We will simplify the explanation by using an analogy. The analogy is a written document. In order to communicate through writing, we organize it as a series of paragraphs, which are composed of sentences, those sentences are composed of words, and the words of letters. So reading a document requires assessing the relationship of letters to one another in increasing layers of complexity, which is a kind of "deep" hierarchy, like the hierarchy in image analysis.
Continuing with our analogy, let's say we're looking for the phrase Ada Lovelace in a paragraph. Ada Lovelace was a mathematician and writer who lived in the 19th century. And she holds the honor of having published the very first algorithm intended to be used by a machine to perform calculations, which makes her the first ever computer programmer. In the first layer of the network, a CNN looks for the basic building blocks of an image. The basic building blocks of written language are letters. So in this analogy, the filters the CNN uses in the first layer would be letters. Let's take on the word "Ada." Here is what the convolution process would look like for the letter A. When the "A" filter overlies the letter "A" in the original image, the convolution output would generate a strong signal. This signal would then be mapped onto something called a feature map. The feature map represents how well elements in the image align with the filter. If something is there, the signal outputs white. If nothing is there, the signal outputs black. CNNs generate a feature map for every filter.
So in our analogy, there would be a feature map for every letter. These feature maps would then become the input for the second layer. In this layer, the CNN would spatially align and "stack" all those maps from the previous layer. This would allow the CNN to then look for short, specific sequences of letters in all the feature maps simultaneously. So the CNN would use a new set of filters to look for specific letters that are adjacent to one another in particular sequences. In our analogy, the second layer would look for places where the letters A, D, and A are in sequence together making the word "ADA". It would also look for places where letters A, C, E, L, O and V are adjacent to one another using filters for LOVE and LACE. The output of the second layer would be the feature maps for those three sequences of letters.
In other words, in those feature maps, strong signals would be present where the sequences ADA, LOVE and LACE are located in the original paragraph. In the third layer, the CNN would stack and align these three new maps and perform more convolutions-this time identifying where longer words and groups of words are located. So the CNN could at this point identify where in the original paragraph the sequences of letters and words making the phrase "ADA LOVELACE" are located. In our analogy, we were looking for a phrase consisting of only two words. Had we been looking for a longer sentence or even a paragraph, the CNN would deal with the greater complexity by having more layers. We've omitted quite a few details about CNNs for simplicity, but this captures the essence of the model. But what does this look like for actual images, like identifying diabetic retinopathy from an ocular photograph? Images are made out of pixels rather than letters. In a digital context, a pixel is the smallest, controllable unit of an image represented on a display. Each pixel is a representation of a tiny portion of the original image. Think about pixels like creating a drawing with dots where every dot has a color value and an intensity.
The more dots used, the clearer the image becomes. The filters a CNN uses in that first layer are small squares of pixels that correspond to things like textures, contrast between two colors, or edges. These are the image analysis-equivalents of the letters used in our analogy. And as a CNN goes up in the hierarchy, it looks for combinations of these filters, getting more and more complex with each layer. As the complexity increases, the CNN gets closer to identifying what it's looking for. So the specific features analyzed at each layer help put the whole thing together. So, for example, some of the earlier work showed that some layers tend to be better at extracting, sort of like, edge-like information. Meaning that, for example, if you combine different kinds of horizontal edges, we might get a continuous line that resembles the retinal blood vessels. And as you combine more of those and start to encode more higher-level concepts such as, you know, is there a micro-aneurysm here, is there bleeding over here, is there other lesions in the image? And right at the very end is where these, after these multiple layers, the network will try to then condense all of that information down into a final prediction. In this case, severe diabetic retinopathy.
Developing a CNN to help identify diabetic retinopathy was motivated because many patients with diabetes are not getting screened frequently enough. We have to screen diabetic patients once a year or we should, and there are some barriers to getting that done. Some of it is just, you know, not having enough trained professionals to do that task. It's also not having that expertise available where the patient is. It's not that, you know, there aren't retina specialists in a metropolitan city four hours away, it's that there isn't a retina specialist at your grocery store. And CNNs could facilitate the integration of diabetic retinopathy and other screening programs into primary care. But before that happens, more research, especially prospective clinical trials, are needed. The way we do approach these things is really the way that medicine usually works, which is to say, "let's do validations of the method again and again and again until we're sure, we're reasonably confident that it really works on many kinds of images, in many settings for, you know, many different patient populations." And so from my perspective that's really at the end of the day what's most important: does it work on real patients and is it reliable? The excitement generated by early results has already spurred several research groups to look into the efficacy of CNNs in clinical practice, which could potentially finally get CNNs from the bench to the bedside. I think we're on the third or fourth technological revolution where neural networks are coming to the forefront, and I really hope that this time we'll get it right.
But there were failures in the past where people used the technology in suboptimal ways and we don't want it to happen again. One has to make sure that we have appropriate and sufficient data for development, validation and testing, and that we're solving actual clinical problems. At the end of the day, one thing to take away is that even if, as a clinician, it can be hard to understand exactly how a CNN arrives at its diagnosis, it can still be a useful tool. And this is similar to how many clinicians use other widely-adopted technologies. Consider antibodies: You know, as a clinician I may not know exactly where that part of an antibody kind of binds to, but after looking at some of this clinical validation of using Lucentis, for example, for an injection.. This is kind of like any new breakthrough technology: needs validation and needs transparency, but the medical community in general responds very well to new technologies that have been validated.6 COMMENTS