DTSA 5511 - Cancer Detection with Convolutional Neural Networks
Millions of people develop cancer each year. In most cases, imaging is used to help detect potentially cancerous masses. As more an more people are screened for cancer each year, deep learning and convolutional neural networks can help ease the burden on doctors to correctly identify cancerous masses. Even a ‘pre-human’ detection program that can classify images and potentially even highlight the potentially cancerous mass can help doctors and practitioners screen even more people for cancer.
import numpy as np
import pandas as pd
import re
import string
from os import listdir
# Plotting Images
import matplotlib.pyplot as plt
# Image Processing
from PIL import Image
# Metrics
from sklearn.metrics import accuracy_score, auc, f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
# Tensorflow
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras import datasets, layers, models
Data Exploration
As the data are in an image format (TIF to be exact), we must first create dataset of the transformed image data.
path_train = '/kaggle/input/histopathologic-cancer-detection/train/'
path_test = '/kaggle/input/histopathologic-cancer-detection/test/'
images_train = [f for f in listdir(path_train)]
images_test= [f for f in listdir(path_test)]
print('Train Images:', len(images_train))
print('Test Images:', len(images_test))
print('Split: %2.f' %(len(images_test)/(len(images_test)+len(images_train))*100),'%')
Train Images: 220025
Test Images: 57458
Split: 21 %
There are 220,025 images in the training set and 57,458 images in the testing set. This results in a 21% train test split.
We can now explore a single image and its characteristics.
im = Image.open('/kaggle/input/histopathologic-cancer-detection/train/00001b2b5609af42ab0ab276dd4cd41c3e7745b5.tif')
print(im.format, im.size, im.mode)
plt.imshow(im)
TIFF (96, 96) RGB
<matplotlib.image.AxesImage at 0x7f502a366ec0>

We can see that this looks like a medical image. I have no medical training, so I cannot comment on exactly what is in the picture, but I can see that there are various different colored circles that could be classified as cancer to a model or unknowledgeable person like myself. The image is 96x96 pixels and is in the RGC colorspace.
We can now explore the training labels.
train_labels = pd.read_csv('/kaggle/input/histopathologic-cancer-detection/train_labels.csv')
train_labels['path'] = train_labels['id'] + '.tif'
train_labels
| id | label | path | |
|---|---|---|---|
| 0 | f38a6374c348f90b587e046aac6079959adf3835 | 0 | f38a6374c348f90b587e046aac6079959adf3835.tif |
| 1 | c18f2d887b7ae4f6742ee445113fa1aef383ed77 | 1 | c18f2d887b7ae4f6742ee445113fa1aef383ed77.tif |
| 2 | 755db6279dae599ebb4d39a9123cce439965282d | 0 | 755db6279dae599ebb4d39a9123cce439965282d.tif |
| 3 | bc3f0c64fb968ff4a8bd33af6971ecae77c75e08 | 0 | bc3f0c64fb968ff4a8bd33af6971ecae77c75e08.tif |
| 4 | 068aba587a4950175d04c680d38943fd488d6a9d | 0 | 068aba587a4950175d04c680d38943fd488d6a9d.tif |
| ... | ... | ... | ... |
| 220020 | 53e9aa9d46e720bf3c6a7528d1fca3ba6e2e49f6 | 0 | 53e9aa9d46e720bf3c6a7528d1fca3ba6e2e49f6.tif |
| 220021 | d4b854fe38b07fe2831ad73892b3cec877689576 | 1 | d4b854fe38b07fe2831ad73892b3cec877689576.tif |
| 220022 | 3d046cead1a2a5cbe00b2b4847cfb7ba7cf5fe75 | 0 | 3d046cead1a2a5cbe00b2b4847cfb7ba7cf5fe75.tif |
| 220023 | f129691c13433f66e1e0671ff1fe80944816f5a2 | 0 | f129691c13433f66e1e0671ff1fe80944816f5a2.tif |
| 220024 | a81f84895ddcd522302ddf34be02eb1b3e5af1cb | 1 | a81f84895ddcd522302ddf34be02eb1b3e5af1cb.tif |
220025 rows × 3 columns
The data itself is fairly sparse. It contains the id, which is the file name minus the file format, which is ‘.tif’ and the label itself.
print('Training Label Size:', len(train_labels['id']))
print('Missing Labels:', len(train_labels[train_labels['label'].isna()==True]))
print('Bad Labels:', len(train_labels[(train_labels['label']>1)|(train_labels['label']<0)]))
Training Label Size: 220025
Missing Labels: 0
Bad Labels: 0
fig, ax = plt.subplots()
df = train_labels['label'].value_counts().reset_index()
df
ax.pie(df['count'], labels=df['label'], autopct='%1.1f%%')
([<matplotlib.patches.Wedge at 0x7f502a3ae350>,
<matplotlib.patches.Wedge at 0x7f502a3ac9d0>],
[Text(-0.32334109228524116, 1.051404079333815, '0'),
Text(0.3233409938456856, -1.0514041096071884, '1')],
[Text(-0.17636786851922245, 0.5734931341820809, '59.5%'),
Text(0.1763678148249194, -0.57349315069483, '40.5%')])

Comparing to the outputs above, the training labels length matches the total number of training images. There also do not seem to be any missing labels or any bad/incorrect labels. Overall the training data seems to be fairly evenly split between positive and negative instances.
We can now explore some images from each label.
n = 10
label = 0
fig, ax = plt.subplots(1,n,figsize=(20,2))
df = train_labels[train_labels['label']==label].head(n)['path'].values
for i in range(len(df)):
im = Image.open(path_train+df[i])
ax[i].imshow(im)
fig.suptitle('Label: '+str(label))
plt.show()

n = 10
label = 1
fig, ax = plt.subplots(1,n,figsize=(20,2))
df = train_labels[train_labels['label']==label].head(n)['path'].values
for i in range(len(df)):
im = Image.open(path_train+df[i])
ax[i].imshow(im)
fig.suptitle('Label: '+str(label))
plt.show()

Looking at these two set of images, to an untrained eye, there does not seem to be much that stands out between the two images. It will be interesting to see how well the CNN model does.
Dataset Creation and Split
We must now create our dataset from the images and labels. We can utilize TensorFlow packages to generate the training and test data. Unfortunately, after repeated testing, there were major performance concerns with using the entire dataset of over 500,000 images. We will instead use a subset of 10,000 images; this allows the ImageDataGenerator functions that create the two subsets and the model training itself faster.
We need to specifically choose the training and validation subsets, as well as specify shuffle = False in order to perform post model fit analysis.
# takes a bit to run
train_labels['label_s'] = train_labels['label'].apply(lambda x: str(x))
batch_size = 100
# data_gen = ImageDataGenerator(rescale=1/255, validation_split =0.15)
data_gen = ImageDataGenerator(rescale=1/255)
training = data_gen.flow_from_dataframe(
dataframe = train_labels.iloc[0:8500],
directory = path_train,
x_col = 'path',
y_col = 'label_s',
batch_size = batch_size,
shuffle = False,
class_mode = 'binary', #binary labels need to be strings
target_size = (96,96),
# subset = 'training',
seed = 101,
validate_filenames = True
)
validation = data_gen.flow_from_dataframe(
dataframe = train_labels.iloc[8500:10000],
directory = path_train,
x_col = 'path',
y_col = 'label_s',
batch_size = batch_size,
shuffle = False,
class_mode = 'binary', #binary labels need to be strings
target_size = (96,96),
# subset = 'validation',
seed = 101,
validate_filenames = True
)
Found 8500 validated image filenames belonging to 2 classes.
Found 1500 validated image filenames belonging to 2 classes.
Modeling
We will start with a simple model to train and validate. We can also make sure the size of data used in the training and validation is small enough to use.
# adapted from https://www.tensorflow.org/tutorials/keras/text_classification
def plot_func(history):
history_dict = history.history
history_dict.keys()
acc = history_dict['accuracy']
val_acc = history_dict['val_accuracy']
loss = history_dict['loss']
val_loss = history_dict['val_loss']
epochs = range(1, len(acc) + 1)
# "bo" is for "blue dot"
plt.plot(epochs, loss, 'bo', label='Training loss')
# b is for "solid blue line"
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend(loc='lower right')
plt.show()
First few models commented out to improve notebook submission runtime.
# first_model = tf.keras.Sequential([
# layers.Conv2D(16, kernel_size=(2,2), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(2,2), strides=2, padding='same'),
# layers.Flatten(),
# layers.Dense(16, activation='relu'),
# layers.Dense(1, activation='sigmoid')
# ])
# first_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate=0.0001))
# first_model.summary()
# history = first_model.fit(x=training, validation_data=validation, epochs=5)
# plot_func(history)
From the above plots, it does look like the model seems to be leveling off, especially when looking at the training accuracy over epochs. We can try building and training a more complex model with more CNN layers.
# second_model = tf.keras.Sequential([
# layers.Conv2D(32, kernel_size=(4,4), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(4,4), strides=2, padding='same'),
# layers.Conv2D(32, kernel_size=(3,3), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
# layers.Conv2D(32, kernel_size=(2,2), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(2,2), strides=2, padding='same'),
# layers.Flatten(),
# layers.Dense(16, activation='relu'),
# layers.Dense(1, activation='sigmoid')
# ])
# second_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate=0.0001))
# second_model.summary()
# history = second_model.fit(x=training, validation_data=validation, epochs=5)
# plot_func(history)
This model clearly achieves at a better accuracy after one epoch, but does not seem to dramatically increase the model’s validation accuracy. We can try increasing the amount of filters in the CNN layers.
# third_model = tf.keras.Sequential([
# layers.Conv2D(64, kernel_size=(4,4), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(4,4), strides=2, padding='same'),
# layers.Conv2D(32, kernel_size=(3,3), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
# layers.Conv2D(16, kernel_size=(2,2), padding='same', activation='relu', input_shape=(96,96,3)),
# layers.MaxPooling2D(pool_size=(2,2), strides=2, padding='same'),
# layers.Flatten(),
# layers.Dense(16, activation='relu'),
# layers.Dense(1, activation='sigmoid')
# ])
# third_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate=0.0001))
# third_model.summary()
# history = third_model.fit(x=training, validation_data=validation, epochs=10)
# plot_func(history)
This model looks slightly better than the previous model. We’ll try one more model with the same number of filters per layer, but kernel sizes all set to (3,3).
fourth_model = tf.keras.Sequential([
layers.Conv2D(64, kernel_size=(3,3), padding='same', activation='relu', input_shape=(96,96,3)),
layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
layers.Conv2D(32, kernel_size=(3,3), padding='same', activation='relu', input_shape=(96,96,3)),
layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
layers.Conv2D(16, kernel_size=(3,3), padding='same', activation='relu', input_shape=(96,96,3)),
layers.MaxPooling2D(pool_size=(3,3), strides=2, padding='same'),
layers.Flatten(),
layers.Dense(8, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
fourth_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate=0.0001))
fourth_model.summary()
Model: "sequential_1"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d_3 (Conv2D) (None, 96, 96, 64) 1792
max_pooling2d_3 (MaxPoolin (None, 48, 48, 64) 0
g2D)
conv2d_4 (Conv2D) (None, 48, 48, 32) 18464
max_pooling2d_4 (MaxPoolin (None, 24, 24, 32) 0
g2D)
conv2d_5 (Conv2D) (None, 24, 24, 16) 4624
max_pooling2d_5 (MaxPoolin (None, 12, 12, 16) 0
g2D)
flatten_1 (Flatten) (None, 2304) 0
dense_2 (Dense) (None, 8) 18440
dense_3 (Dense) (None, 1) 9
=================================================================
Total params: 43329 (169.25 KB)
Trainable params: 43329 (169.25 KB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
history = fourth_model.fit(x=training, validation_data=validation, epochs=10)
Epoch 1/10
85/85 [==============================] - 111s 1s/step - loss: 0.6565 - accuracy: 0.6029 - val_loss: 0.6384 - val_accuracy: 0.5960
Epoch 2/10
85/85 [==============================] - 107s 1s/step - loss: 0.5843 - accuracy: 0.6885 - val_loss: 0.5221 - val_accuracy: 0.7700
Epoch 3/10
85/85 [==============================] - 110s 1s/step - loss: 0.4860 - accuracy: 0.7824 - val_loss: 0.4879 - val_accuracy: 0.7607
Epoch 4/10
85/85 [==============================] - 107s 1s/step - loss: 0.4732 - accuracy: 0.7816 - val_loss: 0.4914 - val_accuracy: 0.7667
Epoch 5/10
85/85 [==============================] - 108s 1s/step - loss: 0.4664 - accuracy: 0.7875 - val_loss: 0.4851 - val_accuracy: 0.7640
Epoch 6/10
85/85 [==============================] - 108s 1s/step - loss: 0.4673 - accuracy: 0.7901 - val_loss: 0.4847 - val_accuracy: 0.7633
Epoch 7/10
85/85 [==============================] - 108s 1s/step - loss: 0.4614 - accuracy: 0.7899 - val_loss: 0.4746 - val_accuracy: 0.7760
Epoch 8/10
85/85 [==============================] - 109s 1s/step - loss: 0.4589 - accuracy: 0.7964 - val_loss: 0.4738 - val_accuracy: 0.7753
Epoch 9/10
85/85 [==============================] - 108s 1s/step - loss: 0.4552 - accuracy: 0.7944 - val_loss: 0.4685 - val_accuracy: 0.7767
Epoch 10/10
85/85 [==============================] - 109s 1s/step - loss: 0.4567 - accuracy: 0.7936 - val_loss: 0.4700 - val_accuracy: 0.7733
plot_func(history)


Once again, this model is slightly better, but not by much. Due to the time it takes to train these models, we will use the final model above trained on 10 epochs to predict the final test dataset and investigate the validation data.
val_pred = fourth_model.predict(validation)
val_pred
15/15 [==============================] - 5s 309ms/step
array([[0.06299686],
[0.18402255],
[0.13164014],
...,
[0.8321301 ],
[0.5108832 ],
[0.81643224]], dtype=float32)
We now have to convert these values from probabilities into 1s or 0s.
# val_y_batch = [int(i) for i in val_y_batch]
val_pred = val_pred.reshape(1,-1)[0]
val_pred[val_pred >= 0.5] = '1'
val_pred[val_pred < 0.5] = '0'
print('Lengths:', len(val_pred), len(train_labels.iloc[8500:10000]['label_s'].values))
val_pred
Lengths: 1500 1500
array([0., 0., 0., ..., 1., 1., 1.], dtype=float32)
val_label = train_labels.iloc[8500:10000]['label'].values
cm = confusion_matrix(val_label, val_pred)
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()
print('Accuracy: %.2f' %(accuracy_score(val_label, val_pred)))
print('F1: %.2f' %(f1_score(val_label, val_pred)))
print('Precision: %.2f' %(precision_score(val_label, val_pred)))
print('Recall: %.2f' %(recall_score(val_label, val_pred)))
Accuracy: 0.77
F1: 0.73
Precision: 0.71
Recall: 0.75

Looking at the confusion matrix, the model predicts fairly well. Fortunately, it does worse predicting cancer when there is none than predicting no cancer where there is cancer, although both situations are not ideal.
We can now predict the test images and submit tot he Kaggle competition.
test_df = pd.DataFrame(images_test, columns=['path'])
test_df['id'] = test_df['path'].apply(lambda x: x.replace('.tif',''))
test = data_gen.flow_from_dataframe(
dataframe = test_df,
directory = path_test,
x_col = 'path',
shuffle = False,
target_size = (96,96),
class_mode=None, #testing df
classes=None,
validate_filenames = True
)
Found 57458 validated image filenames.
test_pred = fourth_model.predict(test)
1796/1796 [==============================] - 183s 102ms/step
test_pred = test_pred.reshape(1,-1)[0]
test_pred[test_pred >= 0.5] = '1'
test_pred[test_pred < 0.5] = '0'
print('Lengths:', len(test_pred), len(test_df))
Lengths: 57458 57458
test_df['label'] = test_pred
test_df['label'] = test_df['label'].astype(int)
test_df[['id','label']]
| id | label | |
|---|---|---|
| 0 | a7ea26360815d8492433b14cd8318607bcf99d9e | 0 |
| 1 | 59d21133c845dff1ebc7a0c7cf40c145ea9e9664 | 0 |
| 2 | 5fde41ce8c6048a5c2f38eca12d6528fa312cdbb | 0 |
| 3 | bd953a3b1db1f7041ee95ff482594c4f46c73ed0 | 1 |
| 4 | 523fc2efd7aba53e597ab0f69cc2cbded7a6ce62 | 0 |
| ... | ... | ... |
| 57453 | 7907c88a7f5f9c8ca5b2df72c1e6ff9650eea22b | 0 |
| 57454 | 2a6fc1ed16fa94d263efab330ccbeb1906cbd421 | 0 |
| 57455 | 6bb5c0611c0ccf4713e0ccbc0e8c54bcb498ef14 | 1 |
| 57456 | f11e7c9e77cbc1ec916a52e6b871a293ee1bb928 | 0 |
| 57457 | 66d529ceeb28e822fac5e1378cc5702194532127 | 0 |
57458 rows × 2 columns
test_df[['id','label']].to_csv('submission.csv', index=False)
Conclusions
The final submission to this project had an accuracy of 0.724, which was below the final validation’s model accuracy of 0.77. There are two specific ways that the final model could be improved: training on more images and manipulating the trained (and validated) images. Due to time and processing limitations, the model was only trained on the first 10,000 images. This is a lot of images, but was only 4.5% of the overall training image data. Similarly, the 10,000 images trained on was also less than the testing dataset. There are likely improvements to be made in the modeling if more images are included. Similarly, the method used to load images into the training dataframe allowed for manipulation of the images, which includes rotating and resizing. This can help the model not overfit to certain types of images, which could be helpful in the modeling process.
Overall, I think this model did well for the amount of images being used to train and the lack of training image manipulation.
