Twitter, or as it is now named X, is a social media platform where any person can ‘tweet’ about anything. Tweets historically had a character limit, and while you can ‘multi-tweet’ about a topic, most people write short tweets (less than 200 characters).

As the effects of climate change worsen, so do natural disasters. Many people often flock to social media to post about a disaster, with some using it as a way to inform others or to indicate that they are currently in a disaster area. Because of this, it is important to be able to quickly classify tweets as those tweeting about actual natural disasters and those not. Filtering can help connect rescue workers with those potentially in danger of the disaster itself or amplify conditions and spread of the disaster.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

import re
import string

# Metrics
from sklearn.metrics import accuracy_score, auc, f1_score
from sklearn.metrics import precision_score, recall_score
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt

# Text Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# decision tree model

# NLP
import spacy
nlp = spacy.load("en_core_web_sm")

# NN
import keras
import tensorflow as tf
from keras import layers
from tensorflow.keras import utils
import tensorflow_hub as hub

from keras import Sequential
from keras.layers import Dense, Dropout, Flatten, Embedding, LSTM, Dense, TextVectorization
from keras.optimizers import Adam

Load Data

train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
print('Train Length:', len(train))
print('Test Length:', len(test))
print('Split: %.2f' %(len(test)/(len(test)+len(train))*100),'%')
Train Length: 7613
Test Length: 3263
Split: 30.00 %
train['target'].value_counts()
target
0    4342
1    3271
Name: count, dtype: int64

The training data looks to be almost evenly split between 1s and 0s, or disaster tweets and non-disaster tweets.

train
id keyword location text target
0 1 NaN NaN Our Deeds are the Reason of this #earthquake M... 1
1 4 NaN NaN Forest fire near La Ronge Sask. Canada 1
2 5 NaN NaN All residents asked to 'shelter in place' are ... 1
3 6 NaN NaN 13,000 people receive #wildfires evacuation or... 1
4 7 NaN NaN Just got sent this photo from Ruby #Alaska as ... 1
... ... ... ... ... ...
7608 10869 NaN NaN Two giant cranes holding a bridge collapse int... 1
7609 10870 NaN NaN @aria_ahrary @TheTawniest The out of control w... 1
7610 10871 NaN NaN M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt... 1
7611 10872 NaN NaN Police investigating after an e-bike collided ... 1
7612 10873 NaN NaN The Latest: More Homes Razed by Northern Calif... 1

7613 rows × 5 columns

Looking at the dataframe itself, there are two additional columns, called keyword and location, that exist. From the dataframe preview, it looks like some of these values are NANs. We can see how many non-NAN values are below.

print('Non-NAN keyword:', len(train[train['keyword'].isna()==False]))
print('Non-NAN location:', len(train[train['location'].isna()==False]))
Non-NAN keyword: 7552
Non-NAN location: 5080
np.unique(train[train['keyword'].isna()==False]['keyword'])
array(['ablaze', 'accident', 'aftershock', 'airplane%20accident',
       'ambulance', 'annihilated', 'annihilation', 'apocalypse',
       'armageddon', 'army', 'arson', 'arsonist', 'attack', 'attacked',
       'avalanche', 'battle', 'bioterror', 'bioterrorism', 'blaze',
       'blazing', 'bleeding', 'blew%20up', 'blight', 'blizzard', 'blood',
       'bloody', 'blown%20up', 'body%20bag', 'body%20bagging',
       'body%20bags', 'bomb', 'bombed', 'bombing', 'bridge%20collapse',
       'buildings%20burning', 'buildings%20on%20fire', 'burned',
       'burning', 'burning%20buildings', 'bush%20fires', 'casualties',
       'casualty', 'catastrophe', 'catastrophic', 'chemical%20emergency',
       'cliff%20fall', 'collapse', 'collapsed', 'collide', 'collided',
       'collision', 'crash', 'crashed', 'crush', 'crushed', 'curfew',
       'cyclone', 'damage', 'danger', 'dead', 'death', 'deaths', 'debris',
       'deluge', 'deluged', 'demolish', 'demolished', 'demolition',
       'derail', 'derailed', 'derailment', 'desolate', 'desolation',
       'destroy', 'destroyed', 'destruction', 'detonate', 'detonation',
       'devastated', 'devastation', 'disaster', 'displaced', 'drought',
       'drown', 'drowned', 'drowning', 'dust%20storm', 'earthquake',
       'electrocute', 'electrocuted', 'emergency', 'emergency%20plan',
       'emergency%20services', 'engulfed', 'epicentre', 'evacuate',
       'evacuated', 'evacuation', 'explode', 'exploded', 'explosion',
       'eyewitness', 'famine', 'fatal', 'fatalities', 'fatality', 'fear',
       'fire', 'fire%20truck', 'first%20responders', 'flames',
       'flattened', 'flood', 'flooding', 'floods', 'forest%20fire',
       'forest%20fires', 'hail', 'hailstorm', 'harm', 'hazard',
       'hazardous', 'heat%20wave', 'hellfire', 'hijack', 'hijacker',
       'hijacking', 'hostage', 'hostages', 'hurricane', 'injured',
       'injuries', 'injury', 'inundated', 'inundation', 'landslide',
       'lava', 'lightning', 'loud%20bang', 'mass%20murder',
       'mass%20murderer', 'massacre', 'mayhem', 'meltdown', 'military',
       'mudslide', 'natural%20disaster', 'nuclear%20disaster',
       'nuclear%20reactor', 'obliterate', 'obliterated', 'obliteration',
       'oil%20spill', 'outbreak', 'pandemonium', 'panic', 'panicking',
       'police', 'quarantine', 'quarantined', 'radiation%20emergency',
       'rainstorm', 'razed', 'refugees', 'rescue', 'rescued', 'rescuers',
       'riot', 'rioting', 'rubble', 'ruin', 'sandstorm', 'screamed',
       'screaming', 'screams', 'seismic', 'sinkhole', 'sinking', 'siren',
       'sirens', 'smoke', 'snowstorm', 'storm', 'stretcher',
       'structural%20failure', 'suicide%20bomb', 'suicide%20bomber',
       'suicide%20bombing', 'sunk', 'survive', 'survived', 'survivors',
       'terrorism', 'terrorist', 'threat', 'thunder', 'thunderstorm',
       'tornado', 'tragedy', 'trapped', 'trauma', 'traumatised',
       'trouble', 'tsunami', 'twister', 'typhoon', 'upheaval',
       'violent%20storm', 'volcano', 'war%20zone', 'weapon', 'weapons',
       'whirlwind', 'wild%20fires', 'wildfire', 'windstorm', 'wounded',
       'wounds', 'wreck', 'wreckage', 'wrecked'], dtype=object)
np.unique(train[train['location'].isna()==False]['location'])
array(['  ', '  Glasgow ', '  Melbourne, Australia', ...,
       'å¡å¡Midwest \x89Û¢\x89Û¢', 'åÊ(?\x89Û¢`?\x89Û¢å«)??',
       'åø\\_(?)_/åø'], dtype=object)

Looking at the actual outputs of these two columns, the keyword seems to hold keyworks, likely used in the tweet text, that relate to natural disasters and other disasters. We will leave this column from the training data to start, as we would need to categorically encode this column.

The location column clearly holds location, but seems to be less organized than the keywork column. For this reason, the location column will not be used in modeling.

df = train[train['keyword'].isna()==False]['keyword'].value_counts().reset_index()
df.head(20)
keyword count
0 fatalities 45
1 deluge 42
2 armageddon 42
3 sinking 41
4 damage 41
5 harm 41
6 body%20bags 41
7 outbreak 40
8 evacuate 40
9 fear 40
10 collided 40
11 siren 40
12 twister 40
13 windstorm 40
14 sinkhole 39
15 sunk 39
16 hellfire 39
17 weapon 39
18 weapons 39
19 famine 39
df.tail(20)
keyword count
201 bombing 29
202 obliteration 29
203 sirens 29
204 snowstorm 29
205 desolate 29
206 seismic 29
207 first%20responders 29
208 rubble 28
209 demolished 28
210 deluged 27
211 volcano 27
212 battle 26
213 bush%20fires 25
214 war%20zone 24
215 rescue 22
216 forest%20fire 19
217 epicentre 12
218 threat 11
219 inundation 10
220 radiation%20emergency 9

The keywords also seem to be fairly well distributed, at most, there are 45 instances of one keyword, while the least frequent have at least 9 observations.

df = train
df['Word Count'] = df['text'].apply(lambda x: len(x.split(sep=' ')))
fig, ax = plt.subplots()
ax.hist(df['Word Count'], bins=50)
ax.set_ylabel('Tweets')
ax.set_xlabel('Word Count')
ax.set_title('Histogram of Text Word Length')
ax.vlines(df['Word Count'].quantile(0.5), 0, 900, color='black')
ax.vlines(df['Word Count'].quantile(0.25), 0, 900, color='black', linestyle='dotted')
ax.vlines(df['Word Count'].quantile(0.75), 0, 900, color='black', linestyle='dotted')
fig.show()

png

print('Median: ', df['Word Count'].quantile(0.5))
print('Q1: ', df['Word Count'].quantile(0.25))
print('Q3: ', df['Word Count'].quantile(0.75))
Median:  15.0
Q1:  11.0
Q3:  19.0

Looking at the word count of the tweets, most seem to have less than 25 words, with the median being around 15 words. This will hopefully make a sequential neural network model run okay, as there are usually less than 30 ‘sequences’ the model needs to run through.

n = 10
for i in range(n):
    print('Tweet ',i,':',train.iloc[i]['text'])
Tweet  0 : Our Deeds are the Reason of this #earthquake May ALLAH Forgive us all
Tweet  1 : Forest fire near La Ronge Sask. Canada
Tweet  2 : All residents asked to 'shelter in place' are being notified by officers. No other evacuation or shelter in place orders are expected
Tweet  3 : 13,000 people receive #wildfires evacuation orders in California 
Tweet  4 : Just got sent this photo from Ruby #Alaska as smoke from #wildfires pours into a school 
Tweet  5 : #RockyFire Update => California Hwy. 20 closed in both directions due to Lake County fire - #CAfire #wildfires
Tweet  6 : #flood #disaster Heavy rain causes flash flooding of streets in Manitou, Colorado Springs areas
Tweet  7 : I'm on top of the hill and I can see a fire in the woods...
Tweet  8 : There's an emergency evacuation happening now in the building across the street
Tweet  9 : I'm afraid that the tornado is coming to our area...

Looking at the first 10 tweets, we can see that tweets often have special characters, like hashtags. These will need to be cleaned to do Natural Language Processing.

Natural Language Processing

Similar to the BBC News Classification Project in DTSA 5509, we can use spaCy to help clean these tweets. We will first remove all punctuation and the common English language stop words.

# SpaCy Stop Words
print(len(nlp.Defaults.stop_words), list(nlp.Defaults.stop_words))
326 ['else', 'namely', 'nowhere', 'still', 'his', 'they', 'much', 'between', 'meanwhile', 'such', 'somewhere', 'seeming', "'ll", 'really', 'onto', 'doing', 'mine', 'every', 'everything', 'all', 'latter', 'over', 'some', 'were', 'alone', 'whither', 'before', 'whereafter', 'wherever', 'should', 'must', 'as', 'i', 'neither', 'yourselves', 'ten', 'does', 'one', 'your', 'yourself', 'beside', 'either', 'though', 'hereafter', 'had', 'below', 'show', 'against', 'where', 'whom', 'there', 'nothing', 'thereafter', 'since', 'along', 'thru', 'not', 'are', 'part', 'which', 'afterwards', 'a', 'well', 'six', 'also', 'during', 'too', 'out', 'throughout', 'indeed', 'quite', 'anyway', '‘re', 'herein', 'less', 'few', 'keep', 'front', 'upon', 'others', 'side', 'that', 'to', 'through', 'anywhere', 'put', 'beforehand', 'top', 'always', 'did', 'by', 'serious', 'whenever', 'beyond', 'three', 'even', 'twelve', 'their', 'because', 'himself', 'both', 'third', 'whereupon', '‘d', 'how', 'after', 'latterly', 'using', 'among', 'anything', 'once', 'what', 'somehow', 'been', 'fifty', 'until', 'someone', 'about', 'elsewhere', 'five', '’m', 'move', 'her', 'itself', 'two', 'whatever', 're', 'only', 'various', 'can', '’re', 'toward', 'the', 'another', 'up', 'is', 'across', 'in', 'will', 'done', 'make', 'at', 'other', 'sometime', 'seem', 'call', 'my', 'go', 'of', 'behind', 'give', 'none', 'and', 'already', 'used', 'those', 'has', 'if', 'you', 'name', 'hereupon', 'more', 'nine', 'back', 'becomes', 'whether', 'be', 'from', 'thence', 'nor', 'who', 'its', 'might', 'almost', '’ve', 'myself', 'last', 'former', 'here', '’ll', 'perhaps', 'ourselves', 'anyone', 'made', 'without', 'n‘t', 'own', 'just', 'forty', 'for', 'n’t', '‘s', 'me', 'ours', 'do', 'cannot', 'into', 'each', 'unless', 'hereby', 'whole', "'re", 'again', '’d', 'may', 'thereupon', '‘ve', 'via', 'fifteen', 'many', 'herself', 'anyhow', 'being', 'very', 'become', 'yours', 'our', 'on', 'whereas', 'per', 'but', 'something', 'hence', 'yet', "'s", 'please', 'whereby', 'towards', '‘ll', 'it', 'everywhere', 'due', 'although', 'becoming', 'no', 'noone', 'four', 'when', 'see', 'sixty', "'d", 'least', "'ve", 'eleven', 'formerly', 'otherwise', 'these', 'have', 'often', 'thereby', 'twenty', 'except', 'whence', 'around', '’s', 'would', 'while', 'hundred', 'amount', 'everyone', 'ca', 'mostly', 'down', 'we', 'never', 'sometimes', 'with', 'full', 'most', 'than', 'any', 'this', 'get', 'them', 'wherein', 'she', 'amongst', 'same', 'several', 'whoever', 'thus', "n't", 'bottom', 'could', 'became', 'why', 'moreover', 'eight', 'next', 'rather', 'regarding', 'so', 'was', 'an', 'within', 'enough', 'off', 'above', 'nevertheless', 'take', 'he', 'empty', 'themselves', 'seems', 'under', 'therefore', 'or', 'ever', 'whose', 'first', 'say', 'us', 'together', 'hers', 'further', "'m", 'besides', '‘m', 'nobody', 'seemed', 'him', 'am', 'therein', 'however', 'now', 'then']
train_clean = train.copy()

# Remove Words Function
def remove_words(text):
    doc = nlp(text)
    n_text = [word.text for word in doc if (word not in nlp.Defaults.stop_words)] #word.text forces back to string
    return ' '.join(n_text) #forces back to full text

import re
emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
def remove_emojis(data):
    return re.sub(emoj, '', data)

def remove_urls(text):
    return re.sub(r'http\S+', '', text, flags=re.MULTILINE)

train_clean['text'] = train_clean['text'].apply(remove_words)
train_clean['text'] = train_clean['text'].apply(remove_emojis)
train_clean['text'] = train_clean['text'].apply(remove_urls)

# Remove Punctuation (after removal of stop words with punctuation)
train_clean['text'] = train_clean['text'].apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
train_clean['text'] = train_clean['text'].apply(lambda x: re.sub(' +', ' ', x))
n = 32 #tweet 31 has URLs
for i in range(n):
    print('Tweet ',i,':',train_clean.iloc[i]['text'])
Tweet  0 : Our Deeds are the Reason of this earthquake May ALLAH Forgive us all
Tweet  1 : Forest fire near La Ronge Sask Canada
Tweet  2 : All residents asked to shelter in place are being notified by officers No other evacuation or shelter in place orders are expected
Tweet  3 : 13000 people receive wildfires evacuation orders in California
Tweet  4 : Just got sent this photo from Ruby Alaska as smoke from wildfires pours into a school
Tweet  5 :  RockyFire Update California Hwy 20 closed in both directions due to Lake County fire CAfire wildfires
Tweet  6 :  flood disaster Heavy rain causes flash flooding of streets in Manitou Colorado Springs areas
Tweet  7 : I m on top of the hill and I can see a fire in the woods 
Tweet  8 : There s an emergency evacuation happening now in the building across the street
Tweet  9 : I m afraid that the tornado is coming to our area 
Tweet  10 : Three people died from the heat wave so far
Tweet  11 : Haha South Tampa is getting flooded hah WAIT A SECOND I LIVE IN SOUTH TAMPA WHAT AM I GONNA DO WHAT AM I GONNA DO FVCK flooding
Tweet  12 :  raining flooding Florida TampaBay Tampa 18 or 19 days I ve lost count
Tweet  13 :  Flood in Bago Myanmar We arrived Bago
Tweet  14 : Damage to school bus on 80 in multi car crash BREAKING
Tweet  15 : What s up man 
Tweet  16 : I love fruits
Tweet  17 : Summer is lovely
Tweet  18 : My car is so fast
Tweet  19 : What a goooooooaaaaaal 
Tweet  20 : this is ridiculous 
Tweet  21 : London is cool 
Tweet  22 : Love skiing
Tweet  23 : What a wonderful day 
Tweet  24 : LOOOOOOL
Tweet  25 : No way I ca nt eat that shit
Tweet  26 : Was in NYC last week 
Tweet  27 : Love my girlfriend
Tweet  28 : Cooool 
Tweet  29 : Do you like pasta 
Tweet  30 : The end 
Tweet  31 : bbcmtd Wholesale Markets ablaze 

Train Test Split

We must now split the data into training and testing subsets. The overall test subset is in a separate dataframe, so this split will more or less act as a train and validation set. We do not know the true labels of the test data and will not know the result of the final model until we submit the generated labels to the competition.

from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(train_clean, test_size=0.2, stratify = train_clean['target'], random_state =101)

print('Train Length:',len(X_train), '\t Percent Disaster: %.2f' %(len(X_train[X_train['target']==1])/len(X_train)*100))
print('Test Length:',len(X_test), '\t Percent Disaster: %.2f' %(len(X_test[X_test['target']==1])/len(X_test)*100))

# Force to Tensor Datasets
tensor_train_text = tf.convert_to_tensor(X_train['text'].values)
X_train_t = tf.data.Dataset.from_tensor_slices((tensor_train_text, X_train['target']))

tensor_test_text = tf.convert_to_tensor(X_test['text'].values)
X_test_t = tf.data.Dataset.from_tensor_slices((tensor_test_text, X_test['target']))
Train Length: 6090 	 Percent Disaster: 42.97
Test Length: 1523 	 Percent Disaster: 42.94

From the conversion to tensorflow datasets, we can utilize various functions to tokenize and vectorize the tweet texts. We can start by shuffling the data and creating batch sizes. We can see the output below.

BATCH_SIZE = 64

train_dataset = X_train_t.shuffle(len(X_train_t), seed=101).batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
test_dataset = X_test_t.batch(BATCH_SIZE).prefetch(tf.data.AUTOTUNE)
for example, label in train_dataset.take(1):
    print('texts: ', example.numpy()[:3])
    print()
    print('labels: ', label.numpy()[:3])
texts:  [b'Police Officer Wounded Suspect Dead After Exchanging Shots ABC News AN247'
 b'chromsucks do nt drown'
 b'Survival Kit Whistle Fire Starter Wire Saw Cree Torch Emergency Blanket S knife Full re\xc2\x89\xc3\x9b ']

labels:  [1 0 1]

Text Vectorization

We can now apply a semi-custom standardization, which will also standardize the tweet texts and remove punctuation and special characters. We then create the vectorization layer and run just the training and validation text through to convert. Code adapted from this tutorial.

max_features = 10000
max_tweet_len = max(len(tweet.split(' ')) for tweet in train_clean['text'])

def custom_standardization(input_data):
    lowercase = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, '<br />', ' ')
    return tf.strings.regex_replace(stripped_html,
                                  '[%s]' % re.escape(string.punctuation),
                                  '')

vectorize_layer = layers.TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    output_mode='int',
    output_sequence_length=max_tweet_len)

train_text = train_dataset.map(lambda x, y: x) # just the text
vectorize_layer.adapt(train_text) # train vectorization
def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label

text_batch, label_batch = next(iter(train_dataset))
first_review, first_label = text_batch[0], label_batch[0]
print("Review:", str(first_review))
print("Label:", int(first_label))
print("Vectorized review", vectorize_text(first_review, first_label))
Review: tf.Tensor(b'USGS reports a M194 earthquake 5 km S of Volcano Hawaii on 8615 10401 UTC quake', shape=(), dtype=string)
Label: 1
Vectorized review (<tf.Tensor: shape=(1, 38), dtype=int64, numpy=
array([[1330,  845,    3, 2654,  241,  146,  650,   12,    6,  487, 1399,
          13, 5739, 5803, 1797, 4452,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0]])>, <tf.Tensor: shape=(), dtype=int64, numpy=1>)

We can see from the output that the single tensor (tweet) above was “USGS reports a M194 earthquake 5 km S of Volcano Hawaii on 8615 10401 UTC quake,” which does sound like a potential disaster. The label found is 1, which mean this tweet was classified as a disaster tweet. The vector created is a vector of the tweet text with every number as a positive integer value. Each number corresponds to a unique dictionary key with will map to the exact word. The vector is 0 padded to be the same length of the longest cleaned tweet.

train_ds = train_dataset.map(vectorize_text)
val_ds = test_dataset.map(vectorize_text)

Sequential Neural Network

We can now build and train a Sequential Neural Network. In order to ensure the cleaning and vectorization above is correct, we will create a simple model with an embedding layer, a GlobalAverage Pooling1D layer, and a single dense layer with a sigmoid activation function. Much of the first model and evaluation code is copied from this tutorial.

def plot_func(history):
    history_dict = history.history
    history_dict.keys()


    acc = history_dict['accuracy']
    val_acc = history_dict['val_accuracy']
    loss = history_dict['loss']
    val_loss = history_dict['val_loss']

    epochs = range(1, len(acc) + 1)

    # "bo" is for "blue dot"
    plt.plot(epochs, loss, 'bo', label='Training loss')
    # b is for "solid blue line"
    plt.plot(epochs, val_loss, 'b', label='Validation loss')
    plt.title('Training and validation loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    plt.show()

    plt.plot(epochs, acc, 'bo', label='Training acc')
    plt.plot(epochs, val_acc, 'b', label='Validation acc')
    plt.title('Training and validation accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend(loc='lower right')

    plt.show()
# First model
first_model = tf.keras.Sequential([
    layers.Embedding(max_features ,64),
    layers.Dropout(0.2),
    layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

first_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])
first_model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, None, 64)          640000    
                                                                 
 dropout (Dropout)           (None, None, 64)          0         
                                                                 
 global_average_pooling1d (  (None, 64)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dropout_1 (Dropout)         (None, 64)                0         
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
=================================================================
Total params: 640065 (2.44 MB)
Trainable params: 640065 (2.44 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
history = first_model.fit(x=train_ds, validation_data=val_ds, epochs=10)
plot_func(history)
Epoch 1/10
96/96 [==============================] - 1s 6ms/step - loss: 0.6828 - accuracy: 0.5701 - val_loss: 0.6786 - val_accuracy: 0.5706
Epoch 2/10
96/96 [==============================] - 0s 4ms/step - loss: 0.6755 - accuracy: 0.5703 - val_loss: 0.6718 - val_accuracy: 0.5706
Epoch 3/10
96/96 [==============================] - 0s 5ms/step - loss: 0.6678 - accuracy: 0.5706 - val_loss: 0.6627 - val_accuracy: 0.5712
Epoch 4/10
96/96 [==============================] - 0s 5ms/step - loss: 0.6571 - accuracy: 0.5768 - val_loss: 0.6500 - val_accuracy: 0.6014
Epoch 5/10
96/96 [==============================] - 0s 5ms/step - loss: 0.6441 - accuracy: 0.6174 - val_loss: 0.6361 - val_accuracy: 0.6316
Epoch 6/10
96/96 [==============================] - 0s 5ms/step - loss: 0.6284 - accuracy: 0.6662 - val_loss: 0.6203 - val_accuracy: 0.6861
Epoch 7/10
96/96 [==============================] - 0s 5ms/step - loss: 0.6125 - accuracy: 0.7044 - val_loss: 0.6047 - val_accuracy: 0.7104
Epoch 8/10
96/96 [==============================] - 0s 5ms/step - loss: 0.5953 - accuracy: 0.7273 - val_loss: 0.5897 - val_accuracy: 0.7269
Epoch 9/10
96/96 [==============================] - 0s 5ms/step - loss: 0.5791 - accuracy: 0.7409 - val_loss: 0.5759 - val_accuracy: 0.7459
Epoch 10/10
96/96 [==============================] - 0s 5ms/step - loss: 0.5643 - accuracy: 0.7540 - val_loss: 0.5636 - val_accuracy: 0.7433

png

png

From these two graphs, there is clearly room for improvement over more epochs for this model. The training and validation loss do not seem to have leveled off; similarly, the accuracies have also not seemed to level off. We can try again with additional epochs.

# clears first model's prior training
first_model = tf.keras.Sequential([
    layers.Embedding(max_features ,64),
    layers.Dropout(0.2),
    layers.GlobalAveragePooling1D(),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

first_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'])

epochs = 25
history = first_model.fit(x=train_ds, validation_data=val_ds, epochs=epochs)
plot_func(history)
Epoch 1/25
96/96 [==============================] - 1s 6ms/step - loss: 0.6836 - accuracy: 0.5650 - val_loss: 0.6781 - val_accuracy: 0.5706
Epoch 2/25
96/96 [==============================] - 0s 4ms/step - loss: 0.6751 - accuracy: 0.5703 - val_loss: 0.6710 - val_accuracy: 0.5706
Epoch 3/25
96/96 [==============================] - 0s 5ms/step - loss: 0.6670 - accuracy: 0.5704 - val_loss: 0.6611 - val_accuracy: 0.5712
Epoch 4/25
96/96 [==============================] - 0s 5ms/step - loss: 0.6554 - accuracy: 0.5826 - val_loss: 0.6488 - val_accuracy: 0.5955
Epoch 5/25
96/96 [==============================] - 0s 5ms/step - loss: 0.6419 - accuracy: 0.6271 - val_loss: 0.6339 - val_accuracy: 0.6428
Epoch 6/25
96/96 [==============================] - 0s 5ms/step - loss: 0.6258 - accuracy: 0.6708 - val_loss: 0.6181 - val_accuracy: 0.6914
Epoch 7/25
96/96 [==============================] - 0s 5ms/step - loss: 0.6100 - accuracy: 0.7108 - val_loss: 0.6070 - val_accuracy: 0.6697
Epoch 8/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5928 - accuracy: 0.7236 - val_loss: 0.5881 - val_accuracy: 0.7347
Epoch 9/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5772 - accuracy: 0.7456 - val_loss: 0.5773 - val_accuracy: 0.7190
Epoch 10/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5617 - accuracy: 0.7516 - val_loss: 0.5615 - val_accuracy: 0.7485
Epoch 11/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5475 - accuracy: 0.7617 - val_loss: 0.5503 - val_accuracy: 0.7538
Epoch 12/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5331 - accuracy: 0.7668 - val_loss: 0.5409 - val_accuracy: 0.7584
Epoch 13/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5213 - accuracy: 0.7745 - val_loss: 0.5304 - val_accuracy: 0.7571
Epoch 14/25
96/96 [==============================] - 0s 5ms/step - loss: 0.5084 - accuracy: 0.7816 - val_loss: 0.5214 - val_accuracy: 0.7577
Epoch 15/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4960 - accuracy: 0.7906 - val_loss: 0.5144 - val_accuracy: 0.7571
Epoch 16/25
96/96 [==============================] - 0s 4ms/step - loss: 0.4860 - accuracy: 0.7910 - val_loss: 0.5058 - val_accuracy: 0.7649
Epoch 17/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4729 - accuracy: 0.8030 - val_loss: 0.4971 - val_accuracy: 0.7676
Epoch 18/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4629 - accuracy: 0.8079 - val_loss: 0.4897 - val_accuracy: 0.7787
Epoch 19/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4521 - accuracy: 0.8144 - val_loss: 0.4883 - val_accuracy: 0.7741
Epoch 20/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4407 - accuracy: 0.8192 - val_loss: 0.4778 - val_accuracy: 0.7925
Epoch 21/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4305 - accuracy: 0.8271 - val_loss: 0.4754 - val_accuracy: 0.7833
Epoch 22/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4217 - accuracy: 0.8296 - val_loss: 0.4668 - val_accuracy: 0.7978
Epoch 23/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4112 - accuracy: 0.8333 - val_loss: 0.4620 - val_accuracy: 0.8024
Epoch 24/25
96/96 [==============================] - 0s 5ms/step - loss: 0.4022 - accuracy: 0.8366 - val_loss: 0.4597 - val_accuracy: 0.8011
Epoch 25/25
96/96 [==============================] - 0s 5ms/step - loss: 0.3924 - accuracy: 0.8433 - val_loss: 0.4538 - val_accuracy: 0.8063

png

png

Unfortunately, it looks like the validation loss and accuracy are starting to level off. Accuracy of 0.8 is good, but we can try to improve the model instead of running additional epochs again.

We can see that this model’s accuracy seems to maximize around 4 epochs. At that point, there also seems to be a steep increase in validation loss, which is bad.

second_model = tf.keras.Sequential([
    layers.Embedding(max_features,200),
    layers.LSTM(100),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

second_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'], optimizer=tf.optimizers.Adam())
second_model.summary()
Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_2 (Embedding)     (None, None, 200)         2000000   
                                                                 
 lstm (LSTM)                 (None, 100)               120400    
                                                                 
 dropout_4 (Dropout)         (None, 100)               0         
                                                                 
 dense_2 (Dense)             (None, 1)                 101       
                                                                 
=================================================================
Total params: 2120501 (8.09 MB)
Trainable params: 2120501 (8.09 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
history = second_model.fit(x=train_ds, validation_data=val_ds, epochs=10)
plot_func(history)
Epoch 1/10


/opt/conda/lib/python3.10/site-packages/keras/src/backend.py:5805: UserWarning: "`binary_crossentropy` received `from_logits=True`, but the `output` argument was produced by a Sigmoid activation and thus does not represent logits. Was this intended?
  output, from_logits = _get_logits(


96/96 [==============================] - 7s 53ms/step - loss: 0.6848 - accuracy: 0.5698 - val_loss: 0.6813 - val_accuracy: 0.5706
Epoch 2/10
96/96 [==============================] - 5s 50ms/step - loss: 0.5054 - accuracy: 0.7842 - val_loss: 0.4628 - val_accuracy: 0.7965
Epoch 3/10
96/96 [==============================] - 5s 51ms/step - loss: 0.3167 - accuracy: 0.8856 - val_loss: 0.4700 - val_accuracy: 0.8043
Epoch 4/10
96/96 [==============================] - 5s 48ms/step - loss: 0.2150 - accuracy: 0.9268 - val_loss: 0.4848 - val_accuracy: 0.7984
Epoch 5/10
96/96 [==============================] - 5s 49ms/step - loss: 0.1726 - accuracy: 0.9465 - val_loss: 0.6023 - val_accuracy: 0.7833
Epoch 6/10
96/96 [==============================] - 5s 49ms/step - loss: 0.1220 - accuracy: 0.9596 - val_loss: 0.8937 - val_accuracy: 0.7748
Epoch 7/10
96/96 [==============================] - 5s 49ms/step - loss: 0.1028 - accuracy: 0.9662 - val_loss: 0.8089 - val_accuracy: 0.7367
Epoch 8/10
96/96 [==============================] - 5s 48ms/step - loss: 0.0987 - accuracy: 0.9663 - val_loss: 0.7211 - val_accuracy: 0.7853
Epoch 9/10
96/96 [==============================] - 5s 49ms/step - loss: 0.0803 - accuracy: 0.9716 - val_loss: 0.7955 - val_accuracy: 0.7781
Epoch 10/10
96/96 [==============================] - 5s 48ms/step - loss: 0.0707 - accuracy: 0.9713 - val_loss: 1.1622 - val_accuracy: 0.7538

png

png

third_model = tf.keras.Sequential([
    layers.Embedding(max_features,128),
    layers.Bidirectional(layers.LSTM(64)),
    layers.Dropout(0.2),
    layers.Dense(1, activation='sigmoid')
])

third_model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True), metrics=['accuracy'], optimizer=tf.optimizers.Adam())
third_model.summary()
Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 128)         1280000   
                                                                 
 bidirectional (Bidirection  (None, 128)               98816     
 al)                                                             
                                                                 
 dropout_5 (Dropout)         (None, 128)               0         
                                                                 
 dense_3 (Dense)             (None, 1)                 129       
                                                                 
=================================================================
Total params: 1378945 (5.26 MB)
Trainable params: 1378945 (5.26 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
history = third_model.fit(x=train_ds, validation_data=val_ds, epochs=10)
plot_func(history)
Epoch 1/10
96/96 [==============================] - 8s 51ms/step - loss: 0.5708 - accuracy: 0.7026 - val_loss: 0.4532 - val_accuracy: 0.7958
Epoch 2/10
96/96 [==============================] - 4s 45ms/step - loss: 0.3425 - accuracy: 0.8611 - val_loss: 0.4571 - val_accuracy: 0.8043
Epoch 3/10
96/96 [==============================] - 4s 42ms/step - loss: 0.2215 - accuracy: 0.9190 - val_loss: 0.4909 - val_accuracy: 0.7912
Epoch 4/10
96/96 [==============================] - 4s 42ms/step - loss: 0.1568 - accuracy: 0.9470 - val_loss: 0.5707 - val_accuracy: 0.7892
Epoch 5/10
96/96 [==============================] - 4s 42ms/step - loss: 0.1124 - accuracy: 0.9627 - val_loss: 0.7384 - val_accuracy: 0.7676
Epoch 6/10
96/96 [==============================] - 4s 42ms/step - loss: 0.0886 - accuracy: 0.9686 - val_loss: 0.7901 - val_accuracy: 0.7728
Epoch 7/10
96/96 [==============================] - 4s 45ms/step - loss: 0.0725 - accuracy: 0.9749 - val_loss: 0.9381 - val_accuracy: 0.7538
Epoch 8/10
96/96 [==============================] - 4s 42ms/step - loss: 0.0630 - accuracy: 0.9762 - val_loss: 0.8841 - val_accuracy: 0.7761
Epoch 9/10
96/96 [==============================] - 4s 41ms/step - loss: 0.0524 - accuracy: 0.9787 - val_loss: 1.1961 - val_accuracy: 0.7511
Epoch 10/10
96/96 [==============================] - 4s 42ms/step - loss: 0.0518 - accuracy: 0.9787 - val_loss: 1.0500 - val_accuracy: 0.7630

png

png

Adding a LSTM (with and without bidirectional) layer and dropout layer does not seem to improve performance overall, but does seem to achieve a higher training accuracy from the first model. Overall, this model seems to cap out at a validation accuracy of 0.8, which is not bad, but definitely could be better.

BERT Text Vectorization

We can shift to try and using a different type of word vectorization. Previously, we used a simple word to vector embedding based on all words in our dataset. We will now attempt to use BERT, a pre-trained model based on vocabulary from Wikipedia (model link).

# code taken from documentation
import tensorflow_text as text # Registers the ops.
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
preprocessor = hub.KerasLayer(
    "https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-preprocess/versions/3")
encoder_inputs = preprocessor(text_input)
encoder = hub.KerasLayer(
    "https://www.kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/bert-en-uncased-l-4-h-128-a-2/versions/2",
    trainable=True)
outputs = encoder(encoder_inputs)
pooled_output = outputs["pooled_output"]      # [batch_size, 128].
sequence_output = outputs["sequence_output"]  # [batch_size, seq_length, 128].

We can test what an example output of the model looks like.

# code taken from documentation
preprocessor_model = hub.KerasLayer(preprocessor)
text_preprocessed = preprocessor_model(['hello world'])
bert_model = hub.KerasLayer(encoder)
bert_model(text_preprocessed)
{'encoder_outputs': [<tf.Tensor: shape=(1, 128, 128), dtype=float32, numpy=
  array([[[-0.28209957,  0.16127399, -0.35150453, ...,  0.5077009 ,
            0.21340457,  1.9965067 ],
          [-1.5265744 ,  3.378838  ,  1.6338611 , ...,  0.12054971,
            1.1757679 ,  3.0933146 ],
          [ 0.08307905,  0.3280999 ,  0.07436036, ...,  2.4959233 ,
            0.99645245,  3.7378154 ],
          ...,
          [ 0.0132023 ,  0.61902344, -0.2620767 , ...,  0.02692477,
            0.39911288,  4.7912455 ],
          [ 0.16357262,  0.50296265,  0.23560297, ...,  0.01681199,
            0.02894995,  4.681708  ],
          [ 0.07864648, -0.0484555 ,  0.44076368, ..., -0.10352544,
           -0.49364454,  4.559942  ]]], dtype=float32)>,
  <tf.Tensor: shape=(1, 128, 128), dtype=float32, numpy=
  array([[[ 0.0516504 ,  0.6599202 , -0.9217181 , ...,  0.30025643,
            0.43955854,  1.8802348 ],
          [ 0.6656793 ,  2.8988056 ,  1.0965407 , ...,  0.55948937,
            1.1671605 ,  2.225875  ],
          [ 0.38766512,  1.0334573 ,  0.04376348, ...,  2.0849016 ,
            0.84792554,  3.0966625 ],
          ...,
          [ 0.27386662,  0.9967978 , -0.00480282, ...,  0.5122368 ,
            0.5424355 ,  2.7287927 ],
          [ 0.38234806,  0.9611372 ,  0.3953129 , ...,  0.5155247 ,
            0.2566129 ,  2.678214  ],
          [ 0.27309746,  0.45097864,  0.5349711 , ...,  0.4195987 ,
           -0.18874912,  2.5692964 ]]], dtype=float32)>,
  <tf.Tensor: shape=(1, 128, 128), dtype=float32, numpy=
  array([[[-0.21175407,  0.8670821 , -0.5636373 , ...,  1.2098176 ,
            0.6358854 ,  1.2231494 ],
          [ 1.0007379 ,  2.7127125 ,  1.8869675 , ...,  1.4713739 ,
            1.4528961 ,  1.5206515 ],
          [ 0.15607256,  0.7841349 ,  0.3155134 , ...,  2.5778186 ,
            1.3517176 ,  1.8645667 ],
          ...,
          [-0.08595099,  1.5815217 ,  0.27560985, ...,  1.8091085 ,
            0.5119754 ,  2.1717086 ],
          [ 0.01241413,  1.4937234 ,  0.7201883 , ...,  1.7614553 ,
            0.21611717,  2.158123  ],
          [-0.15097669,  0.8030128 ,  1.064392  , ...,  1.485912  ,
           -0.30418843,  2.0891755 ]]], dtype=float32)>,
  <tf.Tensor: shape=(1, 128, 128), dtype=float32, numpy=
  array([[[ 0.6211376 ,  1.3517511 , -0.45417488, ...,  2.0057793 ,
            0.0606292 ,  2.484833  ],
          [ 0.65974665,  2.0155602 ,  1.994353  , ...,  0.9720714 ,
            0.559381  ,  1.8422587 ],
          [ 0.35356337,  0.63893765,  0.8127144 , ...,  2.18133   ,
            0.19869362,  2.4446754 ],
          ...,
          [ 0.20386173,  1.724275  ,  0.73474425, ...,  1.6744812 ,
           -0.41350663,  2.608837  ],
          [ 0.3207193 ,  1.680187  ,  1.0361519 , ...,  1.6692009 ,
           -0.5980547 ,  2.6772265 ],
          [ 0.18855697,  1.3109277 ,  1.2347947 , ...,  1.4752425 ,
           -0.8028536 ,  2.6195338 ]]], dtype=float32)>],
 'sequence_output': <tf.Tensor: shape=(1, 128, 128), dtype=float32, numpy=
 array([[[ 0.6211376 ,  1.3517511 , -0.45417488, ...,  2.0057793 ,
           0.0606292 ,  2.484833  ],
         [ 0.65974665,  2.0155602 ,  1.994353  , ...,  0.9720714 ,
           0.559381  ,  1.8422587 ],
         [ 0.35356337,  0.63893765,  0.8127144 , ...,  2.18133   ,
           0.19869362,  2.4446754 ],
         ...,
         [ 0.20386173,  1.724275  ,  0.73474425, ...,  1.6744812 ,
          -0.41350663,  2.608837  ],
         [ 0.3207193 ,  1.680187  ,  1.0361519 , ...,  1.6692009 ,
          -0.5980547 ,  2.6772265 ],
         [ 0.18855697,  1.3109277 ,  1.2347947 , ...,  1.4752425 ,
          -0.8028536 ,  2.6195338 ]]], dtype=float32)>,
 'default': <tf.Tensor: shape=(1, 128), dtype=float32, numpy=
 array([[ 0.07584224,  0.9836907 , -0.2630973 , -0.97893184,  0.743709  ,
          0.0090101 , -0.14360051,  0.8566583 , -0.00537995, -0.33194244,
         -0.9988629 , -0.47580013,  0.9926781 ,  0.98163223, -0.51875824,
          0.8436456 , -0.9282018 , -0.93616855,  0.9788719 , -0.9985078 ,
         -0.42084292,  0.9939893 ,  0.43430611,  0.883811  ,  0.9943545 ,
         -0.15068094, -0.9983269 ,  0.05422748,  0.7614241 ,  0.9365335 ,
          0.05953573,  0.99065554,  0.07227435, -0.8536249 , -0.999306  ,
          0.82566214,  0.18399948, -0.08194325, -0.03688229, -0.949369  ,
         -0.85700506, -0.26376978,  0.9706242 ,  0.9703997 ,  0.77763283,
          0.9936656 ,  0.44197613, -0.98407215, -0.02194322,  0.99992055,
         -0.96972024, -0.97639036, -0.7846998 , -0.50623816,  0.8963169 ,
         -0.07334638,  0.76571584, -0.82761514,  0.95137453, -0.9821145 ,
          0.992975  ,  0.03632161,  0.09662161, -0.99999493,  0.14631358,
          0.9070506 , -0.9502659 ,  0.99737585, -0.54943436, -0.9307689 ,
          0.99105525,  0.9982059 ,  0.80663604,  0.01156592,  0.7350246 ,
          0.18778832, -0.0662348 , -0.8304772 , -0.0237916 ,  0.02730418,
          0.7108603 ,  0.936256  ,  0.99855816,  0.6254929 ,  0.34067762,
          0.91873646, -0.9985096 , -0.9998925 ,  0.98237705, -0.83680135,
         -0.38750505, -0.99742264, -0.9745682 ,  0.10217275, -0.99289674,
         -0.99141985,  0.84903777,  0.9685095 , -0.575578  ,  0.9401124 ,
         -0.95453966,  0.93488944,  0.97211385,  0.14320926,  0.95782334,
         -0.4970999 ,  0.9970584 ,  0.9890446 , -0.9881183 ,  0.9631142 ,
         -0.9962711 , -0.04193094,  0.906351  , -0.94458044,  0.99867475,
         -0.99285704,  0.9965014 ,  0.99486065, -0.05255312,  0.9156404 ,
          0.09295443, -0.9775028 ,  0.99608743,  0.02004856,  0.9997972 ,
          0.9552411 ,  0.8401237 ,  0.930562  ]], dtype=float32)>,
 'pooled_output': <tf.Tensor: shape=(1, 128), dtype=float32, numpy=
 array([[ 0.07584224,  0.9836907 , -0.2630973 , -0.97893184,  0.743709  ,
          0.0090101 , -0.14360051,  0.8566583 , -0.00537995, -0.33194244,
         -0.9988629 , -0.47580013,  0.9926781 ,  0.98163223, -0.51875824,
          0.8436456 , -0.9282018 , -0.93616855,  0.9788719 , -0.9985078 ,
         -0.42084292,  0.9939893 ,  0.43430611,  0.883811  ,  0.9943545 ,
         -0.15068094, -0.9983269 ,  0.05422748,  0.7614241 ,  0.9365335 ,
          0.05953573,  0.99065554,  0.07227435, -0.8536249 , -0.999306  ,
          0.82566214,  0.18399948, -0.08194325, -0.03688229, -0.949369  ,
         -0.85700506, -0.26376978,  0.9706242 ,  0.9703997 ,  0.77763283,
          0.9936656 ,  0.44197613, -0.98407215, -0.02194322,  0.99992055,
         -0.96972024, -0.97639036, -0.7846998 , -0.50623816,  0.8963169 ,
         -0.07334638,  0.76571584, -0.82761514,  0.95137453, -0.9821145 ,
          0.992975  ,  0.03632161,  0.09662161, -0.99999493,  0.14631358,
          0.9070506 , -0.9502659 ,  0.99737585, -0.54943436, -0.9307689 ,
          0.99105525,  0.9982059 ,  0.80663604,  0.01156592,  0.7350246 ,
          0.18778832, -0.0662348 , -0.8304772 , -0.0237916 ,  0.02730418,
          0.7108603 ,  0.936256  ,  0.99855816,  0.6254929 ,  0.34067762,
          0.91873646, -0.9985096 , -0.9998925 ,  0.98237705, -0.83680135,
         -0.38750505, -0.99742264, -0.9745682 ,  0.10217275, -0.99289674,
         -0.99141985,  0.84903777,  0.9685095 , -0.575578  ,  0.9401124 ,
         -0.95453966,  0.93488944,  0.97211385,  0.14320926,  0.95782334,
         -0.4970999 ,  0.9970584 ,  0.9890446 , -0.9881183 ,  0.9631142 ,
         -0.9962711 , -0.04193094,  0.906351  , -0.94458044,  0.99867475,
         -0.99285704,  0.9965014 ,  0.99486065, -0.05255312,  0.9156404 ,
          0.09295443, -0.9775028 ,  0.99608743,  0.02004856,  0.9997972 ,
          0.9552411 ,  0.8401237 ,  0.930562  ]], dtype=float32)>}

Both these functions can be used within a model to convert the ‘raw’ (cleaned in this case) text data into vectors and then encodings.

def build_bert_model():
    # reload due to weight changing
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
    preprocessor = hub.KerasLayer(
        "https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-preprocess/versions/3")
    encoder_inputs = preprocessor(text_input)
    encoder = hub.KerasLayer(
        "https://www.kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/bert-en-uncased-l-4-h-128-a-2/versions/2",
        trainable=True)
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = layers.Dropout(0.5)(net)
    net = layers.Dense(1, activation='sigmoid')(net)
    return tf.keras.Model(text_input, net)

bert_model = build_bert_model()
bert_model.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate=3e-5))
bert_model.summary()
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_2 (InputLayer)        [(None,)]                    0         []                            
                                                                                                  
 keras_layer_4 (KerasLayer)  {'input_word_ids': (None,    0         ['input_2[0][0]']             
                             128),                                                                
                              'input_type_ids': (None,                                            
                             128),                                                                
                              'input_mask': (None, 128)                                           
                             }                                                                    
                                                                                                  
 keras_layer_5 (KerasLayer)  {'sequence_output': (None,   4782465   ['keras_layer_4[0][0]',       
                              128, 128),                             'keras_layer_4[0][1]',       
                              'encoder_outputs': [(None              'keras_layer_4[0][2]']       
                             , 128, 128),                                                         
                              (None, 128, 128),                                                   
                              (None, 128, 128),                                                   
                              (None, 128, 128)],                                                  
                              'pooled_output': (None, 1                                           
                             28),                                                                 
                              'default': (None, 128)}                                             
                                                                                                  
 dropout_6 (Dropout)         (None, 128)                  0         ['keras_layer_5[0][5]']       
                                                                                                  
 dense_4 (Dense)             (None, 1)                    129       ['dropout_6[0][0]']           
                                                                                                  
==================================================================================================
Total params: 4782594 (18.24 MB)
Trainable params: 4782593 (18.24 MB)
Non-trainable params: 1 (1.00 Byte)
__________________________________________________________________________________________________
# this takes a while...
history = bert_model.fit(x=train_dataset, validation_data=test_dataset, epochs=5)
Epoch 1/5
96/96 [==============================] - 76s 708ms/step - loss: 0.6622 - accuracy: 0.6433 - val_loss: 0.4823 - val_accuracy: 0.7682
Epoch 2/5
96/96 [==============================] - 66s 687ms/step - loss: 0.5253 - accuracy: 0.7506 - val_loss: 0.4432 - val_accuracy: 0.7958
Epoch 3/5
96/96 [==============================] - 64s 670ms/step - loss: 0.4853 - accuracy: 0.7805 - val_loss: 0.4382 - val_accuracy: 0.8050
Epoch 4/5
96/96 [==============================] - 65s 682ms/step - loss: 0.4551 - accuracy: 0.7998 - val_loss: 0.4271 - val_accuracy: 0.8089
Epoch 5/5
96/96 [==============================] - 65s 677ms/step - loss: 0.4385 - accuracy: 0.8123 - val_loss: 0.4378 - val_accuracy: 0.8017
plot_func(history)

png

png

This model performs about as well as the previous models. We can try adding in additional layers. We can add a LSTM layer to make the model a recurrent neural network.

def build_bert_model2():
    # take vars from original load
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
    preprocessor = hub.KerasLayer(
        "https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-preprocess/versions/3")
    encoder_inputs = preprocessor(text_input)
    encoder = hub.KerasLayer(
        "https://www.kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/bert-en-uncased-l-4-h-128-a-2/versions/2",
        trainable=True)
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = layers.Dropout(0.5)(net)
    net = layers.Reshape((-1,1))(net)
    net = layers.LSTM(16,return_sequences=True)(net)
    net = layers.GlobalMaxPooling1D()(net) # forcing to the right output shape
    net = layers.Dense(1, activation='sigmoid')(net)
    return tf.keras.Model(text_input, net)

bert_model2 = build_bert_model2()
bert_model2.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate = 0.0001))
bert_model2.summary()
Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                Output Shape                 Param #   Connected to                  
==================================================================================================
 input_4 (InputLayer)        [(None,)]                    0         []                            
                                                                                                  
 keras_layer_7 (KerasLayer)  {'input_word_ids': (None,    0         ['input_4[0][0]']             
                             128),                                                                
                              'input_mask': (None, 128)                                           
                             , 'input_type_ids': (None,                                           
                              128)}                                                               
                                                                                                  
 keras_layer_8 (KerasLayer)  {'default': (None, 128),     4782465   ['keras_layer_7[0][0]',       
                              'pooled_output': (None, 1              'keras_layer_7[0][1]',       
                             28),                                    'keras_layer_7[0][2]']       
                              'encoder_outputs': [(None                                           
                             , 128, 128),                                                         
                              (None, 128, 128),                                                   
                              (None, 128, 128),                                                   
                              (None, 128, 128)],                                                  
                              'sequence_output': (None,                                           
                              128, 128)}                                                          
                                                                                                  
 dropout_7 (Dropout)         (None, 128)                  0         ['keras_layer_8[0][5]']       
                                                                                                  
 reshape (Reshape)           (None, 128, 1)               0         ['dropout_7[0][0]']           
                                                                                                  
 lstm_2 (LSTM)               (None, 128, 16)              1152      ['reshape[0][0]']             
                                                                                                  
 global_max_pooling1d (Glob  (None, 16)                   0         ['lstm_2[0][0]']              
 alMaxPooling1D)                                                                                  
                                                                                                  
 dense_5 (Dense)             (None, 1)                    17        ['global_max_pooling1d[0][0]']
                                                                                                  
==================================================================================================
Total params: 4783634 (18.25 MB)
Trainable params: 4783633 (18.25 MB)
Non-trainable params: 1 (1.00 Byte)
__________________________________________________________________________________________________
history = bert_model2.fit(x=train_dataset, validation_data=test_dataset, epochs=5)
Epoch 1/5
96/96 [==============================] - 79s 735ms/step - loss: 0.7132 - accuracy: 0.4297 - val_loss: 0.6782 - val_accuracy: 0.4353
Epoch 2/5
96/96 [==============================] - 68s 712ms/step - loss: 0.6669 - accuracy: 0.6300 - val_loss: 0.6384 - val_accuracy: 0.7997
Epoch 3/5
96/96 [==============================] - 69s 717ms/step - loss: 0.6024 - accuracy: 0.8074 - val_loss: 0.5737 - val_accuracy: 0.7978
Epoch 4/5
96/96 [==============================] - 69s 719ms/step - loss: 0.5154 - accuracy: 0.8307 - val_loss: 0.5436 - val_accuracy: 0.7820
Epoch 5/5
96/96 [==============================] - 70s 728ms/step - loss: 0.4462 - accuracy: 0.8545 - val_loss: 0.4942 - val_accuracy: 0.8083
plot_func(history)

png

png

We can see that the model overall hits the same 0.80 maximum other models have achieved. We can try increasing the nodes in the LSTM layer and checking the performance.

def build_bert_model3():
    # take vars from original load
    text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
    preprocessor = hub.KerasLayer(
        "https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-preprocess/versions/3")
    encoder_inputs = preprocessor(text_input)
    encoder = hub.KerasLayer(
        "https://www.kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/bert-en-uncased-l-4-h-128-a-2/versions/2",
        trainable=True)
    outputs = encoder(encoder_inputs)
    net = outputs['pooled_output']
    net = layers.Dropout(0.5)(net)
    net = layers.Reshape((-1,1))(net)
    net = layers.LSTM(100,return_sequences=True)(net)
    net = layers.GlobalMaxPooling1D()(net) # forcing to the right output shape
    net = layers.Dense(1, activation='sigmoid')(net)
    return tf.keras.Model(text_input, net)

bert_model3 = build_bert_model3()
bert_model3.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate = 0.0001))
history = bert_model3.fit(x=train_dataset, validation_data=test_dataset, epochs=5)
Epoch 1/5
96/96 [==============================] - 85s 792ms/step - loss: 0.6771 - accuracy: 0.5596 - val_loss: 0.6340 - val_accuracy: 0.5706
Epoch 2/5
96/96 [==============================] - 75s 778ms/step - loss: 0.5929 - accuracy: 0.5938 - val_loss: 0.5739 - val_accuracy: 0.7019
Epoch 3/5
96/96 [==============================] - 73s 766ms/step - loss: 0.5330 - accuracy: 0.7854 - val_loss: 0.5117 - val_accuracy: 0.7807
Epoch 4/5
96/96 [==============================] - 74s 770ms/step - loss: 0.4863 - accuracy: 0.8025 - val_loss: 0.4913 - val_accuracy: 0.8011
Epoch 5/5
96/96 [==============================] - 74s 770ms/step - loss: 0.4365 - accuracy: 0.8320 - val_loss: 0.4707 - val_accuracy: 0.8083
plot_func(history)

png

png

This model’s performance is slightly better. It’s possible the network could continued to be trained on more epochs and potentially would achieve a higher validation accuracy, but it does look to be leveling off at around the 5th epoch.

Out of all models, it seems like the last one had a good validation accuracy at around 0.8083 at epoch 5. We can use that model to further investigate the validation predictions and predict the final testing data.

# model used in first submission
# def build_bert_model3():
#     # take vars from original load
#     text_input = tf.keras.layers.Input(shape=(), dtype=tf.string)
#     preprocessor = hub.KerasLayer(
#         "https://kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/en-uncased-preprocess/versions/3")
#     encoder_inputs = preprocessor(text_input)
#     encoder = hub.KerasLayer(
#         "https://www.kaggle.com/models/tensorflow/bert/frameworks/TensorFlow2/variations/bert-en-uncased-l-4-h-128-a-2/versions/2",
#         trainable=True)
#     outputs = encoder(encoder_inputs)
#     net = outputs['pooled_output']
#     net = layers.Dropout(0.5)(net)
#     net = layers.Reshape((-1,1))(net)
#     net = layers.LSTM(100,return_sequences=True)(net)
#     net = layers.GlobalMaxPooling1D()(net) # forcing to the right output shape
#     net = layers.Dense(1, activation='sigmoid')(net)
#     return tf.keras.Model(text_input, net)

# bert_model3 = build_bert_model3()
# bert_model3.compile(loss=tf.keras.losses.BinaryCrossentropy(), metrics=['accuracy'], optimizer=tf.optimizers.Adam(learning_rate = 0.0001))
# history = bert_model3.fit(x=train_dataset, validation_data=test_dataset, epochs=3)
bert_model3.evaluate(test_dataset)
bert_pred = bert_model3.predict(test_dataset)
24/24 [==============================] - 6s 238ms/step - loss: 0.4707 - accuracy: 0.8083
24/24 [==============================] - 6s 234ms/step

We can now look at what the model predicted for the validation testing set.

test_y = []
for i in list(test_dataset): #getting all labels from the tensor df
    for j in i[1]:
        test_y.append(int(j))

# turning into 1/0 predictions
bert_pred[bert_pred >= 0.5] = 1
bert_pred[bert_pred < 0.5] = 0 
bert_pred = bert_pred.reshape(1,-1)[0]
print('Lengths:', len(bert_pred), len(test_y))
Lengths: 1523 1523
cm = confusion_matrix(test_y, bert_pred)
cmd = ConfusionMatrixDisplay(cm)
cmd.plot()
print('Accuracy: %.2f' %(accuracy_score(test_y, bert_pred)))
print('F1: %.2f' %(f1_score(test_y, bert_pred)))
print('Precision: %.2f' %(precision_score(test_y, bert_pred)))
print('Recall: %.2f' %(recall_score(test_y, bert_pred)))
Accuracy: 0.81
F1: 0.77
Precision: 0.79
Recall: 0.75

png

Looking at the confusion matrix, it looks like the model misclassifies more instances of 0, or not disaster, as a disaster, instead of the inverse.

We can now export the final predictions for the model and see how we did!

test_df = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
test_clean = test_df.copy()
test_clean['text'] = test_clean['text'].apply(remove_words)
test_clean['text'] = test_clean['text'].apply(remove_emojis)
test_clean['text'] = test_clean['text'].apply(remove_urls)

# Remove Punctuation (after removal of stop words with punctuation)
test_clean['text'] = test_clean['text'].apply(lambda x: x.translate(str.maketrans('','',string.punctuation)))
test_clean['text'] = test_clean['text'].apply(lambda x: re.sub(' +', ' ', x))

tensor_finaltest_text = tf.convert_to_tensor(test_clean['text'].values)
ids = test_clean['id'].values
final_test_pred = bert_model3.predict(tensor_finaltest_text)
final_test_pred = final_test_pred.reshape(1,-1)[0]
final_test_pred[final_test_pred >= 0.5] = 1
final_test_pred[final_test_pred < 0.5] = 0 
final_df = pd.DataFrame(ids, columns=['id'])
final_df['target'] = final_test_pred
final_df['target'] = final_df['target'].astype('int')
final_df
102/102 [==============================] - 13s 126ms/step
id target
0 0 1
1 2 1
2 3 1
3 9 1
4 11 1
... ... ...
3258 10861 1
3259 10865 1
3260 10868 1
3261 10874 1
3262 10875 1

3263 rows × 2 columns

# Checking to make sure our model didn't just predict 1s or 0s
final_df['target'].value_counts()
target
0    1994
1    1269
Name: count, dtype: int64
# final_df.to_csv('/kaggle/working/submission.csv', index=False)

Conclusion

This was an interesting project to work on. It’s interesting how many different ways to vectorize a sentence there is. It was also interesting to get inspiration on how to clean tweets from other notebooks in this project. I think there is definitely still room in my code to reduce the ‘bad’ tweet text like new line markers and various other text I wasn’t exactly sure was doing.

The submitted model’s predictions scored , which is better than a random guess. With an accuracy score of 0.786, the model performed fairly well. The test set accuracy was slightly less than the final model’s validation accuracy. This is overall positive because it means the model did not overfit the training/validation data.

Referneces