Tweet Sentiments: Analyzing Airline Complaints with RNN

In this project, a Recurrent Neural Network (RNN) model was developed to classify tweets as complaints or non-complaints directed towards airlines. The project involved several steps, starting with data acquisition, where tweets were downloaded and extracted. Preprocessing steps included cleaning the text by removing URLs, converting mentions and hashtags, eliminating numbers and punctuation, and applying lemmatization. The data was then tokenized and sequenced for model input.

The model, built using TensorFlow and Keras, featured an embedding layer, a global average pooling layer, and dense layers, including one for binary classification with a sigmoid activation function. Training was conducted over 30 epochs, showing a final test accuracy of approximately 83.8%. The project also explored different thresholds for classifying a tweet as a complaint to optimize the balance between identifying actual complaints and reducing false positives.

The outcomes allowed a deeper understanding of customer sentiments towards airlines, highlighting the potential of RNNs in extracting meaningful insights from unstructured text data in social media.

Data Preparation and Preprocessing

The project starts with collecting tweets categorized into complaints and non-complaints. Data manipulation is performed using the Pandas library, a powerful tool for data analysis in Python.

python

Copy code

import pandas as pd

complaints = pd.read_csv('complaint.txt', sep="\t", header=None)

noncomplaints = pd.read_csv('noncomplaint.txt', sep="\t", header=None)

Here, we load the datasets containing complaints and non-complaints from text files using Pandas, which allows easy handling of tabular data with mixed types.

Sample Dataset:

Key Preprocessing Steps:

The preprocessing aims to clean and prepare the text for modeling:

Text Normalization and Cleaning: Convert the text to lowercase to standardize it, remove URLs, mentions, and numbers which don't contribute to sentiment analysis.

python

This function cleans the text by applying regular expressions to remove unwanted characters and words, and lemmatization to reduce words to their base forms, which helps in standardizing variations of the same word.

Model Building

A sequential model is constructed using TensorFlow and Keras, well-suited for stacking layers linearly.

The Embedding layer transforms each word into a dense vector of fixed size and is trained with the model to better capture the semantics of the words. GlobalAveragePooling1D reduces the dimensionality by averaging over the sequence dimension, simplifying the model and reducing the total number of parameters. Dense layers then learn to classify the sentiment based on features extracted by previous layers.

Training and Evaluation

The model is trained with 80% of the data reserved for training and the remaining 20% for validation to monitor overfitting.

python

Copy code

history = model.fit(X_train, y_train, epochs=30, validation_data=(X_test, y_test), verbose=2)

Training is done over 30 epochs, allowing the model to learn from the training data gradually. The verbose parameter is set to 2 to display detailed information about the training progress at each epoch.

Results Visualization

To understand the training dynamics, we plot accuracy and loss for both training and validation phases.

python

This function plots the accuracy and loss for each epoch, providing visual feedback on how well the model is learning and generalizing. The red line represents training metrics, and the blue line represents validation metrics, helping identify overfitting or underfitting.

Practical Application

Finally, the trained model is used to predict sentiments on new data, illustrating its potential utility.

python

The predict method outputs the probability of each tweet being a complaint. A threshold of 0.65 is used to classify tweets as complaints, balancing sensitivity and specificity based on the model's probabilistic outputs.

Results:

60% (135650) of the tweets were Complaints

40% (21559) of the tweets were Non-complaints

Percentage non complaints Complaints 0.1371359146104867 (13%)

Conclusion and Future Directions

This sentiment analysis project demonstrates the capabilities of RNNs in handling text data. Future improvements could include using more complex RNN architectures like LSTM or GRU, which might offer better performance due to their ability to capture longer dependencies in text.