import numpy as npimport pandas as pdimport seaborn as snsimport calendarimport matplotlib.pyplot as pltimport yfinance as yfimport refrom nltk.tokenize import word_tokenizefrom collections import Counterimport nltkfrom nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizefrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.metrics import classification_report, accuracy_scoreimport torchfrom torch.utils.data import DataLoaderfrom torch.utils.data import Datasetimport torch.optim as optimfrom torch.optim.lr_scheduler import ExponentialLRimport torch.nn as nnfrom transformers import AutoTokenizersns.set_theme(style="whitegrid")SEED =3np.random.seed(SEED)torch.manual_seed(SEED)torch.cuda.manual_seed(SEED)
Introduction
Stock trading, as a critical component of the modern economic framework, profoundly impacts almost every aspect of our daily lives. The stock market itself serves as a vital barometer of economic trends, reflecting the trajectories of companies, regions, nations, and even the global economy. Stock prices fluctuate every second, with international news often playing a significant role in these changes. In the past, people relied on newspapers the next day to access news, but today, information is transmitted instantaneously to electronic devices worldwide through the internet.
In this project, we aim to train a model capable of predicting the rise or fall of the S&P 500 index immediately upon receiving news headlines. This tool can empower individuals to make more informed financial decisions, even if they lack expertise in the relevant industry or region covered by the news. For this purpose, we collected daily news headlines from CNBC, The Guardian, and Reuters from 2018 to 2020, along with daily trading data for the S&P 500 during the same period.
Methods
Exploratory Data Analysis
News Headlines Datasets
The three news datasets are obtained through Kaggle. The author mentioned these data are scraped from CNBC, the Guardian, and Reuters official websites, the headlines in these datasets reflects the overview of the U.S. economy and stock market every day for the past year to 2 years.
The Timeframes of data:
Data scraped from CNBC contains the headlines, last updated date, and the preview text of articles from the end of December 2017 to July 19th, 2020.
Data scraped from the Guardian Business contains the headlines and last updated date of articles from the end of December 2017 to July 19th, 2020 since the Guardian Business does not offer preview text.
Data scraped from Reuters contains the headlines, last updated date, and the preview text of articles from the end of March 2018 to July 19th, 2020.
Code
cnbc_data = pd.read_csv('dataset/cnbc_headlines.csv')# There is ,, empty lines in CNBC, drop themcnbc_data.dropna(subset=['Time'],inplace=True)guardian_data = pd.read_csv('dataset/guardian_headlines.csv')reuters_data = pd.read_csv('dataset/reuters_headlines.csv')
Code
dfs = [cnbc_data, guardian_data, reuters_data]df_names = ['CNBC', 'Guardian', 'Reuters']non_null_counts = [df.dropna().shape[0] for df in dfs]plt.figure(figsize=(5, 5))plt.bar(df_names, non_null_counts)plt.title("Number of Non-Null Entries in DataFrames")plt.xlabel("DataFrames")plt.ylabel("Non-Null Count")plt.show()
We can see here that all of the news datasets have varying numbers of data points ranging from 2800 to 32700. This will not be a problem for us since our question focuses on the impact of news headlines in general on the S&P 500, so all of this data will be combined in to a larger dataset ordered by the date of the headline. We can also see there is null data within the CNBC dataset which will be removed.
These histograms display the number of headlines given a specific length. Disregarding the bar at length of 0 in the CNBC graph due to the null data, headlines from all three stations seem to center around 60-70 words with the max being ~100 for CNBC and Reuters, and ~120 for the Guardian.
Headline Distribution
To find out the distribution of headlines throughout the time frame, we generated a graph with headlines colored differently in each month of the year.
Code
# Extra cleaning for CNBCcnbc_data['Time'] = ( cnbc_data['Time'] .str.replace(r"ET", "", regex=True) .str.strip() .str.replace(r"\s+", " ", regex=True))
Code
cnbc_data['Time'] = pd.to_datetime( cnbc_data['Time'], format="mixed", errors='coerce')# For GUARDIANguardian_data['Time'] = pd.to_datetime( guardian_data['Time'], format='%d-%b-%y', errors='coerce')# For REUTERSreuters_data['Time'] = pd.to_datetime( reuters_data['Time'], format='%b %d %Y', errors='coerce')# Adding additional columns for time analysisfor df in [cnbc_data, guardian_data, reuters_data]:# Extract date parts for time-based analysis df['Year'] = df['Time'].dt.year df['Month'] = df['Time'].dt.month# Frequency of headlines by year and month for each datasetcnbc_yearly_counts = cnbc_data.groupby(['Year', 'Month']).size().unstack(fill_value=0)guardian_yearly_counts = guardian_data.groupby(['Year', 'Month']).size().unstack(fill_value=0)reuters_yearly_counts = reuters_data.groupby(['Year', 'Month']).size().unstack(fill_value=0)
Code
fig, ax = plt.subplots(3, 1, figsize=(14, 12), sharex=True)# Set a colormap to represent months consistentlymonth_colors = plt.colormaps["tab20"]# CNBC dataset with month colorscnbc_yearly_counts.plot(kind="bar", stacked=True, ax=ax[0], color=[month_colors(i) for i inrange(12)], legend=False)ax[0].set_title("CNBC Headlines Frequency by Year and Month")ax[0].set_ylabel("Number of Headlines")# Guardian dataset with month colorsguardian_yearly_counts.plot(kind="bar", stacked=True, ax=ax[1], color=[month_colors(i) for i inrange(12)], legend=False)ax[1].set_title("Guardian Headlines Frequency by Year and Month")ax[1].set_ylabel("Number of Headlines")# Reuters dataset with month colorsreuters_yearly_counts.plot(kind="bar", stacked=True, ax=ax[2], color=[month_colors(i) for i inrange(12)], legend=False)ax[2].set_title("Reuters Headlines Frequency by Year and Month")ax[2].set_ylabel("Number of Headlines")ax[2].set_xlabel("Year")# Adding a single legend for the monthsmonth_names = [calendar.month_name[i] for i inrange(1, 13)]fig.legend(month_names, loc="upper right", title="Months")plt.tight_layout(rect=[0, 0, 0.85, 1]) # Adjust layout to fit the legendplt.show()
These graphs depict the number of headlines per month per year. With this, we can see that the earlier months of the year seem to have a higher concentration of headlines.
Code
for df in [cnbc_data, guardian_data, reuters_data]: df.drop(columns=['Year','Month'],inplace=True)
Word Frequency
A short analysis on word frequency. We used the stopword dictionary in nltk to help filtering out words like a and the.
# Function to clean and process headlines for meaningful word frequenciesdef process_and_plot(data, title, start=0, end=15, stopwords=None): stopwords = stop_words combined_string =' '.join(data['Headlines']) # Combine all headlines word_list = combined_string.split() # Split into words word_list = [word.lower().strip(",.!?()[]") for word in word_list if word.lower() notin stopwords]# Calculate word frequencies word_count = Counter(word_list)# Sort words by frequency sorted_words = word_count.most_common() # Sort by frequency top_words = sorted_words[start:end] # Select words from the specified range# Create lists of words and their counts words = [word for word, count in top_words] counts = [count for word, count in top_words]# Plot the bar chart plt.figure(figsize=(10, 4)) sns.barplot(x=counts, y=words,hue=words, palette="Blues_d", orient="h") plt.xlabel('Counts') plt.ylabel('Words') plt.title(title) plt.tight_layout() plt.show()# Plot for each dataset excluding common stopwordsprocess_and_plot(guardian_data, 'Guardian: Top 15 Meaningful Words Frequency')process_and_plot(cnbc_data, 'CNBC: Top 15 Meaningful Words Frequency')process_and_plot(reuters_data, 'Reuters: Top 15 Meaningful Words')
Most of the words are meaningful, but who is cramer in the CNBC dataset? Turns out Jim Cramer is the host of various financial programs in CNBC. We will prune him out from the CNBC dataset later.
[*********************100%%**********************] 1 of 1 completed
Statistical Analysis
Code
data_hlcv.describe()
High
Low
Close
Volume
count
669.000000
669.000000
669.000000
6.690000e+02
mean
2883.568308
2849.033738
2867.277964
3.965081e+09
std
198.117998
206.057495
202.285824
1.154337e+09
min
2300.729980
2191.860107
2237.399902
1.296530e+09
25%
2739.189941
2709.540039
2724.439941
3.300220e+09
50%
2856.669922
2825.389893
2843.489990
3.635780e+09
75%
2999.149902
2970.090088
2984.870117
4.156640e+09
max
3393.520020
3378.830078
3386.149902
9.053950e+09
The S&P 500 dataset from December 1, 2017, to July 31, 2020, contains 669 daily records with columns for High, Low, Close, and Volume, with no missing values. The average ‘High’, ‘Low’, and ‘Close’ prices are around 2883, 2849, and 2867, respectively, with standard deviations near 200 points, indicating moderate volatility. The ‘Volume’ data, averaging 3.97 billion shares, shows considerable variability, ranging from 1.3 billion to 9.05 billion, reflecting spikes in trading activity during certain market events.
To prepare for analysis, normalization or standardization may be beneficial to handle the scale differences, particularly between price and volume data. This initial overview confirms a relatively stable daily distribution, setting up further analysis on trends, volatility, and potential event impacts on S&P 500 performance.
This is the stock price of the S&P 500 over min and max dates covered by the news headlines. Some noticeable features that are included in this graph is the large dip during early 2020 caused by covid. This will have an interesting impact on our model since the news did play a big role in the scare factor for COVID-19, but the fact that it was caused by a global epidemic may skew the embeddings of other words.
Data Preprocessing
Clean out NaT values in Time column of three datasets.
Here we did our first cleaning by converting all characters to lower case, and remove extra spaces, quotation marks and other unwanted ones. We are also removing JimCramer, as well as his show Mad Money from the CNBC dataset.
Code
def clean_headlines(text):# Convert text to lowercase text = text.lower()# Remove special characters except hyphens and spaces text = re.sub(r"[^\w\s\-]", "", text) words = word_tokenize(text) cleaned_text =" ".join(words)return cleaned_textguardian_data['Headlines'] = guardian_data['Headlines'].apply(clean_headlines)cnbc_data['Headlines'] = cnbc_data['Headlines'].apply(clean_headlines)cnbc_data['Description'] = cnbc_data['Description'].apply(clean_headlines)reuters_data['Headlines'] = reuters_data['Headlines'].apply(clean_headlines)reuters_data['Description'] = reuters_data['Description'].apply(clean_headlines)def remove_jim(text): words_to_remove = ['jim', 'cramer', 'mad money'] pattern =r'\b('+'|'.join(words_to_remove) +r')\b' cleaned = re.sub(pattern, '', text, flags=re.IGNORECASE) cleaned = re.sub(r'\s+', ' ', cleaned).strip()return cleanedcnbc_data['Headlines'] = cnbc_data['Headlines'].apply(remove_jim)cnbc_data['Description'] = cnbc_data['Description'].apply(remove_jim)cnbc_data.head()
Headlines
Time
Description
0
a better way to invest in the covid-19 vaccine...
2020-07-17 19:51:00
host recommended buying four companies that ar...
1
cramers lightning round i would own teradyne
2020-07-17 19:33:00
host rings the lightning round bell which mean...
3
cramers week ahead big week for earnings even ...
2020-07-17 19:25:00
well pay more for the earnings of the non-covi...
4
iq capital ceo keith bliss says tech and healt...
2020-07-17 16:24:00
keith bliss iq capital ceo joins closing bell ...
5
wall street delivered the kind of pullback ive...
2020-07-16 19:36:00
look for the stocks of high-quality companies ...
Add Prediction Target
Since our goal is to relate news outlets with S&P500, part of our project will be focusing on the trend prediction of future S&P 500 price change. Which we created a binary column trend_up which will be True if the price current trading date is lower than tomorrow’s.
Code
stock_data = data_hlcv.reset_index()[['Date', 'Close']]# Flatten the column headers if they are multi-levelstock_data.columns = stock_data.columns.map(lambda x: x[1] ifisinstance(x, tuple) else x)stock_data.rename(columns={stock_data.columns[0]: 'Date', stock_data.columns[1]: 'Close'}, inplace=True)stock_data['trend_up'] = stock_data['Close'].shift(-1) > stock_data['Close']stock_data.head()
Date
Close
trend_up
0
2017-12-01
2642.219971
False
1
2017-12-04
2639.439941
False
2
2017-12-05
2629.570068
False
3
2017-12-06
2629.270020
True
4
2017-12-07
2636.979980
True
We also want to make sure that the proportion True and False are balanced.
For our first model, we will ignore temporal relationship by treating every news as an independent datapoint. We merged all datasets into one, along with the prediction target.
We are using logistic regression with TF-IDF features as our base model.
Individual Headline Model
For this model, we are treating every news are individual data points. This is guarantee to fail because there is way to little information contained in a single news title, and there will be too much noise.
Code
data = first_model_data[['Headlines', 'trend_up']].copy()vectorizer = TfidfVectorizer(max_features=300) X = vectorizer.fit_transform(data['Headlines']).toarray()y = data['trend_up']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=SEED)model = LogisticRegression(random_state=SEED, max_iter=1500)model.fit(X_train, y_train)y_pred_test = model.predict(X_test)y_pred_train = model.predict(X_train)train_accuracy = accuracy_score(y_train, y_pred_train)test_accuracy = accuracy_score(y_test, y_pred_test)data.to_csv("./dataset/single_dataset.csv",index=False)# Display resultsprint("Test Accuracy:", test_accuracy)print("Train Accuracy:", train_accuracy)print("Report on test dataset:")pd.DataFrame(classification_report(y_test, y_pred_test, output_dict=True)).transpose()
Test Accuracy: 0.5147572199301809
Train Accuracy: 0.5766158891357241
Report on test dataset:
precision
recall
f1-score
support
False
0.443804
0.033861
0.062921
4548.000000
True
0.517461
0.960652
0.672614
4905.000000
accuracy
0.514757
0.514757
0.514757
0.514757
macro avg
0.480633
0.497257
0.367768
9453.000000
weighted avg
0.482023
0.514757
0.379281
9453.000000
Joint Headline Model
By joining all headlines of the same day in to one sentence, we hope that TF-IDF could capture more information than our previous model.
Test Accuracy: 0.5461538461538461
Train Accuracy: 0.6421663442940039
Report on test dataset:
precision
recall
f1-score
support
False
0.333333
0.035088
0.063492
57.000000
True
0.556452
0.945205
0.700508
73.000000
accuracy
0.546154
0.546154
0.546154
0.546154
macro avg
0.444892
0.490147
0.382000
130.000000
weighted avg
0.458623
0.546154
0.421201
130.000000
We also experimented in the amount of features TF-IDF should have in order to have the best performance.
Code
complexities = [10, 50, 100, 200, 400, 600, 800, 1000, 1200]# Lists to store accuraciestrain_accuracies = []test_accuracies = []for max_features in complexities: vectorizer = TfidfVectorizer(max_features=max_features) X = vectorizer.fit_transform(data['Headlines']).toarray() y = data['trend_up'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False, random_state=SEED) model = LogisticRegression(random_state=SEED, max_iter=1500) model.fit(X_train, y_train) train_accuracy = accuracy_score(y_train, model.predict(X_train)) test_accuracy = accuracy_score(y_test, model.predict(X_test)) train_accuracies.append(train_accuracy) test_accuracies.append(test_accuracy)plot_data = pd.DataFrame({'Complexity': complexities *2,'Accuracy': train_accuracies + test_accuracies,'Type': ['Training'] *len(complexities) + ['Test'] *len(complexities)})plt.figure(figsize=(10, 6))sns.lineplot(data=plot_data, x='Complexity', y='Accuracy', hue='Type', marker='o')plt.axvline(x=complexities[np.argmax(test_accuracies)], linestyle='--', color='gray', label="Optimal Complexity")plt.title("Fitting Graph: Training & Test Accuracy")plt.xlabel("Model Complexity (max_features)")plt.ylabel("Accuracy")plt.legend()plt.grid()plt.show()
As expected, we do not need to high complexity for TF-IDF, as increasing it will overfit our training dataset. A complexity around 800 yields the best result on the test set with accuracy about 55%.
Transformer Model
We will be using our joined dataset which concatenate all news in the same day to one line. The main reason for this is because if we use single data news data points, there will be too much noise in our dataset, and the model cannot learn any features. Notice that shuffle is set to False in our split. This is because the stock and news data are all time-series, which we cannot inform our model about the future.
Code
data = pd.read_csv("./dataset/grouped_dataset.csv")train_dataset, test_dataset = train_test_split(data, test_size=0.2, shuffle=False, random_state=SEED)
Preparation
We used a tokenizer from HuggingFace, which give unique tokens to every word.
The Dataset class to create PyTorch’s data loader.
Code
# Class that containerizes the datasetclass NewsDataset(Dataset):def__init__(self, headlines, labels, tokenizer, max_length):self.headlines = headlinesself.labels = labelsself.tokenizer = tokenizerself.max_length = max_lengthdef__len__(self):returnlen(self.headlines)def__getitem__(self, idx):# Tokenize individual headline text =self.headlines[idx] tokens =self.tokenizer( text, padding="max_length", truncation=True, max_length=self.max_length, return_tensors="pt" )return tokens['input_ids'].squeeze(0), torch.tensor(self.labels[idx], dtype=torch.float)# Create the test and train data loaders from datadef createDataLoader(train_data, test_data, tokenizer, MAX_LENGTH, BATCH_SIZE): train_PYdataset = NewsDataset( headlines=train_data['Headlines'].tolist(), labels=train_data['trend_up'].astype(int).tolist(), tokenizer=tokenizer, max_length=MAX_LENGTH ) train_dataloader = DataLoader( train_PYdataset, batch_size=BATCH_SIZE, shuffle=True, ) test_PYdataset = NewsDataset( headlines=test_data['Headlines'].tolist(), labels=test_data['trend_up'].astype(int).tolist(), tokenizer=tokenizer, max_length=MAX_LENGTH ) test_dataloader = DataLoader( test_PYdataset, batch_size=BATCH_SIZE, shuffle=True, )return train_dataloader, test_dataloader
The module is a custom classifier from PyTorch. We will be utilizing the TransformerEncoderLayer, as our prediction class is only using the encoder. After the encoder, we added a fully connected layer so the output will be num_output of logits.
Code
# An untrained transformer class with multiple hyperparametersclass UntrainedTransformerClassifier(nn.Module):def__init__(self, vocab_size, embed_dim, num_heads, num_layers, max_length, num_output, dropout =0.2):super(UntrainedTransformerClassifier, self).__init__()self.embedding = nn.Embedding(vocab_size, embed_dim)self.positional_encoding = nn.Embedding(max_length,embed_dim)# Transformer encoder encoder_layer = nn.TransformerEncoderLayer( d_model=embed_dim, nhead=num_heads, dim_feedforward=embed_dim *4, dropout=dropout, batch_first=True, )self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)#self.linear = nn.Linear(embed_dim, num_output)self.dropout = nn.Dropout(dropout)def forward(self, input_ids): seq_length = input_ids.size(1)# Token embedding + positional encoding x =self.embedding(input_ids) +self.positional_encoding(torch.arange(seq_length,device=input_ids.device)) x =self.transformer(x) x =self.dropout(x) x = x.max(dim=1)[0] out =self.linear(x)return out
Below are some helper methods for calculating evaluation metrics and graphs.
Code
# Evalutation method for single classdef evaluate(model, dataloader, device): model.eval() correct =0 total_samples =0with torch.no_grad():for X, Y in dataloader: input_ids = X.to(device) labels = Y.to(device) logits = model(input_ids) outputs = torch.sigmoid(logits).squeeze(1) preds = (outputs >=0.5).float() correct += (preds == labels).sum().item() total_samples += labels.size(0) accuracy = correct / total_samples model.train()return accuracy#Evaluation method for two classesdef evaluate_old(model, dataloader, device): model.eval() correct =0 total_samples =0with torch.no_grad(): # Disable gradient computationfor X,Y in dataloader: input_ids = X.to(device) # Move to device labels = Y.to(device) # Move to device logits = model(input_ids)# Calculate predictions _,preds = torch.max(logits.data, 1) correct += (preds == labels).sum().item() total_samples += labels.size(0) accuracy = correct / total_samples model.train()return accuracy# Generates the fitting graph given test and train errordef generateFittingGraph(train_error, test_error, epochs): x =range(1,epochs+1) plt.plot(x, train_error, label="Train Error") plt.plot(x, test_error, label="Test Error") plt.xlabel("Epochs") plt.ylabel("Predictive Error") plt.title("Fitting Graph") plt.legend() plt.show()
Base Classifier Transformer Model
To start, we are using the following hyperparameters.
Epoch [5/25], Loss: 0.6124, Train Acc: 0.6615, Test Acc: 0.5000
Epoch [10/25], Loss: 0.5535, Train Acc: 0.5919, Test Acc: 0.5615
Epoch [15/25], Loss: 0.5094, Train Acc: 0.9439, Test Acc: 0.4769
Epoch [20/25], Loss: 0.3591, Train Acc: 0.9884, Test Acc: 0.5000
Epoch [25/25], Loss: 0.3440, Train Acc: 0.9923, Test Acc: 0.5385
As we can tell, the test accuracy is around 50%, which means the model did not learn anything useful and ended up guessing randomly. We are not using a model which does nothing more than just coin-flipping, so we did more hyperparameter tunning.
Changing to Singular Class
Since we are doing binary classification, there is no need for the output to be two classes, we can just merge it into one class.
Epoch [5/25], Loss: 0.7740, Train Acc: 0.5629, Test Acc: 0.5538
Epoch [10/25], Loss: 0.7029, Train Acc: 0.6402, Test Acc: 0.4923
Epoch [15/25], Loss: 0.6256, Train Acc: 0.6983, Test Acc: 0.5000
Epoch [20/25], Loss: 0.6154, Train Acc: 0.7737, Test Acc: 0.4769
Epoch [25/25], Loss: 0.6474, Train Acc: 0.8162, Test Acc: 0.5231
Removing Stop Words
During our data analysis, we discovered that the most frequent words in the headlines are just stop words. Though they may serve a purpose in long paragraphs, we think it may be uninformative in compact sentences like headlines. So we removed them through WordListCorpusReader from nltk.
Code
data_clean = data.copy()data_clean['Headlines'] = data_clean['Headlines'].apply(word_tokenize)stop_words =set(stopwords.words('english'))data_clean['Headlines'] = data_clean['Headlines'].apply(lambda x: [word for word in x if word notin stop_words])i=0for headline in data_clean['Headlines']: new_headline =' '.join(headline) data_clean.loc[i, 'Headlines'] = new_headline i+=1
Epoch [5/25], Loss: 0.7025, Train Acc: 0.5725, Test Acc: 0.6154
Epoch [10/25], Loss: 0.6257, Train Acc: 0.6518, Test Acc: 0.6231
Epoch [15/25], Loss: 0.5229, Train Acc: 0.6905, Test Acc: 0.6000
Epoch [20/25], Loss: 0.7097, Train Acc: 0.7079, Test Acc: 0.6077
Epoch [25/25], Loss: 0.5669, Train Acc: 0.7427, Test Acc: 0.6077
Removing Specific Words (remix, cramers lightning round)
Though we removed most of the words directly related with Mr. Cramers, there are still relavent words in the CNBC dataset. So we tried removing more words related with him.
Epoch [5/25], Loss: 0.6566, Train Acc: 0.5725, Test Acc: 0.5385
Epoch [10/25], Loss: 0.6646, Train Acc: 0.5977, Test Acc: 0.5692
Epoch [15/25], Loss: 0.6579, Train Acc: 0.6170, Test Acc: 0.5615
Epoch [20/25], Loss: 0.7210, Train Acc: 0.6480, Test Acc: 0.5692
Epoch [25/25], Loss: 0.5859, Train Acc: 0.7292, Test Acc: 0.5462
Increasing number of heads
Since our model is processing a long sequence of concatenated sentences, maybe increasing number of heads in our transformer could capture more relationships, and hopefully learn that there are different sources in the input.
Epoch [5/25], Loss: 0.6907, Train Acc: 0.5706, Test Acc: 0.5538
Epoch [10/25], Loss: 0.7336, Train Acc: 0.6499, Test Acc: 0.5077
Epoch [15/25], Loss: 0.7047, Train Acc: 0.7060, Test Acc: 0.5385
Epoch [20/25], Loss: 0.5875, Train Acc: 0.7776, Test Acc: 0.5077
Epoch [25/25], Loss: 0.6778, Train Acc: 0.8472, Test Acc: 0.4769
Increasing Number of Heads with LR Decay
There was a significant over-fitting problem when we set NUM_HEAD = 32. To solve this issue, we introduced learning rate decay. It will decrease learning rate (gamma=0.9) after each epoch.
Epoch [5/25], Loss: 0.6720, Train Acc: 0.6054, Test Acc: 0.4923
Epoch [10/25], Loss: 0.6274, Train Acc: 0.6190, Test Acc: 0.5231
Epoch [15/25], Loss: 0.7014, Train Acc: 0.6402, Test Acc: 0.5000
Epoch [20/25], Loss: 0.6835, Train Acc: 0.6422, Test Acc: 0.5077
Epoch [25/25], Loss: 0.5104, Train Acc: 0.6499, Test Acc: 0.5077
Conclusion and Best Model
Our best results happened to be 60% accuracy when we removed stop words. Usually for context analysis, we should not remove stop words, but it maybe that stop words in news headlines are uninformative compared with other words.
Our model did relatively well on predicting, with very few false negatives, which we can take advantage of it.
Results
Performance Summary of Models
We evaluated multiple machine learning models to predict the daily movement of the S&P 500 index based on news headlines. The table below summarizes the training and testing accuracies of the models used in this study:
Transformer Classifier (Increased Heads + LR Decay)
65.0
50.8
Model Performance Analysis
The Logistic Regression model with TF-IDF features served as a baseline for our analysis. Performance improved when headlines were aggregated daily rather than analyzed individually. This finding highlights the importance of incorporating richer contextual information to improve predictions.
Transformer-based models demonstrated varied performance based on preprocessing and hyperparameter tuning. The Transformer Classifier with stop word removal achieved the highest test accuracy of 61.5%, showcasing the significance of reducing noise in textual data. However, the other configurations of Transformer models, such as increasing the number of attention heads or applying learning rate decay, failed to generalize effectively, often resulting in overfitting.
Overall, while some models outperformed random guessing, the relatively modest accuracy highlights the challenges inherent in predicting stock market movements based solely on news headlines. These results reflect the complexity of financial markets, which depend on multiple interrelated factors beyond news sentiment.
Discussion
Quantitative Trading Strategy
Before we make our strategy, lets visualize how our best model (Transformer with stop words removed) performs on the training dataset.
Code
model = model_rm1def visualize_predictions(date,predictions): chart_data = pd.DataFrame({"Date": date,"Prediction": [1if pred else-1for pred in predictions] # +1 for True, -1 for False }) chart_data["Date"] = pd.to_datetime(chart_data["Date"]) chart_data.sort_values("Date", inplace=True) plt.figure(figsize=(12, 4)) plt.bar(chart_data["Date"], chart_data["Prediction"], color=chart_data["Prediction"].map({1: "green", -1: "red"}), width=1.0) # Adjust width for better appearance plt.xlabel("Date") plt.ylabel("Prediction") plt.title("Predictions Over Time") plt.axhline(0, color="black", linewidth=0.8, linestyle="--") # Add a horizontal line at 0 plt.xticks(rotation=45) # Rotate date labels for readability plt.tight_layout() plt.show()train_dataset_quant = NewsDataset( headlines=train_dataset['Headlines'].tolist(), labels=train_dataset['trend_up'].astype(int).tolist(), tokenizer=tokenizer, max_length=MAX_LENGTH )train_dataloader_quant = DataLoader( train_dataset_quant, batch_size=BATCH_SIZE, shuffle=False,)model.eval()correct =0total_samples =0FP =0# False PositivesFN =0# False NegativesTP =0# True PositivesTN =0# True Negativespredictions = []with torch.no_grad():for X, Y in train_dataloader_quant: input_ids = X.to(device) labels = Y.to(device) logits = model(input_ids) outputs = torch.sigmoid(logits).squeeze(1) preds = (outputs >=0.5) predictions.extend(preds.cpu().numpy().astype(bool))visualize_predictions(train_dataset["Date"],predictions)
From the graph above, we know that most of the time our model is predicting true. We can utilize it because from our previous section, we know that our model has very few false negatives, so we can be certain that most of the times the model is able to predict upcoming downfalls of the index. So we can make our strategy as the following:
Hold until model predicts bear market
When we detect bears, sell 80% and do a short with the revenue for 3 days
We simulated our strategy on the test dataset. We are using Vanguard’s S&P500 ETF (VOO) as our target since ETFs reflects the asset with very short lags, which is enough as we are not doing high-frequency trading.
[*********************100%%**********************] 1 of 1 completed
We deploy our strategy and compare it with a base strategy of holding the ETF for the entry time. And the results are the following:
Code
holdings = quant_df.iloc[0]["Close"] *100returns = []for i inrange(len(quant_df)):if i <=len(quant_df) -4: today = quant_df.iloc[i] tmr = quant_df.iloc[i+1] day3 = quant_df.iloc[i+3]if today["Predictions"] ==True: # Bull market holdings *= (tmr["Close"]/today["Close"])elif today["Predictions"] ==False: # Bear marketif holdings >0: # Sell 80% of holdings shortCapital = holdings *0.8 holdings *=0.2 holdings = holdings*(tmr["Close"]/today["Close"]) + shortCapital*(today["Close"]/day3["Close"]) returns.append(holdings)# Add returns to the DataFramequant_df["Portfolio Value"] = np.array(returns) /100print(f"Portfolio is {quant_df.iloc[-1]["Portfolio Value"] / quant_df.iloc[0]["Portfolio Value"] *100:.4f}% after half a year.")print(f"Long holding is {quant_df.iloc[-1]["Close"] / quant_df.iloc[0]["Close"] *100:.4f}% after half a year.")# Plot resultsplt.figure(figsize=(12, 6))plt.plot(quant_df["Date"], quant_df["Portfolio Value"], label="Portfolio Value")plt.plot(quant_df["Date"], quant_df["Close"], label="Long Hold Value", alpha=0.7)plt.title("Portfolio Performance")plt.xlabel("Date")plt.ylabel("Value")plt.legend()plt.show()
Portfolio is 111.7785% after half a year.
Long holding is 97.8522% after half a year.
The conclusion is that our portfolio with this quant strategy surpassed the base strategy significantly. We got 10% gross profit within 6 months, which our base strategy was hard to maintain itself.
Summary
Initially, we believed news sentiment combined with market data would be a strong predictor. Early efforts, like the TF-IDF-based model, set a baseline but highlighted the limitations of simple feature extraction. The Transformer classifier offered modest improvements, achieving a 61.5% test accuracy after tuning. However, the gains were incremental, suggesting the model struggled with the high noise and complexity inherent in financial data.
One major challenge was the dataset itself. Stock movements depend on a mix of news, macroeconomic factors, and investor behavior, making it difficult for any single model to perform well. Additionally, the binary classification approach may have oversimplified the problem. A regression model could provide more nuanced predictions, reflecting the continuous nature of market changes.
Our results are moderately believable—they outperform random guessing but remain far from reliable for decision-making. This reflects both the unpredictable nature of financial markets and the limitations of current modeling approaches.
Conclusion
Overall our second model performed worse than our expectations. Although it beat out the first model, it was very marginal compared to the hyperparameter tuning that was done. Some improvements that can be done to this model is to possibly create and ensemble with a time series model and transform the problem into a regression problem. We believe that there is a lot of noise in the data to do a simple classification, so reworking the problem may utilize the model the best. These changes could improve the model; however, we believe that the improvements would be limited due to the nature of the model and the data itself.
Future Models Another that we plan to look into are LSTMs. These types of models perform in both NLP tasks and time series tasks. Since our problem is heavily dependent on those two things, LSTMs could be the perfect model. We have also already began working with BERT because it is bidirectional and it is specifically built for sentiment analysis. We believe that the combination of these facts along with attention can boost the accuracy on this dataset.