Stock Market Forecasting using Time Series Analysis with ARIMA Model

Hardikkumar Last Updated : 31 May, 2024

10 min read

Introduction

The stock market is a marketplace that allows for the seamless exchange of corporate stock purchases and sales. Every Stock Exchange has its own value for the Stock Index. The index is the average value derived by adding up the prices of various equities. This aids in the representation of the entire stock market as well as the forecasting of market movement over time. The stock market can have a significant impact on individuals and the economy as a whole. As a result, effectively predicting stock market trends can reduce the risk of loss while increasing profit through stock market prediction.

We will use the ARIMA model to forecast the stock price of ARCH CAPITAL GROUP in this tutorial, focusing on various trading strategies and machine learning algorithms to handle market data effectively. The application of these techniques aims to manage the low predictability and volatility within financial markets.

Learning Objectives

Learn how the Autoregressive Integrated Moving Average (ARIMA) model utilizes historical data to forecast future stock market prices and stock returns.
Gain practical experience in applying ARIMA methodology to real-world stock data to identify trends and seasonal patterns in stock market movements.
Develop skills to assess the accuracy of ARIMA model predictions using common statistical metrics like MSE, MAE, RMSE, and MAPE, enhancing your ability to make informed trading strategies.

This article was published as a part of the Data Science Blogathon.

What is ARIMA Model?
ARIMA’s Role in Forecasting Market Prices
ADF (Augmented Dickey-Fuller) Test
Frequently Asked Questions

What is ARIMA Model?

The Autoregressive Integrated Moving Average (ARIMA) model is a powerful predictive tool used primarily in time series analysis. This model is crucial for transforming non-stationary data into stationary data, a necessary step for effective forecasting. ARIMA is renowned for its application in predicting future prices based on historical data, making it highly valued in financial sectors such as banking and economics. By using regression on past values, ARIMA helps to accurately forecast short-term movements in stock prices and stock returns, demonstrating its efficacy as a predictive model.

Time Series Analysis Arima | what is Arima

ARIMA’s Role in Forecasting Market Prices

ARIMA excels in the stock market by analyzing historical data to predict future stock prices, thereby aiding in short-term investment decisions. It integrates three essential components: Autoregression (AR), Differencing (I), and Moving Average (MA). The AR component models the relationship between a stock’s current price and its historical prices. Differencing helps stabilize the series by mitigating variations at different lags, essential for maintaining stationarity. The MA aspect manages the noise in the data by smoothing out past forecast errors. Collectively, these features enable ARIMA to provide robust predictions of market prices, capturing the dynamic patterns and trends inherent in time series data of stock returns.

We will use the ARIMA model to forecast the stock price of ARCH CAPITAL GROUP in this tutorial.

Load Required Libraries

!pip install pmdarima
import os
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from pmdarima.arima import auto_arima
from sklearn.metrics import mean_squared_error, mean_absolute_error
import math

stock_data = pd.read_csv(
    'acgl.us.txt', 
    sep=',', 
    index_col='Date', 
    parse_dates=['Date'], 
    date_parser=lambda dates: pd.to_datetime(dates, format='%Y-%m-%d')  ).fillna(0)
stock_data

Output:

Visualize the Stock’s Daily Closing Price

#plot close price
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Date')
plt.ylabel('Close Prices')
plt.plot(stock_data['Close'])
plt.title('ARCH CAPITAL GROUP closing price')
plt.show()

Output:

We can also use a probability distribution to visualize the data in our series.

#Distribution of the dataset
df_close = stock_data['Close']
df_close.plot(kind='kde')

Output:

Test for Stationarity

A time series is also regarded to include three systematic components: level, trend, and seasonality, as well as one non-systematic component termed noise. The following are the components’ definitions:

The average value in the series is called the level.
The increasing or falling value in the series is referred to as the trend.
Seasonality is the series’ recurring short-term cycle.
The random variance in the series is referred to as noise.

Because time series analysis only works with stationary data, we must first determine whether a series is stationary.

Before proceeding, it is essential to understand the concept of stationarity in time series. A stationarity in a time series means that its statistical properties like mean and variance do not change over time. This stability is crucial because most forecasting models require the series to be stationary to produce reliable results. Non-stationary series, which show trends or seasonal variations, often need adjustments such as differencing or transformation to achieve stationarity.

ADF (Augmented Dickey-Fuller) Test

One of the most widely used statistical tests is the Dickey-Fuller test. It can be used to determine whether or not a series has a unit root, and thus whether or not the series is stationary. This test’s null and alternate hypotheses are:

Null Hypothesis: The series has a unit root (value of a =1)
Alternate Hypothesis: The series has no unit root.

If the null hypothesis is not rejected, the series is said to be non-stationary. The series can be linear or difference stationary as a result of this.

The series becomes stationary if both the mean and standard deviation are flat lines (constant mean and constant variance).

#Test for staionarity
def test_stationarity(timeseries):
    #Determing rolling statistics
    rolmean = timeseries.rolling(12).mean()
    rolstd = timeseries.rolling(12).std()
    #Plot rolling statistics:
    plt.plot(timeseries, color='blue',label='Original')
    plt.plot(rolmean, color='red', label='Rolling Mean')
    plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Rolling Mean and Standard Deviation')
    plt.show(block=False)
    print("Results of dickey fuller test")
    adft = adfuller(timeseries,autolag='AIC')
    # output for dft will give us without defining what the values are.
    #hence we manually write what values does it explains using a for loop
    output = pd.Series(adft[0:4],index=['Test Statistics','p-value','No. of lags used','Number of observations used'])
    for key,values in adft[4].items():
        output['critical value (%s)'%key] =  values
    print(output)
test_stationarity(df_close)

Output:

Mean and standard deviation | Time Series Analysis Arima

Results of dickey fuller test
Test Statistics                   1.374899
p-value                           0.996997
No. of lags used                  5.000000
Number of observations used    3195.000000
critical value (1%)              -3.432398
critical value (5%)              -2.862445
critical value (10%)             -2.567252
dtype: float64

We can also use a probability distribution to visualize the data in our series.

#Distribution of the dataset
df_close = stock_data['Close']
df_close.plot(kind='kde')

Output:

The increasing mean and standard deviation may be seen in the graph above, indicating that our series isn’t stationary.

We can’t rule out the Null hypothesis because the p-value is bigger than 0.05. Additionally, the test statistics exceed the critical values. As a result, the data is nonlinear.

Eliminate Trend and Seasonality

Seasonality and trend may need to be separated from our series before we can undertake a time series analysis. This approach will cause the resulting series to become stagnant.

Let’s isolate the time series from the Trend and Seasonality.

#To separate the trend and the seasonality from a time series, 
# we can decompose the series using the following code.
result = seasonal_decompose(df_close, model='multiplicative', freq = 30)
fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(16, 9)

Output:

trend and seasonality | Time Series Analysis Arima

To reduce the magnitude of the values and the growing trend in the series, we first take a log of the series. We then calculate the rolling average of the series after obtaining the log of the series. A rolling average is computed by taking data from the previous 12 months and calculating a mean consumption value at each subsequent point in the series.

#if not stationary then eliminate trend
#Eliminate trend
from pylab import rcParams
rcParams['figure.figsize'] = 10, 6
df_log = np.log(df_close)
moving_avg = df_log.rolling(12).mean()
std_dev = df_log.rolling(12).std()
plt.legend(loc='best')
plt.title('Moving Average')
plt.plot(std_dev, color ="black", label = "Standard Deviation")
plt.plot(moving_avg, color="red", label = "Mean")
plt.legend()
plt.show()

Output:

Split Data into Training and Test Sets

Now we’ll develop an ARIMA model and train it using the stock’s closing price from the train data. So, let’s visualize the data by dividing it into training and test sets.

#split data into train and training set
train_data, test_data = df_log[3:int(len(df_log)*0.9)], df_log[int(len(df_log)*0.9):]
plt.figure(figsize=(10,6))
plt.grid(True)
plt.xlabel('Dates')
plt.ylabel('Closing Prices')
plt.plot(df_log, 'green', label='Train data')
plt.plot(test_data, 'blue', label='Test data')
plt.legend()

Output:

Train and test data | Time Series Analysis Arima

It’s time to choose the ARIMA model’s p,q, and d parameters. We chose the values of p,d, and q last time by looking at the ACF and PACF charts, but this time we’ll utilize Auto ARIMA to find the best parameters without looking at the ACF and PACF graphs.

To clarify, the p parameter in the ARIMA model denotes the number of lag observations included in the model, reflecting the autoregressive part that predicts future values based on past values. The d parameter represents the degree of differencing required to make the data stationary, addressing trends or seasonal effects by subtracting previous observations from current ones. Lastly, q indicates the size of the moving average window, which incorporates the dependency of an observation on a residual error from a moving average model applied to lagged observations. Understanding these parameters is crucial as they directly impact the model’s ability to capture the underlying patterns in the time series data.

Auto ARIMA: Find the Best Parameters

The auto_arima function returns a fitted ARIMA model after determining the most optimal parameters for an ARIMA model. This function is based on the forecast::auto. Arima R function, which is widely used.

The auro_arima function works by performing differencing tests (e.g., Kwiatkowski–Phillips–Schmidt–Shin, Augmented Dickey-Fuller, or Phillips–Perron) to determine the order of differencing, d, and then fitting models within start p, max p, start q, max q ranges. After conducting the Canova-Hansen to determine the optimal order of seasonal differencing, D, auto_arima also seeks to identify the optimal P and Q hyper-parameters if the seasonal option is enabled.

model_autoARIMA = auto_arima(train_data, start_p=0, start_q=0,
                      test='adf',       # use adftest to find optimal 'd'
                      max_p=3, max_q=3, # maximum p and q
                      m=1,              # frequency of series
                      d=None,           # let model determine 'd'
                      seasonal=False,   # No Seasonality
                      start_P=0, 
                      D=0, 
                      trace=True,
                      error_action='ignore',  
                      suppress_warnings=True, 
                      stepwise=True)
print(model_autoARIMA.summary())
model_autoARIMA.plot_diagnostics(figsize=(15,8))
plt.show()

Output:

So, how should the plot diagnostics be interpreted?

Top left: The residual errors appear to have a uniform variance and fluctuate around a mean of zero.

Top Right: The density plot on the top right suggests a normal distribution with a mean of zero.

Bottom left: The red line should be perfectly aligned with all of the dots. Any significant deviations would indicate a skewed distribution.

Bottom Right: The residual errors are not autocorrelated, as shown by the Correlogram, also known as the ACF plot. Any autocorrelation would imply that the residual errors have a pattern that isn’t explained by the model. As a result, you’ll need to add more Xs (predictors) to the model.

As a result, the Auto ARIMA model assigned the values 1, 1, and 2 to, p, d, and q, respectively.

As a result, the Auto ARIMA model assigned the values 1, 1, and 2 to, p, d, and q, respectively.

#Modeling
# Build Model
model = ARIMA(train_data, order=(1,1,2))  
fitted = model.fit(disp=-1)  
print(fitted.summary())

Output:

Modeling and Forecasting

Let’s now begin forecasting stock prices on the test dataset with a 95% confidence level.

# Forecast
fc, se, conf = fitted.forecast(321, alpha=0.05)  # 95% conf

# Make as pandas series
fc_series = pd.Series(fc, index=test_data.index)
lower_series = pd.Series(conf[:, 0], index=test_data.index)
upper_series = pd.Series(conf[:, 1], index=test_data.index)
# Plot
plt.figure(figsize=(10,5), dpi=100)
plt.plot(train_data, label='training data')
plt.plot(test_data, color = 'blue', label='Actual Stock Price')
plt.plot(fc_series, color = 'orange',label='Predicted Stock Price')
plt.fill_between(lower_series.index, lower_series, upper_series, 
                 color='k', alpha=.10)
plt.title('ARCH CAPITAL GROUP Stock Price Prediction')
plt.xlabel('Time')
plt.ylabel('ARCH CAPITAL GROUP Stock Price')
plt.legend(loc='upper left', fontsize=8)
plt.show()

Output:

Evaluate Model Performance

Our model played great, as you can see. Let’s take a look at some of the most common accuracy metrics for evaluating forecast results:

# report performance
mse = mean_squared_error(test_data, fc)
print('MSE: '+str(mse))
mae = mean_absolute_error(test_data, fc)
print('MAE: '+str(mae))
rmse = math.sqrt(mean_squared_error(test_data, fc))
print('RMSE: '+str(rmse))
mape = np.mean(np.abs(fc - test_data)/np.abs(test_data))
print('MAPE: '+str(mape))

Output:

With a MAPE of around 2.5%, the model is 97.5% accurate in predicting the next 15 observations.

Conclusion

Utilizing advanced learning techniques in Python provides a robust framework for stock market forecasting using the ARIMA model. This approach effectively analyzes price data and predicts price changes with high accuracy. By incorporating data mining methods to manage extensive datasets, our model supports real-time operations, yielding insights into stock trends. The ability to accurately forecast future market movements enhances investment strategies and underscores the importance of sophisticated analytics in modern financial markets.

Key Takeaways

ARIMA models are powerful for forecasting stock market trends by analyzing historical data and identifying potential future price movements.
The performance of the ARIMA model can be evaluated using metrics like MSE, MAE, RMSE, and MAPE, ensuring high accuracy in stock price predictions.
The effectiveness of ARIMA models in predicting short-term market movements supports their use in developing informed trading strategies, thereby reducing investment risks.

Frequently Asked Questions

Q1. How do deep learning models compare to ARIMA in stock market forecasting?

A. Deep learning models, especially those using recurrent neural networks (RNNs) like LSTM (Long Short-Term Memory), often outperform ARIMA when handling fluctuations in big data sets due to their ability to capture complex dependencies in the data over time.

Q2. Can neural networks be applied to technical analysis of stocks?

A. Yes, neural networks, particularly artificial neural networks (ANNs) and deep learning structures, are increasingly used in technical analysis to predict stock trend movements and valuation changes by learning from historical price data.

Q3. What is the significance of incorporating big data in financial valuation using methodology like ARIMA?

A. Utilizing big data allows ARIMA and similar methodologies to enhance forecast accuracy by analyzing a more comprehensive range of dependencies, such as GDP fluctuations and other macroeconomic factors, which significantly impact stock prices and market trends.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

ARIMA blogathon python Time Series

Hardikkumar

Data Analyst | Digital Data Analysis Specialist | Data Science Learner Currently working in Data Analytics field. I have done my post-graduation. My main focus is growing in the fields of Data Science and Analytics.

Beginner Data Science Machine Learning Project Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Daniel

Hi, this section of the code: fc, se, conf = fitted.forecast(321, alpha=0.05) isn't working. I'm new to ARIMA and was wondering if you can help me. Thanks so much for all the codes.

Show 1 reply

Isha .

did you find any solution to this error?

Bruce

Hi Hardikkumar, Thank you for sharing your interesting model. I am new to ML and start to learn stock prediction. I created a model by LSTM with 97.5% accuracy. But I don't know how I can predict the stock model for next week or the next 2 weeks. Any other information would be appreciated.

wahidkhan

you can learn advanced market forecasting from one and the only institute in india - " Arthashastragurukul.com". It teaches you vedic astronomy combined with Gann and Time cycle theory. Its accuracy is above 90%.

Reading list

Introduction

Common Patterns

Validation Techniques

Time Series Forecasting

Exponential Smoothing

ARIMA

Prophet

Deep Learning

Stock Market Forecasting using Time Series Analysis with ARIMA Model

Introduction

Learning Objectives

Table of contents

What is ARIMA Model?

ARIMA’s Role in Forecasting Market Prices

Load Required Libraries

Visualize the Stock’s Daily Closing Price

Test for Stationarity

ADF (Augmented Dickey-Fuller) Test

Eliminate Trend and Seasonality

Split Data into Training and Test Sets

Auto ARIMA: Find the Best Parameters

Modeling and Forecasting

Evaluate Model Performance

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set