Time Series Analysis- Part I

Madhu Ramiah
5 min readAug 20, 2019

--

Time series analysis is used for a wide variety of applications like stock market analysis, sales forecasting, economic forecasting, budget forecasting, etc. In all the previous blogs, we read about a variety of classification and regression techniques. Now comes a question, is time series a classification or regression algorithm or neither of them. To answer this lets look into the data below,

Here we can see only 2 fields, one is date and the other is number of passengers. We want to predict the number of passengers for the month 1949–12. Since this prediction is going to be a numeric, this is not a classification problem. When we consider a regression problem, there should be a dependent variable and an independent variable. Here we have only one variable, so this cannot be a regression problem as well. This is a time series problem. In time series, we have only one predictor variable for a specific time period. The time can be in hours, days, weeks, bi-weekly, months, quarterly, half-yearly or yearly.

When is the data a time series and when is it not?

If the predictor variable depends on itself over a period of time, only then the data is a time series data. For example, while forecasting the temperature at a particular place, if the temperature of today (t) depends upon the temperature of the previous day(t-1) and the days before (t-2),(t-3)… then it is a time series problem. But, if the temperature depends on other factors like wind, humidity, etc then it is not a time series problem. It would become a regression problem (dependent and independent variables). Now, let us see how to solve this type of a problem.

ARIMA Model:

The most commonly used model to solve a time series problem is ARIMA model. But, to use the ARIMA model we have a few pre-requisites that are listed below.

  1. Check if the data is stationary
  2. Make the data stationary if it is non-stationary
  3. Find the auto-correlation factor (q) and partial auto-correlation factor (p)
  4. Plug the above ‘p’ and ‘q’ into the ARIMA model and forecast
  5. Convert the data back into the original scale

We will discuss about checking if the data is stationary in this blog. We will discuss the remaining topics in the subsequent blog posts.

Air Passenger Data:

In the graph below, you can see the number of passengers using the airplane from year 1948 to 1961. You can download this data from here. Here we have only 2 columns, Date and number of passengers. This is a time series data, where the predictor variable #passengers depends on the same variable across a given time period.

Airline passenger data

Here we can see that the same pattern repeats every year, meaning that the spikes and drops in the data occur during the same period each year. This is called Seasonality. You can also see that year over year the number of passengers keep increasing. This is called Trend. This trend could be only increasing or only decreasing or both. Sometimes there would be no trend too. Trend can be calculated using the moving average of the past ’n’ months. I considered n=12 for this data set. In the below graph, the orange line denotes the trend. Here the trend is upwards, denoting an increase in the number of passengers year over year.

Airline passengers trend

The next step is finding if the data is stationary. There are few ways to check the stationarity:

  1. Look into the histogram of the data: If you can see a normal distribution, then the data is stationary, else it is not stationary. The below histogram doesn’t have a normal distribution, meaning data is not stationary.
Histogram of #passengers

2. Mean is constant and Variance is zero: Stationary means the mean of the data points remains constant and the variance is 0. Doing the calculations, the 2 mean values are different and not constant over time.

Mean of #passengers in 1st 5.5 years is  181.95714285714286
Mean of #passengers in last 5.5 years is 373.3243243243243

This again shows us that the data is not stationary.

3. Augmented Dickey-Fuller test: Augmented Dickey Fuller test is used for testing the null hypothesis in a time series data. Here we consider the null hypotheses (H0) as the data is not stationary (time dependency). The alternate hypotheses is that the data is stationary (not time dependent).

If we reject the null hypotheses, then the data is stationary. If we failed to reject the null hypotheses, then the data is not stationary. We can interpret this using the p-value from the test. If we have a p-value below the 1% or 5% threshold, then we reject the null hypotheses. If we have a p-value greater that the 1% or 5% threshold, then we fail to reject the null hypotheses.

p-value≤0.05: We reject the null hypotheses and data is stationary

p-value>0.05: We failed to reject the null hypotheses and data is non-stationary

ADF Statistic: 0.815369
p-value: 0.991880
Critical Values:
1%: -3.482
5%: -2.884
10%: -2.579

Here p-value>0.05, so we reject the null hypotheses and the data is non-stationary.

Since the data is not stationary, we need to make it stationary. We will look into approaches in which this can be done in my next blog.

Thanks for reading through. If you liked my blog, click the clap icon! Leave your comments below or contact me via LinkedIn.

--

--