Notebook: Financial Analytics

PADS - Programa Avançado em Data Science

Insper

Paloma Vaissman Uribe

1. Exploring VAR - Vector Autoregression Models

Based on https://www.statsmodels.org/dev/vector_ar.html

Código


import numpy as np
import pandas
import statsmodels.api as sm
from statsmodels.tsa.api import VAR
from statsmodels.tsa.base.datetools import dates_from_str

mdata = sm.datasets.macrodata.load_pandas().data

# prepare the dates index
dates = mdata[['year', 'quarter']].astype(int).astype(str)
quarterly = dates["year"] + "Q" + dates["quarter"]
quarterly = dates_from_str(quarterly)
mdata = mdata[['realgdp','realcons','realinv']]
mdata.index = pandas.DatetimeIndex(quarterly)
data = np.log(mdata).diff().dropna()

# make a VAR model
model = VAR(data)

Código

data

	realgdp	realcons	realinv
1959-06-30	0.024942	0.015286	0.080213
1959-09-30	-0.001193	0.010386	-0.072131
1959-12-31	0.003495	0.001084	0.034425
1960-03-31	0.022190	0.009534	0.102664
1960-06-30	-0.004685	0.012572	-0.106694
...	...	...	...
2008-09-30	-0.006781	-0.008948	-0.017836
2008-12-31	-0.013805	-0.007843	-0.069165
2009-03-31	-0.016612	0.001511	-0.175598
2009-06-30	-0.001851	-0.002196	-0.067561
2009-09-30	0.006862	0.007265	0.020197

202 rows × 3 columns

Código

results = model.fit(1)
results.summary()

  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Fri, 28, Nov, 2025
Time:                     22:40:39
--------------------------------------------------------------------
No. of Equations:         3.00000    BIC:                   -27.7388
Nobs:                     201.000    HQIC:                  -27.8562
Log likelihood:           1963.94    FPE:                7.37174e-13
AIC:                     -27.9360    Det(Omega_mle):     6.94859e-13
--------------------------------------------------------------------
Results for equation realgdp
==============================================================================
                 coefficient       std. error           t-stat            prob
------------------------------------------------------------------------------
const               0.003580         0.000911            3.928           0.000
L1.realgdp         -0.338056         0.172084           -1.964           0.049
L1.realcons         0.746283         0.130411            5.723           0.000
L1.realinv          0.057939         0.025349            2.286           0.022
==============================================================================

Results for equation realcons
==============================================================================
                 coefficient       std. error           t-stat            prob
------------------------------------------------------------------------------
const               0.006286         0.000775            8.114           0.000
L1.realgdp         -0.134053         0.146285           -0.916           0.359
L1.realcons         0.327751         0.110859            2.956           0.003
L1.realinv          0.042521         0.021549            1.973           0.048
==============================================================================

Results for equation realinv
==============================================================================
                 coefficient       std. error           t-stat            prob
------------------------------------------------------------------------------
const              -0.015808         0.004756           -3.324           0.001
L1.realgdp         -2.220857         0.898064           -2.473           0.013
L1.realcons         4.585966         0.680582            6.738           0.000
L1.realinv          0.300989         0.132290            2.275           0.023
==============================================================================

Correlation matrix of residuals
             realgdp  realcons   realinv
realgdp     1.000000  0.607456  0.759286
realcons    0.607456  1.000000  0.142791
realinv     0.759286  0.142791  1.000000

Código

# selection of orders of VAR: optimizing using information criteria
model.select_order(15)
results = model.fit(maxlags=15, ic='aic')
lag_order = results.k_ar
print("chosen VAR order is = " + str(lag_order))

chosen VAR order is = 3

Código

# forecasting: have to pass initial values
results.forecast(data.values[-lag_order:], 10)

array([[ 0.00616044,  0.00500006,  0.00916198],
       [ 0.00427559,  0.00344836, -0.00238478],
       [ 0.00416634,  0.0070728 , -0.01193629],
       [ 0.00557873,  0.00642784,  0.00147152],
       [ 0.00626431,  0.00666715,  0.00379567],
       [ 0.00651738,  0.00764936,  0.00198474],
       [ 0.00690284,  0.00760909,  0.00548495],
       [ 0.00714517,  0.00775843,  0.00620737],
       [ 0.007258  ,  0.0079993 ,  0.00602391],
       [ 0.00738975,  0.00804623,  0.00690895]])

Código

results.plot_forecast(10)

Impulse response function: how a system responds to a sudden change or impulse in its input in the time domain. In economics, and especially in contemporary macroeconomic modeling, impulse response functions are used to describe how the economy reacts over time to exogenous impulses, which economists usually call shocks, and are often modeled in the context of a vector autoregression.

Código

# impulse response function
irf = results.irf(10)
irf.plot(orth=False)

2. Exploring cointegration

Based on: - https://github.com/KidQuant/Pairs-Trading-With-Python/blob/master/PairsTrading.ipynb - https://www.programcreek.com/python/example/122726/statsmodels.tsa.stattools.coint - https://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.coint.html# - https://quantopian-archive.netlify.app/notebooks/notebooks/quantopian_notebook_145.html

Código

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import coint
from pandas_datareader import data as pdr
import datetime
import yfinance as yf
import matplotlib.pyplot as plt

A note on how to detect cointegration

Steps:

Test for a unit root in each component series individually, using the univariate unit root tests, say ADF, KPSS, PP test.
In the case we don’t reject the null hypothesis of unit root, we can apply Engle-Granger test:

It works in two steps: first, a regression is estimated between the variables and the residuals are extracted. Second, a unit root test, such as the Augmented Dickey-Fuller (ADF) test, is applied to the residuals to check if they are stationary. If the residuals are stationary, the original variables are considered cointegrated.”

The coint function from statsmodels.tsa.stattools performs the Engle-Granger two-step cointegration test.

Simulating some example data

Simulating prices using cumulative returns;
Now we generate Y. Remember that Y is supposed to have a deep economic link to X, so the price of Y should vary pretty similarly. We model this by taking X, shifting it up and adding some random noise drawn from a normal distribution.

Código

X_returns = np.random.normal(0, 1, 100) # generating random normal var
X = pd.Series(np.cumsum(X_returns), name='X') + 50
X.plot()

Código

some_noise = np.random.normal(0, 1, 100)
Y = X + 5 + some_noise
Y.name = 'Y'
pd.concat([X, Y], axis=1).plot()

Testing for cointegration. But first, we can compare cointegration vs correlation.

Código

X.corr(Y)

np.float64(0.9273678099454663)

Código

# compute the p-value of the cointegration test
# Test for no-cointegration of a univariate equation as H0
# will inform us as to whether the spread btwn the 2 timeseries is stationary
# around its mean
score, pvalue, _ = coint(X,Y)
print(pvalue)

1.5239505499170159e-09

With a low value for test p-value, we cannot reject the cointegration the hypothesis of residuals being I(0) - stationarity.

Código

plt.figure(figsize=(12,6))
(Y - X).plot() # Plot the spread
plt.axhline((Y - X).mean(), color='red', linestyle='--') # Add the mean
plt.xlabel('Time')
plt.xlim(0,99)
plt.legend(['Price Spread', 'Mean']);

Correlation is not cointegration:

Another simulates example is the following:

Código

X_returns = np.random.normal(1, 1, 100)
Y_returns = np.random.normal(2, 1, 100)

X_diverging = pd.Series(np.cumsum(X_returns), name='X')
Y_diverging = pd.Series(np.cumsum(Y_returns), name='Y')


pd.concat([X_diverging, Y_diverging], axis=1).plot(figsize=(12,6));
plt.xlim(0, 99)

Código

print('Correlation: ' + str(X_diverging.corr(Y_diverging)))
score, pvalue, _ = coint(X_diverging,Y_diverging)
print('Cointegration test p-value: ' + str(pvalue))

Correlation: 0.9936099205123949
Cointegration test p-value: 0.7523500720747303

Real example:

Now a real example of detecting pair trading strategies.

Código

start = datetime.datetime(2013, 1, 1)
end = datetime.datetime(2018, 1, 1)

ticker_list = ['AAPL', 'ADBE', 'ORCL', 'EBAY', 'MSFT', 'QCOM', 'HPQ', 'AMD', 'IBM', 'SPY']

df = yf.download(ticker_list, start=start, end=end)['Close']

Código

df

Ticker	AAPL	ADBE	AMD	EBAY	HPQ	IBM	MSFT	ORCL	QCOM	SPY
Date
2013-01-02	16.612209	38.340000	2.53	20.143671	4.630522	113.857239	22.242884	28.839972	45.717426	116.953865
2013-01-03	16.402523	37.750000	2.49	19.715164	4.667517	113.231010	21.944916	28.524071	45.505615	116.689621
2013-01-04	15.945637	38.130001	2.59	19.839211	4.667517	112.488739	21.534203	28.773472	44.834850	117.202103
2013-01-07	15.851841	37.939999	2.67	20.113605	4.676767	111.995857	21.493935	28.623833	45.194942	116.881821
2013-01-08	15.894504	38.139999	2.67	19.801622	4.744590	111.839302	21.381193	28.632137	45.124340	116.545525
...	...	...	...	...	...	...	...	...	...	...
2017-12-22	40.985916	175.000000	10.54	33.723557	16.513344	102.822487	78.645721	42.007748	52.838741	236.793823
2017-12-26	39.946098	174.440002	10.46	33.884304	16.490046	103.045013	78.544533	42.069832	52.487728	236.510590
2017-12-27	39.953129	175.360001	10.53	33.589596	16.521111	103.247269	78.829651	42.025471	52.683643	236.625656
2017-12-28	40.065544	175.550003	10.55	33.866447	16.427904	103.860840	78.838844	42.149654	52.553032	237.112488
2017-12-29	39.632286	175.240005	10.28	33.705700	16.319162	103.442802	78.673302	41.936779	52.259182	236.218445

1259 rows × 10 columns

Código

# function to build log returns from a dataframe
def log_returns(df):
  import numpy as np
  df_ = df.copy()
  for stock in df.columns:
      df_[stock] = np.log(df_[stock]).diff()
  return df_.dropna()

Código

# applying the function
daily_returns = log_returns(df)
daily_returns

Ticker	AAPL	ADBE	AMD	EBAY	HPQ	IBM	MSFT	ORCL	QCOM	SPY
Date
2013-01-03	-0.012703	-0.015508	-0.015937	-0.021502	0.007958	-0.005515	-0.013487	-0.011014	-0.004644	-0.002262
2013-01-04	-0.028250	0.010016	0.039375	0.006272	0.000000	-0.006577	-0.018893	0.008706	-0.014850	0.004382
2013-01-07	-0.005900	-0.004995	0.030421	0.013736	0.001980	-0.004391	-0.001872	-0.005214	0.007999	-0.002736
2013-01-08	0.002688	0.005258	0.000000	-0.015633	0.014398	-0.001399	-0.005259	0.000290	-0.001563	-0.002881
2013-01-09	-0.015752	0.013542	-0.015095	0.001517	0.029452	-0.002856	0.005634	0.000581	0.015063	0.002539
...	...	...	...	...	...	...	...	...	...	...
2017-12-22	0.000000	0.002517	-0.032667	-0.001323	0.000470	0.006579	0.000117	0.001691	0.005267	-0.000262
2017-12-26	-0.025697	-0.003205	-0.007619	0.004755	-0.001412	0.002162	-0.001287	0.001477	-0.006665	-0.001197
2017-12-27	0.000176	0.005260	0.006670	-0.008736	0.001882	0.001961	0.003623	-0.001055	0.003726	0.000486
2017-12-28	0.002810	0.001083	0.001898	0.008208	-0.005658	0.005925	0.000117	0.002951	-0.002482	0.002055
2017-12-29	-0.010873	-0.001767	-0.025926	-0.004758	-0.006641	-0.004033	-0.002102	-0.005063	-0.005607	-0.003778

1258 rows × 10 columns

Código

# matrix correlation of log-returns

import seaborn as sns
correlation_matrix = daily_returns.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Log-Returns')
plt.show()

Function to detect cointegrated pairs:

Código

def find_cointegrated_pairs(data):
    n = data.shape[1]
    score_matrix = np.zeros((n, n))
    pvalue_matrix = np.ones((n, n))
    keys = data.keys()
    pairs = []
    for i in range(n):
        for j in range(i+1, n):
            S1 = data[keys[i]]
            S2 = data[keys[j]]
            result = coint(S1, S2)
            score = result[0]
            pvalue = result[1]
            score_matrix[i, j] = score
            pvalue_matrix[i, j] = pvalue
            if pvalue < 0.05:
                pairs.append((keys[i], keys[j]))
    return score_matrix, pvalue_matrix, pairs

Código

scores, pvalues, pairs = find_cointegrated_pairs(df)
pairs

[('AAPL', 'ORCL'),
 ('AAPL', 'SPY'),
 ('ADBE', 'EBAY'),
 ('ADBE', 'MSFT'),
 ('EBAY', 'MSFT'),
 ('HPQ', 'ORCL')]

Código

# picking one example
S1 = df['ADBE']
S2 = df['MSFT']

score, pvalue, _ = coint(S1, S2)
pvalue

np.float64(0.003089062812445259)

Código

S1 = sm.add_constant(S1)
results = sm.OLS(S2, S1).fit()
S1 = S1['ADBE']
b = results.params['ADBE']

spread = S2 - b * S1
spread.plot(figsize=(12,6))
plt.axhline(spread.mean(), color='black')
plt.xlim('2013-01-01', '2018-01-01')
plt.legend(['Spread']);

Explanation of the meaning of “trading the spread”:

In a cointegrated pair trading strategy, ‘trading the spread’ refers to exploiting the long-term, statistically stable relationship between the prices of two cointegrated assets. Here’s a breakdown:

Cointegrated Pair: When two (or more) non-stationary time series are cointegrated, it means they move together in the long run, and a linear combination of them is stationary. This stationary linear combination is often referred to as the ‘spread’ or ‘error correction term’. Even though individual asset prices might wander, their relationship tends to revert to a mean.

The Spread: The ‘spread’ is typically defined as the difference between the price of one asset and a scaled price of the other asset (e.g., Spread = Price_Y - beta * Price_X, where beta is the hedge ratio obtained from a regression, as we saw with b for ADBE and MSFT). Because the assets are cointegrated, this spread is expected to be stationary, meaning it will fluctuate around a constant mean.

Trading the Spread Strategy (Mean Reversion):

Execution:

When the spread deviates significantly above its historical mean (meaning one asset is relatively overvalued compared to the other), you would sell the overvalued asset and buy the undervalued asset. For example, if Price_Y - beta * Price_X is high, you’d sell Y and buy X.

Closing Position: The goal is to profit when the spread reverts to its mean. Once the spread returns to its average, you would close both positions, realizing a profit.

De volta ao topo