Statistical Arbitrage Trading Pairs in Python: Using Correlation, Cointegration, and the Engle-Granger Approach

This is the first iteration of my exploration into pairs trading. Pairs trading is a type of statistical arbitrage that attempts to take advantage of mis-priced assets in the market place.


Arbitrage is a ‘risk-free’ trading strategy that attempts to exploit inefficiencies in a market environment. One classic example of technological arbitrage is ETF arbitrage. ETFs (Exchange Traded Funds) consist of a basket of stocks that allow traders to invest in a single instrument while remaining well diversified across an entire sector. ETFs can represent sectors (XLK: Technology), bonds (HYG: High Yield Bonds), and even the major indexes (SPY: Dow Jones Industrials). ETFs are made up of a number of different equities that are bundled together to make a fund.  If a trader has the correct amount of stocks, he can actually go to the ETF manager and exchange his stocks for an ETF. Likewise, if you own an ETF, you can go to the fund manager and redeem your ETF for the underlying stocks. So if an ETF was made up of 1 GOOG, 2 AAPL, and 5 IBM, a trader could either provide all the parts to the ETF and redeem and ETF, or redeem their ETF for the equities. The arbitrage opportunity occurs when there is a price discrepancy between the price of the ETF and the price of the underlying, since these should always be equal. However, there exists a small opportunity for some market makers to profit off these small discrepancies. Market making firms like Jane Street Capital dedicate significant resources to developing the most sophisticated hardware to exploit the tiny arbitrage opportunities that exist in this space.

The table below shows how an ETF of Google, Apple, and IBM might be mispriced against what the individual stocks are trading at. You can see that the ETF is worth less than the individual stocks. This means, you could have a risk free trade if you buy the ETF and sell the individual stocks at the exact same time.

Screen Shot 2016-12-22 at 3.24.40 PM.png

Opportunities like this can only be taken advantage of by professional market makers with advanced formulae and a strong technology infrastructure. These sorts of trades happen in milliseconds and don’t provide a viable trading opportunity for retail traders. Because of this, we will focus on Statistical Arbitrage, which can take place in weeks or months.

Statistical Arbitrage

Statistical Arbitrage is not so ‘risk-free.’ Instead of exploiting market inefficiencies, you make certain assumptions about how prices should move relative to each other. For example, assume two companies Ford (F) and General Motors (GM) have similar price moves over the last few years.

Financial Data from

You can make an assumption that if these two stocks diverge, they should eventually re-converge. Statistical Arbitrage focuses on this idea. It could be said it is a Mean Reversion strategy, which assumes stock prices tend to revert back to the mean. Pairs Trading is the idea that there exists some subset of pairs of stocks that tend to converge and meet their theoretical equilibrium over time. The of goal Pairs Trading is to monitor stocks that trend together, and identify when they begin to diverge.  By buying the undervalued equity, and selling the overvalued equity, you hope to capture the convergence back to equilibrium. I will describe methods I used to uncover correlated stocks in the market, as well as examine some methods for trading on the pairs.

Identifying Correlated Stocks

Pearson’s Coefficient

Before you can begin to use Statistical Arbitrage to conduct Pairs Trading, you must identify a set a stocks that move together. There are several methods for searching for correlated stocks. In this section I will look at a method of identifying correlation in stock price moves. The method uses historical data to compute a Pearson Coefficient that represents how correlated two stocks have been in the past. The general approach is to take the average distance apart the prices are and provide a score. The Pearson Coefficient is calculated below.

Pearson’s Coefficient Formula

In this formula, X and Y are two different stocks, and r, is Pearson’s Coefficient. The close is to 1.0, the more correlated the two stocks should be. Inversely, the close r is to -1.0, the more inversely correlated they are (X goes up, Y goes down).

Correlation Matrix

In order to identify correlated stocks, you have to search every combination of stock pairs in the market and compare their respective Pearson Coefficient. This is difficult, unless you are fluent in Python! (Don’t worry all the source is below). I began by producing a correlation matrix that would highlight stocks that were highly correlated. Below is an example of a visualization using Pearson’s Coefficient to compare correlations between stocks. These values were calculated using stock data from from Jan 1, 2014 to Dec 1, 2016.

Correlation Matrix including various Consumer Discretionary and Technology stocks

If you take a closer look at the figure you will begin to notice some interesting results. First of all, GOOG and GOOGL have a very high correlation with each other. This makes intuitive sense because GOOG and GOOGL are equity in the same company. You can also see, that FB, MSFT and GOOG/GOOGL also have very high correlation factors. Its also interesting to see that AAPL doesn’t fit into the club of correlated stocks. Another interesting result it that IBM really has negative correlation with the rest of the stocks picked. I chose the next four images to compare GOOGL, MSFT, AAPL, and IBM’s performance versus GOOG. Try to compare the correlation value calculated in the correlation matrix with the stock patterns below.



This illustrates in more detail how GOOG/GOOGL are very correlated.  We will later develop a strategy that watches these pairs and trades when they diverge. GOOG/AAPL and GOOG/IBM have lower correlation coefficients and it shows in the images above. Apple and Google converge slightly, but the timing is so random that it is hard to provide a high correlation. GOOG/IBM almost seems to move opposite from each other. As IBM moves up, in the early stages, GOOG goes down. This also continues in the the more recent months.

The correlation matrix above includes 14 securities. Obviously there are more than 14 equities on the exchange. This function can be run on my entire dataset (approximately 3000 equities), and it provides a 3000×3000 correlation matrix. It’s hard to display the results visually, but the source will be included in this post. It allows you to provide a list of tickers, and it will return all stock pairs that have a correlation rating above a provided threshold. This is a good method for quickly searching for all correlated stocks. For the purposes of this blog, I will focus on only a small subset of the potential trading pairs.


Cointegration is an approach that attempts to model stationary processes. Stationarity describes processes that trend horizontally. Harris & Sollis postulate that a process y is stationary if and only if all of the following conditions are satisfied:

  1. E[yis constant for all t
  2. Var[yis constant for all t
  3. Covar[y, y+n] is constant for all t

If a process follows these properties, then we can use cointegration to model that process. Obviously stock prices do not move in a stationary path. However, if you have a pair of stocks, that move with correlation, the differences in the prices, should be stationary. This is how cointegration can be applied to pairs trading.

The standard method of evaluating cointegration is to estimate the linear relationship between the two stock prices using a linear regression. If we assume stationarity, the relationship should be linear according to the principals above, and if you have one price, you should be able to determine the price of the other based on past performance. The regression model follows the form:

ln(PAt) = µ + γ * ln(PBt) + εt

Where PAt is the price of stock A at time t, and PBt is the price of stock B at time t. γ is called the cointegration coefficient. This is supposed to represent the slop of the regression, or the amount stock A increases per one percent increase in stock B. εt is the residual error at time t. Under perfect correlation, εt should be zero for all t. If at any time, εt is not zero, it is an indication that a correlated stock pair is diverging. We can formally rewrite this equation to isolate this value and come up with an equation that will give us an indicator for divergence.

 St= µ + γ * ln(PBt) – ln(PAt)

Where St is a zero mean random variable representing the error away from the stationary process. Below is a plot of St. comparing St from the formula above. Notice how even though it moves up and down, the process is fairly mean reverting. At least much more mean reverting than a stock price in isolation. I also include the total mean over the last two years. Compare the residual graph with the normalize price graph to see how the two relate to each other.



I have indicated the two peaks on the Residuals graph. As the price of stock B increases, St increases, and vice versa. Stock B in this scenario is MSFT, so when St increases far beyond the mean, you can conclude that Microsoft may be overvalued compared to Google. This would provide you a trading opportunity. You can short MSFT and long GOOGL. And you can see that these two stocks converge after the peaks arise. This suggests there might be a viable strategy in pairs trading after all.

As of writing this, 12/22/2016, Microsoft is trading at $63.78, and Google (GOOGL) is trading at $809.62. It appears these two stocks have begun to diverge, and a convergence should occur soon. Timing is the hardest part about the stock market, but this strategy should suggest that MSFT doesn’t provide the same amount of upside potential compared to GOOGL. A trader using the pairs trading strategy would want to Long GOOGL stock today, and either short, or be flat on MSFT, depending on the investors risk preferences.

 Entry and Exit Signals

So far we have identified correlated stock pairs, and build a model to show us over valued, and undervalues stocks relative to each other. Now, the most important part, is how can we trade on this. This strategy can be used on any correlated pairs, but often some pairs are more correlated than others. If you look at GOOG and GOOGL, these are highly correlated, but very rarely do they ever diverge. However, if you do spot a divergence you should definitely enter the trade.

The recommended way to trade this is define a threshold that, if your residual plot breaks out of, you should enter. Remember, if the larger the residual, the more overvalued stock B is to stock A. So if your residuals chart moves above your threshold, you would want to short stock B and long stock A. Likewise, if the residuals moved below your threshold, stock A would be overvalued relative to stock B. In this case, you would want to short B and long A to capture the re-convergence.

Coming up with the threshold is difficult and may vary based on the types of stock you are trading. Your goal is to capture the most profitable time to enter a trade. Below is an example of one way to identify potential break out points. I have places grid lines for standard deviation markers. This is a was to quantify how much of an outlier a specific move is. The further away from the mean (the more standard deviations), the more likely we are experiencing a divergence.


This gives us one possible approach to finding buy and sell points. Look for residual charts who are more than 1.5 standard deviations away from the mean. Because we are assuming a mean reverting processes, the further away the residuals get, the more likely they are to converge. You can see that this might be a good opportunity to enter a position either shorting MSFT or buying GOOGL.

Further Examples

In order to viably automate this, I would want to compare a lot more parameters. Things like market risk, P/E, dividends, earnings reports, etc. I haven’t implemented a backtesting model, but, I will provide some examples of other stocks I have found to move with high correlation. Some correlations makes sense, others might just be from the law of large numbers, and some of the larger cap stocks simply move with the market, so you will get some correlation from the market itself, as big stocks move together when market is trending.




MMM and CSCO are both used in a lot of major market tracking indexes. Although it may seem like 3M and Cisco are in totally different industries, this suggests that they move with each other and this is most likely due to the general market trends. The reason they correlate doesn’t matter as much, as long as they remain mean reverting. As you can see, when the residuals chart crosses above the 3/2 standard deviation, the prices above snap back together right after.




Source Code

I think this blog post it too long to include a detailed step-by-step outline of the source code. So I will instead just link you to the github. I may go over how to reproduce the results, but I think if you understand Python, you should be able to read it quite easily.

For anyone not capable of reproducing these results with Python you can email me with any questions about the process at

Concluding Remarks

This is an introduction into Pairs Trading that I put together while I was on break from school. I am interested in pursuing quantitative analysis on stock market trends professionally and hope to end up in New York or Chicago working on algorithmic models. I have some plans to go deeper into other Pairs Trading algorithms in the future. If you made it this far, thank you for reading, and if you have any questions you can comment below or send me an email at


Kyle Franz


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s