Z-Score Pairs Trading Fails on Every Exchange Tested: 7-Exchange Backt

CAGR comparison bar chart showing z-score pairs trading returning negative CAGR across all 7 exchanges tested, while the S&P 500 returned 9.81%

We tested active z-score pairs trading on 7 major exchanges from 2005 to 2024. Convergence rates ranged from 77% to 87%. Every exchange lost money. Canada lost the most (-2.84% CAGR). Taiwan came closest to flat (-0.09% CAGR). The underlying cause is the same everywhere: transaction costs absorb the mean-reversion gain.

Contents

  1. Method
  2. What We Found
  3. Universal failure
  4. The convergence rate doesn't predict performance
  5. Germany: 40% cash years
  6. Canada: worst avg trade return
  7. Taiwan: the near-zero outlier
  8. Why It Fails Everywhere
  9. What Would Fix It
  10. Limitations
  11. Run It Yourself
  12. Takeaway
  13. References

Method

Same strategy across all exchanges. Annual pair formation (same sector, correlation > 0.70, AR(1) half-life 5-60 days, top 20 pairs). Daily z-score monitoring with 40-day rolling window. Entry at |z| > 2.0, exit at |z| < 0.5 (convergence), 60-day time stop, or -5% loss stop. Transaction costs at ~0.1% per leg (4 legs per trade).

  • Exchanges excluded from content due to data quality issues: -
  • South Africa (JNB): 2006: +331%, 2011-2013: 100-300% annually. Implausible for a market-neutral strategy. Cause: thin universe (73 large-cap stocks) and apparent data gaps in 2004-2005 that corrupt beta estimates. -
  • India (BSE+NSE): 2005: +50%, 2006: +111%. FMP data coverage for India is sparse in 2004. Formation-year beta estimates are unstable. -
  • Korea (KSC): 2005: +66%, 2008: +56%. Same data warmup issue, plus the 2008 KRW currency crisis disrupts spread estimates. -
  • Sweden (STO): 2005: +54%. Warmup artifact. -
  • China (SHZ+SHH): 2005: +54%. Warmup artifact.

All five excluded exchanges show the same pattern: extreme returns in 2005 followed by normal-range returns after that. The common cause is FMP's sparse 2004 data coverage for these markets, which corrupts the formation-year OLS beta. These results aren't representative.

The seven clean exchanges:

Exchange Universe
NYSE+NASDAQ+AMEX US large caps
JPX Japan large caps
LSE UK large caps
HKSE Hong Kong large caps
TAI+TWO Taiwan large caps
XETRA Germany large caps
TSX Canada large caps

What We Found

Universal failure

Every exchange posted negative CAGR over 20 years. The range was -0.09% (Taiwan) to -2.84% (Canada). Convergence rates ranged from 77.7% (Canada) to 86.6% (US).

Exchange CAGR vs SPY Sharpe Max DD Cash% Conv% Avg Trade
TAI+TWO (Taiwan) -0.09% -9.91% -0.173 -23.52% 20% 80.4% +0.107%
HKSE (Hong Kong) -0.88% -10.69% -1.913 -17.31% 5% 82.6% -0.143%
JPX (Japan) -0.92% -10.73% -0.623 -17.43% 0% 85.4% -0.139%
LSE (UK) -0.92% -10.74% -1.224 -24.49% 5% 81.4% -0.053%
NYSE+NASDAQ+AMEX (US) -1.22% -11.03% -2.750 -21.81% 0% 86.6% -0.181%
XETRA (Germany) -1.38% -11.19% -0.951 -24.31% 40% 80.9% -0.261%
TSX (Canada) -2.84% -12.65% -1.714 -44.26% 5% 77.7% -0.454%

SPY benchmark: 9.81% CAGR (2005-2024).

The convergence rate doesn't predict performance

Taiwan has an 80.4% convergence rate and a -0.09% CAGR. The US has an 86.6% convergence rate and a -1.22% CAGR. Higher convergence rate doesn't produce better returns.

The chart showing convergence rate vs average trade return across all seven exchanges illustrates this directly. Taiwan and UK are the outliers with near-zero avg trade returns (+0.107% and -0.053% respectively). The US has the highest convergence rate and one of the worst avg trade returns (-0.181%). There is no positive relationship between convergence rate and profitability.

The explanation is mechanical. Convergence rate measures how often the z-score returns to below 0.5 standard deviations from the mean. That's a mild condition. The spread only needs to travel from |z|=2.0 to |z|<0.5, which corresponds to a small move in log-price space. The gain from that travel, after four transaction legs, often doesn't survive.

Germany: 40% cash years

Germany (XETRA) has 40% cash years, meaning 8 out of 20 years the strategy found fewer than 3 active pairs and held cash instead. XETRA's same-sector large-cap universe is smaller than the US or Japan. Fewer qualifying pairs means fewer z-score crossings means fewer trades. The strategy under-earns its cost of capital even when it's invested, and then holds cash for much of the time.

Canada: worst avg trade return

Canada (TSX) has the worst per-trade performance at -0.454% avg trade return and the lowest convergence rate at 77.7%. TSX large caps tend toward resource and financial stocks, which are concentrated in a few sectors. Same-sector correlation is high but the pairs are highly correlated to commodity cycles. When oil prices or metals prices shift, many TSX pairs diverge simultaneously and stay diverged, causing loss stops across multiple pairs in the same year. The 2016 (-7.08%) and 2020 (-6.76%) years are examples of this pattern.

Taiwan: the near-zero outlier

Taiwan (TAI+TWO) comes closest to zero (-0.09% CAGR) with a positive avg trade return (+0.107%). The strategy was in cash 20% of the time (4 years). Taiwan's semiconductor-heavy large caps form tight within-sector pairs in tech and electronic components. When they do form valid pairs, the spreads are tight enough that the cost drag is small. But the universe is thin enough that many years produce fewer than the minimum pairs required.

The net effect: Taiwan is the least-bad result, not a success. The positive avg trade return is wiped out by cash periods and the overall 20-year loss.


Why It Fails Everywhere

The root cause is the same on every exchange. The mean-reversion signal works mechanically: spreads that diverge do tend to revert. But:

  1. The gain per converging trade is small. The entry at |z|=2.0 and exit at |z|=0.5 captures a fraction of the spread's volatility. Typical gross gain per converging trade: 0.5-1.5% on the spread.
  2. Transaction costs are fixed. Four legs × ~0.1% = ~0.4% per trade, regardless of whether the spread produces 0.3% or 3.0% gross.
  3. Loss stops are real losses. 8-16% of trades hit the loss stop. These typically lose 3-5% before the stop triggers. The converging trades need to make enough profit to cover these.

The blended result: converging trades barely cover their cost, loss stops create real losses, time stops close unprofitable positions at zero, and the net average is negative. This mechanism operates identically across developed markets.

The 2008 and 2022 years show the strategy's one genuine property: market neutrality. In 2008, all seven exchanges outperformed SPY (which fell 36%). The portfolio has near-zero market beta by construction. But market neutrality with a negative carry isn't a useful investment property.


What Would Fix It

The math requires either lower transaction costs or larger spreads at entry.

Lower costs: feasible for prime brokerage clients with negotiated commissions, near-zero for market makers. Out of reach for most investors.

Wider entry threshold: raising the entry from |z|=2.0 to |z|=2.5 or |z|=3.0 would increase the gross gain per trade but reduce trade frequency. At very high thresholds, the strategy is essentially in cash most of the time.

Better pair selection: using stricter cointegration tests (ADF, KPSS) rather than the AR(1) half-life filter might select pairs where the spread volatility is higher and the per-trade gain is larger. This is worth testing.

Alternative timing: the academic literature (Gatev et al., 2006; Do and Faff, 2010) documents that pairs trading was profitable in the 1962-2002 period. The mechanism worked before widespread algorithmic execution compressed the edge. In the 2005-2024 window used here, the edge is gone or too small to capture after costs.


Limitations

Short-selling assumed everywhere. Short-selling restrictions in practice vary by market and change over time. In Taiwan and Hong Kong, short-selling of specific stocks can be restricted or expensive. The backtest doesn't model borrow costs.

Fixed formation calendar. Pairs form once per year. If a pair's cointegrating relationship breaks mid-year (merger announcement, spinoff, earnings shock), the strategy continues trading it until year-end or a stop is hit. Rolling formation would catch this faster.

FMP data quality. Five exchanges were excluded due to 2004 data sparsity corrupting beta estimates. The seven clean exchanges in this comparison have full coverage from 2004 onward.


Run It Yourself

git clone https://github.com/ceta-research/backtests.git
cd backtests

export CR_API_KEY="your_key_here"

# Run all exchanges
python3 pairs-zscore/backtest.py --global --output results/exchange_comparison.json

# Run a specific exchange
python3 pairs-zscore/backtest.py --preset japan

# Current live signals
python3 pairs-zscore/screen.py --preset japan

Takeaway

Active z-score pairs trading, with realistic parameters and transaction costs, produced negative returns on every exchange tested over 20 years. Convergence rates of 77-87% look like a working signal. They are. The problem isn't the signal. It is the economics: mean-reversion gains are too small to cover four-leg transaction costs.

The strategy has real properties worth keeping. Market neutrality. Very low volatility. No correlation to SPY. These are valuable in a portfolio context if the carry is positive. Here it isn't.

The pairs trading methodology works. This particular implementation doesn't scale to typical trading costs. The 2006-vintage literature found it did. The 2025-vintage data says it doesn't.


References

  • Gatev, E., Goetzmann, W. & Rouwenhorst, K. (2006). "Pairs Trading: Performance of a Relative-Value Arbitrage Rule." Review of Financial Studies, 19(3), 797-827.
  • Do, B. & Faff, R. (2010). "Does Simple Pairs Trading Still Work?" Financial Analysts Journal, 66(4), 83-95.
  • Do, B. & Faff, R. (2012). "Are Pairs Trading Profits Robust to Trading Costs?" Journal of Financial Research, 35(2), 261-287.
  • Krauss, C. (2017). "Statistical Arbitrage Pairs Trading Strategies: Review and Outlook." Journal of Economic Surveys, 31(2), 513-545.

Run It Yourself

Explore the data behind this analysis on Ceta Research. Query our financial data warehouse with SQL, build custom screens, and run your own backtests across 70,000+ stocks on 20 exchanges.

Data: Ceta Research (FMP data warehouse). 7 exchanges, 2005-2024. All exchanges: large caps only, same formation and trading parameters. Note: Past performance doesn't guarantee future results. This is educational content, not investment advice.

Read more