Z-Score Pairs Trading Fails on Every Exchange Tested: 7-Exchange Backtest

We tested active z-score pairs trading on 7 major exchanges from 2005 to 2024. Convergence rates ranged from 77% to 87%. Every exchange returned negative CAGR. The same mechanism that fails in the US fails identically in Japan, UK, Germany, Hong Kong, Taiwan, and Canada.

CAGR comparison bar chart showing z-score pairs trading returning negative CAGR across all 7 exchanges tested, while the S&P 500 returned 9.81%

We tested active z-score pairs trading on 7 major exchanges from 2005 to 2024. Convergence rates ranged from 80% to 89%. Every exchange lost money. The range was -1.23% (US) to -2.66% (UK). The underlying cause is the same everywhere: transaction costs absorb the mean-reversion gain.

Contents

  1. Method
  2. What We Found
  3. Universal failure
  4. The convergence rate doesn't predict performance
  5. Germany: smaller universe, mid-tier losses
  6. Canada: worst avg trade return
  7. Taiwan: thin universe, mid-tier losses
  8. Why It Fails Everywhere
  9. What Would Fix It
  10. Limitations
  11. Run It Yourself
  12. Takeaway
  13. References

Method

Same strategy across all exchanges. Annual pair formation (same sector, correlation > 0.70, AR(1) half-life 5-60 days, top 20 pairs). Daily z-score monitoring with 40-day rolling window. Entry at |z| > 2.0, exit at |z| < 0.5 (convergence), 60-day time stop, or -5% loss stop. Transaction costs at ~0.1% per leg (4 legs per trade).

Exchanges excluded from content due to data quality issues: - South Africa (JNB): 2006: +331%, 2011-2013: 100-300% annually. Implausible for a market-neutral strategy. Cause: thin universe (73 large-cap stocks) and apparent data gaps in 2004-2005 that corrupt beta estimates. - India (BSE+NSE): 2005: +50%, 2006: +111%. FMP data coverage for India is sparse in 2004. Formation-year beta estimates are unstable. - Korea (KSC): 2005: +66%, 2008: +56%. Same data warmup issue, plus the 2008 KRW currency crisis disrupts spread estimates. - Sweden (STO): 2005: +54%. Warmup artifact. - China (SHZ+SHH): 2005: +54%. Warmup artifact.

All five excluded exchanges show the same pattern: extreme returns in 2005 followed by normal-range returns after that. The common cause is FMP's sparse 2004 data coverage for these markets, which corrupts the formation-year OLS beta. These results are not representative.

The seven clean exchanges:

Exchange Universe
NYSE+NASDAQ+AMEX US large caps
JPX Japan large caps
LSE UK large caps
HKSE Hong Kong large caps
TAI+TWO Taiwan large caps
XETRA Germany large caps
TSX Canada large caps

What We Found

Universal failure

Every exchange posted negative CAGR over 20 years. The range was -1.23% (US, best) to -2.66% (UK, worst). Convergence rates ranged from 79.7% (Canada) to 88.9% (US).

Exchange CAGR vs SPY Sharpe Max DD Cash% Conv% Avg Trade
NYSE+NASDAQ+AMEX (US) -1.23% -11.04% -1.658 -22.62% 0% 88.9% -0.183%
JPX (Japan) -1.43% -11.24% -0.872 -25.01% 0% 86.2% -0.218%
HKSE (Hong Kong) -1.47% -11.28% -1.937 -28.97% 5% 83.7% -0.257%
TAI+TWO (Taiwan) -2.36% -12.17% -0.731 -38.00% 15% 82.3% -0.295%
TSX (Canada) -2.52% -12.33% -1.425 -41.36% 5% 79.7% -0.405%
XETRA (Germany) -2.58% -12.39% -1.474 -40.66% 15% 83.2% -0.394%
LSE (UK) -2.66% -12.47% -2.022 -41.71% 5% 82.4% -0.334%

SPY benchmark: 9.81% CAGR (2005-2024). Note: The "vs SPY" column uses the US market as a universal cross-exchange reference. Against local market benchmarks, the losses are similar: Japan -7.8% vs Nikkei (6.3% CAGR), Hong Kong -2.6% vs Hang Seng (1.1%), Germany -10.1% vs DAX (7.5%), Canada -7.0% vs TSX Composite (4.5%), Taiwan -9.3% vs TAIEX (6.9%), UK -4.5% vs FTSE (1.9%).

The convergence rate doesn't predict performance

The US has the highest convergence rate (88.9%) but also the best CAGR (-1.23%, least negative). Canada has the lowest convergence rate (79.7%) and is middle-of-pack on CAGR (-2.52%). Higher convergence rate does not reliably produce better returns.

The chart showing convergence rate vs average trade return across all seven exchanges illustrates this directly. The US has the highest convergence and a middling avg trade return (-0.183%). Canada and Germany have worse avg trade returns (-0.405% and -0.394%) despite similar or lower convergence rates. There is no strong relationship between convergence rate and profitability.

The explanation is mechanical. Convergence rate measures how often the z-score returns to below 0.5 standard deviations from the mean. That's a mild condition. The spread only needs to travel from |z|=2.0 to |z|<0.5, which corresponds to a small move in log-price space. The gain from that travel, after four transaction legs, often doesn't survive.

Germany: smaller universe, mid-tier losses

Germany (XETRA) returned -2.58% CAGR with 15% cash years (3 out of 20). XETRA's same-sector large-cap universe is smaller than the US or Japan. Avg trade return of -0.394% is the second-worst among the seven exchanges, just ahead of Canada (-0.405%). The combination of a thin universe (occasional cash years) and poor per-trade economics makes Germany one of the worse-performing exchanges in the set.

Canada: worst avg trade return

Canada (TSX) has the worst per-trade performance at -0.405% avg trade return and the lowest convergence rate at 79.7%. TSX large caps tend toward resource and financial stocks concentrated in energy and materials. Same-sector correlation is high but pairs are highly correlated to commodity cycles. When oil or metals prices shift, many TSX pairs diverge simultaneously and stay diverged, causing loss stops across multiple pairs in the same year. The 2022 year (-9.0%) is the clearest example: commodity volatility drove multiple divergences.

Taiwan: thin universe, mid-tier losses

Taiwan (TAI+TWO) returned -2.36% CAGR with 15% cash years (3 out of 20). Taiwan's semiconductor-heavy large caps form same-sector pairs in tech and electronic components, but the universe is thin. The strategy was in cash for 2007, 2018, and 2020 (years with fewer than 3 qualifying pairs). The avg trade return of -0.295% is mid-tier in the set. Taiwan's result is similar to the other non-US exchanges: negative returns driven by transaction costs overwhelming mean-reversion gains.


Why It Fails Everywhere

The root cause is the same on every exchange. The mean-reversion signal works mechanically: spreads that diverge do tend to revert. But:

  1. The gain per converging trade is small. The entry at |z|=2.0 and exit at |z|=0.5 captures a fraction of the spread's volatility. Typical gross gain per converging trade: 0.5-1.5% on the spread.
  2. Transaction costs are fixed. Four legs × ~0.1% = ~0.4% per trade, regardless of whether the spread produces 0.3% or 3.0% gross.
  3. Loss stops are real losses. 6-14% of trades hit the loss stop (US: 5.9%, Japan: 8.3%, Canada: 13.6%). These typically lose 3-5% before the stop triggers. The converging trades need to make enough profit to cover these.

The blended result: converging trades barely cover their cost, loss stops create real losses, time stops close unprofitable positions at zero, and the net average is negative. This mechanism operates identically across developed markets.

The 2008 and 2022 years show the strategy's one genuine property: market neutrality. In 2008, all seven exchanges outperformed SPY (which fell 36%). The portfolio has near-zero market beta by construction. But market neutrality with a negative carry is not a useful investment property.


What Would Fix It

The math requires either lower transaction costs or larger spreads at entry.

Lower costs: feasible for prime brokerage clients with negotiated commissions, near-zero for market makers. Out of reach for most investors.

Wider entry threshold: raising the entry from |z|=2.0 to |z|=2.5 or |z|=3.0 would increase the gross gain per trade but reduce trade frequency. At very high thresholds, the strategy is essentially in cash most of the time.

Better pair selection: using stricter cointegration tests (ADF, KPSS) rather than the AR(1) half-life filter might select pairs where the spread volatility is higher and the per-trade gain is larger. This is worth testing.

Alternative timing: the academic literature (Gatev et al., 2006; Do and Faff, 2010) documents that pairs trading was profitable in the 1962-2002 period. The mechanism worked before widespread algorithmic execution compressed the edge. In the 2005-2024 window used here, the edge is gone or too small to capture after costs.


Limitations

Short-selling assumed everywhere. Short-selling restrictions in practice vary significantly by market and change over time. In Taiwan and Hong Kong, short-selling of specific stocks can be restricted or expensive. The backtest doesn't model borrow costs.

Fixed formation calendar. Pairs form once per year. If a pair's cointegrating relationship breaks mid-year (merger announcement, spinoff, earnings shock), the strategy continues trading it until year-end or a stop is hit. Rolling formation would catch this faster.

FMP data quality. Five exchanges were excluded due to 2004 data sparsity corrupting beta estimates. The seven clean exchanges in this comparison have full coverage from 2004 onward.


Run It Yourself

git clone https://github.com/ceta-research/backtests.git
cd backtests

export CR_API_KEY="your_key_here"

# Run all exchanges
python3 pairs-zscore/backtest.py --global --output results/exchange_comparison.json

# Run a specific exchange
python3 pairs-zscore/backtest.py --preset japan

# Current live signals
python3 pairs-zscore/screen.py --preset japan

Takeaway

Active z-score pairs trading, with realistic parameters and transaction costs, produced negative returns on every exchange tested over 20 years. Convergence rates of 80-89% look like a working signal. They are. The problem is not the signal. It is the economics: mean-reversion gains are too small to cover four-leg transaction costs.

The strategy has real properties worth keeping. Market neutrality. Very low volatility. No correlation to SPY. These are valuable in a portfolio context if the carry is positive. Here it isn't.

The pairs trading methodology works. This particular implementation doesn't scale to typical trading costs. The 2006-vintage literature found it did. The 2025-vintage data says it doesn't.


References

  • Gatev, E., Goetzmann, W. & Rouwenhorst, K. (2006). "Pairs Trading: Performance of a Relative-Value Arbitrage Rule." Review of Financial Studies, 19(3), 797-827.
  • Do, B. & Faff, R. (2010). "Does Simple Pairs Trading Still Work?" Financial Analysts Journal, 66(4), 83-95.
  • Do, B. & Faff, R. (2012). "Are Pairs Trading Profits Robust to Trading Costs?" Journal of Financial Research, 35(2), 261-287.
  • Krauss, C. (2017). "Statistical Arbitrage Pairs Trading Strategies: Review and Outlook." Journal of Economic Surveys, 31(2), 513-545.

Data: Ceta Research (FMP data warehouse). 7 exchanges, 2005-2024. All exchanges: large caps only, same formation and trading parameters. Note: Past performance does not guarantee future results. This is educational content, not investment advice.