Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

## Computation of the Loss Distribution not only for Operational Risk Managers

In the Operational Risk Management, given a number/type of risks or/and business line combinations, the quest is all about providing the risk management board with an estimation of the losses the bank (or any other financial institution, hedge-fund, etc.) can suffer from. If you think for a second, the spectrum of things that might go wrong is wide, e.g. the failure of a computer system, an internal or external fraud, clients, products, business practices, a damage to physical goods, and so on. These ones blended with business lines, e.g. corporate finance, trading and sales, retail banking, commercial banking, payment and settlement, agency services, asset management, or retail brokerage return over 50 combinations of the operational risk factors one needs to consider. Separately and carefully. And it’s a tough one.

Why? A good question “why?”! Simply because of two main reasons. For an operational risk manager the sample of data describing the risk is usually insufficient (statistically speaking: the sample is small over the life period of the financial organ). Secondly, when something goes wrong, the next (of the same kind) event may take place in not-to-distant future or in far-distant future. The biggest problem the operational risk manager meets in his/her daily work regards the prediction of all/any losses due to operational failures. Therefore, the time of the (next) event comes in as an independent variable into that equation: the loss frequency distribution. The second player in the game is: the loss severity distribution, i.e. if the worst strikes, how much the bank/financial body/an investor/a trader might lose?!

From a perspective of a trader we well know that Value-at-Risk (VaR) and the Expected Shortfall are two quantitative risk measures that address similar questions. But from the viewpoint of the operational risk, the estimation of losses requires a different approach.

In this post, after Hull (2015), we present an algorithm in Python for computation of the loss distribution given the best estimation of the loss frequency and loss severity distributions. Though designed for operation risk analysts in mind, in the end we argue its usefulness for any algo-trader and/or portfolio risk manager.

1. Operational Losses: Case Study of the Vanderloo Bank

An access to operational loss data is much much harder than in case of stocks traded in the exchange. They usually stay within the walls of the bank, with an internal access only. A recommended practice for operational risk managers around the world is to share those unique data despite confidentiality. Only in that instance we can build a broader knowledge and understanding of risks and incurred losses due to operational activities.

Let’s consider a case study of a hypothetical Vanderloo Bank. The bank had been found in 1988 in Netherlands and its main line of business was concentrated around building unique customer relationships and loans for small local businesses. Despite a vivid vision and firmly set goals for the future, Vanderloo Bank could not avoid a number of operational roadblocks that led to a substantial operational losses:

Year Month Day Business Line Risk Category Loss ($M) 0 1989.0 1.0 13.0 Trading and Sales Internal Fraud 0.530597 1 1989.0 2.0 9.0 Retail Brokerage Process Failure 0.726702 2 1989.0 4.0 14.0 Trading and Sales System Failure 1.261619 3 1989.0 6.0 11.0 Asset Managment Process Failure 1.642279 4 1989.0 7.0 23.0 Corporate Finance Process Failure 1.094545 5 1990.0 10.0 21.0 Trading and Sales Employment Practices 0.562122 6 1990.0 12.0 24.0 Payment and Settlement Process Failure 4.009160 7 1991.0 8.0 23.0 Asset Managment Business Practices 0.495025 8 1992.0 1.0 28.0 Asset Managment Business Practices 0.857785 9 1992.0 3.0 14.0 Commercial Banking Damage to Assets 1.257536 10 1992.0 5.0 26.0 Retail Banking Internal Fraud 1.591007 11 1992.0 8.0 9.0 Corporate Finance Employment Practices 0.847832 12 1993.0 1.0 11.0 Corporate Finance System Failure 1.314225 13 1993.0 1.0 19.0 Retail Banking Internal Fraud 0.882371 14 1993.0 2.0 24.0 Retail Banking Internal Fraud 1.213686 15 1993.0 6.0 12.0 Commercial Banking System Failure 1.231784 16 1993.0 6.0 16.0 Agency Services Damage to Assets 1.316528 17 1993.0 7.0 11.0 Retail Banking Process Failure 0.834648 18 1993.0 9.0 21.0 Retail Brokerage Process Failure 0.541243 19 1993.0 11.0 11.0 Asset Managment Internal Fraud 1.380636 20 1994.0 11.0 22.0 Retail Banking External Fraud 1.426433 21 1995.0 2.0 14.0 Commercial Banking Process Failure 1.051281 22 1995.0 11.0 21.0 Commercial Banking External Fraud 2.654861 23 1996.0 8.0 17.0 Agency Services Process Failure 0.837237 24 1997.0 7.0 13.0 Retail Brokerage Internal Fraud 1.107019 25 1997.0 7.0 24.0 Agency Services External Fraud 1.513146 26 1997.0 8.0 8.0 Retail Banking Process Failure 1.002040 27 1997.0 9.0 2.0 Agency Services Damage to Assets 0.646596 28 1997.0 9.0 12.0 Retail Banking Employment Practices 0.966086 29 1998.0 1.0 8.0 Retail Banking Internal Fraud 0.938803 30 1998.0 1.0 12.0 Retail Banking System Failure 0.922069 31 1998.0 2.0 5.0 Asset Managment Process Failure 1.042259 32 1998.0 4.0 18.0 Commercial Banking External Fraud 0.969562 33 1998.0 5.0 12.0 Retail Banking External Fraud 0.683715 34 1999.0 1.0 3.0 Trading and Sales Internal Fraud 2.035785 35 1999.0 4.0 27.0 Retail Brokerage Business Practices 1.074277 36 1999.0 5.0 8.0 Retail Banking Employment Practices 0.667655 37 1999.0 7.0 10.0 Agency Services System Failure 0.499982 38 1999.0 7.0 17.0 Retail Brokerage Process Failure 0.803826 39 2000.0 1.0 26.0 Commercial Banking Business Practices 0.714091 40 2000.0 7.0 23.0 Trading and Sales System Failure 1.479367 41 2001.0 6.0 16.0 Retail Brokerage System Failure 1.233686 42 2001.0 11.0 5.0 Agency Services Process Failure 0.926593 43 2002.0 5.0 14.0 Payment and Settlement Damage to Assets 1.321291 44 2002.0 11.0 11.0 Retail Banking External Fraud 1.830254 45 2003.0 1.0 14.0 Corporate Finance System Failure 1.056228 46 2003.0 1.0 28.0 Asset Managment System Failure 1.684986 47 2003.0 2.0 28.0 Commercial Banking Damage to Assets 0.680675 48 2004.0 1.0 11.0 Asset Managment Process Failure 0.559822 49 2004.0 6.0 19.0 Commercial Banking Internal Fraud 1.388681 50 2004.0 7.0 3.0 Retail Banking Internal Fraud 0.886769 51 2004.0 7.0 21.0 Retail Brokerage Employment Practices 0.606049 52 2004.0 7.0 27.0 Asset Managment Employment Practices 1.634348 53 2004.0 11.0 26.0 Asset Managment Damage to Assets 0.983355 54 2005.0 1.0 9.0 Corporate Finance Damage to Assets 0.969710 55 2005.0 9.0 17.0 Commercial Banking System Failure 0.634609 56 2006.0 2.0 24.0 Agency Services Business Practices 0.637760 57 2006.0 3.0 21.0 Retail Banking Employment Practices 1.072489 58 2006.0 6.0 25.0 Payment and Settlement System Failure 0.896459 59 2006.0 12.0 25.0 Trading and Sales Process Failure 0.731953 60 2007.0 6.0 9.0 Commercial Banking System Failure 0.918233 61 2008.0 1.0 5.0 Corporate Finance External Fraud 0.929702 62 2008.0 2.0 14.0 Retail Brokerage System Failure 0.640201 63 2008.0 2.0 14.0 Commercial Banking Internal Fraud 1.580574 64 2008.0 3.0 18.0 Corporate Finance Process Failure 0.731046 65 2009.0 2.0 1.0 Agency Services System Failure 0.630870 66 2009.0 2.0 6.0 Retail Banking External Fraud 0.639761 67 2009.0 4.0 14.0 Payment and Settlement Internal Fraud 1.022987 68 2009.0 5.0 25.0 Retail Banking Business Practices 1.415880 69 2009.0 7.0 8.0 Retail Banking Business Practices 0.906526 70 2009.0 12.0 26.0 Agency Services System Failure 1.463529 71 2010.0 2.0 13.0 Asset Managment Damage to Assets 0.664935 72 2010.0 3.0 24.0 Payment and Settlement Process Failure 1.848318 73 2010.0 10.0 16.0 Commercial Banking External Fraud 1.020736 74 2010.0 12.0 27.0 Retail Banking Employment Practices 1.126265 75 2011.0 2.0 5.0 Retail Brokerage Process Failure 1.549890 76 2011.0 6.0 24.0 Corporate Finance Damage to Assets 2.153238 77 2011.0 11.0 6.0 Asset Managment System Failure 0.601332 78 2011.0 12.0 1.0 Payment and Settlement External Fraud 0.551183 79 2012.0 2.0 21.0 Corporate Finance External Fraud 1.866740 80 2013.0 4.0 22.0 Retail Brokerage External Fraud 0.672756 81 2013.0 6.0 27.0 Payment and Settlement Employment Practices 1.119233 82 2013.0 8.0 17.0 Commercial Banking System Failure 1.034078 83 2014.0 3.0 1.0 Asset Managment Employment Practices 2.099957 84 2014.0 4.0 4.0 Retail Brokerage External Fraud 0.929928 85 2014.0 6.0 5.0 Retail Banking System Failure 1.399936 86 2014.0 11.0 17.0 Asset Managment Process Failure 1.299063 87 2014.0 12.0 3.0 Agency Services System Failure 1.787205 88 2015.0 2.0 2.0 Payment and Settlement System Failure 0.742544 89 2015.0 6.0 23.0 Commercial Banking Employment Practices 2.139426 90 2015.0 7.0 18.0 Trading and Sales System Failure 0.499308 91 2015.0 9.0 9.0 Retail Banking Employment Practices 1.320201 92 2015.0 9.0 18.0 Corporate Finance Business Practices 2.901466 93 2015.0 10.0 21.0 Commercial Banking Internal Fraud 0.808329 94 2016.0 1.0 9.0 Retail Banking Internal Fraud 1.314893 95 2016.0 3.0 28.0 Asset Managment Business Practices 0.702811 96 2016.0 3.0 25.0 Payment and Settlement Internal Fraud 0.840262 97 2016.0 4.0 6.0 Retail Banking Process Failure 0.465896 Having a record of 97 events, now we can begin building a statistical picture on loss frequency and loss severity distribution. 2. Loss Frequency Distribution For loss frequency, the natural probability distribution to use is a Poisson distribution. It assumes that losses happen randomly through time so that in any short period of time$\Delta t$there is a probability of$\lambda \Delta t$of a loss occurring. The probability of$n$losses in time$T$[years] is: $$\mbox{Pr} = \exp{(-\lambda T)} \frac{(\lambda T)^n}{n!}$$ where the parameter$\lambda$can be estimated as the average number of losses per year (Hull 2015). Given our table in the Python pandas’ DataFrame format, df, we code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 # Computation of the Loss Distribution not only for Operational Risk Managers # (c) 2016 QuantAtRisk.com, Pawel Lachowicz from scipy.stats import lognorm, norm, poisson from matplotlib import pyplot as plt import numpy as np import pandas as pd # reading Vanderoo Bank operational loss data df = pd.read_hdf('vanderloo.h5', 'df') # count the number of loss events in given year fre = df.groupby("Year").size() print(fre) where the last operation groups and displays the number of losses in each year: Year 1989.0 5 1990.0 2 1991.0 1 1992.0 4 1993.0 8 1994.0 1 1995.0 2 1996.0 1 1997.0 5 1998.0 5 1999.0 5 2000.0 2 2001.0 2 2002.0 2 2003.0 3 2004.0 6 2005.0 2 2006.0 4 2007.0 1 2008.0 4 2009.0 6 2010.0 4 2011.0 4 2012.0 1 2013.0 3 2014.0 5 2015.0 6 2016.0 4 dtype: int64 The estimation of Poisson’s$\lambda$requires solely the computation of: 16 17 18 # estimate lambda parameter lam = np.sum(fre.values) / (df.Year[df.shape[0]-1] - df.Year[0]) print(lam) 3.62962962963 what informs us that during 1989–2016 period, i.e. over the past 27 years, there were$\lambda = 3.6losses per year. Assuming Poisson distribution as the best descriptor for loss frequency distribution, we model the probability of operational losses of the Vanderloo Bank in the following way: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 # draw random variables from a Poisson distribtion with lambda=lam prvs = poisson.rvs(lam, size=(10000)) # plot the pdf (loss frequency distribution) h = plt.hist(prvs, bins=range(0, 11)) plt.close("all") y = h[0]/np.sum(h[0]) x = h[1] plt.figure(figsize=(10, 6)) plt.bar(x[:-1], y, width=0.7, align='center', color="#2c97f1") plt.xlim([-1, 11]) plt.ylim([0, 0.25]) plt.ylabel("Probability", fontsize=12) plt.title("Loss Frequency Distribution", fontsize=14) plt.savefig("f01.png") revealing: 3. Loss Severity Distribution The data collected in the last column ofdf$allow us to plot and estimate the best fit of the loss severity distribution. In the practice of operational risk mangers, the lognormal distribution is a common choice: 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 c = .7, .7, .7 # define grey color plt.figure(figsize=(10, 6)) plt.hist(df["Loss ($M)"], bins=25, color=c, normed=True) plt.xlabel("Incurred Loss ($M)", fontsize=12) plt.ylabel("N", fontsize=12) plt.title("Loss Severity Distribution", fontsize=14) x = np.arange(0, 5, 0.01) sig, loc, scale = lognorm.fit(df["Loss ($M)"]) pdf = lognorm.pdf(x, sig, loc=loc, scale=scale) plt.plot(x, pdf, 'r') plt.savefig("f02.png")   print(sig, loc, scale) # lognormal pdf's parameters
0.661153638163 0.328566816132 0.647817560825

where the lognormal distribution probability density function (pdf) we use is given by:
$$p(x; \sigma, loc, scale) = \frac{1}{x\sigma\sqrt{2\pi}} \exp{ \left[ -\frac{1}{2} \left(\frac{\log{x}}{\sigma} \right)^2 \right] }$$
where $x = (y – loc)/scale$. The fit of pdf to the data returns:

4. Loss Distribution

The loss frequency distribution must be combined with the loss severity distribution for each risk type/business line combination in order to determine a loss distribution. The most common assumption here is that loss severity is independent of loss frequency. Hull (2015) suggests the following steps to be taken in building the Monte Carlo simulation leading to modelling of the loss distribution:

1. Sample from the frequency distribution to determine the number of loss events ($n$)
2. Sample $n$ times from the loss severity distribution to determine the loss experienced
for each loss event ($L_1, L_2, …, L_n$)
3. Determine the total loss experienced ($=L_1 + L_2 + … + L_n$)

When many simulation trials are used, we obtain a total distribution for losses of the type being considered. In Python we code those steps in the following way:

53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 def loss(r, loc, sig, scale, lam): X = [] for x in range(11): # up to 10 loss events considered if(r < poisson.cdf(x, lam)): # x denotes a loss number out = 0 else: out = lognorm.rvs(s=sig, loc=loc, scale=scale) X.append(out) return np.sum(X) # = L_1 + L_2 + ... + L_n     # run 1e5 Monte Carlo simulations losses = [] for _ in range(100000): r = np.random.random() losses.append(loss(r, loc, sig, scale, lam))     h = plt.hist(losses, bins=range(0, 16)) _ = plt.close("all") y = h[0]/np.sum(h[0]) x = h[1]   plt.figure(figsize=(10, 6)) plt.bar(x[:-1], y, width=0.7, align='center', color="#ff5a19") plt.xlim([-1, 16]) plt.ylim([0, 0.20]) plt.title("Modelled Loss Distribution", fontsize=14) plt.xlabel("Loss ($M)", fontsize=12) plt.ylabel("Probability of Loss", fontsize=12) plt.savefig("f03.png") revealing: The function of loss has been designed in the way that it considers up to 10 loss events. We run$10^5$simulations. In each trial, first, we draw a random number r from a uniform distribution. If it is less than a value of Poisson cumulative distribution function (with$\lambda = 3.6$) for x loss number ($x = 0, 1, …, 10$) then we assume a zero loss incurred. Otherwise, we draw a rv from the lognormal distribution (given by its parameters found via fitting procedure a few lines earlier). Simple as that. The resultant loss distribution as shown in the chart above describes the expected severity of future losses (due to operational “fatal” activities of Vanderloo Bank) given by the corresponding probabilities. 5. Beyond Operational Risk Management A natural step of the numerical procedure which we have applied here seems to pertain to the modelling of, e.g., the anticipated (predicted) loss distribution for any portfolio of N-assets. One can estimate it based on the track record of losses incurred in trading as up-to-date. By doing so, we gain an additional tool in our arsenal of quantitative risk measures and modelling. Stay tuned as a new post will illustrate that case. DOWNLOAD vanderloo.h5 REFERENCES Hull, J., 2015, Risk Management and Financial Institutions, 4th Ed. ## Probability of Black Swan Events at NYSE Featured by a Russian website Financial One (Apr 26, 2016) The prediction of extreme rare events (EREs) in the financial markets remains one of the toughest problems. Firstly because of a very limited knowledge we have on their distribution and underlying correlations across the markets. Literally, we walk in dark, hoping it won’t happen today, not to the stocks we hold in our portfolios. But is that darkness really so dark? In this post we will use a textbook knowledge on classical statistics in order to analyze the universe of 2500+ stocks traded at New York Stock Exchange (NYSE) that experienced the most devastating daily loss in their history (1 data point per 1 stock). Based on that, we will try to shed a new light on the most mind-boggling questions: when the next ERE will happen, how long we need to wait for it, and if it hits what its magnitude “of devastation” will be? 1. Extreme Rare Events of NYSE An ERE meeting certain criteria has its own name in the financial world: a black swan. We will define it in the next Section. Historically speaking, at some point, people were convinced that only white swans existed until someone noticed a population of black swans in Western Australia. The impossible became possible. The same we observe in the stock markets. Black Monday on Oct 19, 1987 no doubt is the greatest historical event where the markets suffered from 20%+ losses. The event was so rare that any risk analyst could not predict it and/or protect against it. The history repeated itself in 2000 when the Internet bubble caused many IT stocks to lose a lot over the period of few weeks. An event of 9/11 in New York forced the closure of US stock exchange for 6 days, and on Sep 17, as a consequence of fear, about 25 stocks at NYSE lost more than 15% in one day. The uncertainty cannot be avoided but it can be minimized. Is it so? The financial crisis of 2007-2008 triggered a significant number of black swans to fly again and fly high… Did we learn enough from that? Do we protect more efficiently against black swans after 2009 market recovery? Will you be surprised if I tell you “no”?! Let’s conduct a data analysis of the most EREs ever spotted at NYSE among all its stocks. To do that, we will use Python language and Yahoo! Finance database. The quality of data are not required to be superbly accurate and we will download only adjusted-close prices with the corresponding dates. A list of companies (inter alia their tickers) traded at NYSE you can extract from a .xls file available for download at http://www.nasdaq.com/screening/company-list.aspx. It contains 3226 tickers, while, as we will see below, Yahoo! Finance recognizes only 2716 of them. Since 2716 is a huge data set meeting a lot of requirements for the statistical significance, I’m more than happy to go along with it. In the process of stock data download, we look for information on the most extreme daily loss per stock and collect them all in pandas’ DataFrame dfblack. It will contain a Date, the magnitude of ERE (Return), and Delta0. The latter we calculate on spot as a number of business days between some initial point in time (here, 1990-01-01) and the day of ERE occurrence. The processing time may take up to 45 min depending on your Internet speed. Be patient, it will be worth waiting that long. Using HDF5 we save dfblack DataFrame on the disk. # Probability of Black Swan Events at NYSE # (c) 2016 by Pawel Lachowicz, QuantAtRisk.com import numpy as np import pandas as pd import matplotlib.pyplot as plt import pandas_datareader.data as web import datetime as dt from scipy.stats import gumbel_l, t, norm, expon, poisson from math import factorial # Constants c = (.7, .7, .7) # grey color col = ['Date', 'Return', 'Delta0'] dfblack = pd.DataFrame(columns=col) # create an empty DataFrame i = n = 0 for lstline in open("nyseall.lst",'r').readlines(): ticker=lstline[:-1] i += 1 print(i, ticker) try: ok = True data = web.DataReader(str(ticker), data_source='yahoo', start='1900-01-01', end='2016-04-14') except: ok = False if(ok): n += 1 data['Return'] = data['Adj Close' ] / data['Adj Close'].shift(1) - 1 data = data[1:] # skip first row df = data['Return'].copy() dfs = df.sort_values() start = dt.date(1900, 1, 1) end = dt.date(dfs.index[0].year, dfs.index[0].month, dfs.index[0].day) delta0 = np.busday_count(start,end) tmin = end rmin = dfs[0] dfblack.loc[len(dfblack)]=[tmin, rmin, delta0] # add new row to dfblack else: print(' not found') dfblack.set_index(dfblack.Date, inplace=True) # set index by Date dfblack.sort_index(inplace=True) # sort DataFrame by Date del dfblack['Date'] # remove Date column print("No. of stocks with valid data: %g" % n) # saving to file store = pd.HDFStore('dfblack.h5') store['dfblack'] = dfblack store.close() # reading from file store = pd.HDFStore('dfblack.h5') dfblack = pd.read_hdf('dfblack.h5', 'dfblack') store.close() dfblack0 = dfblack.copy() # a backup copy where nyseall.lst is a plain text file listing all NYSE tickers: DDD MMM WBAI WUBA AHC ... The first instinct of a savvy risk analyst would be to plot the distribution of the magnitudes of EREs as extracted for the NYSE universe. We achieve that by: plt.figure(num=1, figsize=(9, 5)) plt.hist(dfblack.Return, bins=50, color='grey') # a histrogram plt.xlabel('Daily loss', fontsize = 12) plt.xticks(fontsize = 12) plt.savefig('fig01.png', format='png') what reveals: Peaked around -18% with a long left tail of losses reaching nearly -98%. Welcome to the devil’s playground! But, that’s just the beginning. The information on dates corresponding to the occurrence of those EREs allows us to build a 2D picture: # Black Monday data sample df_bm = dfblack[dfblack.index == dt.date(1987, 10, 19)] # Reopening of NYSE after 9/11 data sample df_wtc = dfblack[dfblack.index == dt.date(2001, 9, 17)] # Financial Crisis of 2008 data sample df_cre = dfblack[(dfblack.index >= dt.date(2008, 9, 19)) & (dfblack.index <= dt.date(2009, 3, 6))] # Drawdown of 2015/2016 data sample df_16 = dfblack[(dfblack.index >= dt.date(2015, 12, 1)) & (dfblack.index <= dt.date(2016, 2, 11))] plt.figure(figsize=(13, 6)) plt.plot(dfblack.Return*100, '.k') plt.plot(df_bm.Return*100, '.r') plt.plot(df_wtc.Return*100, '.b') plt.plot(df_cre.Return*100, '.m') plt.plot(df_16.Return*100, '.y') plt.xlabel('Time Interval [%s to %s]' % (dfblack.index[0], dfblack.index[-1]), fontsize=12) plt.ylabel('Magnitude [%]', fontsize=12) plt.title('Time Distribution of the Most Extreme Daily Losses of %g NYSE Stocks (1 point per stock)' % (len(dfblack)), fontsize=13) plt.xticks(fontsize = 12) plt.yticks(fontsize = 12) plt.xlim([dfblack.index[0], dt.date(2016, 4, 14)]) plt.savefig('fig02.png', format='png') delivering: Keeping in mind that every (out of 2716) stock is represented only by 1 data point, the figure reveals a dramatic state of the play: significantly lower number of EREs before 1995 and progressing increase after that with a surprisingly huge number in the past 5 years! As discussed in my previous post, Detecting Human Fear in Electronic Trading, the surge in EREs could be explained by a domination of algorithmic trading over manual trades. It may also suggest that the algorithmic risk management for many stocks is either not as good as we want it to be or very sensitive to the global stock buying/selling pressure across the markets (the butterfly effect). On the other side, the number of new stocks being introduced to NYSE may cause them “to fight for life” and those most vulnerable, short-lived, of unestablished position, reputation, financial history, or least liquidity can be the subject to ERE occurrence more likely. Regardless of the cause number one, there are more and more “sudden drops in the altitude” now than ever. I find it alarming. For clarity, in the above figure, the historical events of Black Monday 1978-10-19 (89 points), reopening of NYSE after 9/11 (25 points), Financial Crisis 2007-2008 (858 points), and a drawdown of 2015/2016 (256 points!) have been marked by red, blue, purple, and dark yellow colours, respectively. 2. Black Swans of NYSE The initial distribution of Extreme Rare Events of NYSE (Figure 1) provides us with an opportunity to define a black swan event. There is a statistical theory, Extreme Value Theory (EVT), delivering an appealing explanation on behaviour of rare events. We have discussed it in detail in Extreme VaR for Portfolio Managers and Black Swan and Extreme Loss Modeling articles. Thanks to that, we gain a knowledge and tools so needed to describe what we observe. As previously, in this post we will be only considering the Gumbel distribution$G$of the corresponding probability density function (pdf)$g$given as: $$G(z;\ a,b) = e^{-e^{-z}} \ \ \mbox{for}\ \ z=\frac{x-b}{a}, \ x\in\Re$$ and $$g(z;\ a,b) = b^{-1} e^{-z}e^{-e^{-z}} \ .$$ where$a$and$b$are the location parameter and scale parameter, respectively. You can convince yourself that fitting$g(z;\ a,b)$function truly works in practice: # fitting locG, scaleG = gumbel_l.fit(dfblack.Return) # location, scale parameters dx = 0.0001 x = [dx*i for i in range(-int(1/dx), int(1/dx)+1)] x2 = x.copy() plt.figure(num=3, figsize=(9, 4)) plt.hist(dfblack.Return, bins=50, normed=True, color=c, label="NYSE Data of 2716 EREs") pdf1 = gumbel_l.pdf(x2, loc=locG, scale=scaleG) y = pdf1.copy() plt.plot(x2, y, 'r', linewidth=2, label="Gumbel PDF fit") plt.xlim([-1, 0]) plt.xlabel('Daily loss [%]', fontsize = 12) plt.xticks(fontsize = 12) plt.legend(loc=2) plt.savefig('fig03.png', format='png') By deriving: print(locG, scaleG, locG-scaleG, locG+scaleG) # Pr(Loss < locG-scaleG) Pr_bs = gumbel_l.cdf(locG-scaleG, loc=locG, scale=scaleG) print(Pr_bs) we see that -0.151912998711 0.0931852781844 -0.245098276896 -0.058727720527 0.307799372445 i.e. the Gumbel pdf’s location parameter (peak) is at -15.2% with scale of 9.3%. Using an analogy to Normal distribution, we define two brackets, namely: $$a-b \ \ \mbox{and} \ \ a+b$$ to be -24.5% and -5.9%, respectively. Given that, we define the black swan event as a daily loss of: $$L < a-b$$ magnitude. Based on NYSE data sample we find that 30.8% of EREs (i.e. greater than 24.5% in loss) meet that criteria. We re-plot their time-magnitude history as follows: # skip data after 2015-04-10 (last 252 business days) dfb = dfblack0[dfblack0.index < dt.date(2015, 4, 10)] # re-fit Gumbel (to remain politically correct in-sample ;) locG, scaleG = gumbel_l.fit(dfb.Return) print(locG, locG-scaleG) # -0.16493276305 -0.257085820659 # extract black swans from EREs data set dfb = dfb[dfb.Return < locG-scaleG] dfb0 = dfb.copy() plt.figure(figsize=(13, 6)) plt.plot(dfb.Return*100, '.r') plt.xlabel('Time Interval [%s to %s]' % (dfb.index[0], dfb.index[-1]), fontsize=14) plt.ylabel('Black Swan Magnitude [%]', fontsize=14) plt.title('Time Distribution of Black Swans of %g NYSE Stocks (L < %.1f%%)' % (len(dfb), (locG-scaleG)*100), fontsize=14) plt.xticks(fontsize = 14) plt.yticks(fontsize = 14) plt.ylim([-100, 0]) plt.xlim([dfb.index[0], dt.date(2016, 4, 14)]) plt.savefig('fig04.png', format='png') where we deliberately skipped last 252+ data points to allow ourselves for some backtesting later on (out-of-sample). That reduces the number of black swans from 807 down to 627 (in-sample). 3. Modelling the Probability of an Event Based on Observed Occurrences Let’s be honest. It’s nearly impossible to predict in advance the exact day when next black swan will appear, for which stocks, will it be isolated or trigger some stocks to fall down on the same day? Therefore, what follows from now on, you should consider as a common sense in quantitative modelling of black swans backed by our knowledge of statistics. And yes, despite the opinions of Taleb (2010) in his book of The Black Swan: The Impact of the Highly Improbable we might derive some risk uncertainties and become more aware of the impact if the worst strikes. The classical guideline we find in Vose’s book Quantitative Risk Analysis: A Guide to Monte Carlo Simulation Modelling (1996). David delivers not only a solid explanations of many statistical distributions but also excellent hints on their practical usage for risk analysts. Since the discovery of this book in 2010 in Singapore, it is my favourite reference manual of all times. The modelling of the probability of an event based on observed occurrences may follow two distinct processes, analogous to the difference between discrete and continuous distributions: (1) an event that may occur only among a set of specific discrete opportunities, and (2) an event that may occur over a continuum of opportunity. Considering the black swan events we are not limited by a finite number of instances, therefore we will be taking into account the continuous exposure probability modelling. A continuous exposure process is characterized by the mean interval between events (MIBE)$\beta$. The underlying assumption refers to the uncertainty displaying the properties of a Poisson process, i.e. the probability of an event is independent of however many events have occurred in the past or how recently. And this is our starting point towards a better understanding of NYSE black swans. 3.1. Mean Interval Between Black Swan Events (MIBE) The MIBE is the average interval between$n$observed occurrences of a black swan event. Its true value can be estimated from the observed occurrences using central limit theorem: $$\mbox{MIBE}\beta = \mbox{Normal} \left( \bar{t}, \frac{\sigma}{\sqrt{n-1}} \right)$$ where$\bar{t}$is the average of the$n-1$observed intervals$t_i$between the$n$observed contiguous black swan events and$\sigma$is the standard deviation of the$t_i$intervals. The larger the value of$n$, the greater our confidence on knowing its true value. First, we estimate$\bar{t}$from the data. Using NumPy’s histogram function, extra, we confirm that a plain mean is the same as the expected value,$E(\Delta t)$, derived based on the histogram, $$E(\Delta t) = \sum_{i} p_i x_i \ ,$$ i.e.$\bar{t} = E(\Delta t)$where there is$p_i$probability of observing$x_i$. In Python we execute that check as follows: deltat = np.diff(dfb.Delta0) # time differences between points print('tbar = %.2f [days]' % np.mean(deltat)) print('min, max = %.2f, %.2f [days]' % (np.min(deltat), np.max(deltat))) print() # Confirm the mean using histogram data (mean = the expected value) y, x = np.histogram(deltat, bins=np.arange(np.ceil(np.max(deltat)+2)), density=True) hmean = np.sum(x[:-1]*y) print('E(dt) = %.2f [days]' % hmean) # plot historgam with a fitted Exp PDF plt.figure(figsize=(9, 5)) plt.bar(x[:-1], y, width=1, color=c) plt.xlim([0, 20]) plt.title('Time Differences between Black Swan Events', fontsize=12) plt.xlabel('Days', fontsize=12) plt.savefig('fig05.png', format='png') what returns tbar = 17.38 [days] min, max = 0.00, 950.00 [days] E(dt) = 17.38 [days] What is worth noticing is the fact that the largest recorded separation among historical black swans at NYSE was 950 days, i.e. 2.6 years. By plotting the distribution of time differences, we get: what could be read out more clearly if we type, e.g.: # the first 5 data points from the historgram print(y[0:5]) print(x[0:5]) returning [ 0.30990415 0.12300319 0.08466454 0.04472843 0.03354633] [ 0. 1. 2. 3. 4.] and means that our data suggest that there is a probability of 31.0% of observing the black swan events on the same day, 12.3% with a 1 day gap, 8.5% with a 2 day separation, and so on. Since MIBE$\beta$is not given by a single number, the recipe we have for its estimation (see the formula above) allows us solely for modelling of Mean Interval Between Black Swan Events as follows. First, we calculate the corresponding standard deviation: sig = np.std(deltat, ddof=1) # standard deviation print(tbar, sig) print(tbar, sig/np.sqrt(ne-1)) 15.8912 57.5940828519 15.8912 2.30192251209 and next ne = len(dfb) # 627, a number of black swans in data sample betaMIBE_sample = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=10000) plt.figure(figsize=(9, 5)) plt.hist(betaMIBE_sample, bins=25, color=c) plt.xlabel("MIBE beta [days]", fontsize=12) plt.title("Mean Interval Between Black Swan Events", fontsize=12) plt.savefig('fig06.png', format='png') 3.2. Time Till Next Black Swan Event Once we have estimated the MIBE$\beta$, it is possible to use its value in order to estimate the time till the next black swan event. It obeys $$\mbox{TTNE} = \mbox{Expon}(\beta)$$ i.e. the exponential distribution given by the probability density function of: $$f(x; \beta) = \lambda \exp(-\lambda x) = \frac{1}{\beta} \exp\left(-\frac{x}{\beta}\right) \ .$$ In Python we model TTNE in the following way: betaMIBE_one = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=1) # The time till next event = Expon(beta) ttne = expon.rvs(loc=0, scale=betaMIBE, size=100000) y, x = np.histogram(ttne, bins=np.arange(int(np.max(ttne))), density=True) plt.figure(figsize=(9, 5)) plt.bar(x[:-1], y, width=1, color=c, edgecolor=c) plt.xlim([0, 120]) plt.title("Time Till Next Black Swan Event", fontsize=12) plt.xlabel("Days", fontsize=12) plt.savefig('fig07.png', format='png') where by checking print(y[0:5]) print(x[0:5]) we get [0.05198052 0.0498905 0.04677047 0.04332043] [1 2 3 4] id est there is 5.20% of probability that the next black swan event at NYSE will occur on the next day, 4.99% in 2 days, 4.68% in 3 days, etc. 3.3. The Expected Number of Black Swans per Business Day The distribution of the number of black swan events to occur at NYSE per unit time (here, per business day) can be modelled using Poisson distribution: $$\mbox{Poisson}(\lambda) \ \ \mbox{where} \ \ \lambda = \frac{1}{\beta} \ .$$ The Poisson distribution models the number of occurrences of an event in a time$T$when the time between successive events follows a Poisson process. If$\beta$is the mean time between events, as used by the Exponential distribution, then$\lambda = T/\beta. We will use it in the “out-of-sample” backtesting in Section 4. Let’s turn words into action by executing the following Python code: # generate a set of Poisson rvs nep = poisson.rvs(1/betaMIBE_one, size=100000) y, x = np.histogram(nep, bins=np.arange(np.max(nep)+1), density=True) plt.figure(figsize=(9, 5)) plt.title("The Expected Number of Black Swans per Business Day", fontsize=12) plt.ylabel("N", fontsize=12) plt.ylabel("Probability", fontsize=12) plt.xticks(x) plt.xlim([-0.1, 2.1]) plt.bar(x[:-1], y, width=0.05, color='r', edgecolor='r', align='center') plt.savefig('fig08.png', format='png') The derived probabilities: print(y) print(x[:-1]) [0.95465 0.04419 0.00116] [0 1 2] inform us that there is 95.5% of chances that there will be no occurrence of the black swan event per business day, 4.4% that we will record only one, and 0.12% that we may find 2 out of 2500+ NYSE stocks that will suffer from the loss greater than 25.7% on the next business day (unit time). 3.4. Probability of the Occurrence of Several Events in an Interval The Poisson(1/\beta)$distribution calculates the distribution of the number of events that will occur in a single unit interval. The probability of exactly$k$black swan events in$m$trials is given by: $$\mbox{Pr}[X = N] = \frac{\lambda^k}{k!} \exp(-\lambda) = \frac{1}{\beta^k k!} \exp\left(-\frac{1}{\beta}\right)$$ with$\beta$rescaled to reflect the period of exposure, e.g.: $$\beta_{yr} = \beta/252$$ where 252 stands for a number of business days in a calendar year. Therefore, the number of black swan events in next business year can be modelled by: $$N = \mbox{Poisson}(\lambda) = \mbox{Poisson}(1/\beta_{yr}) = \mbox{Poisson}(252/\mbox{Normal}(\bar{t}, \sigma/\sqrt{n-1}))$$ what we can achieve in Python, coding: inNdays = 252 N = poisson.rvs(inNdays/norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1)), size=100000) exN = np.round(np.mean(N)) stdN = np.round(np.std(N, ddof=1)) y, x = np.histogram(N, bins=np.arange(int(np.max(N))), density=True) _ = plt.figure(figsize=(9, 5)) _ = plt.title("The Expected Number of Black Swans in next %g Business Days" % inNdays, fontsize=12) _ = plt.ylabel("N", fontsize=12) _ = plt.ylabel("Probability", fontsize=12) _ = plt.bar(x[:-1], y, width=1, color=c) plt.savefig('fig09.png', format='png') # probability that it will occur <N> events in next 'inNdays' days print('E(N) = %g' % exN) print('stdN = %g' % stdN) tmp = [(i,j) for i, j in zip(x,y) if i == exN] print('Pr(X=%g) = %.4f' % (exN, tmp[0][1])) tmp = [(i,j) for i, j in zip(x,y) if i == 0] print('Pr(X=%g) = %.4f' % (0, tmp[0][1])) Our Monte Carlo simulation may return, for instance: where the calculated measures are: E(N) = 16 stdN = 4 Pr(X=16) = 0.0992 Pr(X=0) = 0.0000 The interpretation is straightforward. On average, we expect to record$16\pm 4$black swan events in next 252 business days among 2500+ stocks traded at NYSE as for Apr 10, 2015 COB. The probability of observing exactly 16 black swans is 9.92%. It is natural, based on our data analysis, that the resultant probability of the “extreme luck” of not having any black swan at NYSE, Pr$(X=0)$, in the following trading year is zero. But, miracles happen! 9.92% is derived based on a single run of the simulation. Using the analytical formula for Pr$(X=N)$given above, we compute the uncertainty of $$\mbox{Pr}[X = (N=16)] = \frac{1}{(\beta_{yr})^{16} 16!} \exp\left(-\frac{1}{\beta_{yr}}\right) \ ,$$ Pr = [] for nsim in range(100000): betayr = norm.rvs(loc=tbar, scale=sig/np.sqrt(ne-1), size=1) / inNdays p = 1/(betayr**exN * factorial(exN)) * np.exp(-1/betayr) Pr.append(p) print('Pr[X = E(N) = %g] = %.2f +- %.2f' % (exN, np.mean(Pr), np.std(Pr, ddof=1))) that yields Pr[X = E(N) = 16] = 0.08 +- 0.02 and stays in and agreement with our result. 4. Prediction and Out-of-Sample Backtesting All right, time to verify a theoretical statistics in practice. Our goal is to analyze the sample of black swan data skipped so far but available to us. This procedure usually is known as an “out-of-sample” backtesting, i.e. we know “the future” as it has already happened but we pretend it is unknown. In first step, we look for a suitable time interval of exactly 252 business days “in the future”. It occurs to be: print(np.busday_count(dt.date(2015, 4, 10) , dt.date(2016,3, 29))) # 252 Next, we extract out-of-sample (oos) DataFrame and illustrate them in a similar fashion as previously for in-sample data: oos = dfblack0[(dfblack0.index >= dt.date(2015, 4, 10)) & (dfblack0.index <= dt.date(2016, 3, 29))] # extract black swans oos = oos[oos.Return < locG-scaleG] plt.figure(figsize=(13, 6)) plt.plot(oos.Return*100, '.b') plt.xlabel('Time Interval [%s to %s]' % (dfblack.index[0], oos.index[-1]), fontsize=14) plt.ylabel('Black Swan Magnitude [%]', fontsize=14) plt.title('Time Distribution of Out-of-Sample Black Swans of %g NYSE Stocks (L < %.1f%%)' % (len(oos), (locG-scaleG)*100), fontsize=14) plt.xticks(fontsize = 14) plt.yticks(fontsize = 14) plt.ylim([-100, 0]) plt.xlim([dfblack.index[0], oos.index[-1]]) plt.savefig('fig10.png', format='png') Based on that, we compare the MIBE as predicted with what the next 252 business days reveal: deltat_oos = np.diff(oos.Delta0) # time differences between points tbar_oos = np.mean(deltat_oos) print('MIBE (predicted) = %.2f +- %.2f [days]' % (tbar, sig/np.sqrt(ne-1))) print('MIBE out-of-sample = %.2f [days]' % tbar_oos) print() y, x = np.histogram(deltat_oos, bins=np.arange(np.ceil(np.max(deltat)+2)), density=True) print(y[0:5]) # [ 0.3814433 0.20618557 0.17525773 0.06185567 0.05154639] print(x[0:5]) # [ 0. 1. 2. 3. 4.] # Predicted probabilities were: # [ 0.30990415 0.12300319 0.08466454 0.04472843 0.03354633] The outcome surprises, i.e. MIBE (predicted) = 17.38 +- 2.30 [days] MIBE out-of-sample = 2.32 [days] clearly indicating that Apr 2015–Apr 2016 period was nearly 7.5 times more “active” than “all-time” NYSE black swan data would predict to be. Have I already said “it’s alarming”?! Well, at least scary enough for all who trade on the daily basis holding portfolios rich in NYSE assets. The probability of having black swan on the same day increased from 31% to 38%. A new, pure definition of sophisticated gambling. A verification of the time till next black swan event requires a gentle modification of oos DataFrame. Namely, we need to eliminate all data points with the same Delta0‘s: oos2 = oos.Delta0.drop_duplicates(inplace=False) tdiff = np.diff(oos2) y, x = np.histogram(tdiff, bins=np.arange(int(np.max(tdiff))), density=True) _ = plt.figure(figsize=(9, 5)) _ = plt.bar(x[:-1], y, width=1, color=c, edgecolor=c) _ = plt.xlim([1, 30]) _ = plt.title("Time Till Next Black Swan Event (Out-of-Sample)", fontsize=12) _ = plt.xlabel("Days", fontsize=12) _ = plt.ylabel("Probability", fontsize=12) plt.savefig('fig11.png', format='png') print(y[1:5]) print(x[1:5]) Again, from our modeling of Poisson($1/\beta$) we expected to have about 5% chances of waiting for the next black swan event a day, two, or three. Out-of-sample data suggest, [0.33898305 0.28813559 0.10169492 0.08474576] [1 2 3 4] that we are 6.3 times more likely to record black swan event at NYSE on the next business day. Terrifying. The calculation of the expected number of black swans per business day in out-of-sample data I’m leaving with you as a homework in Python programming. Our earlier prediction of those events in next 252 business days returned 16$\pm$4. As we already have seen it, in a considered time interval after Apr 10, 2015 we recorded 98 swans which is$6.1^{+2.1}_{-1.2}$times more than forecasted based on a broad sample. In next post of this series, we will try to develop a methodology for a quantitative derivation of the probability of black swan for$N$-asset portfolio. Reflection Investors invest in stocks with anticipation of earning profit. As we have witnessed above, the birds become a serious threat to flying high with a significant increase in this investment air-traffic. Your plane (stock) can be grounded and your airline (portfolio) can suffer from huge losses. Hoping is not the solution. The better ways of protecting against the unavoidable should be implemented for investments while your money are in the game. But what you should do when black swan strikes? Denzel Washington once said If you pray for rain, you gotta deal with the mud too. ## Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (2) This part is awesome. Trust me! Previously, in Part 1, we examined two independent methods in order to estimate the probability of a very rare event (heavy or extreme loss) that an asset could experience on the next day. The first one was based on the computation of the tail probability, i.e.: given the upper threshold$L_{thr}$find the corresponding$\alpha$such that$ Pr(L < -L_{thr}) = \alpha$where a random variable$L$was referred to as a daily asset return (in percent). In this case, we showed that this approach was strongly dependent on the number of similar events in the past that would contribute to the changes of$\alpha$'s in time. The second method was based on a classical concept of the conditional probability. For example, given that the index (benchmark, etc.) drops by$L'$what is the probability that an asset will lose$L\le L'$? We found that this methods fails if the benchmark never experienced a loss of$L'$. In both cases, the lack of information on the occurrence of a rare event was the major problem. Therefore, is there any tool we can use to predict an event that has not happened yet?! This post addresses an appealing solution and delivers a black-box for every portfolio holder he or she can use as an integrated part of the risk management computer system. 1. Bayesian Conjugates Bayesian statistics treats the problem of finding the probability of an event a little bit differently than what we know from school. For example, if we flip a fair coin, in long run, we expect the same proportion of heads and tails. Formally, it is known as a frequentist interpretation. Bayesian interpretation assumes nothing about the coin (is it fair or weighted) and with every toss of the coin, it builds a grand picture on the underlying probability distribution. And it’s fascinating because we can start with an assumption that the coin is made such that we should observe a number of heads more frequently whereas, after$N$tosses, we may find that the coin is fair. Let me skip the theoretical part on that aspect of Bayesian statistics which has been beautifully explained by Mike Halls-Moore within his two articles (Bayesian Statistics: A Beginner’s Guide and Bayesian Inference Of A Binomial Proportion – The Analytical Approach) that I recommend you to study before reading this post further down. Looking along the return-series for any asset (spanned over$M$days), we may denote by$x_i = 1$($i=1,…,M$) a day when a rare event took place (e.g. an asset lost more than -16% or so) and by$x_i = 0$a lack of such event. Treating days as a series of Bernoulli trials with unknown probability$p$, if we start with an uninformed prior, such as Beta($\alpha, \beta$) with$\alpha=\beta=1$then the concept of Bayesian conjugates can be applied in order to update a prior belief. The update is given by: $$\alpha’ = \alpha + n_e = \alpha + \sum_{i=1}^{N\le M} x_i \\ \beta’ = \beta + N – n_e = \beta + N – \sum_{i=1}^{N\le M} x_i$$ what simply means that after$N$days from some initial moment in the past, by observing a rare event$n_e$times, the probability distribution is given by Beta($\alpha’, \beta’$). Therefore, a posterior predictive mean for Bernoulli likelihood can be interpreted as a new probability of an event to take place on the next day: $$\mbox{Pr}(L < -L_{thr}) = \frac{\alpha'}{\alpha' + \beta'}$$ And that's the key concept we are going to use now for the estimation of probability for heavy or extreme losses that an asset can experience in trading. As a quick visualisation, imagine you analyse the last 252 trading days of the AAPL stock. The stock never lost more than 90% in one day. No such extreme event took place. According to our new method, the probability that AAPL will plummet 90% (or more) in next trading session is not zero(!) but $$\mbox{Pr}(L < -90\%) = \frac{\alpha'}{\alpha' + \beta'} = \frac{1+0}{1+252-0} = 0.3953\%$$ with 95% confidence interval: $$\left[ B^{-1}(0.05, \alpha', \beta');\ B^{-1}(0.95, \alpha', \beta') \right]$$ where$B^{-1}$denotes the percent point function (quantile function) for Beta distribution. The estimation of a number of events corresponding to the loss of$L < -L_{thr}$during next 252 trading day, we find by: $$E(n_e) = \left\lceil \frac{252}{[B^{-1}(0.95, \alpha', \beta')]^{-1}} \right\rceil$$ where$\lceil \rceil$denotes the operation of rounding. 2. Analysis of Facebook since its IPO Having a tool that includes information whether an event took place or not is powerful. You may also notice that the more information we gather (i.e.$N\to\infty$) the more closer we are to the “true” probability function. It also means that the spread around a new posterior predictive mean: $$\sigma = \sqrt{ \frac{\alpha’ \beta’} { (\alpha’+\beta’)^2 (\alpha’+\beta’+1) } } \ \to 0 .$$ It is logical that the analysis of an asset traded daily for the past 19 years will deliver “more trustful” estimation of$\mbox{Pr}(L < -L_{thr})$rather than a situation where we take into account only a limited period of time. However, what if the stock is a newbie in the market? Let’s investigate the case of Facebook (NASDAQ:FB). The information we have the record of, spans a bit over 3.5 years back, when Facebook’s initial public offering (IPO) took place on May 18th, 2012. Since that day till today (December 6, 2015) the return-series (daily data) is$M=892$points long. Using the analytical formulations as given above and writing a Python code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 # Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (2) # (c) 2015 Pawel Lachowicz, QuantAtRisk.com import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt import pandas_datareader.data as web # Fetching Yahoo! Finance for Facebook data = web.DataReader("FB", data_source='yahoo', start='2012-05-18', end='2015-12-04')['Adj Close'] cp = np.array(data.values) # daily adj-close prices ret = cp[1:]/cp[:-1] - 1 # compute daily returns N = len(ret) # Plotting IBM price- and return-series plt.figure(num=2, figsize=(9, 6)) plt.subplot(2, 1, 1) plt.plot(cp) plt.axis("tight") plt.ylabel("FB Adj Close [USD]") plt.subplot(2, 1, 2) plt.plot(ret, color=(.6, .6, .6)) plt.axis("tight") plt.ylabel("Daily Returns") plt.xlabel("Time Period [days]") #plt.show() # provide a threshold Lthr = -0.21 # how many events of L < -21% occured? ne = np.sum(ret < Lthr) # of what magnitude? ev = ret[ret < Lthr] # avgloss = np.mean(ev) # if ev is non-empty array # prior alpha0 = beta0 = 1 # posterior alpha1 = alpha0 + ne beta1 = beta0 + N - ne pr = alpha1/(alpha1+beta1) cl1 = beta.ppf(0.05, alpha1, beta1) cl2 = beta.ppf(0.95, alpha1, beta1) ne252 = np.round(252/(1/cl2)) print("ne = %g" % ne) print("alpha', beta' = %g, %g" % (alpha1, beta1)) print("Pr(L < %3g%%) = %5.2f%%\t[%5.2f%%, %5.2f%%]" % (Lthr*100, pr*100, cl1*100, cl2*100)) print("E(ne) = %g" % ne252) first we visualise both the price- and return-series: and while the computations we find: ne = 0 alpha', beta' = 1, 893 Pr(L < -21%) = 0.11% [ 0.01%, 0.33%] E(ne) = 1 There is no record of an event within 892 days of FB trading,$n_e = 0$, that the stock lost 21% or more in a single day. However, the probability that such even may happen at the end of the next NASDAQ session is 0.11%. Non zero but nearly zero. A quick verification of the probability estimate returned by our Bayesian method we obtain by changing the value of a threshold from -0.21 to Lthr = 0. Then, we get: alpha', beta' = 422, 472 Pr(L < 0%) = 47.20% [44.46%, 49.95%] E(ne) = 126 i.e. 47.2% that on the next day the stock will close lower. It is in an excellent agreement with what can be calculated based on the return-series, i.e. there were 421 days with the negative daily returns, thus$421/892 = 0.4719$. Really cool, don’t you think?! Now, imagine for a second that after first few days since the Facebook IPO we recalculate the posterior predictive mean, day-by-day, eventually to reach the most current estimate of the probability that the stock may lose -21% or more at the end of the next trading day — such a procedure allows us to plot the change of$\mbox{Pr}(L < -L_{thr})$during the whole trading history of FB. We achieve that by running a modified version of the previous code: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt import pandas_datareader.data as web data = web.DataReader("FB", data_source='yahoo', start='2012-05-18', end='2015-12-05')['Adj Close'] cp = np.array(data.values) # daily adj-close prices returns = cp[1:]/cp[:-1] - 1 # compute daily returns Lthr = -0.21 Pr = CL1 = CL2 = np.array([]) for i in range(2, len(returns)): # data window ret = returns[0:i] N = len(ret) ne = np.sum(ret < Lthr) # prior alpha0 = beta0 = 1 # posterior alpha1 = alpha0 + ne beta1 = beta0 + N - ne pr = alpha1/(alpha1+beta1) cl1 = beta.ppf(0.05, alpha1, beta1) cl2 = beta.ppf(0.95, alpha1, beta1) # Pr = np.concatenate([Pr, np.array([pr*100])]) CL1 = np.concatenate([CL1, np.array([cl1*100])]) CL2 = np.concatenate([CL2, np.array([cl2*100])]) plt.figure(num=1, figsize=(10, 5)) plt.plot(Pr, "k") plt.plot(CL1, "r--") plt.plot(CL2, "r--") plt.axis("tight") plt.ylim([0, 5]) plt.grid((True)) plt.show() what displays: i.e. after 892 days we reach the value of 0.11% as found earlier. The black line tracks the change of probability estimate while both red dashed lines denote the 95% confidence interval. The black line is smooth, i.e. because of a lack of -21% event in the past, a recalculated probability did not changed its value “drastically”. This is not the case when we consider the scenario for Lthr = -0.11, delivering: what stays in agreement with: ne = 1 alpha', beta' = 2, 892 Pr(L < -11%) = 0.22% [ 0.04%, 0.53%] E(ne) = 1 and confirms that only one daily loss of$L < -11\%$took place while the entire happy life of Facebook on NASDAQ so far. Since the computer can find the number of losses to be within a certain interval, that gives us an opportunity to estimate: $$\mbox{Pr}(L_1 \le L < L_2)$$ in the following way: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 import numpy as np from scipy.stats import beta import matplotlib.pyplot as plt import pandas_datareader.data as web data = web.DataReader("FB", data_source='yahoo', start='2012-05-18', end='2015-12-05')['Adj Close'] cp = np.array(data.values) # daily adj-close prices ret = cp[1:]/cp[:-1] - 1 # compute daily returns N = len(ret) dL = 2 for k in range(-50, 0, dL): ev = ret[(ret >= k/100) & (ret < (k/100+dL/100))] avgloss = 0 ne = np.sum((ret >= k/100) & (ret < (k/100+dL/100))) # # prior alpha0 = beta0 = 1 # posterior alpha1 = alpha0 + ne beta1 = beta0 + N - ne pr = alpha1/(alpha1+beta1) cl1 = beta.ppf(0.05, alpha1, beta1) cl2 = beta.ppf(0.95, alpha1, beta1) if(len(ev) > 0): avgloss = np.mean(ev) print("Pr(%3g%% < L < %3g%%) =\t%5.2f%%\t[%5.2f%%, %5.2f%%] ne = %4g" " E(L) = %.2f%%" % \ (k, k+dL, pr*100, cl1*100, cl2*100, ne, avgloss*100)) else: print("Pr(%3g%% < L < %3g%%) =\t%5.2f%%\t[%5.2f%%, %5.2f%%] ne = " \ "%4g" % (k, k+dL, pr*100, cl1*100, cl2*100, ne)) By doing so, we visualise the distribution of potential losses (that may occur on the next trading day) along with their magnitudes: Pr(-50% < L < -48%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-48% < L < -46%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-46% < L < -44%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-44% < L < -42%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-42% < L < -40%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-40% < L < -38%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-38% < L < -36%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-36% < L < -34%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-34% < L < -32%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-32% < L < -30%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-30% < L < -28%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-28% < L < -26%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-26% < L < -24%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-24% < L < -22%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-22% < L < -20%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-20% < L < -18%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-18% < L < -16%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-16% < L < -14%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-14% < L < -12%) = 0.11% [ 0.01%, 0.33%] ne = 0 Pr(-12% < L < -10%) = 0.34% [ 0.09%, 0.70%] ne = 2 E(L) = -11.34% Pr(-10% < L < -8%) = 0.67% [ 0.29%, 1.17%] ne = 5 E(L) = -8.82% Pr( -8% < L < -6%) = 0.89% [ 0.45%, 1.47%] ne = 7 E(L) = -6.43% Pr( -6% < L < -4%) = 2.91% [ 2.05%, 3.89%] ne = 25 E(L) = -4.72% Pr( -4% < L < -2%) = 9.84% [ 8.26%, 11.53%] ne = 87 E(L) = -2.85% Pr( -2% < L < 0%) = 33.11% [30.54%, 35.72%] ne = 295 E(L) = -0.90% For example, based on a complete trading history of FB, there is a probability of 0.34% that the stock will close with a loss to be between -12% and -10%, and if so, the “expected” loss would be -11.3% (estimated based solely on 2 events). Again. The probability, not certainty. ## Student t Distributed Linear Value-at-Risk One of the most underestimated feature of the financial asset distributions is their kurtosis. A rough approximation of the asset return distribution by the Normal distribution becomes often an evident exaggeration or misinterpretations of the facts. And we know that. The problem arises if we investigate a Value-at-Risk (VaR) measure. Within a standard approach, it is computed based on the analytical formula: $$\mbox{VaR}_{h, \alpha} = \Phi ^{-1} (1-\alpha)\sqrt{h}\sigma – h\mu$$ what expresses$(1-\alpha)h$-day VaR,$\Phi ^{-1} (1-\alpha)$denotes the standardised Normal distribution$(1-\alpha)$quantile, and$h$is a time horizon. The majority of stocks display significant excess kurtosis and negative skewness. In plain English that means that the distribution is peaked and likely to have a heavy left tail. In this scenario, the best fit of the Normal probability density function (pdf) to the asset return distribution underestimates the risk accumulated in a far negative territory. In this post we will introduce the concept of Student t Distributed Linear VaR, i.e. an alternative method to measure VaR by taking into account “a correction” for kurtosis of the asset return distribution. We will use the market stock data of IBM as an exemplary case study and investigate the difference in a standard and non-standard VaR calculation based on the parametric models. It will also be an excellent opportunity to learn how to do it in Python, quickly and effectively. 1. Leptokurtic Distributions and Student t VaR A leptokurtic distribution is one whose density function has a higher peak and greater mass in the tails than the normal density function of the same variance. In a symmetric unimodal distribution, i.e. one whose density function has only one peak, leptokurtosis is indicated by a positive excess kurtosis (Alexander 2008). The impact of leptokurtosis on VaR is non-negligible. Firstly, assuming high significance levels, e.g.$\alpha \le 0.01$, the estimation of VaR for Normal and leptokurtic distribution would return: $$\mbox{VaR}_{lepto,\ \alpha\le 0.01} \gt \mbox{VaR}_{Norm,\ \alpha\le 0.01}$$ which is fine, however for low significance levels the situation may change: $$\mbox{VaR}_{lepto,\ \alpha\ge 0.05} \lt \mbox{VaR}_{Norm,\ \alpha\ge 0.05} \ .$$ A good example of the leptokurtic distribution is Student t distribution. If it describes the data better and displays significant excess kurtosis, there is a good chance that VaR for high significance levels will be a much better measure of the tail risk. The Student t distribution is formally given by its pdf: $$f_\nu(x) = \sqrt{\nu\pi} \Gamma\left(\frac{\nu}{2}\right)^{-1} \Gamma\left(\frac{\nu+1}{2}\right) (1+\nu^{-1} x^2)^{-(\nu+1)/2}$$ where$\Gamma$denotes the gamma function and$\nu$the degrees of freedom. For$\nu\gt 2$its variance is$\nu(\nu-2)^{-1}$and the distribution has a finite excess kurtosis of$\kappa = 6(\nu-4)^{-1}$for$\nu \gt 4$. It is obvious that for$\nu\to \infty$the Student t pdf approaches the Normal pdf. Large values of$\nu$for the fitted Student t pdf in case of the financial asset daily return distribution are unlikely, unless the Normal distribution is the best pdf model indeed (possible to obtain for very large samples). We find$\alpha$quantile of the Student t distribution ($\mu=0$and$\sigma=1$) by the integration of the pdf and usually denote it as$\sqrt{\nu^{-1}(\nu-2)} t_{\nu}^{-1}(\alpha)$where$t_{\nu}^{-1}$is the percent point function of the t distribution. Accounting for the fact that in practice we deal with random variables of the form of$X = \mu + \sigma T$where$X$would describe the asset return given as a transformation of the standardised Student t random variable, we reach to the formal definition of the$(1-\alpha)h$-day Student t VaR given by: $$\mbox{Student}\ t\ \mbox{VaR}_{\nu, \alpha, h} = \sqrt{\nu^{-1}(\nu-2)h}\ t_{\nu}^{-1}(1-\alpha)\sigma – h\mu$$ Having that, we are ready to use this knowledge in practice and to understand, based on data analysis, the best regimes where one should (and should not) apply the Student t VaR as a more “correct” measure of risk. 2. Normal and Student t VaR for IBM Daily Returns Within the following case study we will make of use of Yahoo! Finance data provider in order to fetch the price-series of the IBM stock (adjusted close) and transform it into return-series. As the first task we have to verify before heading further down this road will be the sample skewness and kurtosis. We begin, as usual: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # Student t Distributed Linear Value-at-Risk # The case study of VaR at the high significance levels # (c) 2015 Pawel Lachowicz, QuantAtRisk.com import numpy as np import math from scipy.stats import skew, kurtosis, kurtosistest import matplotlib.pyplot as plt from scipy.stats import norm, t import pandas_datareader.data as web # Fetching Yahoo! Finance for IBM stock data data = web.DataReader("IBM", data_source='yahoo', start='2010-12-01', end='2015-12-01')['Adj Close'] cp = np.array(data.values) # daily adj-close prices ret = cp[1:]/cp[:-1] - 1 # compute daily returns # Plotting IBM price- and return-series plt.figure(num=2, figsize=(9, 6)) plt.subplot(2, 1, 1) plt.plot(cp) plt.axis("tight") plt.ylabel("IBM Adj Close [USD]") plt.subplot(2, 1, 2) plt.plot(ret, color=(.6, .6, .6)) plt.axis("tight") plt.ylabel("Daily Returns") plt.xlabel("Time Period 2010-12-01 to 2015-12-01 [days]") where we cover the most recent past 5 years of trading of IBM at NASDAQ. Both series visualised look like: where, in particular, the bottom one reveals five days where the stock experienced 5% or higher daily loss. In Python, a correct computation of the sample skewness and kurtosis is possible thanks to the scipy.stats module. As a test you may verify that for a huge sample (e.g. of 10 million) of random variables$X\sim N(0, 1)$, the expected skewness and kurtosis, from scipy.stats import skew, kurtosis X = np.random.randn(10000000) print(skew(X)) print(kurtosis(X, fisher=False)) is indeed equal 0 and 3, 0.0003890933008049 3.0010747070253507 respectively. As we will convince ourselves in a second, this is not the case for our IBM return sample. Continuing, 31 32 33 34 35 36 37 38 39 40 41 print("Skewness = %.2f" % skew(ret)) print("Kurtosis = %.2f" % kurtosis(ret, fisher=False)) # H_0: the null hypothesis that the kurtosis of the population from which the # sample was drawn is that of the normal distribution kurtosis = 3(n-1)/(n+1) _, pvalue = kurtosistest(ret) beta = 0.05 print("p-value = %.2f" % pvalue) if(pvalue < beta): print("Reject H_0 in favour of H_1 at %.5f level\n" % beta) else: print("Accept H_0 at %.5f level\n" % beta) we compute the moments and run the (excess) kurtosis significance test. As already explained in the comment, we are interested whether the derived value of kurtosis is significant at$\beta=0.05$level or not? We find that for IBM, the results reveal: Skewness = -0.70 Kurtosis = 8.39 p-value = 0.00 Reject H_0 in favour of H_1 at 0.05000 level id est, we reject the null hypothesis that the kurtosis of the population from which the sample was drawn is that of the normal distribution kurtosis in favour of the alternative hypothesis ($H_1$) that it is not. And if not then we have work to do! In the next step we fit the IBM daily returns data sample with both Normal pdf and Student t pdf in the following way: 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 # N(x; mu, sig) best fit (finding: mu, stdev) mu_norm, sig_norm = norm.fit(ret) dx = 0.0001 # resolution x = np.arange(-0.1, 0.1, dx) pdf = norm.pdf(x, mu_norm, sig_norm) print("Integral norm.pdf(x; mu_norm, sig_norm) dx = %.2f" % (np.sum(pdf*dx))) print("Sample mean = %.5f" % mu_norm) print("Sample stdev = %.5f" % sig_norm) print() # Student t best fit (finding: nu) parm = t.fit(ret) nu, mu_t, sig_t = parm pdf2 = t.pdf(x, nu, mu_t, sig_t) print("Integral t.pdf(x; mu, sig) dx = %.2f" % (np.sum(pdf2*dx))) print("nu = %.2f" % nu) print() where we ensure that the complete integral over both pdfs return 1, as expected: Integral norm.pdf(x; mu_norm, sig_norm) dx = 1.00 Sample mean = 0.00014 Sample stdev = 0.01205 Integral t.pdf(x; mu, sig) dx = 1.00 nu = 3.66 The best fit of the Student t pdf returns the estimate of 3.66 degrees of freedom. From this point, the last step is the derivation of two distinct VaR measure, the first one corresponding to the best fit of the Normal pdf and the second one to the best fit of the Student t pdf, respectively. The trick is 61 62 63 64 65 66 67 68 69 # Compute VaR h = 1 # days alpha = 0.01 # significance level StudenthVaR = (h*(nu-2)/nu)**0.5 * t.ppf(1-alpha, nu)*sig_norm - h*mu_norm NormalhVaR = norm.ppf(1-alpha)*sig_norm - mu_norm lev = 100*(1-alpha) print("%g%% %g-day Student t VaR = %.2f%%" % (lev, h, StudenthVaR*100)) print("%g%% %g-day Normal VaR = %.2f%%" % (lev, h, NormalhVaR*100)) that in the calculation of the$(1-\alpha)1$-day Student t VaR ($h=1$) according to: $$\mbox{Student}\ t\ \mbox{VaR}_{\nu, \alpha=0.01, h=1} = \sqrt{\nu^{-1}(\nu-2)}\ t_{\nu}^{-1}(0.99)\sigma – \mu$$ we have to take$\sigma$and$\mu$given by sig_norm and mu_norm variable (in the code) and not the ones returned by the t.fit(ret) function in line #54. We find out that: 99% 1-day Student t VaR = 3.19% 99% 1-day Normal VaR = 2.79% i.e. VaR derived based on the Normal pdf best fit may underestimate the tail risk for IBM, as suspected in the presence of highly significant excess kurtosis. Since the beauty of the code is aesthetic, the picture is romantic. We visualise the results of our investigation, extending our code with: 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 plt.figure(num=1, figsize=(11, 6)) grey = .77, .77, .77 # main figure plt.hist(ret, bins=50, normed=True, color=grey, edgecolor='none') plt.hold(True) plt.axis("tight") plt.plot(x, pdf, 'b', label="Normal PDF fit") plt.hold(True) plt.axis("tight") plt.plot(x, pdf2, 'g', label="Student t PDF fit") plt.xlim([-0.2, 0.1]) plt.ylim([0, 50]) plt.legend(loc="best") plt.xlabel("Daily Returns of IBM") plt.ylabel("Normalised Return Distribution") # inset a = plt.axes([.22, .35, .3, .4]) plt.hist(ret, bins=50, normed=True, color=grey, edgecolor='none') plt.hold(True) plt.plot(x, pdf, 'b') plt.hold(True) plt.plot(x, pdf2, 'g') plt.hold(True) # Student VaR line plt.plot([-StudenthVaR, -StudenthVaR], [0, 3], c='g') # Normal VaR line plt.plot([-NormalhVaR, -NormalhVaR], [0, 4], c='b') plt.text(-NormalhVaR-0.01, 4.1, "Norm VaR", color='b') plt.text(-StudenthVaR-0.0171, 3.1, "Student t VaR", color='g') plt.xlim([-0.07, -0.02]) plt.ylim([0, 5]) plt.show() what brings us to: where in the inset, we zoomed in the left tail of the distribution marking both VaR results. Now, it is much clearly visible why at$\alpha=0.01$the Student t VaR is greater: the effect of leptokurtic distribution. Moreover, the fit of the Student t pdf appears to be a far better parametric model describing the central mass (density) of IBM daily returns. There is an ongoing debate which one out of these two risk measures should be formally applied when it comes to reporting of 1-day VaR. I do hope that just by presenting the abovementioned results I made you more aware that: (1) it is better to have a closer look at the excess kurtosis for your favourite asset (portfolio) return distribution, and (2) normal is boring. Someone wise once said: “Don’t be afraid of being different. Be afraid of being the same as everyone else.” The same means normal. Dare to fit non-Normal distributions! DOWNLOAD tStudentVaR.py REFERENCES Alexander, C., 2008, Market Risk Analysis. Volume IV., John Wiley & Sons Ltd ## How to Get a List of all NASDAQ Securities as a CSV file using Python? This post will be short but very informative. You can learn a few good Unix/Linux tricks on the way. The goal is well defined in the title. So, what’s the quickest solution? We will make use of Python in the Unix-based environment. As you will see, for any text file, writing a single line of Unix commands is more than enough to deliver exactly what we need (a basic text file processing). If you try to do the same in Windows.. well, good luck! In general, we need to get through the FTP gate of NASDAQ heaven. It is sufficient to log on as an anonymous user providing your password defined by your email. In fact, any fake email will do the job. Let’s begin coding in Python: 1 2 3 4 5 6 7 8 9 10 # How to Get a List of all NASDAQ Securities as a CSV file using Python? # +tested in Python 3.5.0b2, Mac OS X 10.10.3 # # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz import os os.system("curl --ftp-ssl anonymous:jupi@jupi.com " "ftp://ftp.nasdaqtrader.com/SymbolDirectory/nasdaqlisted.txt " "> nasdaq.lst") Here we use os module from the Python’s Standard Library and a Unix command of curl. The latter allows us to connect to FTS server of NASDAQ exchange, fetch the file of nasdaqlisted.txt to be usually stored in the SymbolDirectory directory and download it directly to our current folder under a given name of nasdaq.lst. During that process you will see the progress information displayed by Python, e.g.:  % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 162 100 162 0 0 125 0 0:00:01 0:00:01 --:--:-- 125 100 174k 100 174k 0 0 23409 0 0:00:07 0:00:07 --:--:-- 39237 Now, in order to inspect the content of the downloaded file we may run in Python an extra line of code, namely: 12 13 os.system("head -20 nasdaq.lst") print() which displays first 20 lines from the top: <html> <head><title>403 Forbidden</title></head> <body bgcolor="white"> <center><h1>403 Forbidden</h1></center> <hr><center>nginx</center> </body> </html> Symbol|Security Name|Market Category|Test Issue|Financial Status|Round Lot Size AAIT|iShares MSCI All Country Asia Information Technology Index Fund|G|N|N|100 AAL|American Airlines Group, Inc. - Common Stock|Q|N|N|100 AAME|Atlantic American Corporation - Common Stock|G|N|D|100 AAOI|Applied Optoelectronics, Inc. - Common Stock|G|N|N|100 AAON|AAON, Inc. - Common Stock|Q|N|N|100 AAPC|Atlantic Alliance Partnership Corp. - Ordinary Shares|S|N|N|100 AAPL|Apple Inc. - Common Stock|Q|N|N|100 AAVL|Avalanche Biotechnologies, Inc. - Common Stock|G|N|N|100 AAWW|Atlas Air Worldwide Holdings - Common Stock|Q|N|N|100 AAXJ|iShares MSCI All Country Asia ex Japan Index Fund|G|N|N|100 ABAC|Aoxin Tianli Group, Inc. - Common Shares|S|N|N|100 ABAX|ABAXIS, Inc. - Common Stock|Q|N|N|100 As you can see, we are not interested in first 8 lines of our file. Before cleaning that mess, let’s inspect the “happing ending” as well: 15 16 os.system("tail -5 nasdaq.lst") print() displaying ZVZZT|NASDAQ TEST STOCK|G|Y|N|100 ZWZZT|NASDAQ TEST STOCK|S|Y|N|100 ZXYZ.A|Nasdaq Symbology Test Common Stock|Q|Y|N|100 ZXZZT|NASDAQ TEST STOCK|G|Y|N|100 File Creation Time: 0624201511:02||||| Again, we notice that the last line does not make our housewarming party more merrier. Given that information, we employ heavy but smart one-liner making use of immortal Unix commands of cat and sed in the pipe (pipeline process). Therefore, the next calling in our Python code does 3 miracles all-in-one shot. Have a look: 18 19 os.system("tail -n +9 nasdaq.lst | cat | sed '$d' | sed 's/|/ /g' > " "nasdaq.lst2")

If you view the output file of nasdaq.lst2 you will see its content to be exactly as we wanted it to be, i.e.:

$echo; head nasdaq.lst2; echo "..."; tail nasdaq.lst2 AAIT iShares MSCI All Country Asia Information Technology Index Fund G N N 100 AAL American Airlines Group, Inc. - Common Stock Q N N 100 AAME Atlantic American Corporation - Common Stock G N D 100 AAOI Applied Optoelectronics, Inc. - Common Stock G N N 100 AAON AAON, Inc. - Common Stock Q N N 100 AAPC Atlantic Alliance Partnership Corp. - Ordinary Shares S N N 100 AAPL Apple Inc. - Common Stock Q N N 100 AAVL Avalanche Biotechnologies, Inc. - Common Stock G N N 100 AAWW Atlas Air Worldwide Holdings - Common Stock Q N N 100 AAXJ iShares MSCI All Country Asia ex Japan Index Fund G N N 100 ... ZNGA Zynga Inc. - Class A Common Stock Q N N 100 ZNWAA Zion Oil & Gas Inc - Warrants G N N 100 ZSAN Zosano Pharma Corporation - Common Stock S N N 100 ZSPH ZS Pharma, Inc. - Common Stock G N N 100 ZU zulily, inc. - Class A Common Stock Q N N 100 ZUMZ Zumiez Inc. - Common Stock Q N N 100 ZVZZT NASDAQ TEST STOCK G Y N 100 ZWZZT NASDAQ TEST STOCK S Y N 100 ZXYZ.A Nasdaq Symbology Test Common Stock Q Y N 100 ZXZZT NASDAQ TEST STOCK G Y N 100 The command of tail -n +9 nasdaq.lst lists all lines of the file skipping first nine at the beginning. Next we push in a pipe that output and list it as a whole using cat command. In next step that output is processed by sed command which (a) removes the last line first; (b) the second one replaces all “|” tokens with “empty space” token. Finally, the processed output is saved as a nasdaq.lst2 file. The power of Unix in a single line. After 15 years of using it I’m still smiling to myself doing that :) All right. What is left? Getting a list of tickers and storing it into a CSV file. Piece of cake. Here we employ the Unix command of awk in the following way: 21 22 os.system("awk '{print$1}' nasdaq.lst2 > nasdaq.csv") os.system("echo; head nasdaq.csv; echo '...'; tail nasdaq.csv")

which returns

AAIT AAL AAME AAOI AAON AAPC AAPL AAVL AAWW AAXJ ... ZNGA ZNWAA ZSAN ZSPH ZU ZUMZ ZVZZT ZWZZT ZXYZ.A ZXZZT

i.e. an isolated list of NASDAQ tickers stored in nasdaq.csv file. From this point, you can read it into Python’s pandas DataFrame as follows:

24 25 26 27 import pandas as pd data = pd.read_csv("nasdaq.csv", index_col=None, header=None) data.columns=["Ticker"] print(data)

displaying

 Ticker 0 AAIT 1 AAL 2 AAME 3 AAOI 4 AAON 5 AAPC ...   [3034 rows x 1 columns]

That’s it.

In the following post, I will make use of that list to fetch the stock trading data and analyse the distribution of extreme values–the gateway to prediction of extreme and heavy losses for every portfolio holder (part 2 out of 3). Stay tuned!

nasdaqtickers.py

## Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders (1)

The probability of improbable events. The simplicity amongst complexity. The purity in its best form. The ultimate cure for those who trade, for those who invest. Does it exist? Can we compute it? Is it really something impossible? In this post we challenge ourselves to the frontiers of accessible statistics and data analysis in order to find most optimal computable solutions to this enigmatic but mind-draining problem. Since there is no certainty, everything remains shrouded in the veil of probability. The probability we have to face and hope that something unlikely, determined to take place anyhow, eventually will not happen.

Our goal is to calculate the probability of a very rare event (e.g. a heavy and/or extreme loss) in the trading market (e.g. of a stock plummeting 5% or much more) in a specified time-horizon (e.g. on the next day, in one week, in one month, etc.). The probability. Not the certainty of that event.

In this Part 1, first, we look at the tail of an asset return distribution and compress our knowledge on Value-at-Risk (VaR) to extract the essence required to understand why VaR-stuff is not the best card in our deck. Next, we move to a classical Bayes’ theorem which helps us to derive a conditional probability of a rare event given… yep, another event that (hypothetically) will take place. Eventually, in Part 2, we will hit the bull between its eyes with an advanced concept taken from the Bayesian approach to statistics and map, in real-time, for any return-series its loss probabilities. Again, the probabilities, not certainties.

1. VaR (not) for Rare Events

In the framework of VaR we take into consideration $T$ days of trading history of an asset. Next, we drive a number (VaR) that describes a loss that is likely to take place with the probability of approximately $\alpha$. “To take place” does not mean here that it will take place. In this approach we try to provide some likelihood (a quantitative measure) of the rare event in a specified time-horizon (e.g. on the next day if daily return-series are under our VaR investigation; a scenario considered in this post).

If by $L$ we denote a loss (in percent) an asset can experience on the next day, then:
$$\mbox{Pr}(L \le -\mbox{VaR}_{1-\alpha}) = \alpha$$ would be the probability of a loss of $-L\times D$ dollars where $-L\times D\ge -\mbox{VaR}\times D$, equal, for instance, $\alpha=0.05$ (also referred to as $(1-\alpha)$% VaR measure; $\mbox{VaR}_{95}$, etc.) and $D$ is the position size (money invested in the asset in terms of physical currency, e.g. in dollars). In other words, the historical data can help us to find $\mbox{VaR}_{95}$ given $\alpha$ assuming 5% of chances that $\mbox{VaR}_{1-\alpha}$ will be exceeded on the next day.

In order to illustrate that case and its shortcomings when it comes to the analysis of rare events, let’s look at the 10 year trading history of two stocks in the NASDAQ market: highly volatile CAAS (China Automotive Systems, Inc.; of the market capital of 247M) and highly liquid AAPL (Apple Inc.; of the market capital of 750B). First, we fetch their adjusted close price-series from Yahoo! Finance and derive the corresponding daily return-series utilising Python:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 # Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # heavy1.py   import pandas.io.data as web import matplotlib.pyplot as plt import numpy as np from pyvar import findvar, findalpha     # ---1. Data Processing   # fetch and download daily adjusted-close price series for CAAS # and AAPL stocks using Yahoo! Finance public data provider caas = web.DataReader("CAAS", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close'] aapl = web.DataReader("AAPL", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close']   CAAScp = np.array(caas.values) AAPLcp = np.array(aapl.values)   f = file("data1.dat","wb") np.save(f, CAAScp) np.save(f, AAPLcp) f.close()   # read in the data from a file f = file("data1.dat","rb") CAAScp = np.load(f) AAPLcp = np.load(f) f.close()   # compute return-series retCAAS = CAAScp[1:]/CAAScp[:-1]-1 retAAPL = AAPLcp[1:]/AAPLcp[:-1]-1

The best way to understand the data is by plotting them:

39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 # plotting (figure #1) # adjusted-close price-series fig, ax1 = plt.subplots(figsize=(10, 6)) plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.plot(CAAScp, '-r', label="CAAS") plt.axis("tight") plt.legend(loc=(0.02, 0.8)) plt.ylabel("CAAS Adj Close Price (US$)") ax2 = ax1.twinx() plt.plot(AAPLcp, '-', label="AAPL") plt.legend(loc=(0.02, 0.9)) plt.axis("tight") plt.ylabel("AAPL Adj Close Price (US$)")   # plotting (figure #2) # daily return-series plt.figure(num=2, figsize=(10, 6)) plt.subplot(211) plt.grid(True) plt.plot(retCAAS, '-r', label="CAAS") plt.axis("tight") plt.ylim([-0.25,0.5]) plt.legend(loc="upper right") plt.ylabel("CAAS daily returns") plt.subplot(212) plt.grid(True) plt.plot(retAAPL, '-', label="AAPL") plt.legend(loc="upper right") plt.axis("tight") plt.ylim([-0.25,0.5]) plt.ylabel("AAPL daily returns") plt.xlabel("Trading days 13/05/2005-13/05/2015")

We obtain the price-series

and return-series

respectively. For the latter plot, by fixing the scaling of both $y$-axes we immediately gain an chance to inspect the number of daily trades closing with heavy losses. Well, at least at first glance and for both directions of trading. In this post we will be considering the long positions only.

Having our data pre-processed we may implement two different strategies to make use of the VaR framework in order to work out the probabilities for tail events. The first one is based on setting $\alpha$ level and finding $-VaR_{1-\alpha}$. Let’s assume $\alpha=0.01$, then the following piece of code

73 74 75 76 77 78 79 80 # ---2. Compuation of VaR given alpha   alpha = 0.01   VaR_CAAS, bardata1 = findvar(retCAAS, alpha=alpha, nbins=200) VaR_AAPL, bardata2 = findvar(retAAPL, alpha=alpha, nbins=100)   cl = 100.*(1-alpha)

aims at computation of the corresponding numbers making use of the function:

def findvar(ret, alpha=0.05, nbins=100): # Function computes the empirical Value-at-Risk (VaR) for return-series # (ret) defined as NumPy 1D array, given alpha # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # compute a normalised histogram (\int H(x)dx = 1) # nbins: number of bins used (recommended nbins>50) hist, bins = np.histogram(ret, bins=nbins, density=True) wd = np.diff(bins) # cumulative sum from -inf to +inf cumsum = np.cumsum(hist * wd) # find an area of H(x) for computing VaR crit = cumsum[cumsum <= alpha] n = len(crit) # (1-alpha)VaR VaR = bins[n] # supplementary data of the bar plot bardata = hist, n, wd return VaR, bardata

Here, we create the histogram with a specified number of bins and $-VaR_{1-\alpha}$ is found in an empirical manner. For many reasons this approach is much better than fitting the Normal Distribution to the data and finding VaR based on the integration of that continuous function. It is a well-know fact that such function would underestimate the probabilities in far tail of the return distribution. And the game is all about capturing what is going out there, right?

The results of computation we display as follows:

82 83 print("%g%% VaR (CAAS) = %.2f%%" % (cl, VaR_CAAS*100.)) print("%g%% VaR (AAPL) = %.2f%%\n" % (cl, VaR_AAPL*100.))

i.e.

99% VaR (CAAS) = -9.19% 99% VaR (AAPL) = -5.83%

In order to gain a good feeling of those numbers we display the left-tails of both return distributions

85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 # plotting (figure #3) # histograms of daily returns; H(x) # plt.figure(num=3, figsize=(10, 6)) c = (.7,.7,.7) # grey color (RGB) # # CAAS ax = plt.subplot(211) hist1, bins1 = np.histogram(retCAAS, bins=200, density=False) widths = np.diff(bins1) b = plt.bar(bins1[:-1], hist1, widths, color=c, edgecolor="k", label="CAAS") plt.legend(loc=(0.02, 0.8)) # # mark in red all histogram values where int_{-infty}^{VaR} H(x)dx = alpha hn, nb, _ = bardata1 for i in range(nb): b[i].set_color('r') b[i].set_edgecolor('k') plt.text(-0.225, 30, "VaR$_{%.0f}$ (CAAS) = %.2f%%" % (cl, VaR_CAAS*100.)) plt.xlim([-0.25, 0]) plt.ylim([0, 50]) # # AAPL ax2 = plt.subplot(212) hist2, bins2 = np.histogram(retAAPL, bins=100, density=False) widths = np.diff(bins2) b = plt.bar(bins2[:-1], hist2, widths, color=c, edgecolor="k", label="AAPL") plt.legend(loc=(0.02, 0.8)) # # mark in red all histogram bars where int_{-infty}^{VaR} H(x)dx = alpha hn, nb, wd = bardata2 for i in range(nb): b[i].set_color('r') b[i].set_edgecolor('k') plt.text(-0.225, 30, "VaR$_{%.0f}$ (AAPL) = %.2f%%" % (cl, VaR_AAPL*100.)) plt.xlim([-0.25, 0]) plt.ylim([0, 50]) plt.xlabel("Stock Returns (left tail only)") plt.show()

where we mark our $\alpha=$1% regions in red:

As you may notice, so far, we haven’t done much new. A classical textbook example coded in Python. However, the last figure reveals the main players of the game. For instance, there is only 1 event of a daily loss larger than 15% for AAPL while CAAS experienced 4 heavy losses. Much higher 99% VaR for CAAS takes into account those 4 historical events and put more weight on 1-day VaR as estimated in our calculation.

Imagine now that we monitor all those tail extreme/rare events day by day. It’s not too difficult to notice (based on the inspection of Figure #2) that in case of AAPL, the stock recorded its first serious loss of $L \lt -10$% approximately 650 days since the beginning of our “monitoring” which commenced on May 13, 2005. In contrast, CAAS was much more volatile and you needed to wait only ca. 100 days to record the loss of the same magnitude.

If something did not happen, e.g. $L \lt -10$%, the VaR-like measure is highly inadequate measure of probabilities for rare events. Once the event took place, the estimation of VaR changes (is updated) but decreases in time until a new rare event occurs. Let’s illustrate it with Python. This is our second VaR strategy: finding $\alpha$ given the threshold for rare events. We write a simple function that does the job for us:

def findalpha(ret, thr=1, nbins=100): # Function computes the probablity P(X<thr)=alpha given threshold # level (thr) and return-series (NumPy 1D array). X denotes the # returns as a rv and nbins is number of bins used for histogram # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # compute normalised histogram (\int H(x)dx=1) hist, bins = np.histogram(ret, bins=nbins, density=True) # compute a default histogram hist1, bins1 = np.histogram(ret, bins=nbins, density=False) wd = np.diff(bins1) x = np.where(bins1 < thr) y = np.where(hist1 != 0) z = list(set(x[0]).intersection(set(y[0]))) crit = np.cumsum(hist[z]*wd[z]) # find alpha try: alpha = crit[-1] except Exception as e: alpha = 0 # count number of events falling into (-inft, thr] intervals nevents = np.sum(hist1[z]) return alpha, nevents

We call it in our main program:

126 127 128 129 130 131 132 133 134 135 136 137 138 # ---3. Computation of alpha, given the threshold for rare events   thr = -0.10   alpha1, ne1 = findalpha(retCAAS, thr=thr, nbins=200) alpha2, ne2 = findalpha(retAAPL, thr=thr, nbins=200)   print("CAAS:") print(" Pr( L < %.2f%% ) = %.2f%%" % (thr*100., alpha1*100.)) print(" %g historical event(s)" % ne1) print("AAPL:") print(" Pr( L < %.2f%% ) = %.2f%%" % (thr*100., alpha2*100.)) print(" %g historical event(s)" % ne2)

what returns the following results:

CAAS: Pr( L < -10.00% ) = 0.76% 19 historical event(s) AAPL: Pr( L < -10.00% ) = 0.12% 3 historical event(s)

And that’s great however all these numbers are given as a summary of 10 years of analysed data (May 13/2005 to May/13 2015 in our case). The final picture could be better understood if we could dynamically track in time the changes of both probabilities and the number of rare events. We achieve it by the following not-state-of-the-art code:

141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 # ---4. Mapping alphas for Rare Events   alphas = [] nevents = [] for t in range(1,len(retCAAS)-1): data = retCAAS[0:t] alpha, ne = findalpha(data, thr=thr, nbins=500) alphas.append(alpha) nevents.append(ne)   alphas2 = [] nevents2 = [] for t in range(1,len(retAAPL)-1): data = retAAPL[0:t] alpha, ne = findalpha(data, thr=thr, nbins=500) alphas2.append(alpha) nevents2.append(ne)   # plotting (figure #4) # running probability for rare events # plt.figure(num=4, figsize=(10, 6)) ax1 = plt.subplot(211) plt.plot(np.array(alphas)*100., 'r') plt.plot(np.array(alphas2)*100.) plt.ylabel("Pr( L < %.2f%% ) [%%]" % (thr*100.)) plt.axis('tight') ax2 = plt.subplot(212) plt.plot(np.array(nevents), 'r', label="CAAS") plt.plot(np.array(nevents2), label="AAPL") plt.ylabel("# of Events with L < %.2f%%" % (thr*100.)) plt.axis('tight') plt.legend(loc="upper left") plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.show()

revealing the following picture:

It’s a great way of looking at and understanding the far left-tail volatility of the asset under your current investigation. The probability between two rare/extreme events decreases for the obvious reason: along the time axis we include more and more data therefore the return distribution evolves and shifts its mass to the right leaving left tail events less and less probable.

The question remains: if an asset has never experienced an extreme loss of given magnitute, the probability of such event, within our VaR framework, simply remains zero! For example, for losses $L \lt -20$%,

CAAS displays 2 extreme events while AAPL none! Therefore, our estimation of superbly rare daily loss of -20% (or more) for AAPL stock based on 10 years of data is zero or undefined or… completely unsound. Can we do better than this?

2. Classical Conditional Prediction for Rare Events

Let’s consider a case where we want to predict the probability of the asset/stock (CAAS; traded at NASDAQ exchange) falling down more than $-L$%. Previously we have achieved such estimation through the integration of its probability density function. An alternative way is via derivation of the conditional probability, i.e. that CAAS will lose more than $-L$% given that NASDAQ index drops more than $-L$% on the next day.

Formula? Well, Bayes’ formula. That all what we need:
$$\mbox{Pr}(B|R) = \frac{ \mbox{Pr}(R|B)\ \mbox{Pr}(B) } { \mbox{Pr}(R) } .$$
Great, now, the meaning of each term. Let $A$ denotes the event of the stock daily return to be between -L% and 0%. Let B is the event of the stock return to be $\lt -L$%. Let $R$ is the event of NASDAQ daily return to be $\lt -L$%. Given that, based on both CAAS and NASDAQ historical time-series, we are able to compute $\mbox{Pr}(A)$, $\mbox{Pr}(B)$, $\mbox{Pr}(R|A)$, $\mbox{Pr}(R|B)$, therefore
$$\mbox{Pr}(R) = \mbox{Pr}(R|A)\mbox{Pr}(A) + \mbox{Pr}(R|B)\mbox{Pr}(B)$$ as well. Here, $\mbox{Pr}(R|A)$ would stand for the probability of NASDAQ falling down more than $-L$% given the observation of the CAAS return to be in $(-L;0)$% interval and $\mbox{Pr}(R|B)$ would denote the probability of NASDAQ falling down more than $-L$% given the observation of the CAAS return also dropping more than $-L$% on the same day.

With Bayes’ formula we inverse the engineering, aiming at providing the answer on $\mbox{Pr}(B|R)$ given all available data. Ready to code it? Awesome. Here we go:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 # Predicting Heavy and Extreme Losses in Real-Time for Portfolio Holders # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # heavy2.py   import pandas.io.data as web import matplotlib.pyplot as plt import numpy as np   # ---1. Data Processing   # fetch and download daily adjusted-close price series for CAAS stock # and NASDAQ index using Yahoo! Finance public data provider ''' caas = web.DataReader("CAAS", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close'] nasdaq = web.DataReader("^IXIC", data_source='yahoo', start='2005-05-13', end='2015-05-13')['Adj Close']   CAAScp = np.array(caas.values) NASDAQcp = np.array(nasdaq.values)   f = file("data2.dat","wb") np.save(f,CAAScp) np.save(f,NASDAQcp) f.close() '''   f = file("data2.dat","rb") CAAScp = np.load(f) NASDAQcp = np.load(f) f.close()   # compute the return-series retCAAS = CAAScp[1:]/CAAScp[:-1]-1 retNASDAQ = NASDAQcp[1:]/NASDAQcp[:-1]-1

The same code as used in Section 1 but instead of AAPL data, we fetch NASDAQ index daily close prices. Let’s plot the return-series:

37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 # plotting (figure #1) # return-series for CAAS and NASDAQ index # plt.figure(num=2, figsize=(10, 6)) plt.subplot(211) plt.grid(True) plt.plot(retCAAS, '-r', label="CAAS") plt.axis("tight") plt.ylim([-0.25,0.5]) plt.legend(loc="upper right") plt.ylabel("CAAS daily returns") plt.subplot(212) plt.grid(True) plt.plot(retNASDAQ, '-', label="NASDAQ") plt.legend(loc="upper right") plt.axis("tight") plt.ylim([-0.10,0.15]) plt.ylabel("NASDAQ daily returns") plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.show()

i.e.,

where we observe a different trading dynamics for NASDAQ index between 750th and 1000th day (as counted from May/13, 2005). Interestingly, the heaviest losses of NASDAQ are not ideally correlated with those of CAAS. That make this case study more exciting! Please also note that CAAS is not the part of the NASDAQ index (i.e., its component). Bayes’ formula is designed around independent events and, within a fair approximation, we may think of CAAS as an asset ticking off that box here.

What remains and takes a lot of caution is the code that “looks at” the data in a desired way. Here is its final form:

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 # ---2. Computations of Conditional Probabilities for Rare Events   # isolate return-series displaying negative returns solely # set 1 for time stamps corresponding to positive returns nretCAAS = np.where(retCAAS < 0, retCAAS, 1) nretNASDAQ = np.where(retNASDAQ < 0, retNASDAQ, 1)   # set threshold for rare events thr = -0.065   # compute the sets of events A = np.where(nretCAAS < 0, nretCAAS, 1) A = np.where(A >= thr, A, 1) B = np.where(nretCAAS < thr, retCAAS, 1) R = np.where(nretNASDAQ < thr, retNASDAQ, 1) nA = float(len(A[A != 1])) nB = float(len(B[B != 1])) n = float(len(nretCAAS[nretCAAS != 1])) # n must equal to nA + nB # (optional) print(nA, nB, n == (nA + nB)) # check, if True then proceed further print(len(A), len(B), len(R)) print   # compute the probabilities pA = nA/n pB = nB/n   # compute the conditional probabilities pRA = np.sum(np.where(R+A < 0, 1, 0))/n pRB = np.sum(np.where(R+B < 0, 1, 0))/n   pR = pRA*pA + pRB*pB   # display results print("Pr(A)\t = %5.5f%%" % (pA*100.)) print("Pr(B)\t = %5.5f%%" % (pB*100.)) print("Pr(R|A)\t = %5.5f%%" % (pRA*100.)) print("Pr(R|B)\t = %5.5f%%" % (pRB*100.)) print("Pr(R)\t = %5.5f%%" % (pR*100.))   if(pR>0): pBR = pRB*pB/pR print("\nPr(B|R)\t = %5.5f%%" % (pBR*100.)) else: print("\nPr(B|R) impossible to be determined. Pr(R)=0.")

Python’s NumPy library helps us in a tremendous way by its smartly designed function of where which we employ in lines #69-72 and #86-87. First we test a logical condition, if it’s evaluated to True we grab the right data (actual returns), else we return 1. That opens for us a couple of shortcuts in finding the number of specific events and making sure we are still on the right side of the force (lines #73-79).

As you can see, in line #66 we specified our threshold level of $L = -6.5$%. Given that, we derive the probabilities:

(1216.0, 88.0, True) (2516, 2516, 2516)   Pr(A) = 93.25153% Pr(B) = 6.74847% Pr(R|A) = 0.07669% Pr(R|B) = 0.15337% Pr(R) = 0.08186%   Pr(B|R) = 12.64368%

The level of -6.5% is pretty random but delivers an interesting founding. Namely, based on 10 years of data there is 12.6% of chances that on May/14 2015 CAAS will lose more than -6.5% if NASDAQ drops by the same amount. How much exactly? We don’t know. It’s not certain. Only probable in 12.6%.

The outcome of $\mbox{Pr}(R|B)$ which is close to zero may support our assumption that CAAS has a negligible “influence” on NASDAQ itself. On the other side of the rainbow, it’s much more difficult to interpret $\mbox{Pr}(R)$, the probability of an “isolated” rare event (i.e., $L<-6.5$%) since its estimation is solely based on two players in the game: CAAS and NASDAQ. However, when computed for $N\ge 100$ stocks outside the NASDAQ index but traded at NASDAQ, such distribution of $\mbox{Pr}(R)$'s would be of great value as an another alternative estimator for rare events (I should write about it a distinct post). Now, back the business. There is one problem with our conditional estimation of rare event probabilities as outlined within this Section. A quick check for $L = -10$% reveals:

Pr(A) = 98.69632% Pr(B) = 1.30368% Pr(R|A) = 0.00000% Pr(R|B) = 0.00000% Pr(R) = 0.00000%   Pr(B|R) impossible to be determined. Pr(R)=0.

Let’s remind that $R$ is the event of NASDAQ daily return to be $\lt -L$%. A quick look at NASDAQ return-series tells us the whole story in order to address the following questions that come to your mind: why, why, why zero? Well, simply because there was no event in past 10 year history of the index that it slid more than -10%. And we are cooked in the water.

A visual representation of that problem can be obtained by executing the following piece of code:

104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 from pyvar import cpr # non-at-all PEP8 style ;)   prob = [] for t in range(2,len(retCAAS)-1): ret1 = retCAAS[0:t] ret2 = retNASDAQ[0:t] pBR, _ = cpr(ret1, ret2, thr=thr) prob.append(pBR)   # plotting (figure #2) # the conditional probability for rare events given threshold # plt.figure(num=2, figsize=(10, 6)) plt.plot(np.array(prob)*100., 'r') plt.ylabel("Pr(B|R) [%%]") plt.axis('tight') plt.title("L$_{thr}$ = %.2f%%" % (thr*100.)) plt.xlabel("Trading days 13/05/2005-13/05/2015") plt.show()

where we moved all Bayes’-based calculations (lines #58-102) into a function cpr (a part of pyvar.py local library; see Download section below) and repeated our previous experiment, however, this time, $\mbox{Pr}(R|B)$ changing in time given the threshold of $L=-6.5$%. The resulting plot would be:

Before ca. 800th day the NASDAQ index was lacking any event of $L\lt -6.5$% therefore the $\mbox{Pr}(R|B)$ could not be determined. After that period we got some heavy hits and the conditional probability could be derived. The end value (as for May 13, 2015) is 12.64%.

Hope amongst Hopelessness?

There is no better title to summarise the outcomes we derived. Our goal was to come up with some (running) probability for very rare event(s) that, in our case, would be expressed as a 1-day loss of $-L$% in trading for a specified financial asset. In the first attempt we engaged VaR-related framework and the estimation of occurrence of the rare event based on the PDF integration. In our second attempt we sought for an alternative solution making use of Bayes’ conditional prediction of the asset’s loss of $-L$% (or more) given the observation of the same event somewhere else (NASDAQ index). For the former we ended up the a running probability of $\mbox{Pr}(L \le L_{thr})$ while for the latter with $\mbox{Pr}(R|B)$ given $L_{thr}$ for both B and R events.

Assuming $L_{thr}$ to be equal to -6.5%, we can present the corresponding computations in the following chart:

Well, the situation looks hopeless. It ain’t get better even if, as discussed earlier, the probability of $\mbox{Pr}(R)$ has been added. Both methodologies seem to deliver the answer to the same question however, somehow, we are left confused, disappointed, misled, in tasteless despair…

Stay tuned as in Part 2 we will see the light at the end of the tunnel. I promise.

heavy1.py, heavy2.py, pyvar.py

Featured in: Data Science Weekly Newsletter, Issue 76 (May 7, 2015)

It has been over a year since I posted Hacking Google Finance in Real-Time for Algorithmic Traders article. Surprisingly, it became the number one URL of QaR that Google has been displaying as a result to various queries and the number two most frequently read post. Thank You! It’s my pleasure to provide quality content covering interesting topics that I find potentially useful.

You can be surprised how fast Python solutions went forward facilitating life of quants and algo traders. For instance, yesterday, haphazardly, I found a code that seems to work equally well as compared to my first version, and, in fact, is more flexible in data content that could be retrieved. The idea stays the same as previously, however, our goal this time is to monitor changes of stock prices provided by Google Finance in real-time before the market opens.

Constructing Pre-Market Price-Series

The pre-market trading session typically occurs between 8:00am and 9:30am EDT each trading day though for some stocks we often observe frequent movements much earlier, e.g. at 6:00am. Many investors and traders watch the pre-market trading activity to judge the strength and direction of the market in anticipation for the regular trading session. Pre-market trading activity generally has limited volume and liquidity, and therefore, large bid-ask spreads are common. Many retail brokers offer pre-market trading, but may limit the types of orders that can be used during the pre-market period$^1$.

In Google Finance the stock price in pre-market is usually displayed right beneath the tricker, for example:

The price of the stock (here: AAPL) varies depending on interest, good/bad news, etc.

In Python we can fetch those changes (I adopt a code found on the Web) in the following way:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 import urllib2 # works fine with Python 2.7.9 (not 3.4.+) import json import time   def fetchPreMarket(symbol, exchange): link = "http://finance.google.com/finance/info?client=ig&q=" url = link+"%s:%s" % (exchange, symbol) u = urllib2.urlopen(url) content = u.read() data = json.loads(content[3:]) info = data[0] t = str(info["elt"]) # time stamp l = float(info["l"]) # close price (previous trading day) p = float(info["el"]) # stock price in pre-market (after-hours) return (t,l,p)     p0 = 0 while True: t, l, p = fetchPreMarket("AAPL","NASDAQ") if(p!=p0): p0 = p print("%s\t%.2f\t%.2f\t%+.2f\t%+.2f%%" % (t, l, p, p-l, (p/l-1)*100.)) time.sleep(60)

In this code we target Google to get every 60 seconds an update of the pre-market price (line #14). What we retrieve is a JSON file of the form:

// [ { "id": "22144" ,"t" : "AAPL" ,"e" : "NASDAQ" ,"l" : "125.80" ,"l_fix" : "125.80" ,"l_cur" : "125.80" ,"s": "1" ,"ltt":"4:02PM EDT" ,"lt" : "May 5, 4:02PM EDT" ,"lt_dts" : "2015-05-05T16:02:28Z" ,"c" : "-2.90" ,"c_fix" : "-2.90" ,"cp" : "-2.25" ,"cp_fix" : "-2.25" ,"ccol" : "chr" ,"pcls_fix" : "128.7" ,"el": "126.10" ,"el_fix": "126.10" ,"el_cur": "126.10" ,"elt" : "May 6, 6:35AM EDT" ,"ec" : "+0.30" ,"ec_fix" : "0.30" ,"ecp" : "0.24" ,"ecp_fix" : "0.24" ,"eccol" : "chg" ,"div" : "0.52" ,"yld" : "1.65" ,"eo" : "" ,"delay": "" ,"op" : "128.15" ,"hi" : "128.45" ,"lo" : "125.78" ,"vo" : "21,812.00" ,"avvo" : "46.81M" ,"hi52" : "134.54" ,"lo52" : "82.90" ,"mc" : "741.44B" ,"pe" : "15.55" ,"fwpe" : "" ,"beta" : "0.84" ,"eps" : "8.09" ,"shares" : "5.76B" ,"inst_own" : "62%" ,"name" : "Apple Inc." ,"type" : "Company" } ]

You can download it individually if we execute in the browser a query:

http://www.google.com/finance/info?infotype=infoquoteall&q=NASDAQ:AAPL

Some of those information you can easily decipher. For our task we need to get only: el (an asset price in pre-market or after-hours trading; a.k.a. extended hours trading); elt (corresponding time stamp); and l (most recent last price). This is what our Python code does for us in lines #12-14. Nice and smoothly.

When executed before 9.30am EDT (here for NASDAQ:AAPL), we may construct the pre-market price-series every time the price has been changed:

May 6, 6:35AM EDT 125.80 126.18 +0.38 +0.30% May 6, 6:42AM EDT 125.80 126.21 +0.41 +0.33% May 6, 6:45AM EDT 125.80 126.16 +0.36 +0.29% May 6, 6:46AM EDT 125.80 126.18 +0.38 +0.30% May 6, 6:49AM EDT 125.80 126.10 +0.30 +0.24% May 6, 6:51AM EDT 125.80 126.20 +0.40 +0.32% May 6, 6:57AM EDT 125.80 126.13 +0.33 +0.26% May 6, 7:00AM EDT 125.80 126.20 +0.40 +0.32% May 6, 7:01AM EDT 125.80 126.13 +0.33 +0.26% May 6, 7:07AM EDT 125.80 126.18 +0.38 +0.30% May 6, 7:09AM EDT 125.80 126.20 +0.40 +0.32% May 6, 7:10AM EDT 125.80 126.19 +0.39 +0.31% May 6, 7:10AM EDT 125.80 126.22 +0.42 +0.33% May 6, 7:12AM EDT 125.80 126.20 +0.40 +0.32% May 6, 7:22AM EDT 125.80 126.27 +0.47 +0.37% May 6, 7:28AM EDT 125.80 126.24 +0.44 +0.35% ... May 6, 9:02AM EDT 125.80 126.69 +0.89 +0.71% May 6, 9:03AM EDT 125.80 126.71 +0.91 +0.72% May 6, 9:04AM EDT 125.80 126.73 +0.93 +0.74% May 6, 9:08AM EDT 125.80 126.67 +0.87 +0.69% May 6, 9:09AM EDT 125.80 126.69 +0.89 +0.71% May 6, 9:10AM EDT 125.80 126.68 +0.88 +0.70% May 6, 9:13AM EDT 125.80 126.67 +0.87 +0.69% May 6, 9:14AM EDT 125.80 126.72 +0.92 +0.73% May 6, 9:16AM EDT 125.80 126.74 +0.94 +0.75% May 6, 9:17AM EDT 125.80 126.72 +0.92 +0.73% May 6, 9:18AM EDT 125.80 126.70 +0.90 +0.72% May 6, 9:19AM EDT 125.80 126.73 +0.93 +0.74% May 6, 9:20AM EDT 125.80 126.75 +0.95 +0.76% May 6, 9:21AM EDT 125.80 126.74 +0.94 +0.75% May 6, 9:21AM EDT 125.80 126.79 +0.99 +0.79% (*) May 6, 9:23AM EDT 125.80 126.78 +0.98 +0.78% May 6, 9:24AM EDT 125.80 126.71 +0.91 +0.72% May 6, 9:25AM EDT 125.80 126.73 +0.93 +0.74% May 6, 9:26AM EDT 125.80 126.75 +0.95 +0.76% May 6, 9:27AM EDT 125.80 126.70 +0.90 +0.72% May 6, 9:28AM EDT 125.80 126.75 +0.95 +0.76% May 6, 9:29AM EDT 125.80 126.79 +0.99 +0.79%

Since the prices in pre-market tend to vary slowly, 60 second time interval is sufficient to keep our eye on the stock. You can compare a live result retrieved using our Python code at 9:21am (*) with the above screenshot I took at the same time.

A simple joy of Python in action. Enjoy!

HOMEWORK
1. The code fails after 9.30am EST (NYC time). Modify it to catch this exception.
2. Modify the code (or write a new function) that works after 9.30am EDT.
3. It is possible to get $N\gt1$ queries for $N$ stocks by calling, for example:
NASDAQ:AAPL,NYSE:JNJ,… in line #7 of the code. Modify the program
to fetch pre-market time-series, $x_i(t)$ $(i=1,…,N)$, for $N$-asset portfolio.
Given that, compute a fractional root-mean-square volatility, $\sigma_{x_i(t)}/\langle x_i(t) \rangle$,
i.e. standard deviation divided by the mean, between 6am and 9.30am EDT
for each asset and check can you use it as an indicator for stock price movement
after 9.30am? Tip: the higher frms the more trading is expected in first 15 min
of a new session at Wall Street.
4. Modify the code to monitor after-hours trading till 4.30pm.

NetworkError.org, 2013, Google’s Undocumented Finance API

REFERENCES
$^1$Pre-Market, Investopedia, http://www.investopedia.com/terms/p/premarket.asp

## How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API

Quandl.com offers an easy solution to that task. Their WIKI database contains 3339 major stock tickers and corresponding company names. They can be fetched via secwiki_tickers.csv file. For a plain file of portfolio.lst storing the list of your tickers, for example:

AAPL IBM JNJ MSFT TXN

you can scan the .csv file for the company name coding in Python:

# How to Find a Company Name given a Stock Ticker Symbol utilising Quandl API # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz   import pandas as pd   df = pd.read_csv('secwiki_tickers.csv') dp = pd.read_csv('portfolio.lst',names=['pTicker'])   pTickers = dp.pTicker.values # converts into a list   tmpTickers = []   for i in range(len(pTickers)): test = df[df.Ticker==pTickers[i]] if not (test.empty): print("%-10s%s" % (pTickers[i], list(test.Name.values)[0]))

what returns:

AAPL Apple Inc. IBM International Business Machines Corporation JNJ Johnson & Johnson MSFT Microsoft Corporation TXN Texas Instruments Inc.

Please note that there is a possibility to combine more stocks from other Quandl’s resources in one file. For more information see Quandl.com‘s documentation online.

I will dedicate a separate Chapter on hacking the financial websites in my next book of Python for Quants. Volume II. in mid-2015.

## Walsh–Hadamard Transform and Tests for Randomness of Financial Return-Series

Randomness. A magic that happens behind the scene. An incomprehensible little blackbox that does the job for us. Quants. Many use it every day without thinking, without considering how those beautifully uniformly distributed numbers are drawn?! Why so fast? Why so random? Is randomness a byproduct of chaos and order tamed somehow? How trustful can we be placing this, of minor importance, command of random() as piece of our codes?

Randomness Built-in. Not only that’s a name of a chapter in my latest book but the main reason I’m writing this article: wandering sideways around the subject that led me to new undiscovered territories! In general the output in a form of a random number comes for a dedicated function. To be perfectly honest, it is like a refined gentleman, of sophisticated quality recognised by its efficiency when a great number of drafts is requested. The numbers are not truly random. They are pseudo-random. That means the devil resides in details. A typical pseudo-random number generator (PRNG) is designed for speed but defined by underlying algorithm. In most of computer languages Mersenne Twister developed by 松本 眞 and 西村 拓士 in 1997 has become a standard. As one might suspect, an algorithm must repeat itself and in case of our Japanese hero, its period is superbly long, i.e. $2^{19937}-1$. A reliable implementation in C or C++ guarantees enormous speedups in pseudo-random numbers generation. Interestingly, Mersenne Twister is not perfect. Its use had been discouraged when it comes to obtaining cryptographic random numbers. Can you imagine that?! But that’s another story. Ideal for a book chapter indeed!

In this post, I dare to present the very first, meaningful, and practical application of the Walsh–Hadamard Transform (WHT) in quantitative finance. Remarkably, this tool, of marginal use in digital signal processing, had been shown to serve as a great facility in testing any binary sequence for its statistically significant randomness!

Within the following sections we introduce the transform to finance. After Oprina et al. (2009) we encode WHT Significance Tests to Python (see Download section) and verify a hypothesis of the evidence of randomness as generated by Mersenne Twister PRNG against the alternative one at vey high significance levels of $\alpha\le 0.00001$. Next, we show a practical application of the WHT framework in search for randomness in any financial time-series by example of 30-min FX return-series and compare them to the results obtained from PRNG available in Python by default. Curious about the findings? Welcome to the world of magic!

1. Walsh–Hadamard Transform And Binary Sequences

I wish I had for You this great opening story on how Jacques Hadamard and Joseph L. Walsh teamed up with Jack Daniels on one Friday’s night in the corner pub somewhere in San Francisco coming up to a memorable breakthrough in theory of numbers. I wish I had! Well, not this time. However, as a make-up, below I will present in a nutshell a theoretical part on WHT I’ve been studying for past three weeks and found it truly mind-blowing because of its simplicity.

Let’s consider any real-valued discrete signal $X(t_i)$ where $i=0,1,…,N-1$. Its trimmed version, $x(t_i)$, of the total length of $n=2^M$ such that $2^M\le(N-1)$ and $M\in\mathbb{Z}^+$ at $M\ge 2$ is considered as an input signal for the Walsh–Hadamard Transform, the latter defined as:
$$WHT_n = \mathbf{x} \bigotimes_{i=1}^M \mathbf{H_2}$$ where the Hadamard matrix of order $n=2^M$ is obtainable recursively by the formula:
$$\mathbf{H_{2^M}} = \begin{pmatrix} H_{2^{M-1}} & H_{2^{M-1}} \\ H_{2^{M-1}} & -H_{2^{M-1}} \end{pmatrix} \ \ \ \ \ \text{therefore} \ \ \ \ \ \mathbf{H_2} = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}$$ and $\otimes$ denotes the Kronecker product between two matrices. Given that, $WHT_n$ is the dot product between the signal (1D array; vector) and resultant Kronecker multiplications of $\mathbf{H_2}$ $M$-times (Johnson & Puschel 2000).

1.1. Walsh Functions

The Walsh-Hadamard transform uses the orthogonal square-wave functions, $w_j(x)$, introduced by Walsh (1923), which have only two values $\pm 1$ in the interval $0\le x\lt 1$ and the value zero elsewhere. The original definition of the Walsh functions is based on the following recursive equations:
$$w_{2j}(x) = w_j(2x)+(-1)^j w_j (2x -1) \quad \mbox{for} \ \ j=1,2,… \\ w_{2j-1}(x) = w_{j-1}(2x)-(-1)^{j-1} w_{j-1} (2x -1) \quad \mbox{for} \ \ j=1,2,…$$ with the initial condition of $w_0(x)= 1$. You can meet with different ordering of Walsh functions in the literature but in general it corresponds to the ordering of the orthogonal harmonic functions $h_j(x)$, which are defined as:
$$h_{2j}(x) = \cos(2\pi j x)\quad \mbox{for} \ \ j=0,1,… \\ h_{2j-1}(x) = \sin(2\pi j x)\quad \mbox{for} \ \ j=1,2,…$$ on the interval $0\le x\lt 1$, and have zero value for all other values of $x$ outside this interval. A comparison of both function classes looks as follows:

where the first eight harmonic functions and Walsh functions are given in the left and right panels, respectively. Walsh functions with even and odd orders are called the cal and sal functions, respectively, and they correspond to the cosine and sine functions in Fourier analysis (Aittokallio et al. 2001).

1.2. From Hadamard to Walsh Matrix

In Python the Hadamard matrix of order $2^M$ is obtainable as a NumPy 2D array making use of SciPy module as follows:

from scipy.linalg import hadamard M=3; n=2**M H=hadamard(n) print(H)

what is this case returns:

[[ 1 1 1 1 1 1 1 1] [ 1 -1 1 -1 1 -1 1 -1] [ 1 1 -1 -1 1 1 -1 -1] [ 1 -1 -1 1 1 -1 -1 1] [ 1 1 1 1 -1 -1 -1 -1] [ 1 -1 1 -1 -1 1 -1 1] [ 1 1 -1 -1 -1 -1 1 1] [ 1 -1 -1 1 -1 1 1 -1]]

Each row of the matrix corresponds to Walsh function. However, the ordering is different (Hadamard ordering). Therefore, to be able to “see” the shape of Walsh functions as presented in the figure above, we need to rearrange their indexing. The resultant matrix is known as Walsh matrix. We derive it in Python:

def Hadamard2Walsh(n): # Function computes both Hadamard and Walsh Matrices of n=2^M order # (c) 2015 QuantAtRisk.com, coded by Pawel Lachowicz, adopted after # au.mathworks.com/help/signal/examples/discrete-walsh-hadamard-transform.html import numpy as np from scipy.linalg import hadamard from math import log   hadamardMatrix=hadamard(n) HadIdx = np.arange(n) M = log(n,2)+1   for i in HadIdx: s=format(i, '#032b') s=s[::-1]; s=s[:-2]; s=list(s) x=[int(x) for x in s] x=np.array(x) if(i==0): binHadIdx=x else: binHadIdx=np.vstack((binHadIdx,x))   binSeqIdx = np.zeros((n,M)).T   for k in reversed(xrange(1,int(M))): tmp=np.bitwise_xor(binHadIdx.T[k],binHadIdx.T[k-1]) binSeqIdx[k]=tmp   tmp=np.power(2,np.arange(M)[::-1]) tmp=tmp.T SeqIdx = np.dot(binSeqIdx.T,tmp)   j=1 for i in SeqIdx: if(j==1): walshMatrix=hadamardMatrix[i] else: walshMatrix=np.vstack((walshMatrix,hadamardMatrix[i])) j+=1   return (hadamardMatrix,walshMatrix)

Therefore, by calling the function in an exemplary main program:

from WalshHadamard import Hadamard2Walsh import matplotlib.pyplot as plt import numpy as np   M=3; n=2**M (H,W)=Hadamard2Walsh(n) # display Hadamard matrix followed by Walsh matrix (n=8) print(H); print; print(W)   # a visual comparison of Walsh function (j=2) M=3; n=2**M _,W=Hadamard2Walsh(n) plt.subplot(2,1,1) plt.step(np.arange(n).tolist(),W[2],where="post") plt.xlim(0,n) plt.ylim(-1.1,1.1); plt.ylabel("order M=3")   M=8; n=2**M _,W=Hadamard2Walsh(n) plt.subplot(2,1,2) plt.step(np.arange(n).tolist(),W[2],where="post") plt.xlim(0,n) plt.ylim(-1.1,1.1); plt.ylabel("order M=8")   plt.show()

first, we display Hadamard and Walsh matrices of order $n=2^3$, respectively:

[[ 1 1 1 1 1 1 1 1] [ 1 -1 1 -1 1 -1 1 -1] [ 1 1 -1 -1 1 1 -1 -1] [ 1 -1 -1 1 1 -1 -1 1] [ 1 1 1 1 -1 -1 -1 -1] [ 1 -1 1 -1 -1 1 -1 1] [ 1 1 -1 -1 -1 -1 1 1] [ 1 -1 -1 1 -1 1 1 -1]]   [[ 1 1 1 1 1 1 1 1] [ 1 1 1 1 -1 -1 -1 -1] [ 1 1 -1 -1 -1 -1 1 1] [ 1 1 -1 -1 1 1 -1 -1] [ 1 -1 -1 1 1 -1 -1 1] [ 1 -1 -1 1 -1 1 1 -1] [ 1 -1 1 -1 -1 1 -1 1] [ 1 -1 1 -1 1 -1 1 -1]]

and next we visually verify that the shape of the third Walsh function ($j=2$) is preserved for two different orders, here $M=3$ and $M=8$, respectively:

The third possibility of ordering the Walsh functions is so that they are arranged in increasing order of their sequencies or number of zero-crossings: sequency order. However, we won’t be interested in it this time.

1.3. Signal Transformations

As we have seen in the beginning, the WHT is able to perform a signal transformation for any real-vauled time-series. The sole requirement is that the signal ought to be of $2^M$ length. Now, it should be obvious why is so. When you think about WHT for a longer second you may understand its uniqueness as contrasted with the Fourier transform. Firstly, the waveforms are much simpler. Secondly, the complexity of computation is significantly reduced. Finally, if the input signal is converted from its original form down to only two discrete values $\pm 1$, we end up with a bunch of trivial arithmetical calculations while WHT.

The Walsh-Hadamard transform found its application in medical signal processing, audio/sound processing, signal and image compression, pattern recognition, and cryptography. In the most simplistic cases, one deals with the input signal to be of the binary form, e.g. 01001011010110. If we consider the binary function $f: \mathbb{Z}_2^n \rightarrow \mathbb{Z}_2$ then the following transformation is possible:
$$\bar{f}(\mathbf{x}) = 1 – 2f(\mathbf{x}) = (-1)^{f(\mathbf{x})}$$ therefore $\bar{f}: \mathbb{Z}_2^n \rightarrow \{-1,1\}$. What does it do is it performs the following conversion, for instance:
$$\{0,1,0,1,1,0,1,0,0,0,1,…\} \rightarrow \{-1,1,-1,1,1,-1,1,-1,-1,-1,1,…\}$$ of the initial binary time-series. However, what would be more interesting for us as it comes to the financial return-series processing, is the transformation:
$$\bar{f}(\mathbf{x}) = \left\{ \begin{array}{l l} 1 & \ \text{if f(\mathbf{x})\ge 0}\\ -1 & \ \text{if f(\mathbf{x})\lt 0} \end{array} \right.$$ Given that, any return-series in the value interval $[-1,1]$ (real-valued) is transformed to the binary form of $\pm 1$, for example:
$$\{0.031,0.002,-0.018,-0.025,0.011,…\} \rightarrow \{1,1,-1,-1,1,…\} \ .$$ This simple signal transformation in connection with the power of Walsh-Hadamard Transform opens new possibilities of analysing the underlying true signal. WHT is all “made of” $\pm 1$ values sleeping in its Hadamard matrices. Coming in a close contact with the signal of the same form this is “where the rubber meets the road” (Durkee, Jacobs, & Meat Loaf 1995).

1.4. Discrete Walsh-Hadamard Transform in Python

The Walsh-Hadamard transform is an orthogonal transformation that decomposes a signal into a set of orthogonal, rectangular waveforms (Walsh functions). For any real-valued signal we derive WHT as follows:

def WHT(x): # Function computes (slow) Discrete Walsh-Hadamard Transform # for any 1D real-valued signal # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz x=np.array(x) if(len(x.shape)<2): # make sure x is 1D array if(len(x)>3): # accept x of min length of 4 elements (M=2) # check length of signal, adjust to 2**m n=len(x) M=trunc(log(n,2)) x=x[0:2**M] h2=np.array([[1,1],[1,-1]]) for i in xrange(M-1): if(i==0): H=np.kron(h2,h2) else: H=np.kron(H,h2)   return (np.dot(H,x)/2.**M, x, M) else: print("HWT(x): Array too short!") raise SystemExit else: print("HWT(x): 1D array expected!") raise SystemExit

Despite the simplicity, this solution slows down for signals of length of $n\ge 2^{22}$ where, in case of my MacBook Pro, 16 GB of RAM is just not enough! Therefore, the mechanical derivation of WHT making use of Kronecker products between matrices is often referred to as Slow Walsh-Hadamard Transform. It is obvious that Fast WHT exists but its application for the use of this article (and research) is not required. Why? We will discuss it later on.

We can see our Python’s WHT(x) function in action coding, for instance:

from WalshHadamard import WHT from random import randrange   x1=[0.123,-0.345,-0.021,0.054,0.008,0.043,-0.017,-0.036] wht,_,_=WHT(x1) print("x1 = %s" % x1) print("WHT = %s" % wht)   x2=[randrange(-1,2,2) for i in xrange(2**4)] wht,_,_=WHT(x2) print; print("x2 = %s" % x2) print("WHT = %s" % wht)

what returns

x1 = [0.123, -0.345, -0.021, 0.054, 0.008, 0.043, -0.017, -0.036] WHT = [-0.023875 0.047125 -0.018875 0.061125 -0.023375 0.051125 -0.044875 0.074625]   x2 = [1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1] WHT = [ 0.375 0.125 0.125 -0.125 -0.125 0.125 -0.375 -0.125 0.375 0.125 -0.375 0.375 -0.125 0.125 0.125 0.375]

where we performed WHT of a real-values signal (e.g. a financial return-series) of $x_1$ and a random binary sequence of $x_2$. Is it correct? You can always verify those results in Matlab (version used here: 2014b) by executing the corresponding function of fwht from the Matlab’s Signal Processing Toolbox:

>> x1=[0.123, -0.345, -0.021, 0.054, 0.008, 0.043, -0.017, -0.036] x1 = 0.1230 -0.3450 -0.0210 0.0540 0.0080 0.0430 -0.0170 -0.0360   >> y1 = fwht(x1,length(x1),'hadamard') y1 = -0.0239 0.0471 -0.0189 0.0611 -0.0234 0.0511 -0.0449 0.0746   >> x2=[1, -1, 1, 1, 1, 1, 1, 1, -1, 1, 1, -1, 1, 1, -1, -1] x2 = 1 -1 1 1 1 1 1 1 -1 1 1 -1 1 1 -1 -1   >> y2 = fwht(x2,length(x2),'hadamard')' y2 = 0.3750 0.1250 0.1250 -0.1250 -0.1250 0.1250 -0.3750 -0.1250 0.3750 0.1250 -0.3750 0.3750 -0.1250 0.1250 0.1250 0.3750

All nice, smooth, and in agreement. The difference between WHT(x) and Matlab’s fwht is that the former trims the signal down while the latter allows for padding with zeros. Just keep that in mind if you employ Matlab in your own research.

2. Random Sequences and Walsh-Hadamard Transform Statistical Test

You may raise a question. Why do we need to introduce WHT at all, transform return-series into binary $\pm 1$ sequences, and what does it have in common with randomness? The answer is straightforward and my idea is simple, namely: we may test any financial return time-series for randomness using our fancy WHT approach. A bit of creativity before sundown.

WFT of the binary signal returns a unique pattern. If signal $x(t)$ is of random nature, we might suspect not to find any sub-string of it to be repeatable. That’s the main motivation standing behind PRNGs: to imitate true randomness met in nature.

In 2009 Oprina et alii (hereinafter Op09) proposed a statistical test based on results derived for any binary $\pm 1$ signal $x(t)$. But they were smart, instead of looking at the signal as a whole, they suggested its analysis chunk by chunk (via signal segmentation into equi-lengthy blocks). The statistical test they designed aims at performing class of autocorrelation tests with the correlation mask given by the rows of Hadamard matrix. As a supplement to the methodology presented in NIST Special Publication 800-22 (Rukhin at al. 2010) which specifies 16 independent statistical tests for random and pseudorandom number generators for cryptographic applications, Op09 proposed another two methods, based on confidence intervals, which can detect a more general failure in the random and pseudorandom generators. All 5 statistical tests of Op09 form the Walsh-Hadamard Transform Statistical Test Suite and we, what follows, will encode them to Python, focusing both on statistical meaning and application.

The working line standing behind the Op09’s Suite was a generalised test suitable for different purposes: randomness testing, cryptographic design, crypto-analysis techniques and stegano-graphic detection. In this Section, first, we concentrate our effort to see how this new test suite works for the standard Mersenne Twister PRNG concealed in the Python’s random function class. Next, we move to the real-life financial data as an input where we aim to verify whether any chunk (block) of time-series can be regarded as of random nature at a given significance level of $\alpha$. If so, which one. In not, well, well, well… why not?! For the latter case, this test opens a new window of opportunities to look for non-stochastic nature of, e.g. trading signals, and their repeatable properties (if any).

2.1. Signal Pre-Processing

Each tests requires some input data. In case of Op09 tests, we need to provide: a trimmed signal $x(t)$ of the total length $n=2^M$; choose a sequence (block) size of length $2^m$; a significance level of $\alpha$ (rejection level); a probability $p$ of occurrence of the digit 1. In step 1 we transform (if needed) $x(t_i)$ into $\pm 1$ sequence as specified in Section 1.3. In step 2 we compute lower and upper rejection limits of the test $u_{\alpha/2}$ and $u_{1-\alpha/2}$. In step 3 we compute the number of sequences to be processed $a=n/(2m)$ and split $x(t)$ into $a$ adjacent blocks (sequences).

Since it sounds complicated let’s see how easy it is in fact. We design a function that splits $X(t)$ signal into $a$ blocks of $b$ length:

def xsequences(x): x=np.array(x) # X(t) or x(t) if(len(x.shape)<2): # make sure x is 1D array if(len(x)>3): # accept x of min length of 4 elements (M=2) # check length of signal, adjust to 2**M if needed n=len(x) M=trunc(log(n,2)) x=x[0:2**M] a=(2**(M/2)) # a number of adjacent sequences/blocks b=2**M/a # a number of elements in each sequence y=np.reshape(x,(a,b)) #print(y) return (y,x,a,b,M) else: print("xsequences(x): Array too short!") raise SystemExit else: print("xsequences(x): 1D array expected!") raise SystemExit

where the conversion of Python list’s (or NumPy 1D array’s) values into $\pm 1$ signal we obtain by:

def ret2bin(x): # Function converts list/np.ndarray values into +/-1 signal # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz Y=[]; ok=False print(str(type(x))) print(x) if('numpy' in str(type(x)) and 'ndarray' in str(type(x))): x=x.tolist() ok=True elif('list' in str(type(x))): ok=True if(ok): for y in x: if(y<0): Y.append(-1) else: Y.append(1) return Y else: print("Error: neither 1D list nor 1D NumPy ndarray") raise SystemExit

To be perfectly informed about what took place we may wish to display a summary on pre-processed signal as follows:

def info(X,xt,a,b,M): line() print("Signal\t\tX(t)") print(" of length\tn = %d digits" % len(X)) print("trimmed to\tx(t)") print(" of length\tn = %d digits (n=2^%d)" % (a*b,M)) print(" split into\ta = %d sub-sequences " % a) print("\t\tb = %d-digit long" % b) print

Now, let see how this section of our data processing works in practice. As an example, first, we generate a random signal $X(t)$ of length $N=2^{10}+3$ and values spread in the interval of $[-0.2,0.2]$. That should provide us with some sort of feeling of the financial return time-series (e.g. collected daily based on stock or FX-series trading). Next, employing the function of xsequences we will trim the signal $X(t)$ down to $x(t)$ of length $n=2^{10}$ however converting first the return-series into $\pm 1$ sequence denoted by $x'(t)$. Finally, the $x(t)$ we split into $a$ sub-sequences of length $b$; $x_{\rm seq}$. An exemplary main program executing those steps could be coded as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 from WalshHadamard import ret2bin, xsequencies, info from random import uniform, seed import numpy as np   # generate a random signal X(t) seed(4515) X=[uniform(-0.2,0.2) for i in range(2**10)]; X=X+[0.12,-0.04,0.01] Xorg=X   # convert its values into +/-1 sequence X=ret2bin(X) # x'(t)   # split X'(t) into a blocks; save result in xseq 2D array (xseq,x,a,b,M) = xsequences(X)   print("X(t) =") for i in xrange(0,len(Xorg)): print("%10.5f" % Xorg[i]) print("\nx'(t) =") print(np.array(X)) print("\nx(t) =") print(x) print("\nxseq =") print(xseq) print   info(X,x,a,b,M)

returning:

X(t) = 0.17496 0.07144 -0.15979 0.11344 0.08134 -0.01725 ... -0.16005 0.01419 -0.08748 -0.03218 -0.07908 -0.02781 0.12000 -0.04000 0.01000   X'(t) = [ 1 1 -1 ..., 1 -1 1]   x(t) = [ 1 1 -1 ..., -1 -1 -1]   xseq = [[ 1 1 -1 ..., -1 -1 1] [-1 -1 -1 ..., 1 -1 1] [-1 -1 1 ..., -1 1 1] ..., [-1 1 1 ..., 1 -1 1] [-1 1 1 ..., -1 -1 -1] [ 1 -1 -1 ..., -1 -1 -1]]   ---------------------------------------------------------------------- Signal X(t) of length n = 1027 digits trimmed to x(t) of length n = 1024 digits (n=2^10) split into a = 32 sub-sequences b = 32-digit long

Having such framework for initial input data, we are ready to program WHT Statistical Test based on Op09 recipe.

2.2. Test Statistic

2D matrix of $x_{\rm seq}$ holding our signal under investigation is the starting point to its tests for randomness. Op09’s test is based on computation of WHT for each row of $x_{\rm seq}$ and the t-statistics, $t_{ij}$, as a test function based on Walsh-Hadamard transformation of all sub-sequencies of $x(t)$.

It is assumed that for any signal $y(t_i)$ where $i=0,1,…$ the WHT returns a sequence $\{w_i\}$ and: (a) for $w_0$ the mean value is $m_0=2^m(1-2p)$; the variance is given by $\sigma^2_0=2^{m+2}p(1-p)$ and the distribution of $(w_0-m_0)/\sigma_0 \sim N(0,1)$ for $m>7$; (b) for $w_i$ ($i\ge 1$) the mean value is $m_i=0$; the variance is $\sigma^2_i=2^{m+2}p(1-p)$ and the distribution of $(w_i-m_i)/\sigma_i \sim N(0,1)\$ for $m>7$. Recalling that $p$ stands for probability of occurrence of the digit 1 in $x_{{\rm seq},j}$ for $p = 0.5$ (our desired test probability) the mean value of $w_i$ is equal 0 for every $i$.

In $x_{\rm seq}$ array for every $j=0,1,…,(a-1)\$ and for every $i=0,1,…,(b-1)\$ we compute t-statistic as follows:
$$t_{ij} = \frac{w_{ij} – m_i}{\sigma_i}$$ where $w_{ij}$ is the $i$-th Walsh-Hadamard transform component of the block $j$. In addition, we convert all $t_{ij}$ into $p$-values:
$$p{\rm-value} = P_{ij} = {\rm Pr}(X\lt t_{ij}) = 1-\frac{1}{\sqrt{2\pi}} \int_{-\infty}^t e^{\frac{-x^2}{2}} dx$$ such $t_{ij}\sim N(0,1)$, i.e. has a normal distribution with zero mean and unit standard deviation. Are you still with me? Great! The Python code for this part of our analysis may look in the following way:

def tstat(x,a,b,M): # specify the probability of occurrence of the digit "1" p=0.5 print("Computation of WHTs...") for j in xrange(a): hwt, _, _ = WHT(x[j]) if(j==0): y=hwt else: y=np.vstack((y,hwt)) # WHT for xseq print(" ...completed") print("Computation of t-statistics..."), t=[]; for j in xrange(a): # over sequences/blocks (rows) for i in xrange(b): # over sequence's elements (columns) if(i==0): if(p==0.5): m0j=0 else: m0j=(2.**M/2.)*(1.-2.*p) sig0j=sqrt((2**M/2)*p*(1.-p)) w0j=y[j][i] t0j=(w0j-m0j)/sig0j t.append(t0j) else: sigij=sqrt((2.**((M+2.)/2.))*p*(1.-p)) wij=y[j][i] tij=wij/sigij t.append(tij) t=np.array(t) print("completed") print("Computation of p-values..."), # standardised t-statistics; t_{i,j} ~ N(0,1) t=(t-np.mean(t))/(np.std(t)) # p-values = 1-[1/sqrt(2*pi)*integral[exp(-x**2/2),x=-inf..t]] P=1-ndtr(t) print("completed\n") return(t,P,y)

and returns, as the output, two 2D arrays storing $t_{ij}$ statistics for every $w_{ij}$ (t variable) and corresponding $p$-values of $P_{ij}$ (P variable).

Continuing the above example, we extend its body by adding what follows:

28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 from WalshHadamard import tstat   (t, P, xseqWHT) = tstat(xseq,a,b,M)   print("t =") print(t) print("\np-values =") print(P) print("\nw =") print(xseqWHT)   # display WHTs for "a" sequences plt.imshow(xseqWHT, interpolation='nearest', cmap='PuOr') plt.colorbar() plt.xlabel("Sequence's Elements"); plt.ylabel("Sequence Number") plt.title("Walsh-Hadamard Transforms") plt.show()

to derive:

Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   t = [ 0.26609308 0.72730106 -0.69960094 ..., -1.41305193 -0.69960094 0.01385006]   p-values = [ 0.39508376 0.23352077 0.75791172 ..., 0.92117977 0.75791172 0.4944748 ]   w = [[ 0.125 0.125 -0.125 ..., 0.125 -0.125 -0.125 ] [ 0.1875 -0.1875 -0.0625 ..., -0.0625 0.0625 -0.0625] [ 0.0625 0.0625 -0.4375 ..., -0.0625 -0.0625 -0.0625] ..., [-0.1875 -0.3125 0.1875 ..., 0.1875 0.1875 -0.1875] [-0.125 0.125 -0.125 ..., -0.125 -0.125 -0.375 ] [ 0. 0.125 -0.25 ..., -0.25 -0.125 0. ]]

and, at the end, we display WHTs computed for every single (horizontal) sequence of $x_{\rm seq}$ array as stored in $w$ matrix by tstats function:

Again, every element (a square) in the above figure corresponds to $w_{ij}$ value. I would like also pay your attention to the fact that both arrays of $t$ and $P$ are 1D vectors storing all t-statistics and $p$-values, respectively. In the following WHT tests we will make an additional effort to split them into $a\times b$ matrices corresponding exactly to $w$-like arrangement. That’s an easy part so don’t bother too much about that.

2.3. Statistical Test Framework

In general, here, we are interested in verification of two opposite statistical hypotheses. The concept of hypothesis testing is given in every textbook on the subject. We test our binary signal $x(t)$ for randomness. Therefore, $H_0$: $x(t)$ is generated by a binary memory-less source i.e. the signal does not contain any predictable component; $H_1$: $x(t)$ is not produced by a binary memory-less source, i.e. the signal contains a predictable component.

Within the standard framework of hypothesis testing we can make two errors. The first one refers to $\alpha$ (so-called significance level) and denotes the probability of occurrence of a false positive result. The second one refers to $\beta$ and denotes the probability of the occurrence of a false negative result.

The testing procedure that can be applied here is: for a fixed value of $\alpha$ we find a confidence region for the test statistic and check if the statistical test value is in the confidence region. The confidence levels are computed using the quantiles $u_{\alpha/2}$ and $u_{1-\alpha/2}$ (otherwise, specified in the text of the test). Alternatively, if an arbitrary $t_{\rm stat}$ is the value of the test statistics (test function) we may compare $p{\rm-value}={\rm Pr}(X\lt t_{\rm stat})$ with $\alpha$ and decide on randomness when $p$-value$\ \ge\alpha$.

2.4. Crude Decision (Test 1)

The first WHT test from the Op09’s suite is a crude decision or majority decision. For chosen $\alpha$ and at $u_\alpha$ denoting the quantile of order $\alpha$ of the normal distribution, if:
$$t_{ij} \notin [u_{\alpha/2}; u_{1-\alpha/2}]$$ then reject the hypothesis of randomness regarding $i$-th test statistic of the signal $x(t)$ at the significance level of $\alpha$. Jot down both $j$ and $i$ corresponding to sequence number and sequence’s element, respectively. Op09 suggest that this test is suitable for small numbers of $a<1/\alpha\$ which is generally always fulfilled for our data. We code this test in Python as follows:

def test1(cl,t,a,b,otest): alpha=1.-cl/100. u1=norm.ppf(alpha/2.) u2=norm.ppf(1-alpha/2.) Results1=[] for l in t: if(l<u1 or l>u2): Results1.append(0) else: Results1.append(1) nfail=a*b-np.sum(Results1) print("Test 1 (Crude Decision)") print(" RESULT: %d out of %d test variables stand for " \ "randomness" % (a*b-nfail,a*b)) if((a*b-nfail)/float(a*b)>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(a*b-nfail)/float(a*b)) # gather per cent of positive results print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

Op09’s decision on rejection of $H_0$ is too stiff. In the function we calculate the number of $t_{ij}$’s falling outside the test interval. If their number exceeds 1%, we claim on the lack of evidence of randomness for $x(t)$ as a whole.

2.5. Proportion of Sequences Passing a Test (Test 2)

Recall that for each row (sub-sequence of $x(t)$) and its elements, we have computed both $t_{ij}$’s and $P_{ij}$ values. Let’s use the latter here. In this test, first, we check for every row of (re-shaped) $t$ 2D array a number of $p$-values to be $P_{ij}\lt\alpha$. If this number is greater than zero, we reject $j$-th sub-sequence of $x(t)$ at the significance level of $\alpha$ to pass the test. For all $a$ sub-sequences we count its total number of those which did not pass the test, $n_2$. If:
$$n_2 \notin \left[ a\alpha \sqrt{a \alpha (1-\alpha))} u_{\alpha/2}; a\alpha \sqrt{a \alpha (1-\alpha))} u_{1-\alpha/2} \right]$$ then there is evidence that signal $x(t)$ is non-random.

We code this test simply as:

def test2(cl,P,a,b,otest): alpha=1.-cl/100. u1=norm.ppf(alpha/2.) u2=norm.ppf(1-alpha/2.) Results2=[] rP=np.reshape(P,(a,b)) # turning P 1D-vector into (a x b) 2D array! for j in xrange(a): tmp=rP[j][(rP[j]<alpha)] #print(tmp) if(len(tmp)>0): Results2.append(0) # fail for sub-sequence else: Results2.append(1) # pass   nfail2=a-np.sum(Results2) # total number of sub-sequences which failed t2=nfail2/float(a) print("Test 2 (Proportion of Sequences Passing a Test)") b1=alpha*a+sqrt(a*alpha*(1-alpha))*u1 b2=alpha*a+sqrt(a*alpha*(1-alpha))*u2 if(t2<b1 or t2>b2): print(" RESULT: Signal x(t) appears to be non-random") otest.append(0.) else: print(" RESULT: Signal x(t) appears to be random") otest.append(100.) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

This test is also described well by Rukhin et al. (2010) though we follow the method of Op09 adjusted for the proportion of $p$-values failing the test for randomness as counted sub-sequence by sub-sequence.

2.7. Uniformity of p-values (Test 3)

In this test, the distribution of $p$-values is examined to ensure uniformity. This may be visually illustrated using a histogram, whereby, the interval between 0 and 1 is divided into $K=10$ sub-intervals, and the $p$-values, i.e. in our case $P_{ij}$’s, that lie within each sub-interval are counted and displayed.

Uniformity may also be determined via an application of a $\chi^2$ test and the determination of a $p$-value corresponding to the Goodness-of-Fit Distributional Test on the $p$-values obtained for an arbitrary statistical test (i.e., a $p$-value of the $p$-values). We accomplish that via computation of the test statistic:
$$\chi^2 = \sum_{i=1}^{K} \frac{\left( F_i-\frac{a}{K} \right)^2}{\frac{a}{K}}$$ where $F_i$ is the number of $P_{ij}$ in the histogram’s bin of $i$, and $a$ is the number of sub-sequences of $x(t)$ we investigate.

We reject the hypothesis of randomness regarding $i$-th test statistic $t_{ij}$ of $x(t)$ at the significance level of $\alpha$ if:
$$\chi^2_i \notin [0; \chi^2(\alpha, K-1)] \ .$$ Let $\chi^2(\alpha, K-1)$ be the quantile of order $\alpha$ of the distribution $\chi^2(K-1)$. In Python we may calculated it in the following way:

from scipy.stats import chi2 alpha=0.001 K=10 # for some derived variable of chi2_test print(Chi2(alpha,K-1))

If our test value of Test 3 $\chi_i^2\le \chi^2(\alpha, K-1)$ then we count $i$-th statistics to be not against randomness of $x(t)$. This is an equivalent to testing $i$-th $p$-value of $P_{ij}$ if $P_{ij}\ge\alpha$. The latter can be computed in Python as:

from scipy.special import gammainc alpha=0.001 K=10 # for some derived variable of chi2_test pvalue=1-gammainc((K-1)/2.,chi2_test/2.)

where gammainc(a,x) stands for incomplete gamma function defined as:
$$\Gamma(a,x) = \frac{1}{\Gamma(a)} \int_0^x e^{-t} t^{a-1} dt$$ and $\Gamma(a)$ denotes a standard gamma function.

Given that, we code Test 3 of Uniformity of $p$-values in the following way:

def test3(cl,P,a,b,otest): alpha=1.-cl/100. rP=np.reshape(P,(a,b)) rPT=rP.T Results3=0 for i in xrange(b): (hist,bin_edges,_)=plt.hist(rPT[i], bins=list(np.arange(0.0,1.1,0.1))) F=hist K=len(hist) # K=10 for bins as defined above S=a chi2=0 for j in xrange(K): chi2+=((F[j]-S/K)**2.)/(S/K) pvalue=1-gammainc(9/2.,chi2/2.) if(pvalue>=alpha and chi2<=Chi2(alpha,K-1)): Results3+=1 print("Test 3 (Uniformity of p-values)") print(" RESULT: %d out of %d test variables stand for randomness"\ % (Results3,b)) if((Results3/float(b))>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(Results3/float(b))) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

where again we allow for less than 1% of all results not to stand against the rejection of $x(t)$ as random signal.

Please note on the transposition of rP matrix. The reason for testing WHT’s values (converted to $t_{ij}$’s or $P_{ij}$’s) is to detect autocorrelation patterns in the tested signal $x(t)$. The same approach has been applied by Op09 in Test 4 and Test 5 as discussed in the following sections and constitutes the core of invention and creativity added to Rukhin et al. (2010)’s suite of 16 tests for signals generated by various PRNGs to meet the cryptographic levels of acceptance as close-to-truly random.

2.8. Maximum Value Decision (Test 4)

This test is based again on the confidence levels approach. Let $T_{ij}=\max_j t_{ij}$ then if:
$$T_{ij} \notin \left[ u_{\left(\frac{\alpha}{2}\right)^{a^{-1}}}; u_{\left(1-\frac{\alpha}{2}\right)^{a^{-1}}} \right]$$ then reject the hypothesis of randomness (regarding $i$-th test function) of signal $x(t)$ at the significance level of $\alpha$. We encode it to Python no simpler as that:

def test4(cl,t,a,b,otest): alpha=1.-cl/100. rt=np.reshape(t,(a,b)) rtT=rt.T Results4=0 for i in xrange(b): tmp=np.max(rtT[i]) u1=norm.ppf((alpha/2.)**(1./a)) u2=norm.ppf((1.-alpha/2.)**(1./a)) if not(tmp<u1 or tmp>u2): Results4+=1 print("Test 4 (Maximum Value Decision)") print(" RESULT: %d out of %d test variables stand for randomness" % (Results4,b)) if((Results4/float(b))>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(Results4/float(b))) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

Pay attention how this test looks at the results derived based on WHTs. It is sensitive to the distribution of maximal values along $i$-th’s elements of $t$-statistics.

2.9. Sum of Square Decision (Test 5)

Final test makes use of the $C$-statistic designed as:
$$C_i = \sum_{j=0}^{a-1} t_{ij}^2 \ .$$ If $C_i \notin [0; \chi^2(\alpha, a)]$ we reject the hypothesis of randomness of $x(t)$ at the significance level of $\alpha$ regarding $i$-th test function. The Python reflection of this test finds its form:

def test5(cl,t,a,b,otest): alpha=1.-cl/100. rt=np.reshape(t,(a,b)) rtT=rt.T Results5=0 for i in xrange(b): Ci=0 for j in xrange(a): Ci+=(rtT[i][j])**2. if(Ci<=Chi2(alpha,a)): Results5+=1 print("Test 5 (Sum of Square Decision)") print(" RESULT: %d out of %d test variables stand for randomness" % (Results5,b)) if((Results5/float(b))>.99): print("\t Signal x(t) appears to be random") else: print("\t Signal x(t) appears to be non-random") otest.append(100.*(Results5/float(b))) print("\t at %.5f%% confidence level" % (100.*(1.-alpha))) print return(otest)

Again, we allow of 1% of false negative results.

2.10. The Overall Test for Randomness of Binary Signal

We accept signal $x(t)$ to be random if the average passing rate from all five WHT statistical tests is greater than 99%, i.e. 1% can be due to false negative results, at the significance level of $\alpha$.

def overalltest(cl,otest): alpha=1.-cl/100. line() print("THE OVERALL RESULT:") if(np.mean(otest)>=99.0): print(" Signal x(t) displays an evidence for RANDOMNESS"), T=1 else: print(" Signal x(t) displays an evidence for NON-RANDOMNESS"), T=0 print("at %.5f%% c.l." % (100.*(1.-alpha))) print(" based on Walsh-Hadamard Transform Statistical Test\n") return(T)

and run all 5 test by calling the following function:

def WHTStatTest(cl,X): (xseq,xt,a,b,M) = xsequences(X) info(X,xt,a,b,M) if(M<7): line() print("Error: Signal x(t) too short for WHT Statistical Test") print(" Acceptable minimal signal length: n=2^7=128\n") else: if(M>=7 and M<19): line() print("Warning: Statistically advisable signal length: n=2^19=524288\n") line() print("Test Name: Walsh-Hadamard Transform Statistical Test\n") (t, P, _) = tstat(xseq,a,b,M) otest=test1(cl,t,a,b,[]) otest=test2(cl,P,a,b,otest) otest=test3(cl,P,a,b,otest) otest=test4(cl,t,a,b,otest) otest=test5(cl,t,a,b,otest) T=overalltest(cl,otest) return(T) # 1 if x(t) is random

fed by binary $\pm 1$ signal of $X$ (see example in Section 2.1). The last function return T variable storing $1$ for the overall decision that $x(t)$ is random, $0$ otherwise. It can be used for a great number of repeated WHT tests for different signals in a loop, thus for determination of ratio of instances the WHT Statistical Test passed.

3. Randomness of random()

I know what you think right now. I have just spent an hour reading all that stuff so far, how about some real-life tests? I’m glad you asked! Here we go! We start with Mersenne Twister algorithm being the Rolls-Royce engine of Python’s random() function (and its derivatives). The whole fun of the theoretical part given above comes down to a few lines of code as given below.

Let’s see our WHT Suite of 5 Statistical Tests in action for a very long (of length $n=2^{21}$) random binary signal of $\pm 1$ form. Let’s run the exemplary main program calling the test:

# Walsh-Hadamard Transform and Tests for Randomness of Financial Return-Series # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # Mersenne Twister PRNG test using WHT Statistical Test   from WalshHadamard import WHTStatTest, line from random import randrange   # define confidence level for WHT Statistical Test cl=99.9999   # generate random binary signal X(t) X=[randrange(-1,2,2) for i in xrange(2**21)] line() print("X(t) =") for i in range(20): print(X[i]), print("...")   WHTStatTest(cl,X)

what returns lovely results, for instance:

---------------------------------------------------------------------- X(t) = -1 -1 1 1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 1 -1 1 1 ... ---------------------------------------------------------------------- Signal X(t) of length n = 2097152 digits trimmed to x(t) of length n = 2097152 digits (n=2^21) split into a = 1024 sub-sequences b = 2048-digit long   ---------------------------------------------------------------------- Test Name: Walsh-Hadamard Transform Statistical Test   Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   Test 1 (Crude Decision) RESULT: 2097149 out of 2097152 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 2 (Proportion of Sequences Passing a Test) RESULT: Signal x(t) appears to be random at 99.99990% confidence level   Test 3 (Uniformity of p-values) RESULT: 2045 out of 2048 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 4 (Maximum Value Decision) RESULT: 2047 out of 2048 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 5 (Sum of Square Decision) RESULT: 2048 out of 2048 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   ---------------------------------------------------------------------- THE OVERALL RESULT: Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l. based on Walsh-Hadamard Transform Statistical Test

where we assumed the significance level of $\alpha=0.000001$. Impressive indeed!

A the same $\alpha$ however for shorter signal sub-sequencies ($a=256; n=2^{16}$), we still get a significant number of passed tests supporting randomness of random(), for instance:

---------------------------------------------------------------------- X(t) = -1 -1 1 -1 1 1 1 1 1 -1 -1 -1 -1 -1 -1 1 1 -1 1 -1 ... ---------------------------------------------------------------------- Signal X(t) of length n = 65536 digits trimmed to x(t) of length n = 65536 digits (n=2^16) split into a = 256 sub-sequences b = 256-digit long   ---------------------------------------------------------------------- Warning: Statistically advisable signal length: n=2^19=524288   ---------------------------------------------------------------------- Test Name: Walsh-Hadamard Transform Statistical Test   Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   Test 1 (Crude Decision) RESULT: 65536 out of 65536 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 2 (Proportion of Sequences Passing a Test) RESULT: Signal x(t) appears to be random at 99.99990% confidence level   Test 3 (Uniformity of p-values) RESULT: 254 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 4 (Maximum Value Decision) RESULT: 255 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 5 (Sum of Square Decision) RESULT: 256 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   ---------------------------------------------------------------------- THE OVERALL RESULT: Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l. based on Walsh-Hadamard Transform Statistical Test

For this random binary signal $x(t)$, the Walsh-Hadamard Transforms for the first 64 signal sub-sequences reveal pretty “random” distributions of $w_{ij}$ values:

4. Randomness of Financial Return-Series

Eventually, we stand face to face in the grand finale with the question: do financial return time-series can be of random nature? More precisely, if we convert any return-series (regardless of time step, frequency of trading, etc.) to the binary $\pm 1$ signal, do the corresponding positive and negative returns occur randomly or not? Good question.

Let’s see by example of FX 30-min return-series of USDCHF currency pair traded between Sep/2009 and Nov/2014, how our WHT test for randomness works. We use the tick-data downloaded from Pepperstone.com and rebinned up to 30-min evenly distributed price series as provided in my earlier post of Rebinning Tick-Data for FX Algo Traders. We read the data by Python and convert the price-series into return-series.

What follows is similar what we have done within previous examples. First, we convert the return-series into binary signal. As we will see, the signal $X(t)$ is 65732 point long and can be split into 256 sub-sequences 256-point long. Therefore $n=2^{16}$ stands for the trimmed signal of $x(t)$. For the same of clarity, we plot first 64 segments (256-point long) for the USDCHF price-series marking all segments with vertical lines. Next, we run the WHT Statistical Test for all $a=256$ sequences but we display WHTs for only first 64 blocks as visualised for the price-series. The main code takes form:

# Walsh-Hadamard Transform and Tests for Randomness of Financial Return-Series # (c) 2015 QuantAtRisk.com, by Pawel Lachowicz # # 30-min FX time-series (USDCHF) traded between 9/2009 and 11/2014   from WalshHadamard import WHTStatTest, ret2bin, line as Line import matplotlib.pyplot as plt import numpy as np import csv   # define confidence level for WHT Statistical Test cl=99.9999   # open the file and read in the 30min price of USD/CHF P=[] with open("USDCHF.30m") as f: c = csv.reader(f, delimiter=' ', skipinitialspace=True) for line in c: price=line[6] P.append(price)   x=np.array(P,dtype=np.float128) # convert to numpy array r=x[1:]/x[0:-1]-1. # get a return-series Line() print("r(t) =") for i in range(7): print("%8.5f" % r[i]), print("...")   X=ret2bin(r) print("X(t) =") for i in range(7): print("%8.0f" %X[i]), print("...")   plt.plot(P) plt.xlabel("Time (from 1/05/2009 to 27/09/2010)") plt.ylabel("USDCHF (30min)") #plt.xlim(0,len(P)) plt.ylim(0.95,1.2) plt.xlim(0,64*256) plt.gca().xaxis.set_major_locator(plt.NullLocator()) for x in range(0,256*265,256): plt.hold(True) plt.plot((x,x), (0,10), 'k-') plt.show() plt.close("all")   WHTStatTest(cl,X)

and displays both plots as follows: (a) USDCHF clipped price-series,

and (b) WFTs for first 64 sequences 256-point long,

From the comparison of both figures one can understand the level of details how WHT results are derived. Interesting, for FX return-series, the WHT picture seems to be quite non-uniform suggesting that our USDCHF return-series is random. Is it so? The final answer deliver the results of WHT statistical test, summarised by our program as follows:

---------------------------------------------------------------------- r(t) = -0.00092 -0.00033 -0.00018 0.00069 -0.00009 -0.00003 -0.00086 ... X(t) = -1 -1 -1 1 -1 -1 -1 ... ---------------------------------------------------------------------- Signal X(t) of length n = 65731 digits trimmed to x(t) of length n = 65536 digits (n=2^16) split into a = 256 sub-sequences b = 256-digit long   ---------------------------------------------------------------------- Warning: Statistically advisable signal length: n=2^19=524288   ---------------------------------------------------------------------- Test Name: Walsh-Hadamard Transform Statistical Test   Computation of WHTs... ...completed Computation of t-statistics... completed Computation of p-values... completed   Test 1 (Crude Decision) RESULT: 65536 out of 65536 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 2 (Proportion of Sequences Passing a Test) RESULT: Signal x(t) appears to be random at 99.99990% confidence level   Test 3 (Uniformity of p-values) RESULT: 250 out of 256 test variables stand for randomness Signal x(t) appears to be non-random at 99.99990% confidence level   Test 4 (Maximum Value Decision) RESULT: 255 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   Test 5 (Sum of Square Decision) RESULT: 256 out of 256 test variables stand for randomness Signal x(t) appears to be random at 99.99990% confidence level   ---------------------------------------------------------------------- THE OVERALL RESULT: Signal x(t) displays an evidence for RANDOMNESS at 99.99990% c.l. based on Walsh-Hadamard Transform Statistical Test

Nice!! Wasn’t worth it all the efforts to see this result?!

I’ll leave you with a pure joy of using the software I created. There is a lot to say and a lot to verify. Even the code itself can be amended a bit and adjusted for different sequence number $a$ and $b$’s. If you discover a strong evidence for non-randomness, post it in comments or drop me an e-mail. I can’t do your homework. It’s your turn. I need to sleep sometimes… ;-)

REFERENCES

## Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$head AUDUSD-2014-09.csv AUD/USD,20140901 00:00:01.323,0.93289,0.93297 AUD/USD,20140901 00:00:02.138,0.9329,0.93297 AUD/USD,20140901 00:00:02.156,0.9329,0.93298 AUD/USD,20140901 00:00:02.264,0.9329,0.93297 AUD/USD,20140901 00:00:02.265,0.9329,0.93293 AUD/USD,20140901 00:00:02.265,0.93289,0.93293 AUD/USD,20140901 00:00:02.268,0.93289,0.93295 AUD/USD,20140901 00:00:02.277,0.93289,0.93296 AUD/USD,20140901 00:00:02.278,0.9329,0.93296 AUD/USD,20140901 00:00:02.297,0.93288,0.93296 The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price. Pre-Processing Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust. Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains: 1 2 3 4 5 6 7 8 9 10 11 # pp.scr # Rebinning Pepperstone.com Tick-Data for FX Algo Traders # (c) 2014 QuantAtRisk, by Pawel Lachowicz clear echo "..making a sorted list of .csv files" for i in$1-*.csv; do echo ${i##$1-} $i${i##.csv}; done | sort -n | awk '{print $2}' >$1.lst   python pp.py head AUDUSD.pp

that you run in Terminal:

$chmod +x pp.scr$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$cat AUDUSD.lst AUDUSD-2014-09.csv as expected. Next, we utilise the power and flexibility of Python in the following way: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # pp.py import csv fnlst="AUDUSD.lst" fnout="AUDUSD.pp" for lstline in open(fnlst,'r').readlines(): fncur=lstline[:-1] #print(fncur) with open(fnout,'w') as f: writer=csv.writer(f,delimiter=" ") i=1 # counts a number of lines with tick-data for line in open(fncur,'r').readlines(): if(i<=5200): # replace with (i>0) to process an entire file #print(line) year=line[8:12] month=line[12:14] day=line[14:16] hh=line[17:19] mm=line[20:22] ss=line[23:29] bidask=line[30:] writer.writerow([year,month,day,hh,mm,ss,bidask]) i+=1 It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following: 2014 09 01 00 00 01.323 "0.93289,0.93297 " 2014 09 01 00 00 02.138 "0.9329,0.93297 " 2014 09 01 00 00 02.156 "0.9329,0.93298 " 2014 09 01 00 00 02.264 "0.9329,0.93297 " 2014 09 01 00 00 02.265 "0.9329,0.93293 " since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # pp.scr (continued) echo "..removing token: comma" sed 's/,/ /g' AUDUSD.pp >$1.tmp rm AUDUSD.pp   echo "..removing token: double quotes" sed 's/"/ /g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing empty lines" sed -i '/^[[:space:]]*$/d' $1.tmp2 mv$1.tmp2 AUDUSD.pp   echo "head..." head AUDUSD.pp echo "tail..." tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma ..removing token: double quotes ..removing empty lines head... 2014 09 01 00 00 01.323 0.93289 0.93297 2014 09 01 00 00 02.138 0.9329 0.93297 2014 09 01 00 00 02.156 0.9329 0.93298 2014 09 01 00 00 02.264 0.9329 0.93297 2014 09 01 00 00 02.265 0.9329 0.93293 2014 09 01 00 00 02.265 0.93289 0.93293 2014 09 01 00 00 02.268 0.93289 0.93295 2014 09 01 00 00 02.277 0.93289 0.93296 2014 09 01 00 00 02.278 0.9329 0.93296 2014 09 01 00 00 02.297 0.93288 0.93296 tail... 2014 09 02 00 54 39.324 0.93317 0.93321 2014 09 02 00 54 39.533 0.93319 0.93321 2014 09 02 00 54 39.543 0.93318 0.93321 2014 09 02 00 54 39.559 0.93321 0.93321 2014 09 02 00 54 39.784 0.9332 0.93321 2014 09 02 00 54 39.798 0.93319 0.93321 2014 09 02 00 54 39.885 0.93319 0.93325 2014 09 02 00 54 39.886 0.93319 0.93321 2014 09 02 00 54 40.802 0.9332 0.93321 2014 09 02 00 54 48.829 0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 # pp2.py import csv import numpy as np   def convert(data): tempDATA = [] for i in data: tempDATA.append([float(j) for j in i.split()]) return np.array(tempDATA).T   fname="AUDUSD.pp"   with open(fname) as f: data = f.read().splitlines()   #print(data)   i=1 for d in data: list=[s for s in d.split(' ')] #print(list) # remover empty element in the list dd=[x for x in list if x] #print(dd) tmp=convert(dd) #print(tmp) if(i==1): a=tmp i+=1 else: a = np.vstack([a, tmp]) i+=1   N=i-1 #print("N = %d" % N)   # print the first line tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2]) print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6])) m=tmp   # check the boundary conditions (5 min bins) for i in xrange(2,N-1): if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):   # BLUE MARKER No. 1 # (print for i-1) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7])) b1=a[i-1][6] b2=a[i][6] a1=a[i-1][7] a2=a[i][7] # mid-price, and new date for 5 min bin bm=(b1+b2)/2 am=(a1+a2)/2 Ym=a[i][0] Mm=a[i][1] Dm=a[i][2] Hm=a[i][3] MMm=a[i][4] Sm=0.0 # set seconds to zero   # RED MARKER print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2)) tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2]) m=np.vstack([m, tmp])   # BLUE MARKER No. 2 # (print for i) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31 32 33 # pp.scr (continued)   python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

$head AUDUSD.dat 2014 9 1 0 0 0.000 0.932935 2014 9 1 0 5 0.000 0.933023 2014 9 1 0 10 0.000 0.932917 2014 9 1 0 15 0.000 0.932928 2014 9 1 0 20 0.000 0.932937 2014 9 1 0 25 0.000 0.933037 2014 9 1 0 30 0.000 0.933075 2014 9 1 0 35 0.000 0.933070 2014 9 1 0 40 0.000 0.933092 2014 9 1 0 45 0.000 0.933063 That concludes our efforts. Happy rebinning! ## GPU-Accelerated Finance in Python with NumbaPro Library. Really? When in 2010 I lived in Singapore I crossed my life path with some great guys working in the field of High-Performance Computing: Łukasz Orłowski, Marek T. Michalewicz, Iain Bell from Quadrant Capital. They both pointed my attention towards GPU computations utilizing Nvidia CUDA architecture. There was one problem. Everything was wrapped up with C syntax with a promise to do it more efficiently in C++. Soon. So I waited and studied C/C++ at least at the level allowing me to understand some CUDA codes. Years were passing by until the day when I discovered an article of Mark Harris, NumbaPro: High-Performance Python with CUDA Acceleration, delivering Python-friendly CUDA solutions to all my nightmares involving C/C++ coding. At this point you just need to understand one thing: not every quant or algo trader is a fan of C/C++ the same as some people prefer Volvo to Audi, including myself ;) Let’s have a sincere look at what the game is about. It is more than tempting to put your hands on a piece of code that allows you to speed-up some quantitative computations in Python making use of a new library. Accelerate your Python I absolutely love Continuum Analytics for the mission they stand for: making Python language easily accessible and used by everyone worldwide! It is a great language with a great syntax, easy to pick up, easy to be utilised in the learning process of the fundamentals of programming. Thanks to them, now you can download and install Python’s distribution of Anaconda for Windows, Mac OS X, or Linux just in few minutes (see my earlier post on Setting up Python for Quantitative Analysis in OS X 10.10 Yosemite as an example). When you visit their webpage you can spot Anaconda’s Add-Ons, three additional software packages to their Python distribution. Among them, they offer Accelerate module containing NumbaPro library. Once you read the description, and I quote, Accelerate is an add-on to Continuum’s free enterprise Python distribution, Anaconda. It opens up the full capabilities of your GPU or multi-core processor to Python. Accelerate includes two packages that can be added to your Python installation: NumbaPro and MKL Optimizations. MKL Optimizations makes linear algebra, random number generation, Fourier transforms, and many other operations run faster and in parallel. NumbaPro builds fast GPU and multi-core machine code from easy-to-read Python and NumPy code with a Python-to-GPU compiler. NumbaPro Features – NumbaPro compiler targets multi-core CPU and GPUs directly from simple Python syntax – Easily move vectorized NumPy functions to the GPU – Multiple CUDA device support – Bindings for CUDA libraries, including cuBlas, cuRand, cuSparse, and cuFFT – Support for array slicing and fast array math – Use multiple threads without worrying about the GIL – Supported on NVIDIA CUDA-enabled GPUs with compute capability 2.0 or above on Intel/AMD (x86) processors. your blood pressure increases and the level of endorphins skyrockets. Why? Simply because of the promise to do some tasks faster utilising GPU in a parallel mode! If you are new to GPU or CUDA I recommend you to read some well written posts on Mike’s website, for instance, Installing Nvidia CUDA on Mac OSX for GPU-based Parallel Computing or Monte Carlo Simulations in CUDA – Barrier Option Pricing. You will grasp the essence of what is all about. In general, much ado about CUDA is still around making use of your GPU and proving this extra upmhhh in speedup. If you have any quantitative problem in mind and it can be executed in the parallel mode, NumbaPro is a tool you need to look at but – not every engine sounds the same. Hold on till the end of this post. It will be worth it. Selling the Speed When you approach a new concept or a new product and someone tries to sell it to you, he needs to impress you to win your attention and boost your curiosity. Imagine for a moment that you have no idea about GPU or CUDA and you want to add two vectors. In Python you can do it as follows: import numpy as np from timeit import default_timer as timer def VectorAdd(a,b,c): for i in xrange(a.size): c[i]=a[i]+b[i] def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() VectorAdd(A,B,C) totaltime=timer()-start print("\nCPU time: %g sec" % totaltime) if __name__ == '__main__': main() We all know that once you run the code, Python does not compile it, it goes line by line and interprets what it reads. So, we aim at adding two vectors,$A$and$B$, containing 32 millions of elements. I get$C=A+B$matrix on my MacBook Pro (2.6 GHz Intel Core i7, 16 GB 1600 MHz DDR3 RAM, NVIDIA GeForce GT 650M 1GB) after: CPU time: 9.89753 sec Can we do it better? With NumbaPro the required changes to the code itself are minor. All we need to add is a function decorator that tells how and where the function should be executed. In fact, what NumbaPro does is that it “compiles” VectorAdd function on-the-go and deploys computations to the GPU unit: import numpy as np from timeit import default_timer as timer from numbapro import vectorize @vectorize(["float32(float32,float32)"], target="gpu") def VectorAdd(a,b): return a+b def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() C=VectorAdd(A,B) totaltime=timer()-start print("\nGPU time: %g sec" % totaltime) if __name__ == '__main__': main() We get GPU time: 0.286101 sec i.e. 34.6x speed-up. Not bad, right?! Not bad if you’re a sale person indeed! But, hey, what’s that?: import numpy as np from timeit import default_timer as timer def main(): N=32000000 A=np.ones(N, dtype=np.float32) B=np.ones(N, dtype=np.float32) C=np.zeros(N, dtype=np.float32) start=timer() C=A+B totaltime=timer()-start print("\nCPU time: %g sec" % totaltime) if __name__ == '__main__': main() Run it to discover that: CPU time: 0.0592878 sec i.e. 4.82x faster than using GPU. Oh, boy! CUDA:NumPy (0:1). Perfect Pitch When Nvidia introduced CUDA among some exemplary C codes utilising CUDA programming we could find an immortal Black-Scholes model for option pricing. In this Nobel-prize winning solution, we derive a call option price for non-dividend-paying underlying stock: $$C(S,t) = N(d_1)S – N(d_2)Ke^{-r(T-t)}$$ where$(T-t)$is the time to maturity (scalar),$r$is the risk free rate (scalar),$S$is the spot price of the underlying asset,$K$is the strike price, and$\sigma$is the volatility of returns of the underlying asset.$N(\dot)$is the cumulative distribution function (cnd) of the standard normal distribution and has an analytical form. A classical way to code it in Python is: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 import numpy as np import time RISKFREE = 0.02 VOLATILITY = 0.30 def cnd(d): A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 K = 1.0 / (1.0 + 0.2316419 * np.abs(d)) ret_val = (RSQRT2PI * np.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) return np.where(d > 0, 1.0 - ret_val, ret_val) def black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears, Riskfree, Volatility): S = stockPrice X = optionStrike T = optionYears R = Riskfree V = Volatility sqrtT = np.sqrt(T) d1 = (np.log(S / X) + (R + 0.5 * V * V) * T) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd(d1) cndd2 = cnd(d2) expRT = np.exp(- R * T) callResult[:] = (S * cndd1 - X * expRT * cndd2) def randfloat(rand_var, low, high): return (1.0 - rand_var) * low + rand_var * high def main (*args): OPT_N = 4000000 iterations = 10 if len(args) >= 2: iterations = int(args[0]) callResult = np.zeros(OPT_N) stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0) optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0) optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0) time0 = time.time() for i in range(iterations): black_scholes(callResult, putResult, stockPrice, optionStrike, optionYears, RISKFREE, VOLATILITY) time1 = time.time() print("Time: %f msec per option" % ((time1-time0)/iterations/OPT_N*1000)) if __name__ == "__main__": import sys main(*sys.argv[1:]) what returns Time: 0.000192 msec per option The essence of this code is to derive 4 million independent results based on feeding the function with random stock prices, option strike prices, and times to maturity. They enter the game under cover as row vectors with randomised values (see lines #45-47). Anaconda Accelerate’s CUDA solution for the same code is: import numpy as np import math import time from numba import * from numbapro import cuda from blackscholes import black_scholes # save the previous code as # black_scholes.py RISKFREE = 0.02 VOLATILITY = 0.30 A1 = 0.31938153 A2 = -0.356563782 A3 = 1.781477937 A4 = -1.821255978 A5 = 1.330274429 RSQRT2PI = 0.39894228040143267793994605993438 @cuda.jit(argtypes=(double,), restype=double, device=True, inline=True) def cnd_cuda(d): K = 1.0 / (1.0 + 0.2316419 * math.fabs(d)) ret_val = (RSQRT2PI * math.exp(-0.5 * d * d) * (K * (A1 + K * (A2 + K * (A3 + K * (A4 + K * A5)))))) if d > 0: ret_val = 1.0 - ret_val return ret_val @cuda.jit(argtypes=(double[:], double[:], double[:], double[:], double[:], double, double)) def black_scholes_cuda(callResult, putResult, S, X, T, R, V): # S = stockPrice # X = optionStrike # T = optionYears # R = Riskfree # V = Volatility i = cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x if i >= S.shape[0]: return sqrtT = math.sqrt(T[i]) d1 = (math.log(S[i] / X[i]) + (R + 0.5 * V * V) * T[i]) / (V * sqrtT) d2 = d1 - V * sqrtT cndd1 = cnd_cuda(d1) cndd2 = cnd_cuda(d2) expRT = math.exp((-1. * R) * T[i]) callResult[i] = (S[i] * cndd1 - X[i] * expRT * cndd2) def randfloat(rand_var, low, high): return (1.0 - rand_var) * low + rand_var * high def main (*args): OPT_N = 4000000 iterations = 10 callResultNumpy = np.zeros(OPT_N) putResultNumpy = -np.ones(OPT_N) stockPrice = randfloat(np.random.random(OPT_N), 5.0, 30.0) optionStrike = randfloat(np.random.random(OPT_N), 1.0, 100.0) optionYears = randfloat(np.random.random(OPT_N), 0.25, 10.0) callResultNumba = np.zeros(OPT_N) putResultNumba = -np.ones(OPT_N) callResultNumbapro = np.zeros(OPT_N) putResultNumbapro = -np.ones(OPT_N) # Numpy ---------------------------------------------------------------- time0 = time.time() for i in range(iterations): black_scholes(callResultNumpy, putResultNumpy, stockPrice, optionStrike, optionYears, RISKFREE, VOLATILITY) time1 = time.time() dtnumpy = ((1000 * (time1 - time0)) / iterations)/OPT_N print("\nNumpy Time %f msec per option") % (dtnumpy) # CUDA ----------------------------------------------------------------- time0 = time.time() blockdim = 1024, 1 griddim = int(math.ceil(float(OPT_N)/blockdim[0])), 1 stream = cuda.stream() d_callResult = cuda.to_device(callResultNumbapro, stream) d_putResult = cuda.to_device(putResultNumbapro, stream) d_stockPrice = cuda.to_device(stockPrice, stream) d_optionStrike = cuda.to_device(optionStrike, stream) d_optionYears = cuda.to_device(optionYears, stream) time2 = time.time() for i in range(iterations): black_scholes_cuda[griddim, blockdim, stream]( d_callResult, d_putResult, d_stockPrice, d_optionStrike, d_optionYears, RISKFREE, VOLATILITY) d_callResult.to_host(stream) d_putResult.to_host(stream) stream.synchronize() time3 = time.time() dtcuda = ((1000 * (time3 - time2)) / iterations)/OPT_N print("Numbapro CUDA Time %f msec per option (speed-up %.1fx) \n") % (dtcuda, dtnumpy/dtcuda) # print(callResultNumbapro) if __name__ == "__main__": import sys main(*sys.argv[1:]) returning Numpy Time 0.000186 msec per option Numbapro CUDA Time 0.000024 msec per option (speed-up 7.7x) In order to understand why CUDA wins over NumPy this time is not so difficult. First we have a programmable analytical form of the problem. We deploy it to GPU and perform exhaustive calculations involving cnd_cuda function for the estimation of the cumulative distribution function of the standard normal distribution. Splitting the task into many concurrently running threats on GPU reduces the time. Again, it’s possible because all option prices can be computed independently. CUDA:NumPy (1:1). Multiplied Promises In finance, the concept of portfolio optimization is well established (see my ebook on that, Applied Portfolio Optimization with Risk Management, as an example). The idea standing behind is to find such a vector of weights,$w$, for all assets that the derived estimated portfolio risk ($\sigma_P$) and return ($\mu_P$) meets our needs or expectations. An alternative (but not greatly recommended) approach would involve optimization through randomisation of$w$vectors. We could generate a big number of them, say$N$, in order to obtain: $$\mu_{P,i} = m w_i^T \ \ \ \mbox{and} \ \ \ \sigma_{P,i} = w_iM_2w_i^T$$ for$i=1,…,N$. Here,$m$is a row-vector holding estimated expected returns for all assets in portfolio$P$and based on return-series we end up with$M_2$covariance matrix$(MxM$where$M$is a number of assets in$P$). In the first case, we aim at the multiplication of row-vector with a transposed row-vector of weight whereas for the latter we perform the multiplication of the row-vector with the square matrix (as the first operation). If$N$is really big, say a couple of millions, the computation could be somehow accelerated using GPU. Therefore, we need a code for matrix multiplication on GPU in Python. Let’s consider a more advanced concept of ($K\times K$)$\times$($K\times K$) matrix multiplication. If that works faster, than our random portfolios problem should be even faster. Continuum Analytics provides with a ready-to-use solution: import numpy as np from numbapro import cuda import numba from timeit import default_timer as timer from numba import float32 bpg = 32 tpb = 32 n = bpg * tpb shared_mem_size = (tpb, tpb) griddim = bpg, bpg blockdim = tpb, tpb @numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])") def naive_matrix_mult(A, B, C): x, y = cuda.grid(2) if x >= n or y >= n: return C[y, x] = 0 for i in range(n): C[y, x] += A[y, i] * B[i, x] @numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])") def optimized_matrix_mult(A, B, C): # Declare shared memory sA = cuda.shared.array(shape=shared_mem_size, dtype=float32) sB = cuda.shared.array(shape=shared_mem_size, dtype=float32) tx = cuda.threadIdx.x ty = cuda.threadIdx.y x, y = cuda.grid(2) acc = 0 for i in range(bpg): if x < n and y < n: # Prefill cache sA[ty, tx] = A[y, tx + i * tpb] sB[ty, tx] = B[ty + i * tpb, x] # Synchronize all threads in the block cuda.syncthreads() if x < n and y < n: # Compute product for j in range(tpb): acc += sA[ty, j] * sB[j, tx] # Wait until all threads finish the computation cuda.syncthreads() if x < n and y < n: C[y, x] = acc # Prepare data on the CPU A = np.array(np.random.random((n, n)), dtype=np.float32) B = np.array(np.random.random((n, n)), dtype=np.float32) print "(%d x %d) x (%d x %d)" % (n, n, n, n) # Prepare data on the GPU dA = cuda.to_device(A) dB = cuda.to_device(B) dC = cuda.device_array_like(A) # Time the unoptimized version s = timer() naive_matrix_mult[griddim, blockdim](dA, dB, dC) numba.cuda.synchronize() e = timer() unopt_ans = dC.copy_to_host() tcuda_unopt = e - s # Time the optimized version s = timer() optimized_matrix_mult[griddim, blockdim](dA, dB, dC) numba.cuda.synchronize() e = timer() opt_ans = dC.copy_to_host() tcuda_opt = e - s assert np.allclose(unopt_ans, opt_ans) print "CUDA without shared memory:", "%.2f" % tcuda_unopt, "s" print "CUDA with shared memory :", "%.2f" % tcuda_opt, "s" s = timer() np.dot(A,B) e = timer() npt=e-s print "NumPy dot product :", "%.2f" % npt, "s" what returns (1024 x 1024) x (1024 x 1024) CUDA without shared memory: 0.76 s CUDA with shared memory : 0.25 s NumPy dot product : 0.06 s and leads to CUDA:NumPy (1:2) score of the game. The natural questions arise. Is it about the matrix size? Maybe it is too simple problem we try to solve it with an improper tool? Or the way how we approach matrix allocation and deployment to GPU itself? The last question made me digging deeper. In Python you can create one big matrix holding a number of smaller matrices. The following code tries to perform 4 million$(2\times 2)$matrix multiplications where matrix$B$is randomised every single time (see our random portfolio problem). In Anaconda Accelerate we achieve it as follows: import numbapro import numba.cuda import numpy as np from timeit import default_timer as timer # Use the builtin matrix_multiply in NumPy for CPU test import numpy.core.umath_tests as ut @numbapro.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'], '(m, n),(n, p)->(m, p)', target='gpu') def batch_matrix_mult(a, b, c): for i in range(c.shape[0]): for j in range(c.shape[1]): tmp = 0 for n in range(a.shape[1]): tmp += a[i, n] * b[n, j] c[i, j] = tmp def main(): n = 4000000 dim = 2 sK=0 KN=10 for K in range(KN): # Matrix Multiplication: c = a x b a = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) c = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) # NUMPY ------------------------------------------------------------- start = timer() b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) d=ut.matrix_multiply(a, b) np_time=timer()-start # CUDA -------------------------------------------------------------- dc = numba.cuda.device_array_like(c) da = numba.cuda.to_device(a) start = timer() b = np.random.random(n*dim*dim).astype(np.float32).reshape(n,dim,dim) db = numba.cuda.to_device(b) batch_matrix_mult(da, db, out=dc) numba.cuda.synchronize() dc.copy_to_host(c) cuda_time=timer()-start sK += np_time/cuda_time del da, db print("\nThe average CUDA speed-up: %.5fx") % (sK/KN) if __name__ == '__main__': main() leading us to The average CUDA speed-up: 0.79003x i.e. deceleration. Playing with the sizes of matrices and their number may result in error caused by GPU memory required for matrix$A$and$B$allocation on GPU. It seems we have CUDA:NumPy (1:3). Black Magic in Black Box I approached NumbaPro solution as a complete rookie. I spent a considerable amount of time searching for Anaconda Accelerate’s GPU codes demonstrating massive speed-ups as promised. I found different fragments in different places across the Web. With the best method known among all beginners, namely, copy and paste, I re-ran what I found. Then modified and re-ran again. And again, and again. I felt the need, the need for speed! But failed finding my tail wind. This post may demonstrate my lack of understanding of what is going on or reveal a blurred picture standing behind: a magic that works if you know all the tricks. I hope that at least you enjoyed the show! ## Covariance Matrix for N-Asset Portfolio fed by Quandl in Python A construction of your quantitative workshop in Python requires a lot of coding or at least spending a considerable amount of time assembling different blocks together. There are many simple fragments of code reused many times. The calculation of covariance matrix is not a problem once NumPy is engaged but the meaning is derived once you add some background idea what you try to achieve. Let’s see in this lesson of Accelerated Python for Quants tutorial how to use Quandl.com data provider in construction of any$N$-Asset Portfolio based on SEC securities and for return-series we may calculate a corresponding covariance matrix. Quandl and SEC Stock List Quandl is a great source of data. With their ambition to become the largest data provider on the planet free of charge, no doubt they do an amazing job. You can use their Python API to feed your code directly with Open, High, Low, Close for any SEC stock. In the beginning we will need a list of companies (tickers) and, unfortunately, the corresponding internal call-tickers as referred to by Quandl. The .csv file containing all information you can download from this website or directly here: secwiki_tickers.csv. The file contains the following header: Ticker Name Sector Industry Price Collection where we are interested in matching Ticker with Price field. The latter for AAPL stock displays “WIKI/AAPL” code. That’s all we need for now to grab. Our Portfolio Let’s say we have a freedom of choice to select any of 2277 stock data from our SEC universe (provided in secwiki_tickers.csv file). For the sake of simplicity I’ll select ony three and save them in a plain text file of portfolio.lst containing: AAPL IBM TXN I do it on purpose in order to show you how easily we can read in this list from a file in Python. As usual, we start our adventure from data pre-processing part: 1 2 3 4 5 6 7 8 9 10 11 # Covariance Matrix for N-Asset Portfolio fed by Quandl in Python # (c) 2014 QuantAtRisk, by Pawel Lachowicz import Quandl import numpy as np import pandas as pd df=pd.read_csv('secwiki_tickers.csv') dp=pd.read_csv('portfolio.lst',names=['pTicker']) pTickers=dp.pTicker.values # converts into a list In order to install Quandl module in your Mac/Linux environment simply type pip install Quandl (for more information see here). In the code above, we employ pandas’ read_csv function both for reading in the data from .csv file as well as from a plain text file. For the latter, line #10, we add a name of the column, pTicker, to point at portfolio tickers, and next we convert pandas’ DataFrame object into a Python list. Now, we gonna use the power of Python over Matlab in search for the corresponding Quandl (Price) ticker code: 13 14 15 16 17 18 19 20 21 tmpTickers=[] for i in range(len(pTickers)): test=df[df.Ticker==pTickers[i]] if not(test.empty): tmp=test.Price.values+'.4' # of <type 'numpy.ndarray'> tmp2=tmp.tolist() tmpTickers.append(tmp2) print(tmpTickers) This is executed in line #15 almost automatically. The result of the search is the DataFrame record (empty or containing the corresponding information on the stock as read out from (df) .csv data source. Simply, if the ticker of the security in our portfolio.lst file does not exist, it is skipped. Quandl allows you to retrieve information on stock’s Open, High, Low, Close, Volume by calling (Price) Quandl ticker in the following form, respectively: 'WIKI/AAPL.1' 'WIKI/AAPL.2' 'WIKI/AAPL.3' 'WIKI/AAPL.4' 'WIKI/AAPL.5' Below we will stick to Close prices. That is why, in line #17, we add ‘.4′ to the string. Please note that we should get the same data by calling ‘GOOG/NASDAQ_AAPL.4′ and ‘WIKI/AAPL.4′ Quandl tickers. Because a variable tmp is of numpy.ndarray type, in line #18, we convert it into a list type. The final list of tmpTickers contains therefore all corresponding Quandl tickers in the form: [['WIKI/AAPL.4'], ['WIKI/IBM.4'], ['WIKI/TXN.4']] and before they can be used for return-series retrieval from the Quandl database, we need to scalp each item a bit in the following way: 23 24 25 26 27 28 29 30 tmp=[] for i in range(len(tmpTickers)): tmp2=str(tmpTickers[i]).strip('[]') print(tmp) tmp.append(str(tmp2).strip('\'\'')) QuandlTickers=tmp print(QuandlTickers) what returns: ['WIKI/AAPL.4', 'WIKI/IBM.4', 'WIKI/TXN.4'] Return-Series Download We fetch the data pretty easily: 32 33 34 35 36 37 data= Quandl.get(QuandlTickers, authtoken='YourAuthToken', \ trim_start='2014-10-01', trim_end='2014-11-04', \ transformation='rdiff') d=data.values.T print(d) where rdiff enforces price-series transformation into return-series expressed as row-on-row % change, y'[t] = (y[t] – y[t-1])/y[t-1] (see more here: Using the Quandl API). By default we download daily returns: [[-0.01558313 0.00725953 -0.0028028 0. -0.00873319 0.02075949 0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282 0.01464783 0.02139859 0.0271652 0.00507466 0.01786581 0.00372031 -0.00104543 0.01550756 0.00562114 -0.00335383 0.00953449 0.01296296 -0.00731261] [-0.01401254 -0.00138911 0.0094163 0.0019611 -0.01761532 0.0196543 -0.01552598 -0.00262847 -0.01296187 0.00152572 -0.01115343 -0.01050894 0.0122887 -0.0711343 -0.03471319 -0.00882191 0.00241053 -0.0006166 -0.00129566 0.01068759 -0.00085575 0.00544476 0.00030423 -0.00024331 -0.01040399] [-0.01698469 nan nan -0.00416489 -0.01319035 0.020213 -0.01959949 -0.07127336 -0.0189518 0.00906272 0.01063578 0.01941066 0.00183528 0.01694527 0.05314118 -0.00320718 0.00815101 0.01212766 0.00798823 0.01147028 -0.00350515 -0.01655287 0.0448138 0.00845751 0.00638978]] You may notice some missing data in the form denoted by nan. It may happen for any dataset. Life is not perfect. Quandl is only an option. In the first approximation we may fill the missing returns with zeros using pandas: 39 40 41 42 43 44 for i in range(d.shape[0]): df=pd.DataFrame(d[i]) df.fillna(0,inplace=True) d[i]=df.values.T print(d) what works quickly and efficiently: [[-0.01558313 0.00725953 -0.0028028 0. -0.00873319 0.02075949 0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282 0.01464783 0.02139859 0.0271652 0.00507466 0.01786581 0.00372031 -0.00104543 0.01550756 0.00562114 -0.00335383 0.00953449 0.01296296 -0.00731261] [-0.01401254 -0.00138911 0.0094163 0.0019611 -0.01761532 0.0196543 -0.01552598 -0.00262847 -0.01296187 0.00152572 -0.01115343 -0.01050894 0.0122887 -0.0711343 -0.03471319 -0.00882191 0.00241053 -0.0006166 -0.00129566 0.01068759 -0.00085575 0.00544476 0.00030423 -0.00024331 -0.01040399] [-0.01698469 0. 0. -0.00416489 -0.01319035 0.020213 -0.01959949 -0.07127336 -0.0189518 0.00906272 0.01063578 0.01941066 0.00183528 0.01694527 0.05314118 -0.00320718 0.00815101 0.01212766 0.00798823 0.01147028 -0.00350515 -0.01655287 0.0448138 0.00845751 0.00638978]] Having data is better than missing them. It is important to check how many nans every return-series has. In our case of TXN missing information constitutes 8% at 25 data points. A good indicator is to have less than 3%. Otherwise an impact on any further calculations may be significantly propagated. In this point it is good to recall Bruce Lee who said: “Empty your mind, be formless, shapeless, like water. If you put water into a cup, it becomes the cup. If you put water in a bottle, it becomes the bottle. If you put it in a teacup, it becomes the teacup. Now, water can flow, and it can crush. Be water, my friend.” Two added zeros contribute to the TXN distribution of returns in its central part, therefore the more we “fill in” with zeros, a fit of normal distribution flattens the tails (a distribution becomes more peaked). Therefore, in such case, I would agree with Bruce only on the “crush” part of his wisdom. Covariance Matrix Keeping above short note on some dirty tricks in mind, we obtain the desired covariance matrix simply and painfully, 46 47 covmat=np.cov(d) print(covmat) id est, [[ 1.44660397e-04 -2.13905277e-05 1.31742330e-04] [ -2.13905277e-05 3.12211511e-04 -5.62146677e-05] [ 1.31742330e-04 -5.62146677e-05 5.44348868e-04]] what accomplishes our efforts. Recommended Reading: Applied Portfolio Optimization with Risk Management using Matlab ## GARCH(1,1) Model in Python In my previous article GARCH(p,q) Model and Exit Strategy for Intraday Algorithmic Traders we described the essentials of GARCH(p,q) model and provided an exemplary implementation in Matlab. In general, we apply GARCH model in order to estimate the volatility one time-step forward, where: $$\sigma_t^2 = \omega + \alpha r_{t-1}^2 + \beta \sigma_{t-1}^2$$ based on the most recent update of$r$and$\sigma$, where$r_{t-1} = \ln({P_{t-1}}/{P_{t-2}})$and$P$corresponds to an asset price. For any financial time-series,$\{r_j\}$, the estimation of$(\omega,\alpha,\beta)$parameters can be conducted utilising the maximum likelihood method. The latter is an iterative process by looking for the maximum value of the sum among all sums defined as: $$\sum_{i=3}^{N} \left[ -\ln(\sigma_i^2) – \frac{r_i^2}{\sigma_i^2} \right]$$ where$N$denotes the length of the return series$\{r_j\}$($j=2,…,N$) under study. Let’s assume we have a test array of input data,$\{r_j\}$, stored in Python variable of r, and we write a function, GARCH11_logL, that will be used in the optimisation process. It contains two input parameters where parma is an 3-element array with some trial values corresponding to$(\omega,\alpha,\beta)$and u denotes the return-series. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 # GARCH(1,1) Model in Python # uses maximum likelihood method to estimate (omega,alpha,beta) # (c) 2014 QuantAtRisk, by Pawel Lachowicz; tested with Python 3.5 only import numpy as np from scipy import optimize import statistics as st r = np.array([0.945532630498276, 0.614772790142383, 0.834417758890680, 0.862344782601800, 0.555858715401929, 0.641058419842652, 0.720118656981704, 0.643948007732270, 0.138790608092353, 0.279264178231250, 0.993836948076485, 0.531967023876420, 0.964455754192395, 0.873171802181126, 0.937828816793698]) def GARCH11_logL(param, r): omega, alpha, beta = param n = len(r) s = np.ones(n)*0.01 s[2] = st.variance(r[0:3]) for i in range(3, n): s[i] = omega + alpha*r[i-1]**2 + beta*(s[i-1]) # GARCH(1,1) model logL = -((-np.log(s) - r**2/s).sum()) return logL In this point it is important to note that in line #32 we multiply the sum by$-1$in order to find maximal value of the expression. Why? This can be understood as we implement optimize.fmin function from Python’s optimize module. Therefore, we seek for best estimates as follows: 34 o = optimize.fmin(GARCH11_logL,np.array([.1,.1,.1]), args=(r,), full_output=1) and we display the results, 36 37 38 R = np.abs(o[0]) print() print("omega = %.6f\nbeta = %.6f\nalpha = %.6f\n" % (R[0], R[2], R[1])) what, in case of the array of r as given in the code, returns the following results: Optimization terminated successfully. Current function value: 14.705098 Iterations: 88 Function evaluations: 162 omega = 0.788244 beta = 0.498230 alpha = 0.033886 Python vs. Matlab Solution Programming requires caution. It is always a good practice to test the outcome of one algorithm against alternative solutions. Let’s run the GARCH(1,1) model estimation for the same input array and compare Python and Matlab results: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 % GARCH(1,1) Model in Matlab 2013a % (c) 2014 QuantAtRisk, by Pawel Lachowicz clear all; close all; clc; r=[0.945532630498276, ... 0.614772790142383, ... 0.834417758890680, ... 0.862344782601800, ... 0.555858715401929, ... 0.641058419842652, ... 0.720118656981704, ... 0.643948007732270, ... 0.138790608092353, ... 0.279264178231250, ... 0.993836948076485, ... 0.531967023876420, ... 0.964455754192395, ... 0.873171802181126, ... 0.937828816793698]'; % GARCH(p,q) parameter estimation model = garch(1,1) % define model [fit,VarCov,LogL,Par] = estimate(model,r); % extract model parameters parC=Par.X(1); % omega parG=Par.X(2); % beta (GARCH) parA=Par.X(3); % alpha (ARCH) % estimate unconditional volatility gamma=1-parA-parG; VL=parC/gamma; volL=sqrt(VL); % redefine model with estimatated parameters model=garch('Constant',parC,'GARCH',parG,'ARCH',parA) what returns: model = GARCH(1,1) Conditional Variance Model: -------------------------------------- Distribution: Name = 'Gaussian' P: 1 Q: 1 Constant: NaN GARCH: {NaN} at Lags [1] ARCH: {NaN} at Lags [1] ____________________________________________________________ Diagnostic Information Number of variables: 3 Functions Objective: @(X)OBJ.nLogLikeGaussian(X,V,E,Lags,1,maxPQ,T,nan,trapValue) Gradient: finite-differencing Hessian: finite-differencing (or Quasi-Newton) Constraints Nonlinear constraints: do not exist Number of linear inequality constraints: 1 Number of linear equality constraints: 0 Number of lower bound constraints: 3 Number of upper bound constraints: 3 Algorithm selected sequential quadratic programming ____________________________________________________________ End diagnostic information Norm of First-order Iter F-count f(x) Feasibility Steplength step optimality 0 4 1.748188e+01 0.000e+00 5.758e+01 1 27 1.723863e+01 0.000e+00 1.140e-03 6.565e-02 1.477e+01 2 31 1.688626e+01 0.000e+00 1.000e+00 9.996e-01 1.510e+00 3 35 1.688234e+01 0.000e+00 1.000e+00 4.099e-02 1.402e+00 4 39 1.686305e+01 0.000e+00 1.000e+00 1.440e-01 8.889e-01 5 44 1.685246e+01 0.000e+00 7.000e-01 2.379e-01 5.088e-01 6 48 1.684889e+01 0.000e+00 1.000e+00 9.620e-02 1.379e-01 7 52 1.684835e+01 0.000e+00 1.000e+00 2.651e-02 2.257e-02 8 56 1.684832e+01 0.000e+00 1.000e+00 8.389e-03 7.046e-02 9 60 1.684831e+01 0.000e+00 1.000e+00 1.953e-03 7.457e-02 10 64 1.684825e+01 0.000e+00 1.000e+00 7.888e-03 7.738e-02 11 68 1.684794e+01 0.000e+00 1.000e+00 3.692e-02 7.324e-02 12 72 1.684765e+01 0.000e+00 1.000e+00 1.615e-01 5.862e-02 13 76 1.684745e+01 0.000e+00 1.000e+00 7.609e-02 8.429e-03 14 80 1.684740e+01 0.000e+00 1.000e+00 2.368e-02 4.072e-03 15 84 1.684739e+01 0.000e+00 1.000e+00 1.103e-02 3.142e-03 16 88 1.684739e+01 0.000e+00 1.000e+00 1.183e-03 2.716e-04 17 92 1.684739e+01 0.000e+00 1.000e+00 9.913e-05 1.378e-04 18 96 1.684739e+01 0.000e+00 1.000e+00 6.270e-05 2.146e-06 19 97 1.684739e+01 0.000e+00 7.000e-01 4.327e-07 2.146e-06 Local minimum possible. Constraints satisfied. fmincon stopped because the size of the current step is less than the default value of the step size tolerance and constraints are satisfied to within the selected value of the constraint tolerance. <stopping criteria details> GARCH(1,1) Conditional Variance Model: ---------------------------------------- Conditional Probability Distribution: Gaussian Standard t Parameter Value Error Statistic ----------- ----------- ------------ ----------- Constant 0.278061 26.3774 0.0105417 GARCH{1} 0.457286 49.4915 0.0092397 ARCH{1} 0.0328433 1.65576 0.0198358 model = GARCH(1,1) Conditional Variance Model: -------------------------------------- Distribution: Name = 'Gaussian' P: 1 Q: 1 Constant: 0.278061 GARCH: {0.457286} at Lags [1] ARCH: {0.0328433} at Lags [1] id est $$(\omega,\beta,\alpha)_{\rm Matlab} = (0.278061,0.457286,0.0328433) \ .$$This slightly differs itself from the Python solution which was $$(\omega,\beta,\alpha)_{\rm Py} =(0.788244,0.498230,0.033886) \ .$$At this stage it is difficult to assess which solution is “better”. Both algorithms and applied methodologies simply may be different what is usually the case. Having that in mind, further extensive tests are required, for example, a dependance of Python solution on trial input$(\omega,\alpha,\beta)$values as displayed in line #33 of the Python code. ## Deriving Limits in Python Lesson 6>> Some quant problems require an intensive work with mathematical (time-)series given initial conditions. In this short lesson on Python, let’s consider the following problem and solve it analytically and with aid of Python. Problem Given a series$\{a_n\}$, where$n\in N^{+}$, and$a_1=\sqrt{2}$and$a_{n+1} = \sqrt{2}^{\log_2 {a_n}}$, solve: $$\lim_{n \to \infty} b_n \ \ \ \ \mbox{where} \ \ b_n = a_1\cdot a_2\cdot … \cdot a_n \ .$$ Solution By employing $$a^{\log_a b} = b \ \ \ \ \mbox{where} \ \ a\in \Re^{+} – \{1\} \ \ \ \mbox{and} \ \ \ b\in \Re^{+}$$and taking into account the initial conditions for our problem, we derive: $$a_1 = \sqrt{2} = 2^{\frac{1}{2}} \\ a_2 = \sqrt{2}^{\log_2 a_1} = 2^{\frac{1}{2}\log_2 2^{\frac{1}{2}}} = 2^{\log_2 2^\frac{1}{4}} = 2^\frac{1}{2^2} \\ a_3 = \sqrt{2}^{\log_2 a_2} = 2^{\frac{1}{2}\log_2 2^{\frac{1}{2^2}}} = 2^\frac{1}{2^3} \\ … \\ a_n = \sqrt{2}^{\log_2 a_{n-1}} = 2^\frac{1}{2^n}$$ therefore $$a_n = 2^\frac{1}{2^n} \ .$$ We find $$b_n = a_1\cdot a_2\cdot … \cdot a_n = 2^{\frac{1}{2}} \cdot 2^\frac{1}{2^2} \cdot … \cdot 2^\frac{1}{2^n} =$$ $$= 2^{\frac{1}{2}+\frac{1}{4}+…+\frac{1}{2^n}} = 2^{1-\left( \frac{1}{2} \right)^n} \ .$$ Having that, we might code in Python the limit for$\{b_n\}$-series in an iterative way: 1 2 3 4 5 6 7 # Deriving Limits in Python, Accelerated Python for Quants Tutorial, Lesson 5 # (c) 2014 QuantAtRisk from sympy import * for m in xrange(1,40): print 2**(1-0.5**m) what results in the output: 1.41421356237 1.68179283051 1.83400808641 1.91520656140 1.95714412418 1.97845602639 1.98919884697 1.99459211217 1.99729422578 1.99864665501 1.99932321299 1.99966157786 1.99983078177 1.99991538910 1.99995769410 1.99997884694 1.99998942344 1.99999471171 1.99999735586 1.99999867793 1.99999933896 1.99999966948 1.99999983474 1.99999991737 1.99999995869 1.99999997934 1.99999998967 1.99999999484 1.99999999742 1.99999999871 1.99999999935 1.99999999968 1.99999999984 1.99999999992 1.99999999996 1.99999999998 1.99999999999 1.99999999999 2.0 while more elegant solution for$\{b_n\}$we may obtain as follows: 9 10 11 12 n=Symbol('n') expr=2**(1-0.5**n) a=limit(expr, n, oo) print a revealing 2 confirmed by an analytical solution: $$\lim_{n \to \infty} b_n = \lim_{n \to \infty} 2^{1-\left( \frac{1}{2} \right)^n} = 2 \ \ .$$ Please note that Python’s command limit requires explicit definition of the variable$n$to be symbolically assigned before calculation, and sympy package needs to be uploaded. ## Asset Allocation for Tangent Portfolio with Risk-Free Asset in Python Lesson 5>> Studying a new programming language is an adventurous journey. It takes time in case of Python. Even for me. Not every concept is easy to grasp and testing a piece of code with its endless modifications may take a longer while. We have two choices: either to get a good book like upcoming Python for Quants or sit down and apply a copy-and-paste-and-test method. The latter approach is much longer but rewarding in the middle of our learning process. Hit and try. So, why not to try? You cannot write the code which performs complex operations omitting necessary set of functions. There is no need to reinvent the wheel. In Python, the community works hard to deliver best programming solutions. In this lesson, we will accelerate by conducting an investigation of Python code aimed at finding optimised weights for a tangent portfolio problem. In QaR ebook on Applied Portfolio Optimization with Risk Management using Matlab we discussed in great detail the theory and practical calculations for various cases of portfolios with different objective functions. Markowitz in 1952 underlined that the goal of portfolio choice was either to look for such portfolio which could deliver maximum return at a given level of risk or minimum risk for a given level of return. Based on the theoretical works of Sharpe in 1964, Lintner in 1965 and Tobin in 1958, the importance of the risk-free asset in the portfolio has been proved to equip the investor with a better control over risk. We can split the budget into fractions of our capital designated for an investment in the risk-free option (e.g. the savings account in a bank) while the rest will be delegated to other assets with diversified risk levels. It was shown that for any portfolio with the risk-free component, the expected return is: $$R_P = mw^T + (1+{\bf 1}w^T)r_f \ \ \ \mbox{at} \ \ \ \sigma_P = wCw^T$$ where by$C$,$m$, and$w$the covariance matrix, individual asset expected return matrix, and asset weighting have been denoted accordingly. If we aim at variance,$\sigma_P$, to be minimised: $$\min_{w} \ \ wCw^T$$ subject to$mw^T + (1+{\bf 1}^T)r_f = r_{\rm target}$, we formulate the minimum variance portfolio optimization problem. It occurs that all minimum variance portfolios are a combination of the risk-free asset and a given risky portfolio. The latter is often called the tangent portfolio and has been shown that it must contain of all assets available to investors (held in quantity to its market value relative to the total market value of all assets). That makes the second name of the tangent portfolio: the market portfolio. The objective function of the form: $$\max_{w} \ \ \frac{mw^T-r_f}{\sqrt{wCw^T}}$$ subject to${\bf 1}w^T=1$called the Sharpe ratio corresponding to the market portfolio directly. The risk-free asset is connected with the tangent portfolio by the straight line therefore provides an investor with a good blend of risk-controlled portfolios. Modern Portfolio Theory tells us that the tangent portfolio is given by: $${\bf w} = C^{-1}({\bf m} – {\bf 1}r_f)$$ where the vector${\bf w}$stores the computed weights for each asset in portfolio$P$. Since finding the tangent portfolio given$N$assets is, luckily, an analytical problem (i.e. without employment of the solvers), that makes this task a straightforward problem to be coded in Python as follows. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 # Asset Allocation for Tangent Portfolio with Risk-Free Asset in Python # Accelerated Python for Quants Tutorial, Lesson 4 # (c) 2014 QuantAtRisk from numpy import matrix, power from math import sqrt def TangentPortfolio(m,C,rf): # find number of rows and columns for the covariance matrix (nr,nc)=C.shape A=matrix([[0.0] for r in xrange(nr)]) A=(1/C)*(m-rf) (nr,nc)=A.shape A=A/sum(A[r,0] for r in xrange(nr)) w=[A[r,0] for r in xrange(nr)] pret=mu.T*A prsk=power(A.T*(C*A),0.5) return matrix(w),pret,prsk Here, we can see a new element of Python language which is a definition and usage of a function. We start its syntax with def, next write a desired function name with input parameters in the brackets, and end it with a colon. The body of the function must be always indented (min 4 space signs; don’t use tab!) and if some results are intended to be sent out of the function, return function should be specified at the end, listing all variables of interest. In addition we make use of numpy module from which we import only two functions that we will be using. The first one is matrix that allows us to implement matrix or vector notation explicitly (we avoid Python’s lists or arrays at this stage). The dimensions of any matrix M can be return into tuple as shown in line #10. Please note how Python eases our life. In line #15 we create a new one-row vector (matrix) referring directly to certain elements of other matrix by putting for…in loop inside the matrix of w itself. How brilliant it is! Lastly, using a function of power we take its first argument to the power of 1/2, i.e. we compute a square root. To see some action, let us first define an exemplary covariance matrix and vector with expected returns corresponding to 3-assets in the portfolio: 20 21 22 23 24 cov=matrix([[0.04, 0.004, 0.02],[0.004, 0.09, 0.09],[0.02,0.09,0.16]]) mu=matrix([[0.13],[0.11],[0.19]]) rf=0.05 w,ret,rsk=TangentPortfolio(mu,cov,rf) where investing at the risk-free rate of 5% has been added to complete the grand picture of the problem we discuss here. Line #24 reveals the way how we call our function and assign calculated values within the function to outer variables (their names can be different). We display the results on the screen by typing: 26 27 28 29 30 print("Portfolio weights") print(w.T) print("Expected Portfolio Return and Risk") print ret,rsk what returns: Portfolio weights [[ 0.46364368] [ 0.4292997 ] [ 0.10705661]] Expected Portfolio Return and Risk [[ 0.1278374]] [[ 0.19715402]] what simply communicates that the expected portfolio return equals 12.8% at 19.7% of risk if we allocate 46%, 42%, and 10% in asset number 1, 2, and 3, respectively. We find that Sharpe ratio, 32 33 sharpe=(ret-rf)/rsk print(sharpe) equals 0.3948 which corresponds to the Sharpe ratio of the first asset: 35 for r in xrange(3): print((mu[r,0]-rf)/sqrt(cov[r,r])) 0.4 0.2 0.35 Finally, if we denote by$\zeta$a fraction of capital we want to invest in risky assets, leaving$(1-\zeta)$in the bank at$r_f=5\%rate, then the expected portfolio return will be: $$\zeta wm+(1-\zeta)r_r \ \ \ \mbox{at} \ \ \ \zeta\sqrt{wCw^T}$$ of risk, therefore for two different cases, for example: 37 38 39 40 41 42 43 alpha=0.7 print(((matrix(alpha)*w)*mu)+(1-alpha)*rf) print(matrix(alpha)*power(w*cov*w.T,1)) alpha=0.25 print(((matrix(alpha)*w)*mu)+(1-alpha)*rf) print(matrix(alpha)*power(w*cov*w.T,1)) we get [[ 0.10448618]] [[ 0.0272088]] [[ 0.06945935]] [[ 0.00971743]] what confirms that by putting 70% of our capital, for instance, into three stocks should result in 10.4% gain at 2.7% rate of risk, while an allocation of 75% of the capital in the bank promises 6.9% return, i.e. approaching earlier defined risk-free rate of 5% pa. Hungry of Python? Want some more? Stay Tuned! A new ebook Python for Quants is coming this July! Don’t miss the full introductory course to Python with many other practical examples ready-to-rerun and use for your projects: ## Hacking Google Finance in Real-Time for Algorithmic Traders Forecasting risk in algorithmic stock trading is of paramount importance for everyone. You should always look for the ways how to detect sudden price changes and take immediate actions to protect your investments. Imagine you opened a new long position last Wednesday for NASDAQ:NVDA buying 1500 shares at the market price of USD16.36. On the next day price goes down to USD15.75 at the end of the session. You are down 3.87% or almost a grand in one day. If you can handle that, it’s okay but if the drop were more steep? Another terrorist attack, unforeseen political event, North Korea nuclear strike? Then what? You need to react! If you have information, you have options in your hands. In this post we will see how one can use real-time data of stock prices displayed on Google Finance website, fetch and record them on your computer. Having them, you can build your own warning system for sudden price swings (risk management) or run the code in the background for a whole trading session (for any stock, index, etc. accessible through Google Finance) and capture asset prices with an intraday sampling (e.g. every 10min, 30min, 1h, and so on). From this point only your imagination can stop you from using all collected data. Hacking with Python If you ever dreamt of becoming a hacker, this is your chance to shine! I have got my inspiration after reading the book of Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers by TJ O’Connor. A powerful combination of the beauty and the beast. The core of our code will be contained in a small function which does the job. For a specified Google-style ticker (query), it fetches the data directly from the server returning the most current price of an asset: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Hacking Google Finance in Real-Time for Algorithmic Traders # # (c) 2014 QuantAtRisk.com, by Pawel Lachowicz import urllib, time, os, re, csv def fetchGF(googleticker): url="http://www.google.com/finance?&q=" txt=urllib.urlopen(url+googleticker).read() k=re.search('id="ref_(.*?)">(.*?)<',txt) if k: tmp=k.group(2) q=tmp.replace(',','') else: q="Nothing found for: "+googleticker return q Just make sure that a Google ticker is correctly specified (as will see below). Next, let’s display on the screen our local time and let’s force a change of the system time to the one corresponding to New York City, NY. The latter assumption we make as we would like to track the intraday prices of stock(s) traded at NYSE or NASDAQ. However, if you are tracking FTSE 100 index, the Universal Time (UTC) of London is advisable as an input parameter. 18 19 20 21 22 23 24 25 26 27 # display time corresponding to your location print(time.ctime()) print # Set local time zone to NYC os.environ['TZ']='America/New_York' time.tzset() t=time.localtime() # string print(time.ctime()) print Having that, let us define a side-function combine which we will use to glue all fetched data together into Python’s list variable: 29 30 31 32 33 34 def combine(ticker): quote=fetchGF(ticker) # use the core-engine function t=time.localtime() # grasp the moment of time output=[t.tm_year,t.tm_mon,t.tm_mday,t.tm_hour, # build a list t.tm_min,t.tm_sec,ticker,quote] return output As an input, we define Google ticker of our interest: 36 ticker="NASDAQ:AAPL" for which we open a new text file where all queries will be saved in real-time: 39 40 41 42 # define file name of the output record fname="aapl.dat" # remove a file, if exist os.path.exists(fname) and os.remove(fname) Eventually, we construct the final loop over trading time. Here, we fetch the last data at 16:00:59 New York time. The key parameter in the game is freq variable where we specify the intraday sampling (in seconds). From my tests, using a private Internet provider, I have found that the most optimal sampling was 600 sec (10 min). Somehow, for shorter time intervals, Google Finance detected too frequent queries sent from my IP address. This test succeed from a different IP location, therefore, feel free to play with your local Internet network to find out what is the lowest available sampling time for your geolocation. 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 freq=600 # fetch data every 600 sec (10 min) with open(fname,'a') as f: writer=csv.writer(f,dialect="excel") #,delimiter=" ") while(t.tm_hour<=16): if(t.tm_hour==16): while(t.tm_min<01): data=combine(ticker) print(data) writer.writerow(data) # save data in the file time.sleep(freq) else: break else: for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) # save data in the file time.sleep(freq) f.close() To see how the above code works in practice, I conducted a test on Jan/9 2014, starting at 03:31:19 Sydney/Australia time, corresponding to 11:31:19 New York time. Setting the sampling frequency to 600 sec, I was able to fetch the data in the following form: Thu Jan 9 03:31:19 2014 Wed Jan 8 11:31:19 2014 [2014, 1, 8, 11, 31, 19, '543.71'] [2014, 1, 8, 11, 41, 22, '543.66'] [2014, 1, 8, 11, 51, 22, '544.22'] [2014, 1, 8, 12, 1, 23, '544.80'] [2014, 1, 8, 12, 11, 24, '544.32'] [2014, 1, 8, 12, 21, 25, '544.86'] [2014, 1, 8, 12, 31, 27, '544.47'] [2014, 1, 8, 12, 41, 28, '543.76'] [2014, 1, 8, 12, 51, 29, '543.86'] [2014, 1, 8, 13, 1, 30, '544.00'] [2014, 1, 8, 13, 11, 31, 'Nothing found for: NASDAQ:AAPL'] [2014, 1, 8, 13, 21, 33, '543.32'] [2014, 1, 8, 13, 31, 34, '543.84'] [2014, 1, 8, 13, 41, 36, '544.26'] [2014, 1, 8, 13, 51, 37, '544.10'] [2014, 1, 8, 14, 1, 39, '544.30'] [2014, 1, 8, 14, 11, 40, '543.88'] [2014, 1, 8, 14, 21, 42, '544.29'] [2014, 1, 8, 14, 31, 45, '544.15'] ... As you can notice, they were displayed on the screen (line #59 in the code) in the form of Python’s list. It is important to note that the time we make an effort to capture and associate it with fetched asset price (query) is the computer’s system time, therefore please don’t expect regular time intervals as one may get from a verified market data providers. We are hacking in real-time! However, if you think about the data themselves, this time precision is not of great importance. As long as we fetch the data every freq seconds, that sufficiently allows us to build a risk management system or even to measure a rolling volatility of an asset. Your trading model will benefit anyway. Have also a note that if our Internet connection fails or there are some disturbances of a different kind, we will miss the data in a sent query as visible in the example above. Looks exciting? Give me High Five! and say Hell Yeah! Code Modification: Portfolio of Assets The presented Python code can be very easily modified if you wish to try fetching data for a couple of assets concurrently every freq seconds. Simply extend and amend all the lines starting at row #36, for example in the following form: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 tickers=["NASDAQ:AAPL","NASDAQ:GOOG","NASDAQ:BIDU","NYSE:IBM", \ "NASDAQ:INTC","NASDAQ:MSFT","NYSEARCA:SPY"] # define the name of an output file fname="portfolio.dat" # remove a file, if exist os.path.exists(fname) and os.remove(fname) freq=600 # fetch data every 600 sec (10 min) with open(fname,'a') as f: writer=csv.writer(f,dialect="excel") #,delimiter=" ") while(t.tm_hour<=16): if(t.tm_hour==16): while(t.tm_min<01): #for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) time.sleep(freq) else: break else: for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) time.sleep(freq) f.close() That’s it! For the sake of real-time verification, here is a screenshot how does it work: Thu Jan 9 07:01:43 2014 Wed Jan 8 15:01:43 2014 [2014, 1, 8, 15, 1, 44, 'NASDAQ:AAPL', '543.55'] [2014, 1, 8, 15, 1, 44, 'NASDAQ:GOOG', '1140.30'] [2014, 1, 8, 15, 1, 45, 'NASDAQ:BIDU', '182.65'] [2014, 1, 8, 15, 1, 45, 'NYSE:IBM', '187.97'] [2014, 1, 8, 15, 1, 46, 'NASDAQ:INTC', '25.40'] [2014, 1, 8, 15, 1, 47, 'NASDAQ:MSFT', '35.67'] [2014, 1, 8, 15, 1, 47, 'NYSEARCA:SPY', '183.43'] [2014, 1, 8, 15, 11, 48, 'NASDAQ:AAPL', '543.76'] [2014, 1, 8, 15, 11, 49, 'NASDAQ:GOOG', '1140.06'] [2014, 1, 8, 15, 11, 49, 'NASDAQ:BIDU', '182.63'] [2014, 1, 8, 15, 11, 50, 'NYSE:IBM', '187.95'] [2014, 1, 8, 15, 11, 51, 'NASDAQ:INTC', '25.34'] [2014, 1, 8, 15, 11, 52, 'NASDAQ:MSFT', '35.67'] [2014, 1, 8, 15, 11, 53, 'NYSEARCA:SPY', '183.34'] ... where we can see that we were able to grab the prices of 6 stocks and 1 ETF (Exchange Trading Fund tracking S&P500 Index) every 10 min. Reflection You may wonder whether hacking is legal or not? The best answer I find in the words of Gordon Gekko: Someone reminded me I once said “Greed is good”, ## Retrieve the Data of Fund Performance utilizing Google and Python Do you remember my post on Get the Data of Fund Performance directly into Excel utilizing VBA and Google? If not, have a look as this time we will do the same but in Python. Shortly, given a list of APIR codes (referring to different investment option performance) we want to fetch the publicly available data at InvestSMART.com.au website. Previously we built a code in Excel/VBA. It works but it is slow as the process makes use of Excel’s plug-ins for establishing connection with external websites and downloading the data into Excel’s workbooks. We can improve it significantly using Python language. The second reason for doing it in Python is purely educational. We will see below how we can read the data from CSV file, send a query from Python directly to Google Search, retrieve a list of responses, get the URL of the page we are interested in, download it as a HTML code, parse it to get the content of the webpage without HTML ornaments, scan the content for required information (in our case: the most actual fund performance figures available online), and finally to save all in a file. This is our roadmap for today. Are you ready for the journey? In Search of Sunrise To get the job done, we will need to utilize some existing modules for Python. The most important one is google module available to download from here. We also will need mechanize module, a stateful programmatic web browsing in Python. If you need to install BeautifulSoap, here is all what you look for. Okay then, let’s roll it: 1 2 3 4 5 6 7 8 9 10 # Retrieve the Data of Fund Performance utilizing Google and Python # # (c) 2014 QuantAtRisk.com, by Pawel Lachowicz import mechanize, csv, os, unicodedata from google import search as gs from bs4 import BeautifulSoup # remove file, if exists os.path.exists("investsmart.csv") and os.remove("investsmart.csv") where the last two lines refer to the comma-separated-values file where we will save all fetched data. Sometimes it’s a good practice to remove it if it exists in a local directory. As you will see in a moment, we do it at the beginning as whatever will be retrieved by us from the web and filtered out for content we will be appending to the output file of investsmart.csv in a loop. We feed our algo with a list of APIR codes stored in a text file of codes.csv of the exemplary structure: AMP1232AU AMP1524AU AMP0167AU AMP1196AU ... In order to read each line from this file, we commence with: 12 13 14 15 16 17 18 19 with open('test.csv', 'rb') as f: reader = csv.reader(f) # reads the lines as a list for irow in reader: apircode=irow[0] print(apircode) # define query for Google Search phrase="investsmart com au "+apircode where line #14 iterates through file line-by-line reading in the APIR code (line #15) as the first element from the irow list. Having code in the variable of apircode we define a query for search engine (line #19) and open an inner operational loop, here referring to the output file: 21 22 23 24 25 26 27 28 29 30 with open('investsmart.csv', 'a') as csvfile: writer=csv.writer(csvfile, dialect="excel") i=0 # Search in Google for InvestSMART APIR code for url in gs(phrase, stop=1): i+=1 if(i==1): link=url # store the first URL from Google results print("%s\n" % link) The whole action starts from line #24 where we execute the first procedure of sending query to Google (command gs in line #26). As you look at this creature, it is, in fact, the masterpiece, beauty, and Python’s elegance all in one. Not only we hit the search engine with our phrase but we filter the incoming data from Google Search according to URLs list and extract URL corresponding to the first one found by Google (line #29). Next, we browse through the first link, 32 33 34 35 br=mechanize.Browser() # open a virtual browser resp=br.open(link) content=resp.get_data() #print content and download the HTML code of the website (line #34). To have a better feeling what the variable content contains for the first APIR code from our codes.csv input file, just uncomment and execute code in line #35. For AMP1232AU investment option, the content displays the website HTML code starting from: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]--> <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]--> <!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]--> <html> <head> <title>AMP SigSup - AMP Super Cash - Managed fund Profile - Managed funds - InvestSMART - managed funds, shares and investment news</title> <meta http-equiv="X-UA-Compatible" content="IE=9" /> <meta name="verify-v1" content="xgkff+3TBcugNz7JE2NiJoqkiVs1PHybWgFkaBuhblI=" /> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="title" content="Discount access and research on more than 1,000 top performing managed funds" /> <meta name="description" content="Rebate of entry fees on the majority of Australian managed funds. Research on managed funds and shares." /> ... and containing (somewhere in the middle) the information we would like to extract: ... <!-- ****************************** --> <!-- *** Performance Table *** --> <!-- ****************************** --> <br /><br /> <table class="Container" width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td class="ContainerHeader" align="left"><b>Fund Performance</b> (as at 30th Nov 2013)&nbsp;<a href="javascript:PopupWindow('/education/GlossaryPopup.asp?id=133', '_blank', 700, 450)" class="glossaryLnk" title="More information...">&nbsp;</a></td> <td class="ContainerHeader" align="right">NOTE : returns for periods greater than 1 year are annualised</td> </tr> <tr> <td colspan="2" width="100%" align="left" valign="top" class="ContainerBody"> <table width="100%" cellpadding="0" cellspacing="0" border="0" class="DataTable"> <tr class="DataTableHeader"> <td align="left" valign="top" nowrap>&nbsp;</td> <td align="right" valign="top" nowrap>1 Month<br />%</td> <td align="right" valign="top" nowrap>3 Month<br />%</td> <td align="right" valign="top" nowrap>6 Month<br />%</td> <td align="right" valign="top" nowrap>1 Year<br />% p.a.</td> <td align="right" valign="top" nowrap>2 Year<br />% p.a.</td> <td align="right" valign="top" nowrap>3 Year<br />% p.a.</td> <td align="right" valign="top" nowrap>5 Year<br />% p.a.</td> <!--<td align="right" valign="top" nowrap>7 Year<br />% p.a.</td>--> <td align="right" valign="top" nowrap>10 Year<br />% p.a.</td> </tr> <tr class="DataTableRow" onMouseOver="this.className = 'DataTableRowHighlight';" onMouseOut ="this.className = 'DataTableRow';"> <td align="left" valign="top"><b>Total Return</b></td> <td align="right" valign="top"><b>0.12</b></td> <td align="right" valign="top"><b>0.37</b></td> <td align="right" valign="top"><b>0.8</b></td> <td align="right" valign="top"><b>1.77</b></td> <td align="right" valign="top"><b>2.24</b></td> <td align="right" valign="top"><b>2.62</b></td> <td align="right" valign="top"><b>-</b></td> <!--<td align="right" valign="top"><b>-</b></td>--> <td align="right" valign="top"><b>-</b></td> </tr> ... As we can see the HTML code is pretty nasty and digging up numbers from a subsection “Total Returns” seems to be troublesome. Well, not for Python. Just have a look how easily you can strip HTML from all those Christmas Tree ornaments: 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 # transform code into beautiful soup ;-) soup = BeautifulSoup(content) # perform post-processing for stripped data outputline=[] row=0 roww=0 for string in soup.stripped_strings: if(string=="Fund Performance"): print(string) j=1 row=1 if(row!=0): row+=1 if(row==3): updated=string print(string) if(string=="Total Return") and (j==1): print(string) roww=1 j=0 if(roww!=0): roww+=1 if(roww>=3) and (roww<11): if(roww==3): print(string) s=string.lstrip('(').rstrip(')').lstrip('as at ') else: s=string outputline.append(s) print(string) In line #38 we make use of BeautifulSoup – a Python library designed for quick turnaround projects like screen-scraping. The transformed data we can scan line-by-line (the loop starting in line #44) where the outer function of .stripped_strings makes all HTML bumpers and stickers vanish leaving us with pure text! From that point, all following lines of Python code (#45-67) are designed to extract specific information, print it on the screen, and append to Python’s list of outputline. For AMP1232AU investment option from our query, Python displays: AMP1232AU http://www.investsmart.com.au/managed-funds/profile.asp?function=print&FundID=16654 Fund Performance (as at 30th Nov 2013) Total Return 0.12 0.12 0.37 0.8 1.77 2.24 2.62 - - If Google returns the first link to be something completely different than InvestSMART webpage with requested APIR code, 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 # check validity of desired data for APIR code if(len(outputline)>0): rowout=[] outputline.insert(0,apircode) # add APIR code to the list outputline.append(updated) # add the performance numbers for item in outputline: rowout.append(item.encode('ascii')) # convert <type:unicode> # to <type:string> print(rowout) # show final version of the list on the screen if(len(rowout)>0): writer.writerow(rowout) # if non-zero, save/append the list # with data in "investsmart.csv" # output file, otherwise ignore # this query's results csvfile.close() f.close() the outputline string should be of zero length. With extra few lines for post-processing, the output file stores the subject of our small endeavour, i.e.: pawels-mbp:Py pawel cat investsmart.csv AMP1232AU,0.12,0.37,0.8,1.77,2.24,2.62,-,-,(as at 30th Nov 2013) AMP1524AU,-2.47,0.14,-1.58,8.63,14.09,9.42,-,-,(as at 30th Nov 2013) AMP0167AU,0.39,1.96,3.18,8.45,7.72,5.82,5.09,4.89,(as at 30th Nov 2013) ...

what ends our journey In Search of Sunrise. The total time of scanning 1000 APIR codes is cut down by 60% making use of Python instead of VBA. That brings to my mind an AirAustral commercial that I saw a year ago: Arrive fast, then slow down.

## First Contact with Python

Python does miracles. Really. The more I’m diving into the abilities of this programming language the more I’m left speechless. This is this sort of feeling when you can swim among the sharks in the ocean and one day you jump off the plane and discover skydiving for the very first time. You are flying! Sounds appealing enough? Great! Welcome onboard!

Before the air of Python coding becomes your second nature you need to start your experience from the basics. They are no so exciting but provide you with the tastes and flavors of the language itself. With the promise to fly anywhere and a motivation standing behind that, it is solely up to you how hungry you are?! How hungry you are to leave your comfort zone of current programming habits in Java, C++ or Matlab, and step into the zone of magic?! This is a proving ground, the judgment day of your character: are you strong enough to accelerate your life, to gain a new skill and do something for yourself that a world will thank you tomorrow?!

Knockin’ on Heaven’s Door

If you went through my previous post on Python, you learned that there were at least two way of running Python in Unix/Linux/MacOSX environment: an interactive method (using Terminal) or active method (using IDLE; an integrated development environment). As for the latter solution, it utilizes a text editor to keep a track record of your codes while the former acts upon what you type and forgets about that as soon as you exit Python. To visualize that case simply see the clip at the end of this post where I provide you with a quick guide to running Python codes for the aforementioned two scenarios.

Now, we will compress our ability to assimilate Python’s fundamentals in the following couples of lines by referring to the codes executable from the text-editor level. Let the guys from MIT don’t think they are the best among the rest.

Python works in the mode of interpretation of every single line of code, line-by-line. It assumes you are responsible for what you wish to do or calculate so it doesn’t check your syntax in first place, flashes with red light and warns you. It trusts you! So as it goes through your code and finds typos, unknown variables, wrong types, or unrecognizable keywords or functions – then it will halt. Pretty fair deal. The rules of game are easy to grasp quickly. Let’s see the basics.

That’s one small Step for a Man, …

It is too much of a hassle to install and use Python as a simple calculator. But you can do it. Just add, subtract, do anything what follows the general rules of arithmetic:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Accelerated Python for Quants, Part 3 # (c) 2013 Pawel Lachowicz, QuantAtRisk.com   # Variables (BTW, the comments in Python start with #) x=2 y=9 z=13.513   a=x+x b=(x-3*x)*7.32+x/y c=y**(1/x) txt= 'is derived' # string   print a print "b= ", b print("The value of a equals to %g where c= %f %s" %(a,c,txt))

where we defined three variables $x, y$ and $z$, assigned to them some values and performed simple operations. The numerical results have been assigned to different three new variables, namely $a, b$, and $c$ with our intention to display them on the screen using the Python command of print. Running the code returns:

4 b= -29.28 The value of a equals to 4 where c= 1.000000 is derived

Now, this is your first lesson on what Python assumes about your style of talking to it. It resembles a never-ending discussion among all women: does the size matter? I don’t know. Just ask them! But all I know is that for Python the type of the variables matters! Our variable of $x$ has been defined as an integer so $a=x+x=4$ as expected. But if you had tried to calculate:

print y/x

you would get:

4

instead of 4.5. To tell Python what you want, simply underline so-called float variable(s) to ensure that the calculation will be undertaken at the floating-point level by the processor. The following operations will force the outcome to be of the float type. Just check them out:

print 9.0/2.0, 9.0/2, 9./2, 9/2.0, 9/2.

Use the command of type to verify your variables anytime, for instance:

print(type(a)) print(type(b)) print(type(c))

should display

<type 'int'> <type 'float'> <type 'int'>

Note that in line #11 of the code, the value of $c$ is integer despite the fact we tried to perform the calculation of
$$c=y^{1/x} \equiv \sqrt[x]{y} = 3 \ .$$ Moreover, it failed delivering the outcome of $c=1$ which is far away for 3, right? This is all because of the difference in the type and you need to be aware of it while coding in Python. How to fix it? If your coding churns lots of variables of different kind you need to keep a track record of all of them and make sure what you want to obtain in result. A simple trick to force floating-point calculations is by the application of float() function acting directly on the variable or expression, for example:

c=float(y**(1/float(x))) print c c_i=int(c) print c_i

will give you:

3.0 3

where the last line shows you how to convert, now correctly derived value of $c$ equal 3, into an integer type again assigned to a new variable of $c_i$.

Take also a closer look at lines #14-16 of the code where I showed you how differently you can display the information on the screen. The line #16 is an old-school way where you wish to display a sentence as a string-type variable which would contain some text of yours and include the values of two variables, namely $a$ and $c$. Inside the string you place a markers in form of %_ where %g, %f, and %s inform Python that it is asked to display the variables a,c, and txt in the integer, float, and string format, respectively.

Drilling the topic one level deeper, in your code you can check the type of any variable applying a conditional verification of some expressions which evaluate to the boolean type, i.e. True (=1) or False (=0).

So, here we go with a new Python structure of if-elif-else:

18 19 20 21 22 23 24 25 26 test=False c=str(float(y**(1/float(x)))) if isinstance(c, int): print 'c is an integer>' elif(isinstance(c,str)): print 'c is a string' Test=True else: print "c is probably of a float-type"

resulting in:

c is a string

where we formulated a logical English-language-style question checking first condition whether $c$ was an integer, if not then checking another condition of $c$ being of the string type, and if that had failed the code would display an information letting us know about $c$ being probably a float. Easy? At least extremely intuitive.

Notice some differences in notations. Python accepts string encapsulated in single or double quotation. That’s very kind of it. Whatever you like, use in Python! Secondly, white spaces doesn’t matter unless they violate the syntax or spoil the output you desire to get. Thirdly, note colons (:) at the end of the line #20, #22, and #25. Forgetting about them is like jotting down an integral $\int x^2$ without $dx$ at the end. Fourthly, we can notice in line #19, #20, and #22 how we apply the idea of function acting on the inner arguments which are evaluated first in the order of nesting: inside$\rightarrow$out.

Lastly, in Python we use indentation for many cases or fixed language structures like if-elif-else mentioned above. The indentation means putting 4 white spaces (never Tab!) and then typing (see e.g. line #21). Why it so important? It occurs that some scientists conducted experiments on programmers and code developers and came out with a stunning result. When we code our brain is nearly blind to spaces or curly braces but it is sharp as a morning sun to the way how and where we nest some lines of code which of course improve the readability. The results were literally so mind-blowing that the guys who oversee Python evolution established (minimal) 4-space indentation as a standard. Keep that story in mind.

At the end we can make use of a variable test we made partially alive in the line #18. It appears again in the block of statements during if-elif-else conditional verification (line #24) and changes its value to True if $c$ appears to be of string type. Since in our code this is happening we send an additional order to Houston, where the guys from NASA commence the countdown of the Python rocket using another important language structure of while-loop as follows:

28 29 30 31 32 33 x = 10 while x >0: print(x) x-=1 if(x==0): print("Lift off!")

sending it into space when the launch status is reached:

10 9 8 7 6 5 4 3 2 1 Lift off!

HOMEWORK

Write a code in Python for the following problems making use of knowledge from this lesson. Submit your results as a comment to this post. Challenge your brain. We will accelerate next time, Warp 2, so buckle up!

(a) Given $|x-3|>21$ solve for $x$.
(b) The monthly rate of return for the cash investment option is $r=0.367$% and is fixed over the course of next 12 months. Find the annualized rate of return for that option. Hint- use the following formula in your code but applying while structure:
$$R = \left[ \prod_{i=1}^{12} (1+r) \right] -1$$
(c) A grandpa and a grandma are all together 147 days old. The grandpa is twice as old as the grandma when she was as old as the grandpa is right now. Find how old is grandpa and grandma today?

VIDEO

A quick guide to the methods how one can run the Python code in MacOS/Unix-like environment:

## Talking to Python: TextMate Code Editor

Programming in Python requires a decent text editor. A decent is a powerful word here. With a dramatic improvement in this field over past decade, anyone who wishes to start his adventure with coding should consider what sort of Integrated Development Environment would be most suitable for his needs. It’s pretty difficult to address all related questions when you are the beginner to programming in Python. A quick scan of the Web suggests one the following solutions for Mac: Vico, TextWrangler, EMacs, TextMate, Eclipse, etc. It is also possible to run Python project under Xcode (see here).

Within this Python for Quants course we will make use of fairly simple but dynamically evolving code editor of TextMate. There are no special criteria standing behind this editor but it does a hell of the job for us as we gain our expertise in Python. The best way to summarize its capability is to quote James Gray as follows. TextMate is a full-featured text editor available for Mac OS X that can greatly enhance your text manipulation skills. TextMate is actually a thin shell over a personalized team of robot ninjas ready to do your bidding. It uses built-in automations for Python. You can manage all the files in your projects, fly through your files with easy navigation techniques, master quick and dirty text editing with strong regular expression integration.

Sound appealing, convincing? Well, you need to start somewhere and this a good place to start.

Installation

One sentence should be enough at this point. Visit TextMate’s website and proceed with download and installation. But one sentence is always not enough. If you already have migrated to Mavericks OS X, the team recommends v2.0-alpha for this platform. You can download and start using it for free in a stripped configuration. This will be more than sufficient for us at the beginning. But if you think more seriously about coding as your new life’s calling, you should consider going for an upgrade.

Editor

We will need a plain sheet of electronic paper to write the code. As soon as we start writing the Python code, a good practice is to save the file at the very beginning. A world-wide convention among coders is to name all source Python listings with an extension of .py. To convince yourself about TextMate capabilities across different languages explore the Bundles from the main menu:

Now, if you type into editor the simplest two lines in Python:

# Your very first script   print 'Hello, world!'

and press ⌘R you should be able to execute the code. A side window pops up with the desired outcome:

Please note that coding in Python and code execution in TextMate is not available for interactive commands like input in the following example:

foo=input('Please enter a value:')

It may sound as a terrible experience but a professional coding nowadays eliminates this interactions to the absolute minimum. You can do it always from the command line python in the Terminal:

If you followed the configuration step from the previous Part 1 and installed all additional Python libraries in the folder of your choice, you may meet some difficulties in reaching for them in TextMate. It took me 15 minutes to find the way to correct that inconvenience which was not so obvious at the beginning.

Firstly, make sure the PATH is set properly and includes the access to the python executables. Select from Menu Preferences.. tab ‘Variables’ and edit properly, for instance:

Secondly, you will amend one line in the TextMate’s script responsible for running your codes while you press ⌘R. Go in menu to Bundles, next to Select Bundle Item… and in the search box of a new window type ‘Run Script’. You should see a limited option pointing at Python reference. Double click will take you to the ⌘R’s command editor.

Around line #13, by default TextMate looks for python installed in your OS X system and tries to execute:

TextMate::Executor.run(ENV["TM_PYTHON"] || "python", "-u", ENV["TM_FILEPATH"] ...

line. We just need to eliminate the inner reference to the variables ENV[“TM_PYTHON”] || “python” and replace them with a path to your python, e.g.:

TextMate::Executor.run("~/env/bash27/bin/python", "-u", ENV["TM_FILEPATH"] ...

That should resolve all major issues related to getting an access to the additional libraries installed by you manually.

Stay Tuned!

A new ebook on Python for Quants is coming this July! Don’t miss the full introductory course to Python with 300+ practical examples ready-to-rerun and use for your code building:

## Setting up Python for Quantitative Analysis in OS X 10.9 Mavericks

Welcome!

Someone, not so long time ago, turned my attention from Matlab coding towards Python programming. He used a strong argument on the enormous flexibility of the language and equally dynamic capabilities as contrasted with Matlab. Well, not mentioning a zero-dollar cost and the power behind the art of computing. That got my attention. When I did my own in-depth research on Python’s applications to the quantitative finance and risk modeling, I came back to my friend saying: Pal, you got my attention but now you have my curiosity. It’s difficult to divide a heart between two women and love them both the same way. Matlab and Python. However, since Python enters the saloons of financial coding with a full splendour, it deserves more respect than initially one would consider.

With this article, my wish is to commence a new thread at Quant At Risk pertaining a tutorial for all quants and financial risk managers wishing to pick up the language’s fundamentals, tastes, and all ingredients backed by a number of useful examples on the way. As this is a new path of wisdom for me too, I hope we will enjoy this journey together.

Assuming that Apple, Inc. and its powerful computers will continue to keep the pace of becoming increasingly popular among financial applications, let me put an extra ground standing behind the motivation of this post. We will set up a complete quant programming Python environment in Apple’s OS X 10.9 Mavericks operating system.

Steps to Success

Let me provide you with a complete list of steps you need to take to download, install, and setup Python on Mac. I will limit all descriptions to a required minimum. The whole process should take no longer than thirty minutes depending on your Internet connection.

1.Download .dmg file with Python 2.7.5 for Mac OS X 64-bit/32-bit from here.
Execute and install on you hard drive.
2. XCode
If you don’t have it, just visit Mac App Store for a free download. It may require you to register first as a developer but that comes will all privileges for your programming career in the future. Once installed, open XCode, go to Preferences… and click on the Downloads tab. We will need to have downloaded and installed Command Line Tools, therefore if you don’t have it, follow that path.
Open Terminal for next steps.
3. sudo easy_install virtualenv
This command installs and allows us to set up a temporary environment assisting within the further installation process.

4. mkdir -p ~/Python/env; virtualenv ~/Python/env
5. pip install mpmath
6. pip install numpy
7. pip install scipy
8. /usr/bin/ruby -e “\$(curl -fsSkL raw.github.com/mxcl/homebrew/go)”
9. brew install pkg-config
10. brew install freetype
11. brew install libpng
12. brew install ffmpeg
13. pip install matplotlib
14. brew install zeromq
15. pip install pyzmq
18. pip install azure
19. pip install curses
20. pip install cython
21. pip install jinja2
22. pip install pexpect
23. pip install pygments
24. pip install pymongo
25. pip install sphinx
26. pip install sqlite3
27. pip install wx
28. pip install zmq
29. pip install sympy
30. pip install patsy
31. pip install scikit_learn
32. pip install statsmodels
33. pip install pandas
34. pip install ipython
All these packages we will be using extensively, separately or in a combined version. They all supplement each other and make us equipped with a powerful weapon of sophisticated coding in Python.
35.export PY_ENV_DIR=~/Python/env
Add this line into ~/.bash_profile file.
By typing a command which python we shoud get: ~/Python/env/bin/python. If so, it looks like that we went through the storm smoothly and we are ready to give go-and-go for launch!
In Terminal, type and execute the following command. It should give us an access to Python coding:
In the next post, we will setup a nice and cosy Integrated Development Environment (IDE) which should be sufficient for our early stages of scripting in Python.