A construction of your quantitative workshop in Python requires a lot of coding or at least spending a considerable amount of time assembling different blocks together. There are many simple fragments of code reused many times. The calculation of covariance matrix is not a problem once NumPy is engaged but the meaning is derived once you add some background idea what you try to achieve.
Let’s see in this lesson of Accelerated Python for Quants tutorial how to use Quandl.com data provider in construction of any $N$-Asset Portfolio based on SEC securities and for return-series we may calculate a corresponding covariance matrix.
Quandl and SEC Stock List
Quandl is a great source of data. With their ambition to become the largest data provider on the planet free of charge, no doubt they do an amazing job. You can use their Python API to feed your code directly with Open, High, Low, Close for any SEC stock. In the beginning we will need a list of companies (tickers) and, unfortunately, the corresponding internal call-tickers as referred to by Quandl. The .csv file containing all information you can download from this website or directly here: secwiki_tickers.csv. The file contains the following header:
Ticker Name Sector Industry Price Collection
where we are interested in matching Ticker with Price field. The latter for AAPL stock displays “WIKI/AAPL” code. That’s all we need for now to grab.
Let’s say we have a freedom of choice to select any of 2277 stock data from our SEC universe (provided in secwiki_tickers.csv file). For the sake of simplicity I’ll select ony three and save them in a plain text file of portfolio.lst containing:
AAPL IBM TXN
I do it on purpose in order to show you how easily we can read in this list from a file in Python. As usual, we start our adventure from data pre-processing part:
1 2 3 4 5 6 7 8 9 10 11
# Covariance Matrix for N-Asset Portfolio fed by Quandl in Python # (c) 2014 QuantAtRisk, by Pawel Lachowicz import Quandl import numpy as np import pandas as pd df=pd.read_csv('secwiki_tickers.csv') dp=pd.read_csv('portfolio.lst',names=['pTicker']) pTickers=dp.pTicker.values # converts into a list
In order to install Quandl module in your Mac/Linux environment simply type pip install Quandl (for more information see here). In the code above, we employ pandas’ read_csv function both for reading in the data from .csv file as well as from a plain text file. For the latter, line #10, we add a name of the column, pTicker, to point at portfolio tickers, and next we convert pandas’ DataFrame object into a Python list.
Now, we gonna use the power of Python over Matlab in search for the corresponding Quandl (Price) ticker code:
13 14 15 16 17 18 19 20 21
tmpTickers= for i in range(len(pTickers)): test=df[df.Ticker==pTickers[i]] if not(test.empty): tmp=test.Price.values+'.4' # of <type 'numpy.ndarray'> tmp2=tmp.tolist() tmpTickers.append(tmp2) print(tmpTickers)
This is executed in line #15 almost automatically. The result of the search is the DataFrame record (empty or containing the corresponding information on the stock as read out from (df) .csv data source. Simply, if the ticker of the security in our portfolio.lst file does not exist, it is skipped. Quandl allows you to retrieve information on stock’s Open, High, Low, Close, Volume by calling (Price) Quandl ticker in the following form, respectively:
'WIKI/AAPL.1' 'WIKI/AAPL.2' 'WIKI/AAPL.3' 'WIKI/AAPL.4' 'WIKI/AAPL.5'
Below we will stick to Close prices. That is why, in line #17, we add ‘.4′ to the string. Please note that we should get the same data by calling ‘GOOG/NASDAQ_AAPL.4′ and ‘WIKI/AAPL.4′ Quandl tickers. Because a variable tmp is of numpy.ndarray type, in line #18, we convert it into a list type. The final list of tmpTickers contains therefore all corresponding Quandl tickers in the form:
[['WIKI/AAPL.4'], ['WIKI/IBM.4'], ['WIKI/TXN.4']]
and before they can be used for return-series retrieval from the Quandl database, we need to scalp each item a bit in the following way:
23 24 25 26 27 28 29 30
tmp= for i in range(len(tmpTickers)): tmp2=str(tmpTickers[i]).strip('') print(tmp) tmp.append(str(tmp2).strip('\'\'')) QuandlTickers=tmp print(QuandlTickers)
['WIKI/AAPL.4', 'WIKI/IBM.4', 'WIKI/TXN.4']
We fetch the data pretty easily:
32 33 34 35 36 37
data= Quandl.get(QuandlTickers, authtoken='YourAuthToken', \ trim_start='2014-10-01', trim_end='2014-11-04', \ transformation='rdiff') d=data.values.T print(d)
where rdiff enforces price-series transformation into return-series expressed as row-on-row % change, y'[t] = (y[t] – y[t-1])/y[t-1] (see more here: Using the Quandl API). By default we download daily returns:
[[-0.01558313 0.00725953 -0.0028028 0. -0.00873319 0.02075949 0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282 0.01464783 0.02139859 0.0271652 0.00507466 0.01786581 0.00372031 -0.00104543 0.01550756 0.00562114 -0.00335383 0.00953449 0.01296296 -0.00731261] [-0.01401254 -0.00138911 0.0094163 0.0019611 -0.01761532 0.0196543 -0.01552598 -0.00262847 -0.01296187 0.00152572 -0.01115343 -0.01050894 0.0122887 -0.0711343 -0.03471319 -0.00882191 0.00241053 -0.0006166 -0.00129566 0.01068759 -0.00085575 0.00544476 0.00030423 -0.00024331 -0.01040399] [-0.01698469 nan nan -0.00416489 -0.01319035 0.020213 -0.01959949 -0.07127336 -0.0189518 0.00906272 0.01063578 0.01941066 0.00183528 0.01694527 0.05314118 -0.00320718 0.00815101 0.01212766 0.00798823 0.01147028 -0.00350515 -0.01655287 0.0448138 0.00845751 0.00638978]]
You may notice some missing data in the form denoted by nan. It may happen for any dataset. Life is not perfect. Quandl is only an option. In the first approximation we may fill the missing returns with zeros using pandas:
39 40 41 42 43 44
for i in range(d.shape): df=pd.DataFrame(d[i]) df.fillna(0,inplace=True) d[i]=df.values.T print(d)
what works quickly and efficiently:
[[-0.01558313 0.00725953 -0.0028028 0. -0.00873319 0.02075949 0.00218254 -0.00287072 -0.00913333 -0.01062018 -0.01225316 -0.01312282 0.01464783 0.02139859 0.0271652 0.00507466 0.01786581 0.00372031 -0.00104543 0.01550756 0.00562114 -0.00335383 0.00953449 0.01296296 -0.00731261] [-0.01401254 -0.00138911 0.0094163 0.0019611 -0.01761532 0.0196543 -0.01552598 -0.00262847 -0.01296187 0.00152572 -0.01115343 -0.01050894 0.0122887 -0.0711343 -0.03471319 -0.00882191 0.00241053 -0.0006166 -0.00129566 0.01068759 -0.00085575 0.00544476 0.00030423 -0.00024331 -0.01040399] [-0.01698469 0. 0. -0.00416489 -0.01319035 0.020213 -0.01959949 -0.07127336 -0.0189518 0.00906272 0.01063578 0.01941066 0.00183528 0.01694527 0.05314118 -0.00320718 0.00815101 0.01212766 0.00798823 0.01147028 -0.00350515 -0.01655287 0.0448138 0.00845751 0.00638978]]
Having data is better than missing them. It is important to check how many nans every return-series has. In our case of TXN missing information constitutes 8% at 25 data points. A good indicator is to have less than 3%. Otherwise an impact on any further calculations may be significantly propagated. In this point it is good to recall Bruce Lee who said: “Empty your mind, be formless, shapeless, like water. If you put water into a cup, it becomes the cup. If you put water in a bottle, it becomes the bottle. If you put it in a teacup, it becomes the teacup. Now, water can flow, and it can crush. Be water, my friend.” Two added zeros contribute to the TXN distribution of returns in its central part, therefore the more we “fill in” with zeros, a fit of normal distribution flattens the tails (a distribution becomes more peaked). Therefore, in such case, I would agree with Bruce only on the “crush” part of his wisdom.
Keeping above short note on some dirty tricks in mind, we obtain the desired covariance matrix simply and painfully,
[[ 1.44660397e-04 -2.13905277e-05 1.31742330e-04] [ -2.13905277e-05 3.12211511e-04 -5.62146677e-05] [ 1.31742330e-04 -5.62146677e-05 5.44348868e-04]]
what accomplishes our efforts.
Applied Portfolio Optimization with Risk Management using Matlab