Quantitative Analysis, Risk Management, Modelling, Algo Trading, and Big Data Analysis

## Rebinning Tick-Data for FX Algo Traders

If you work or intend to work with FX data in order to build and backtest your own FX models, the Historical Tick-Data of Pepperstone.com is probably the best place to kick off your algorithmic experience. As for now, they offer tick-data sets of 15 most frequently traded currency pairs since May 2009. Some of the unzip’ed files (one month data) reach over 400 MB in size, i.e. storing 8.5+ millions of lines with a tick resolution for both bid and ask “prices”. A good thing is you can download them all free of charge and their quality is regarded as very high. A bad thing is there is 3 month delay in data accessibility.

Dealing with a rebinning process of tick-data up, that’s a different story and the subject of this post. We will see how efficiently you can turn Pepperstone’s Tick-Data set(s) into 5-min time-series as an example. We will make use of scripting in bash (Linux/OS X) supplemented with data processing in Python.

Data Structure

You can download Pepperstone’s historical tick-data from here, month by month, pair by pair. Their inner structure follows the same pattern, namely:

$head AUDUSD-2014-09.csv AUD/USD,20140901 00:00:01.323,0.93289,0.93297 AUD/USD,20140901 00:00:02.138,0.9329,0.93297 AUD/USD,20140901 00:00:02.156,0.9329,0.93298 AUD/USD,20140901 00:00:02.264,0.9329,0.93297 AUD/USD,20140901 00:00:02.265,0.9329,0.93293 AUD/USD,20140901 00:00:02.265,0.93289,0.93293 AUD/USD,20140901 00:00:02.268,0.93289,0.93295 AUD/USD,20140901 00:00:02.277,0.93289,0.93296 AUD/USD,20140901 00:00:02.278,0.9329,0.93296 AUD/USD,20140901 00:00:02.297,0.93288,0.93296 The columns, from left to right, represent respectively: a pair name, the date and tick-time, the bid price, and the ask price. Pre-Processing Here, for each .csv file, we aim to split the date into year, month, and day separately, and remove commas and colons to get raw data ready to be read in as a matrix (array) using any other programming language (e.g. Matlab or Python). The matrix is mathematically intuitive data structure therefore making direct reference to any specific column of it makes any backtesting engine running with its full thrust. Let’s play with AUDUSD-2014-09.csv data file. Working in the same directory where the file is located we begin with writing a bash script (pp.scr) that contains: 1 2 3 4 5 6 7 8 9 10 11 # pp.scr # Rebinning Pepperstone.com Tick-Data for FX Algo Traders # (c) 2014 QuantAtRisk, by Pawel Lachowicz clear echo "..making a sorted list of .csv files" for i in$1-*.csv; do echo ${i##$1-} $i${i##.csv}; done | sort -n | awk '{print $2}' >$1.lst   python pp.py head AUDUSD.pp

that you run in Terminal:

$chmod +x pp.scr$ ./pp.scr AUDUSD

where the first command makes sure the script becomes executable (you need to perform this task only once). Lines #7-8 of our script, in fact, look for all .csv data files in the local directory starting with AUDUSD- prefix and create their list in AUDUSD.lst file. Since we work with AUDUSD-2014-09.csv file only, the AUDUSD.lst file will contain:

$cat AUDUSD.lst AUDUSD-2014-09.csv as expected. Next, we utilise the power and flexibility of Python in the following way: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 # pp.py import csv fnlst="AUDUSD.lst" fnout="AUDUSD.pp" for lstline in open(fnlst,'r').readlines(): fncur=lstline[:-1] #print(fncur) with open(fnout,'w') as f: writer=csv.writer(f,delimiter=" ") i=1 # counts a number of lines with tick-data for line in open(fncur,'r').readlines(): if(i<=5200): # replace with (i>0) to process an entire file #print(line) year=line[8:12] month=line[12:14] day=line[14:16] hh=line[17:19] mm=line[20:22] ss=line[23:29] bidask=line[30:] writer.writerow([year,month,day,hh,mm,ss,bidask]) i+=1 It is a pretty efficient way to open really a big file and process its information line by line. Just for further purpose of display, in the code we told computer to process only first 5,200 of lines. The output of lines #10-11 of pp.scr is the following: 2014 09 01 00 00 01.323 "0.93289,0.93297 " 2014 09 01 00 00 02.138 "0.9329,0.93297 " 2014 09 01 00 00 02.156 "0.9329,0.93298 " 2014 09 01 00 00 02.264 "0.9329,0.93297 " 2014 09 01 00 00 02.265 "0.9329,0.93293 " since we allowed Python to save bid and ask information as one string (due to a variable number of decimal digits). In order to clean this mess we continue: 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 # pp.scr (continued) echo "..removing token: comma" sed 's/,/ /g' AUDUSD.pp >$1.tmp rm AUDUSD.pp   echo "..removing token: double quotes" sed 's/"/ /g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing empty lines" sed -i '/^[[:space:]]*$/d' $1.tmp2 mv$1.tmp2 AUDUSD.pp   echo "head..." head AUDUSD.pp echo "tail..." tail AUDUSD.pp

what brings us to pre-processed data:

..removing token: comma ..removing token: double quotes ..removing empty lines head... 2014 09 01 00 00 01.323 0.93289 0.93297 2014 09 01 00 00 02.138 0.9329 0.93297 2014 09 01 00 00 02.156 0.9329 0.93298 2014 09 01 00 00 02.264 0.9329 0.93297 2014 09 01 00 00 02.265 0.9329 0.93293 2014 09 01 00 00 02.265 0.93289 0.93293 2014 09 01 00 00 02.268 0.93289 0.93295 2014 09 01 00 00 02.277 0.93289 0.93296 2014 09 01 00 00 02.278 0.9329 0.93296 2014 09 01 00 00 02.297 0.93288 0.93296 tail... 2014 09 02 00 54 39.324 0.93317 0.93321 2014 09 02 00 54 39.533 0.93319 0.93321 2014 09 02 00 54 39.543 0.93318 0.93321 2014 09 02 00 54 39.559 0.93321 0.93321 2014 09 02 00 54 39.784 0.9332 0.93321 2014 09 02 00 54 39.798 0.93319 0.93321 2014 09 02 00 54 39.885 0.93319 0.93325 2014 09 02 00 54 39.886 0.93319 0.93321 2014 09 02 00 54 40.802 0.9332 0.93321 2014 09 02 00 54 48.829 0.93319 0.93321

Personally, I love that part as you can learn how to do simple but necessary text file operations by typing single lines of Unix/Linux commands. Good luck for those who try to repeat the same in Microsoft Windows not spending more than 30 sec for doing it.

Rebinning: 5-min Data

The rebinning has many schools. It’s the art for some people. We just want to have the job done. I opt for simplicity and understanding of the data we deal with. Imagine we have two adjacent 5 min bins with a tick history of trading:

We want to derive the closest possible (or most fair) price estimation every 5 min, denoted in the above painting by a red marker. The old-school approach is to take the average over a number (larger than 5) of tick data points from the left and from the right. That creates the under- or overestimation of the mid-price.

If we trade live, every 5 min we receive an information on the last tick point before the minute hits 5 and we wait for the next tick point after 5 (blue markers). Taking the average of their prices (mid-price) makes most of sense. The precision we look at here is sometimes $10^{-5}$. It is not much of significance if our position is small, but if it is not, the mid-price may start playing a crucial role.

The cons of the old-school approach: a possible high volatility among all tick-data within last 5 minutes that we neglect.

The following Python code (pp2.py) performs 5-min rebinning for our pre-processed AUDUSD-2014-09 file:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 # pp2.py import csv import numpy as np   def convert(data): tempDATA = [] for i in data: tempDATA.append([float(j) for j in i.split()]) return np.array(tempDATA).T   fname="AUDUSD.pp"   with open(fname) as f: data = f.read().splitlines()   #print(data)   i=1 for d in data: list=[s for s in d.split(' ')] #print(list) # remover empty element in the list dd=[x for x in list if x] #print(dd) tmp=convert(dd) #print(tmp) if(i==1): a=tmp i+=1 else: a = np.vstack([a, tmp]) i+=1   N=i-1 #print("N = %d" % N)   # print the first line tmp=np.array([a[1][0],a[1][1],a[1][2],a[1][3],a[1][4],0.0,(a[1][6]+a[1][7])/2]) print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (tmp[0],tmp[1],tmp[2],tmp[3],tmp[4],tmp[5],tmp[6])) m=tmp   # check the boundary conditions (5 min bins) for i in xrange(2,N-1): if( (a[i-1][4]%5!=0.0) and (a[i][4]%5==0.0)):   # BLUE MARKER No. 1 # (print for i-1) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i-1][0],a[i-1][1],a[i-1][2],a[i-1][3],a[i-1][4],a[i-1][5],a[i-1][6],a[i-1][7])) b1=a[i-1][6] b2=a[i][6] a1=a[i-1][7] a2=a[i][7] # mid-price, and new date for 5 min bin bm=(b1+b2)/2 am=(a1+a2)/2 Ym=a[i][0] Mm=a[i][1] Dm=a[i][2] Hm=a[i][3] MMm=a[i][4] Sm=0.0 # set seconds to zero   # RED MARKER print("%.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f" % (Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2)) tmp=np.array([Ym,Mm,Dm,Hm,MMm,Sm,(bm+am)/2]) m=np.vstack([m, tmp])   # BLUE MARKER No. 2 # (print for i) #print(" %.0f %2.0f %2.0f %2.0f %2.0f %6.3f %10.6f %10.6f" % # (a[i][0],a[i][1],a[i][2],a[i][3],a[i][4],a[i][5],a[i][6],a[i][7]))

what you run in pp.scr file as:

31 32 33 # pp.scr (continued)   python pp2.py > AUDUSD.dat

in order to get 5-min rebinned FX time-series as follows:

head AUDUSD.dat 2014 9 1 0 0 0.000 0.932935 2014 9 1 0 5 0.000 0.933023 2014 9 1 0 10 0.000 0.932917 2014 9 1 0 15 0.000 0.932928 2014 9 1 0 20 0.000 0.932937 2014 9 1 0 25 0.000 0.933037 2014 9 1 0 30 0.000 0.933075 2014 9 1 0 35 0.000 0.933070 2014 9 1 0 40 0.000 0.933092 2014 9 1 0 45 0.000 0.933063 That concludes our efforts. Happy rebinning! ## Hacking Google Finance in Real-Time for Algorithmic Traders Forecasting risk in algorithmic stock trading is of paramount importance for everyone. You should always look for the ways how to detect sudden price changes and take immediate actions to protect your investments. Imagine you opened a new long position last Wednesday for NASDAQ:NVDA buying 1500 shares at the market price of USD16.36. On the next day price goes down to USD15.75 at the end of the session. You are down 3.87% or almost a grand in one day. If you can handle that, it’s okay but if the drop were more steep? Another terrorist attack, unforeseen political event, North Korea nuclear strike? Then what? You need to react! If you have information, you have options in your hands. In this post we will see how one can use real-time data of stock prices displayed on Google Finance website, fetch and record them on your computer. Having them, you can build your own warning system for sudden price swings (risk management) or run the code in the background for a whole trading session (for any stock, index, etc. accessible through Google Finance) and capture asset prices with an intraday sampling (e.g. every 10min, 30min, 1h, and so on). From this point only your imagination can stop you from using all collected data. Hacking with Python If you ever dreamt of becoming a hacker, this is your chance to shine! I have got my inspiration after reading the book of Violent Python: A Cookbook for Hackers, Forensic Analysts, Penetration Testers and Security Engineers by TJ O’Connor. A powerful combination of the beauty and the beast. The core of our code will be contained in a small function which does the job. For a specified Google-style ticker (query), it fetches the data directly from the server returning the most current price of an asset: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 # Hacking Google Finance in Real-Time for Algorithmic Traders # # (c) 2014 QuantAtRisk.com, by Pawel Lachowicz import urllib, time, os, re, csv def fetchGF(googleticker): url="http://www.google.com/finance?&q=" txt=urllib.urlopen(url+googleticker).read() k=re.search('id="ref_(.*?)">(.*?)<',txt) if k: tmp=k.group(2) q=tmp.replace(',','') else: q="Nothing found for: "+googleticker return q Just make sure that a Google ticker is correctly specified (as will see below). Next, let’s display on the screen our local time and let’s force a change of the system time to the one corresponding to New York City, NY. The latter assumption we make as we would like to track the intraday prices of stock(s) traded at NYSE or NASDAQ. However, if you are tracking FTSE 100 index, the Universal Time (UTC) of London is advisable as an input parameter. 18 19 20 21 22 23 24 25 26 27 # display time corresponding to your location print(time.ctime()) print # Set local time zone to NYC os.environ['TZ']='America/New_York' time.tzset() t=time.localtime() # string print(time.ctime()) print Having that, let us define a side-function combine which we will use to glue all fetched data together into Python’s list variable: 29 30 31 32 33 34 def combine(ticker): quote=fetchGF(ticker) # use the core-engine function t=time.localtime() # grasp the moment of time output=[t.tm_year,t.tm_mon,t.tm_mday,t.tm_hour, # build a list t.tm_min,t.tm_sec,ticker,quote] return output As an input, we define Google ticker of our interest: 36 ticker="NASDAQ:AAPL" for which we open a new text file where all queries will be saved in real-time: 39 40 41 42 # define file name of the output record fname="aapl.dat" # remove a file, if exist os.path.exists(fname) and os.remove(fname) Eventually, we construct the final loop over trading time. Here, we fetch the last data at 16:00:59 New York time. The key parameter in the game is freq variable where we specify the intraday sampling (in seconds). From my tests, using a private Internet provider, I have found that the most optimal sampling was 600 sec (10 min). Somehow, for shorter time intervals, Google Finance detected too frequent queries sent from my IP address. This test succeed from a different IP location, therefore, feel free to play with your local Internet network to find out what is the lowest available sampling time for your geolocation. 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 freq=600 # fetch data every 600 sec (10 min) with open(fname,'a') as f: writer=csv.writer(f,dialect="excel") #,delimiter=" ") while(t.tm_hour<=16): if(t.tm_hour==16): while(t.tm_min<01): data=combine(ticker) print(data) writer.writerow(data) # save data in the file time.sleep(freq) else: break else: for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) # save data in the file time.sleep(freq) f.close() To see how the above code works in practice, I conducted a test on Jan/9 2014, starting at 03:31:19 Sydney/Australia time, corresponding to 11:31:19 New York time. Setting the sampling frequency to 600 sec, I was able to fetch the data in the following form: Thu Jan 9 03:31:19 2014 Wed Jan 8 11:31:19 2014 [2014, 1, 8, 11, 31, 19, '543.71'] [2014, 1, 8, 11, 41, 22, '543.66'] [2014, 1, 8, 11, 51, 22, '544.22'] [2014, 1, 8, 12, 1, 23, '544.80'] [2014, 1, 8, 12, 11, 24, '544.32'] [2014, 1, 8, 12, 21, 25, '544.86'] [2014, 1, 8, 12, 31, 27, '544.47'] [2014, 1, 8, 12, 41, 28, '543.76'] [2014, 1, 8, 12, 51, 29, '543.86'] [2014, 1, 8, 13, 1, 30, '544.00'] [2014, 1, 8, 13, 11, 31, 'Nothing found for: NASDAQ:AAPL'] [2014, 1, 8, 13, 21, 33, '543.32'] [2014, 1, 8, 13, 31, 34, '543.84'] [2014, 1, 8, 13, 41, 36, '544.26'] [2014, 1, 8, 13, 51, 37, '544.10'] [2014, 1, 8, 14, 1, 39, '544.30'] [2014, 1, 8, 14, 11, 40, '543.88'] [2014, 1, 8, 14, 21, 42, '544.29'] [2014, 1, 8, 14, 31, 45, '544.15'] ... As you can notice, they were displayed on the screen (line #59 in the code) in the form of Python’s list. It is important to note that the time we make an effort to capture and associate it with fetched asset price (query) is the computer’s system time, therefore please don’t expect regular time intervals as one may get from a verified market data providers. We are hacking in real-time! However, if you think about the data themselves, this time precision is not of great importance. As long as we fetch the data every freq seconds, that sufficiently allows us to build a risk management system or even to measure a rolling volatility of an asset. Your trading model will benefit anyway. Have also a note that if our Internet connection fails or there are some disturbances of a different kind, we will miss the data in a sent query as visible in the example above. Looks exciting? Give me High Five! and say Hell Yeah! Code Modification: Portfolio of Assets The presented Python code can be very easily modified if you wish to try fetching data for a couple of assets concurrently every freq seconds. Simply extend and amend all the lines starting at row #36, for example in the following form: 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 tickers=["NASDAQ:AAPL","NASDAQ:GOOG","NASDAQ:BIDU","NYSE:IBM", \ "NASDAQ:INTC","NASDAQ:MSFT","NYSEARCA:SPY"] # define the name of an output file fname="portfolio.dat" # remove a file, if exist os.path.exists(fname) and os.remove(fname) freq=600 # fetch data every 600 sec (10 min) with open(fname,'a') as f: writer=csv.writer(f,dialect="excel") #,delimiter=" ") while(t.tm_hour<=16): if(t.tm_hour==16): while(t.tm_min<01): #for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) time.sleep(freq) else: break else: for ticker in tickers: data=combine(ticker) print(data) writer.writerow(data) time.sleep(freq) f.close() That’s it! For the sake of real-time verification, here is a screenshot how does it work: Thu Jan 9 07:01:43 2014 Wed Jan 8 15:01:43 2014 [2014, 1, 8, 15, 1, 44, 'NASDAQ:AAPL', '543.55'] [2014, 1, 8, 15, 1, 44, 'NASDAQ:GOOG', '1140.30'] [2014, 1, 8, 15, 1, 45, 'NASDAQ:BIDU', '182.65'] [2014, 1, 8, 15, 1, 45, 'NYSE:IBM', '187.97'] [2014, 1, 8, 15, 1, 46, 'NASDAQ:INTC', '25.40'] [2014, 1, 8, 15, 1, 47, 'NASDAQ:MSFT', '35.67'] [2014, 1, 8, 15, 1, 47, 'NYSEARCA:SPY', '183.43'] [2014, 1, 8, 15, 11, 48, 'NASDAQ:AAPL', '543.76'] [2014, 1, 8, 15, 11, 49, 'NASDAQ:GOOG', '1140.06'] [2014, 1, 8, 15, 11, 49, 'NASDAQ:BIDU', '182.63'] [2014, 1, 8, 15, 11, 50, 'NYSE:IBM', '187.95'] [2014, 1, 8, 15, 11, 51, 'NASDAQ:INTC', '25.34'] [2014, 1, 8, 15, 11, 52, 'NASDAQ:MSFT', '35.67'] [2014, 1, 8, 15, 11, 53, 'NYSEARCA:SPY', '183.34'] ... where we can see that we were able to grab the prices of 6 stocks and 1 ETF (Exchange Trading Fund tracking S&P500 Index) every 10 min. Reflection You may wonder whether hacking is legal or not? The best answer I find in the words of Gordon Gekko: Someone reminded me I once said “Greed is good”, ## Retrieve the Data of Fund Performance utilizing Google and Python Do you remember my post on Get the Data of Fund Performance directly into Excel utilizing VBA and Google? If not, have a look as this time we will do the same but in Python. Shortly, given a list of APIR codes (referring to different investment option performance) we want to fetch the publicly available data at InvestSMART.com.au website. Previously we built a code in Excel/VBA. It works but it is slow as the process makes use of Excel’s plug-ins for establishing connection with external websites and downloading the data into Excel’s workbooks. We can improve it significantly using Python language. The second reason for doing it in Python is purely educational. We will see below how we can read the data from CSV file, send a query from Python directly to Google Search, retrieve a list of responses, get the URL of the page we are interested in, download it as a HTML code, parse it to get the content of the webpage without HTML ornaments, scan the content for required information (in our case: the most actual fund performance figures available online), and finally to save all in a file. This is our roadmap for today. Are you ready for the journey? In Search of Sunrise To get the job done, we will need to utilize some existing modules for Python. The most important one is google module available to download from here. We also will need mechanize module, a stateful programmatic web browsing in Python. If you need to install BeautifulSoap, here is all what you look for. Okay then, let’s roll it: 1 2 3 4 5 6 7 8 9 10 # Retrieve the Data of Fund Performance utilizing Google and Python # # (c) 2014 QuantAtRisk.com, by Pawel Lachowicz import mechanize, csv, os, unicodedata from google import search as gs from bs4 import BeautifulSoup # remove file, if exists os.path.exists("investsmart.csv") and os.remove("investsmart.csv") where the last two lines refer to the comma-separated-values file where we will save all fetched data. Sometimes it’s a good practice to remove it if it exists in a local directory. As you will see in a moment, we do it at the beginning as whatever will be retrieved by us from the web and filtered out for content we will be appending to the output file of investsmart.csv in a loop. We feed our algo with a list of APIR codes stored in a text file of codes.csv of the exemplary structure: AMP1232AU AMP1524AU AMP0167AU AMP1196AU ... In order to read each line from this file, we commence with: 12 13 14 15 16 17 18 19 with open('test.csv', 'rb') as f: reader = csv.reader(f) # reads the lines as a list for irow in reader: apircode=irow[0] print(apircode) # define query for Google Search phrase="investsmart com au "+apircode where line #14 iterates through file line-by-line reading in the APIR code (line #15) as the first element from the irow list. Having code in the variable of apircode we define a query for search engine (line #19) and open an inner operational loop, here referring to the output file: 21 22 23 24 25 26 27 28 29 30 with open('investsmart.csv', 'a') as csvfile: writer=csv.writer(csvfile, dialect="excel") i=0 # Search in Google for InvestSMART APIR code for url in gs(phrase, stop=1): i+=1 if(i==1): link=url # store the first URL from Google results print("%s\n" % link) The whole action starts from line #24 where we execute the first procedure of sending query to Google (command gs in line #26). As you look at this creature, it is, in fact, the masterpiece, beauty, and Python’s elegance all in one. Not only we hit the search engine with our phrase but we filter the incoming data from Google Search according to URLs list and extract URL corresponding to the first one found by Google (line #29). Next, we browse through the first link, 32 33 34 35 br=mechanize.Browser() # open a virtual browser resp=br.open(link) content=resp.get_data() #print content and download the HTML code of the website (line #34). To have a better feeling what the variable content contains for the first APIR code from our codes.csv input file, just uncomment and execute code in line #35. For AMP1232AU investment option, the content displays the website HTML code starting from: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7" lang="en"> <![endif]--> <!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8" lang="en"> <![endif]--> <!--[if IE 8]> <html class="no-js lt-ie9" lang="en"> <![endif]--> <!--[if gt IE 8]><!--> <html class="no-js" lang="en"> <!--<![endif]--> <html> <head> <title>AMP SigSup - AMP Super Cash - Managed fund Profile - Managed funds - InvestSMART - managed funds, shares and investment news</title> <meta http-equiv="X-UA-Compatible" content="IE=9" /> <meta name="verify-v1" content="xgkff+3TBcugNz7JE2NiJoqkiVs1PHybWgFkaBuhblI=" /> <meta http-equiv="Content-Type" content="text/html;charset=utf-8" /> <meta name="title" content="Discount access and research on more than 1,000 top performing managed funds" /> <meta name="description" content="Rebate of entry fees on the majority of Australian managed funds. Research on managed funds and shares." /> ... and containing (somewhere in the middle) the information we would like to extract: ... <!-- ****************************** --> <!-- *** Performance Table *** --> <!-- ****************************** --> <br /><br /> <table class="Container" width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td class="ContainerHeader" align="left"><b>Fund Performance</b> (as at 30th Nov 2013)&nbsp;<a href="javascript:PopupWindow('/education/GlossaryPopup.asp?id=133', '_blank', 700, 450)" class="glossaryLnk" title="More information...">&nbsp;</a></td> <td class="ContainerHeader" align="right">NOTE : returns for periods greater than 1 year are annualised</td> </tr> <tr> <td colspan="2" width="100%" align="left" valign="top" class="ContainerBody"> <table width="100%" cellpadding="0" cellspacing="0" border="0" class="DataTable"> <tr class="DataTableHeader"> <td align="left" valign="top" nowrap>&nbsp;</td> <td align="right" valign="top" nowrap>1 Month<br />%</td> <td align="right" valign="top" nowrap>3 Month<br />%</td> <td align="right" valign="top" nowrap>6 Month<br />%</td> <td align="right" valign="top" nowrap>1 Year<br />% p.a.</td> <td align="right" valign="top" nowrap>2 Year<br />% p.a.</td> <td align="right" valign="top" nowrap>3 Year<br />% p.a.</td> <td align="right" valign="top" nowrap>5 Year<br />% p.a.</td> <!--<td align="right" valign="top" nowrap>7 Year<br />% p.a.</td>--> <td align="right" valign="top" nowrap>10 Year<br />% p.a.</td> </tr> <tr class="DataTableRow" onMouseOver="this.className = 'DataTableRowHighlight';" onMouseOut ="this.className = 'DataTableRow';"> <td align="left" valign="top"><b>Total Return</b></td> <td align="right" valign="top"><b>0.12</b></td> <td align="right" valign="top"><b>0.37</b></td> <td align="right" valign="top"><b>0.8</b></td> <td align="right" valign="top"><b>1.77</b></td> <td align="right" valign="top"><b>2.24</b></td> <td align="right" valign="top"><b>2.62</b></td> <td align="right" valign="top"><b>-</b></td> <!--<td align="right" valign="top"><b>-</b></td>--> <td align="right" valign="top"><b>-</b></td> </tr> ... As we can see the HTML code is pretty nasty and digging up numbers from a subsection “Total Returns” seems to be troublesome. Well, not for Python. Just have a look how easily you can strip HTML from all those Christmas Tree ornaments: 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 # transform code into beautiful soup ;-) soup = BeautifulSoup(content) # perform post-processing for stripped data outputline=[] row=0 roww=0 for string in soup.stripped_strings: if(string=="Fund Performance"): print(string) j=1 row=1 if(row!=0): row+=1 if(row==3): updated=string print(string) if(string=="Total Return") and (j==1): print(string) roww=1 j=0 if(roww!=0): roww+=1 if(roww>=3) and (roww<11): if(roww==3): print(string) s=string.lstrip('(').rstrip(')').lstrip('as at ') else: s=string outputline.append(s) print(string) In line #38 we make use of BeautifulSoup – a Python library designed for quick turnaround projects like screen-scraping. The transformed data we can scan line-by-line (the loop starting in line #44) where the outer function of .stripped_strings makes all HTML bumpers and stickers vanish leaving us with pure text! From that point, all following lines of Python code (#45-67) are designed to extract specific information, print it on the screen, and append to Python’s list of outputline. For AMP1232AU investment option from our query, Python displays: AMP1232AU http://www.investsmart.com.au/managed-funds/profile.asp?function=print&FundID=16654 Fund Performance (as at 30th Nov 2013) Total Return 0.12 0.12 0.37 0.8 1.77 2.24 2.62 - - If Google returns the first link to be something completely different than InvestSMART webpage with requested APIR code, 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 # check validity of desired data for APIR code if(len(outputline)>0): rowout=[] outputline.insert(0,apircode) # add APIR code to the list outputline.append(updated) # add the performance numbers for item in outputline: rowout.append(item.encode('ascii')) # convert <type:unicode> # to <type:string> print(rowout) # show final version of the list on the screen if(len(rowout)>0): writer.writerow(rowout) # if non-zero, save/append the list # with data in "investsmart.csv" # output file, otherwise ignore # this query's results csvfile.close() f.close() the outputline string should be of zero length. With extra few lines for post-processing, the output file stores the subject of our small endeavour, i.e.: pawels-mbp:Py pawel cat investsmart.csv AMP1232AU,0.12,0.37,0.8,1.77,2.24,2.62,-,-,(as at 30th Nov 2013) AMP1524AU,-2.47,0.14,-1.58,8.63,14.09,9.42,-,-,(as at 30th Nov 2013) AMP0167AU,0.39,1.96,3.18,8.45,7.72,5.82,5.09,4.89,(as at 30th Nov 2013) ...

what ends our journey In Search of Sunrise. The total time of scanning 1000 APIR codes is cut down by 60% making use of Python instead of VBA. That brings to my mind an AirAustral commercial that I saw a year ago: Arrive fast, then slow down.

## Pre-Processing of Asset Price Series for Portfolio Optimization

Portfolio Optimization is a significant component of Matlab’s Financial Toolbox. It provides us with ready-to-use solution in finding optimal weights of assets that we consider for trading deriving them based on the historical asset performance. From a practical point of view, we can include it in our algorithmic trading strategy and backtest its applicability under different initial conditions. This is a subject of my next up-coming post. However, before we can enjoy the view from the peak, we need to climb the mountain first.

In Matlab, the portfolio is created as a dedicated object of the same name. It doesn’t read the raw stock data. We need to feed that beast. Two major ingredients satisfy the input: a vector of the expected asset returns and a covariance matrix. Matlab helps us to estimate these moments but first we need to deliver asset data in a digestable form.

In this post we will see how one can quickly download the stock data from the Internet based on our own stock selection and pre-process them for solving portfolio optimization problem in Matlab.

Initial Setup for Portfolio Object

Let’s say that at any point of time you have your own list of stocks you wish to buy. For simplicity let’s also assume that the list contains stocks traded on NYSE or NASDAQ. Since you have been a great fun of this game, now you are almost ready to buy what you jotted down on your ShoppingList.lst. Here, an example of 10 tech stocks:

AAPL AOL BIDU GOOG HPQ IBM INTC MSFT NVDA TXN

They will constitute your portfolio of stocks. The problem of portfolio optimization requires a look back in time in the space of returns obtained in trading by each stock. Based on them the Return Proxy and Risk Proxy can be found.

The return matrix $R$ of dimensions $(N-1)\times M$ where $N$ stands for number of historical prices (e.g. derived daily, or monthly, etc.) and $M$ for the number of stocks in our portfolio, is required by Matlab as an input. We will see how does it work in next post. For now let’s solely focus on creation of this matrix.

In the article Create a Portfolio of Stocks based on Google Finance Data fed by Quandl I discussed Quandl.com as an attractive data provider for US stocks. Here, we will follow this solution making use of Quandl resources to pull out the stock price series for our shopping list. Ultimately, we aim at building a function, here: QuandlForPortfolio, that does the job for us:

% Pre-Processing of Asset Price Series for Portfolio Optimization in Matlab % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz   clear all; close all; clc;   % Input Parameters n=1*365; tickers='ShoppingList.lst'; qcodes='QuandlStockCodeListUS.xlsx';   [X,Y,R,AssetList] = QuandlForPortfolio(n,tickers,qcodes);

We call this function with three input parameters. The first one, $n$, denotes a number of calendar days from today (counting backwards) for which we wish to retrieve the stock data. Usually, 365 days will correspond to about 250$-$252 trading days. The second parameter is a path/file name to our list of stock (desired to be taken into account in the portfolio optimisation process) while the last input defines the path/file name to the file storing stocks’ tickers and associated Quandl Price Codes (see here for more details).

Feeding the Beast

The QuandlForPortfolio Matlab function is an extended version of the previously discussed solution. It contains an important correcting procedure for the data fetched from the Quandl servers. First, let’s have a closer look on the function itself:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 % Function assists in fetching Google Finance data from the Quandl.com % server for a given list of tickers of stocks traded on NYSE or % NASDAQ. Data are retrieved for last 'n' days with daily sampling. % % INPUT % n : number of calendar days from 'today' (e.g. 365 would % correspond to about 252 business days) % tickers : a path/file name of a text file listing tickers % qcodes : a path/file name of Excel workbook (.xlsx) containing a list % of tickers and Quandl Price Codes in the format of % [Ticker,Stock Name,Price Code,Ratios Code,In Market?] % OUTPUT % X0 : [Nx1] column vector with days % Y0 : [NxM] matrix with Close Prices for M stocks % R0 : [(N-1)xM] matrix of Retruns % AssetList : a list of tickers (cell array) % % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz   function [X0,Y0,R0,AssetList0] = QuandlForPortfolio(n,tickers,qcodes) fileID = fopen(tickers); tmp = textscan(fileID, '%s'); fclose(fileID); AssetList=tmp{1}; % a list as a cell array   % Read in the list of tickers and internal Quandl codes % [~,text,~] = xlsread(qcodes); quandlc=text(:,1); % again, as a list in a cell array quandlcode=text(:,3); % corresponding Quandl's Price Code   date1=datestr(today-n,'yyyy-mm-dd'); % from date2=datestr(today,'yyyy-mm-dd'); % to   % Fetch the data from Quandl.com % QData={}; for i=1:length(AssetList) for j=1:length(quandlc) if(strcmp(AssetList{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','x',... 'start_date',date1,'end_date',date2,'collapse','daily'); QData{i}=fts; end end end   % Post-Processing of Fetched Data % % create a list of days across all tickers TMP=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); tmp=tmp(:,1); TMP=[TMP; tmp]; end ut=unique(TMP); % use that list to find these days that are not present % among all data sets TMP=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); tmp=tmp(:,1); TMP=[TMP; setdiff(ut,tmp)]; end ut=unique(TMP); % finally, extract Close Prices from FTS object and store them % in Y0 matrix, plus corresponding days in X0 X0=[]; Y0=[]; for i=1:length(QData) tmp=fts2mat(QData{i},1); cp=[]; for j=1:size(tmp,1) [r,~,~]=find(ut==tmp(j,1)); if(isempty(r)) cp=[cp; tmp(j,5)]; % column 5 corresponds to Close Price if(i<2) % create a time column vector listing days % common among all data sets X0=[X0; tmp(j,1)]; end end end Y0=[Y0 cp]; end % transform Close Prices into Returns, R(i)=cp(i)/cp(i-1)-1 R0=tick2ret(Y0); AssetList0=AssetList'; end

The main bottleneck comes from the fact that Matlab’s portfolio object demands an equal number of historical returns ($N-1$) in the matrix of $R$ for all $M$ assets. We design the function in the way that it sets the common timeframe for all stocks listed on our shopping list. Of course, we ensure that all stocks were traded in the markets for about $n$ last days (rough estimation).

Now, the timeframe of $n$ last days should be understood as a first approximation. We fetch the data from Quandl (numeric date, Open, High, Low, Close, Volume) and save them in the cell array QData (lines #37-49) for each stock separately as FTS objects (Financial Time-Series objects; see Financial Toolbox). However, it may occur that not every stock we fetched displays the same amount of data. That is why we need to investigate for what days and for what stocks we miss the data. We achieve that by scanning each FTS object and creating a unique list of all days for which we have data (lines #54-60).

Next, we loop again over the same data sets but now we compare that list with a list of all dates for each stock individually (lines #63-69), capturing (line #67) those dates that are missing. Their complete list is stored as a vector in line #69. Eventually, given that, we are able to compile the full data set (e.g. Close Prices; here line #80) for all stocks in our portfolio ensuring that we will include only those dates for which we have prices across all $M$ assets (lines #70-91).

Beast Unleashed

We test our data pre-processing simply by running the block of code listed above engaging QuandlForPortfolio function and we check the results in the Matlab’s command window as follows:

>> whos X Y R AssetList Name Size Bytes Class Attributes   AssetList 1x10 1192 cell R 250x10 20000 double X 251x1 2008 double Y 251x10 20080 double

what confirms the correctness of dimensions as expected.

At this stage, the aforementioned function can be used two-fold. First, we are interested in the portfolio optimisation and we look back at last $n$ calendar days since the most current one (today). The second usage is handy too. We consider our stocks on the shopping list and fetch for their last, say, $n=7\times365$ days with data. If all stocks were traded over past 7 years we should be able to collect a reach data set. If not, the function will adjust the beginning and end date to meet the initial time constrains as required for $R$ matrix construction. For the former case, we can use 7-year data sample for direct backtesting of algo models utilizing Portfolio Optimization.

Stay tuned as we will rock this land in the next post!

Any Questions?

Share them across QuantCove.com – the official Forum of QuantAtRisk.

## Create a Portfolio of Stocks based on Google Finance Data fed by Quandl

There is an old saying Be careful what you wish for. Many quants and traders wished for a long time for a better quality of publicly available data. It wasn’t an easy task to accomplish but with a minimum of effort, it has become possible to get all what you truly dreamt of. When I started my experience with daily backtesting of my algo strategies I initially relied on Yahoo! data. Yeah, easily accessible but of low quality. Since the inception of Google Finance the same dream had been reborn. More precise data, covering more aspects of trading, platforms, markets, instruments. But how to get them all?

Quandl.com provides us with an enormous access to an enormous database. Just visit the webpage to find out what I mean by barley putting my thoughts all together in one sentence. In fact, the portal’s accessibility via different programming languages (e.g. Matlab, C++, Python, etc.) makes that database more alive than ever. The space of parameters the quants get an access to becomes astronomical in number!

Alleluja! Let’s ride this horse. Enough of watching the online charts changing a direction of our investing moods with a one-mintute time resolution. We were born to be smarter than stochastic processes taking control over our daily strategies. In this short post, making use of the Matlab R2013a platform, I want to show you how quickly you can construct any portfolio of US stocks taken from the Dow Jones Index universe (as an example). We will use the Google Finance stock data fed by Quandl.com.

Google Finance Stock Data for Model Feeding

Similar to Yahoo! Finance’s resources, Quandl.com has an own system of coding the names of all US stocks. Therefore, we need to know it before doing any further step. The .csv file containing all of the codes for Quandl’s stock data can be obtained via API for Stock Data webpage. The file contains the header of the form:

Ticker Stock Name Price Code Ratios Code In Market?

linking Google Finance ticker with Quandl’s code (Price Code). Doing some extra job with data filtering, I created the pre-processed set of all tickers excluding those non-active in the markets and removing all invalid data. You can download it here QuandlStockCodeListUS.xlsx as an Excel workbook.

The second list we need to have is a list of potential stock we want to deal with in our portfolio. As for today, the current list of Dow Jones Index’s components you can download here: dowjones.lst. Let’s start with it:

1 2 3 4 5 6 7 8 9 10 11 12 13 % Create a Portfolio from Dow Jones Stocks based on Google Finance % Data fed by Quandl.com % % (c) 2013 QuantAtRisk.com, by Pawel Lachowicz     close all; clear all; clc;   % Read the list of Dow Jones components fileID = fopen('dowjones.lst'); tmp = textscan(fileID, '%s'); fclose(fileID); dowjc=tmp{1}; % a list as a cell array

The guys from Mathworks suggest to make a new and good practice of using textscan command (lines 10-12) instead of dataread for reading data from the text files. The latter will be removed from the next Matlab release.

Next, we need to import the current list of tickers for US stocks and their corresponding codes in the Quandl’s database. We will make the use of my pre-processed data set of QuandlStockCodeListUS.xlsx as mentioned earlier in the following way:

15 16 17 % Read in the list of tickers and internal code from Quandl.com [ndata, text, alldata] = xlsread('QuandlStockCodeListUS.xlsx'); quandlc=text(:,1); % again, as a list in a cell array

For the simplicity of our example, we will be considering all 30 stocks as a complete sample set of stocks in our portfolio. A portfolio construction is not solely a selection of stocks. It usually links a time period of the past performance of each individual stock. In the following move, we will collect daily stock returns over last 365 calendar year (corresponding to about 252 trading days).

This framework is pretty handful. Why? If you build or run live a model, most probably your wish is to update your database when US markets are closed. You fetch Quandl.com for data. Now, if your model employs risk management or portfolio optimisation as an integrated building-block, it’s very likely, you are interested in risk(-return) estimation, e.g. by calculating Value-at-Risk (VaR) or similar risk measures. Of course, the choice of the time-frame is up to you based on your strategy.

Having that said,

19 20 21 % fetch stock data for last 365 days date2=datestr(today,'yyyy-mm-dd') % from date1=datestr(today-365,'yyyy-mm-dd') % to

what employs an improved Matlab’s command of datestr with formatting as a parameter. It’s really useful when it comes to data fetching from Quandl.com as the format of the date they like to see in your code is yyyy-mm-dd. The above two commands return in Matlab the following values:

date2 = 2013-11-03 date1 = 2012-11-03

Excellent! Let’s catch some fishes in the pond! Before doing it we need to make sure that Quandl’s software is installed and linked to our Matlab console. In order to download Quandl Matlab package just visit this website and follow the installation procedure at the top of the page. Add proper path in Matlab environment too. The best practice is to register yourself on Quandl. That allows you to rise the daily number of data call from 50 to 500. If you think you will need more data requests per day, from the same webpage you can send your request.

You will need to replace ‘yourauthenticationtoken’ in the Matlab code (see line 35) with your token in order to get the data.

Fetching Stock Data

From this point the road stops to wind. For all 30 stocks belonging to the current composition of Dow Jones Index, we get their Open, High, Low, Close prices (as retrieved from Google Finance) in the following way:

23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 % create an empty array for storing stock data stockd={}; % scan the Dow Jones tickers and fetch the data from Quandl.com for i=1:length(dowjc) for j=1:length(quandlc) if(strcmp(dowjc{i},quandlc{j})) fprintf('%4.0f %s\n',i,quandlc{j}); % fetch the data of the stock from Quandl % using recommanded Quandl's command and % saving them directly into Matlab's FTS object (fts) fts=0; [fts,headers]=Quandl.get(quandlcode{j},'type','fints', ... 'authcode','yourauthenticationtoken',... 'start_date',date1,'end_date',date2); stockd{i}=fts; % entire FTS object in an array's cell end end end % use 'save' command to save all actual variables from memory on the disc, % if you need it for multiple code running % save data

A code in line 35 to 37 is an example how you call Quandi for data and fetch them into FINTS (financial time series object from the Financial Toolbox in Matlab). So easy. If you wish to change the time frequency, apply transformation or get the data not in FINTS but .csv format, all is available for you as a parameter at the level of calling Quandl.get() command (explore all possibilities here).

Having that, you can easily plot the Google Finance data, for instance, of IBM stock straight away:

% diplay High-Low Chart for IBM highlow(stockd{11}) title('IBM'); ylabel('Stock Price ($)'); to get: or prepare a return-risk plot for your further portfolio optimisation ret=[]; for i=1:length(stockd) % extract the Close Price tmp=fts2mat(stockd{i}.Close,0); % calculate a vector of daily returns r=tmp(2:end)./tmp(1:end-1)-1; ret=[ret r]; end % calculate the annualized mean returns for 30 stocks mret=mean(ret)*sqrt(250); % and corresponding standard deviations msig=std(ret)*sqrt(250); % plot return-risk diagram p=100*[msig; mret]' plot(p(:,1),p(:,2),'b.'); xlim([0 30]); xlabel('Annualized Risk of Stock (%)') ylabel('Annualized Mean Return (%)'); providing us with Forum: Further Discussions You can discuss the current post with other quants within a new Quant At Risk Forum (Data Handling and Pre-Processing section). ## Rebinning of Financial Time-Series Working with financial time-series, especially in trading and its following analysis of data trends or so on, we wish to rebin (resample) our data into a new time-series which would provide us with some sort of a new information on the average value of the underlying data records characterized by high volatility in time. The following algorithm in Maltab does the job pretty well: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 % Function rebins any time-series [x,y] where x is a time vector % and y stores time-series values from the current (regular or % irregular samping). A new time-series sampling is dt. % % Input: x : time vector [1xN] % y : data vector [1xN] % dy : data error vector [1xN] % dt : new rebinning time (dt > current sampling) % % (c) 2013, QuantAtRisk.com, by Pawel Lachowicz % % Example: x=1:1:1001; % assume 1001 trading days % y=rand(1,1001); % simulate pseudo data % dy=0.1*rand(1,1001); % simulate pseudo data errors (if required) % dt=50; % rebin data from 1 day sampling down to 50 day intervals % [rx,ry]=rebin(x,y,dy,dt); % or [rx,ry]=rebin(x,y,[],dt]; % plot(x,y,'color',[0.7 0.7 0.7]); % plot original time-series % hold on; plot(rx,ry,'ro-'); % plot down-rebined time-series function [rx,ry]=rebin(x,y,dy,dt) if(isempty(dy)) dy=ones(1,length(y)); end rdata=[]; j=1; k=1; t2=x(1); while(j<=length(x)) i=j; if((x(i)+dt)>x(end)) break else t2=t2+dt; end i=j; sa=0; wa=0; ea=0; il=0; while(x(i)<t2) sa=sa+(y(i)/dy(i)^2); wa=wa+(1/dy(i)^2); i=i+1; il=il+1; end ry=sa/wa; rx=t2-dt; if(il>=1) rdata=[rdata; rx ry]; end j=j+il; end rx=rdata(:,1); ry=rdata(:,2); end Example As an example we will use the daily rate of returns in trading of AAPL (Apple Inc.) from Jan-1984 to Apr-2011 (download Matlab’s aaplr.mat M-file). Below, the short set of command lines allow us to execute the rebinning process of return series, clear all; close all; clc; load aaplr.mat plot(aaplR) x=1:length(aaplR); % 6855 days y=aaplR; plot(x,y,'color',[0.7 0.7 0.7]); % rebin the data with dt=25 days step (one trading month) dt=25; dy=[]; % we have no information on data errors [rx,ry]=rebin(x,y,dy,dt) % overplot results and display first 500 days hold on; plot(rx,ry,'ro-'); xlim([0 500]); ylim([-0.125 0.15]); and plot both time-series, original and rebinned with a new bin time of 25 days for the first 500 days of trading: where a red line denotes a new rebinned data time-series with a binning time of 25 trading days. The function (the algorithm) computes a simple weighted mean based on data points falling into an interval of$\langle t,t+dt \rangle$. If we do not specify the input data error vector of$dy$as in the example above, we should get a simple mean as a consequence. ## Accessing a SQL Server database in Matlab and the use of FTS objects Feeding our models with data for backtesting purposes requires pulling data from the server. If we do our modeling part in Matlab under Linux and the MS SQL Server database is our target, the quickest solution for the establishment of a good connectivity is making use of Java Database Connectivity (JDBC) driver. Firstly, visit the following web page to download and current driver (.tar.gz) and next follow the installation steps provided by Mathworks. In the following example we will try to connect to our database called ‘XDataBase’ and retrieve stock prices from Feb 2, 1998 of Apple Inc. (NASDAQ ticker: AAPL). 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 ticker=['AAPL']; start=['02-Feb-1998']; % --Connect to your local SQL Server and fetch the data for a ticker % % define connection conn=database('XDataBase','matlab','m4tl4b','com.microsoft.sqlserver.jdbc.SQLServerDriver','jdbc:sqlserver://163.118.2.7:1533;database=XDataBase'); % specify return format setdbprefs('DataReturnFormat','structure'); % define your SQL query, as for example: QueryString = ['SELECT sec.tic, dp.datadate, dp.prchd As PriceH, dp.prcld As PriceL, dp.prccd As PriceC, dp.prccd/dp.ajexdi As PriceCA, dp.cshtrd*dp.ajexdi As Vol FROM Xpressfeed.dbo.security AS sec INNER JOIN Xpressfeed.dbo.sec_dprc AS dp ON sec.gvkey = dp.gvkey AND sec.iid = dp.iid AND dp.datadate >=''',start,''' WHERE (sec.tic IN (''',ticker,'''))']; % execute connection, fetch data, and assign them to a Matlab variable cur = exec(conn, QueryString); myData = fetch(cur); secData = myData.data; In the above example, our structure variable of secData is composed of secData.PriceH, secData.PriceL, secData.PriceC, secData.PriceCA, and secData.Vol arrays containing information about AAPL’s High, Low, Close, Adjusted Close prices and Volume traded, respectively. Making use of Matlab’s Financial Toolbox, one can store all these data into a useful text-like file format. We do it by construction of Financial Time-Series (FTS) object: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 % verify the existence of data for a ticker skip=false; try n=length(secData.PriceC); % a random test catch skip=true; end if(~skip) dates=secData.datadate; dates=datenum(dates); data=[zeros(n,1) secData.PriceH secData.PriceL secData.PriceC secData.PriceCA secData.Vol]; colheads={'DATES';'OPEN';'HIGH';'LOW';'CLOSE';'CLOSEA';'VOLUME'}; desc=ticker; filename=[ticker,'.dat']; % save AAPL's extracted Price and Volume data into AAPL.dat file fts2ascii(filename,dates,data,colheads,desc); % construct FTS object by reading the data from the file fts=ascii2fts(filename,1,2); data=fts2mat(fts,1); % and plot the Adjusted Close Price scrsz = get(0,'ScreenSize'); hh=figure('Position',[70 scrsz(4)/2 scrsz(3)/1.1 scrsz(4)/2],'Toolbar','none'); plot(data(:,1),data(:,5)); xlim([min(data(:,1)) max(data(:,1))]); title(ticker); xlabel('Time [days]'); ylabel('Adjusted Close Price [US$]'); end

As the result of execution of code lines 28 to 35, we were able to retrieve AAPL stock data and save them into AAPL.dat file. First ten lines of the file are as follows:

$head -10 AAPL.txt AAPL DATES OPEN HIGH LOW CLOSE CLOSEA VOLUME 02-Feb-1998 0.000000 18.500000 17.375000 17.688000 4.422000 22740000 03-Feb-1998 0.000000 18.625000 17.688000 18.313000 4.578250 14380000 04-Feb-1998 0.000000 18.500000 18.000000 18.250000 4.562500 6078400 05-Feb-1998 0.000000 18.500000 18.000000 18.313000 4.578250 8508000 06-Feb-1998 0.000000 18.688000 18.250000 18.500000 4.625000 7228000 09-Feb-1998 0.000000 19.500000 18.375000 19.188000 4.797000 17668000 10-Feb-1998 0.000000 19.563000 19.063000 19.438000 4.859500 15072000 11-Feb-1998 0.000000 19.500000 18.875000 19.000000 4.750000 7560000 In line 38 we show how to construct FTS object >> whos fts Name Size Bytes Class Attributes fts 3366x6 190016 fints whereas line 39 provides us with a transformation of the object’s data into matrix format that we may use to plot AAPL’s Adjusted Close Prices. It is of great importance to remind that Financial Toolbox allows us to express a date (e.g. 11-Feb-1998) as a continuous number: >> datenum('02-Feb-1998') ans = 729788 or reversely >> datestr(ans) ans = 02-Feb-1998 ## Extracting Time-Series from Tick-Data The tick-data provided in the .csv (comma separated values) file format sometimes may be a real problem to handle quickly, especially when the total size starts to count in hundreds of GB. If your goal is to extract a time-series with, say, hourly time resolution only, this article will provide you with some fresh and constructive guidelines how to do it smartly in the Linux environment. First, let’s have a closer look at the data. Say, we have a collection of 2148 .csv files hosting the FX trading history of AUDUSD pair, covering nearly 10 years between 2000 and 2010. Each file is 7.1 MB large what leaves us with approximately 15 GB of data to process. Having a look into the randomly selected file we can identify the header and data themselves: $ head -10 audusd_216.csv Ticks,TimeStamp,Bid Price,Bid Size,Ask Price,Ask Size 632258349000000015,2004-07-19 11:55:00.000,0.7329,1000000,0.7334,1000000 632258349000000016,2004-07-19 11:55:00.000,0.7329,1000000,0.7334,1000000 632258349000000017,2004-07-19 11:55:00.000,0.7329,1000000,0.7333,1000000 632258349000000018,2004-07-19 11:55:00.000,0.7327,1000000,0.7333,1000000 632258349000000019,2004-07-19 11:55:00.000,0.7327,1000000,0.7333,1000000 632258349000000020,2004-07-19 11:55:00.000,0.7328,1000000,0.7333,1000000 632258349000000021,2004-07-19 11:55:00.000,0.7328,1000000,0.7334,1000000 632258349600000000,2004-07-19 11:56:00.000,0.7328,1000000,0.7334,1000000 632258349600000001,2004-07-19 11:56:00.000,0.7328,1000000,0.7336,1000000

Our aim will be to extract Bid and Ask Price time-series. We will make use of a few Linux standard tools, e.g. sed, awk, supplemented with extra f77 codes. It is also to demonstrate how shell programming can be useful while we have an opportunity to explore the enigmatic syntax of its programs. Generally, we will be writing a shell script, executable for any FX pair name, e.g. gbpnzd, eurjpy, and so on.

In the first step of the script we create a list of all files. This is tricky in Linux as the standard command of ‘ls -lotr’ though returns a desired list but also all details on the file size, attributes, etc. which we do not simply want. Lines 9 and 10 solve the problem,

1 2 3 4 5 6 7 8 9 10 # Extracting Time-Series from Tick-Data .csv files # (c) Quant at Risk, 2012 # # Exemplary usage: ./script.src audusd   #!/bin/bash   echo "..making a sorted list of .csv files" for i in $1_*.csv; do echo${i##$1_}$i ${i##.csv}; done | sort -n | awk '{print$2}' > $1.lst and a newly create file of \$1.lst (note: \$1 corresponds in the shell script to the parameter’s name we called the script with, e.g. audusd; therefore \$1.lst physically means audusd.lst) contains the list:

audusd_1.csv audusd_2.csv audusd_3.csv ... audusd_2148.csv

We create one data file from all 2148 pieces by creating and executing an in-line script:

12 13 14 15 16 17 echo "..creating one data file" awk '{print "cat",$1," &gt;&gt; tmp.lst"}'$1.lst &gt; tmp.cmd chmod +x tmp.cmd ./tmp.cmd rm tmp.cmd mv tmp.lst $1.tmp Now, \$1.tmp is a 15 GB file and we may wish to remove some unnecessary comments and tokens:

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 echo "..removing comments" sed 's/Ticks,TimeStamp,Bid Price,Bid Size,Ask Price,Ask Size//g' $1.tmp >$1.tmp2 rm $1.tmp echo "..removing empty lines" sed '/^$/d' $1.tmp2 >$1.tmp rm $1.tmp2 echo "..removing token ," sed 's/,/ /g'$1.tmp > $1.tmp2 rm$1.tmp   echo "..removing token :" sed 's/:/ /g' $1.tmp2 >$1.tmp rm $1.tmp2 echo "..removing token -" sed 's/-/ /g'$1.tmp > $1.tmp2 rm$1.tmp   echo "..removing column with ticks and ask/bid size" awk '{print $2,$3,$4,$5,$6,$7,$8,$10}' $1.tmp2 >$1.tmp rm $1.tmp2 In order to convert a time information into a continuous measure of time, we modify the f77 code for our task as follows: c Extracting Time-Series from Tick-Data .csv files c (c) Quant at Risk, 2012 c c Program name: fx_getmjd.for c Aim: removes ticks and coverts trade time into MJD time [day] c Input data format: YYYY MM DD HH MM SS.SSS BID BID_Vol ASK ASK_Vol implicit double precision (a-z) integer y,m,d,hh,mm,jd integer*8 bidv,askv character zb*50 call getarg(1,zb) open(1,file=zb) do i=1.d0, 500.d6 read(1,*,end=1) y,m,d,hh,mm,ss,bid,ask jd= d-32075+1461*(y+4800+(m-14)/12)/4+367*(m-2-(m-14)/12*12) _ /12-3*((y+4900+(m-14)/12)/100)/4 mjd=(jd+(hh-12.d0)/24.d0+mm/1440.d0+ss/86400.d0)-2400000.5d0 mjd=mjd-51544.d0 ! T_0 = 2000.01.01 00:00 abr=ask/bid write(*,2) mjd,bid,ask,abr enddo 1 close(1) 2 format(F15.8,F8.4,F8.4,F12.6) end and execute it in our script: 43 44 45 echo "..changing a date to MJD" fx_getmjd$1.tmp > $1.dat rm$1.tmp

In the aforementioned f77 code, we set a zero time point (MJD=0.00) on Jan 1, 2000 00:00. Since that day, now our time is expressed as a single column measuring time progress in days with fractional parts tracking hours and minutes.

We may split the data into two separate time-series containing Bid and Ask Prices at the tick-data level:

47 48 49 echo "..splitting into bid/ask/abr files" awk '{print $1,$2}' $1.dat >$1.bid awk '{print $1,$3}' $1.dat >$1.ask

A quick inspection of both files reveals we deal with nearly $500\times 10^6$ lines! Before we reach our chief aim, i.e. rebinning the series with 1-hour time resolution, there is a need to, unfortunately, separate input into 5 parts, each of maximum $100\times 10^6$ lines. The latter may vary depending of RAM memory size available, and if sufficient, this step can be even skipped. We proceed:

51 52 53 54 55 56 echo "..spliting bid/ask/abr into separate files" fx_splitdat $1 1 fx_splitdat$1 2 fx_splitdat $1 3 fx_splitdat$1 4 fx_splitdat $1 5 where fx_splitdat.for code is given as follows: c Extracting Time-Series from Tick-Data .csv files c (c) Quant at Risk, 2012 c c Program name: fx_splitdat.for c Exemplary usage: ./fx_splitdat audusd [1,2,3,4,5] implicit double precision (a-z) integer nc character*6 zb character*10 zbask,zbbid character*16 zb1ask,zb2ask,zb3ask,zb4ask,zb5ask character*16 zb1bid,zb2bid,zb3bid,zb4bid,zb5bid character*1 par2,st c zb- name of length equal 6 characters only call getarg(1,zb) call getarg(2,par2) ! case write(st,'(a1)') par2 read(st,'(i1)') nc zbask=zb(1:6)//'.ask' zbbid=zb(1:6)//'.bid' zb1ask=zb(1:6)//'.ask.part1' zb2ask=zb(1:6)//'.ask.part2' zb3ask=zb(1:6)//'.ask.part3' zb4ask=zb(1:6)//'.ask.part4' zb5ask=zb(1:6)//'.ask.part5' zb1bid=zb(1:6)//'.bid.part1' zb2bid=zb(1:6)//'.bid.part2' zb3bid=zb(1:6)//'.bid.part3' zb4bid=zb(1:6)//'.bid.part4' zb5bid=zb(1:6)//'.bid.part5' open(11,file=zbask) open(12,file=zbbid) if(nc.eq.1) then open(21,file=zb1ask) open(22,file=zb1bid) do i=1.d0, 100.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=1.0).and.(i<100000001.d0)) then write(21,2) mjd_ask,dat_ask write(22,2) mjd_bid,dat_bid endif enddo endif if(nc.eq.2) then open(31,file=zb2ask) open(32,file=zb2bid) do i=1.d0, 200.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=100000001.d0).and.(i<200000001.d0)) then write(31,2) mjd_ask,dat_ask write(32,2) mjd_bid,dat_bid endif enddo endif if(nc.eq.3) then open(41,file=zb3ask) open(42,file=zb3bid) do i=1.d0, 300.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=200000001.d0).and.(i<300000001.d0)) then write(41,2) mjd_ask,dat_ask write(42,2) mjd_bid,dat_bid endif enddo endif if(nc.eq.4) then open(51,file=zb4ask) open(52,file=zb4bid) do i=1.d0, 400.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=300000001.d0).and.(i<400000001.d0)) then write(51,2) mjd_ask,dat_ask write(52,2) mjd_bid,dat_bid endif enddo endif if(nc.eq.5) then open(61,file=zb5ask) open(62,file=zb5bid) do i=1.d0, 500.d6 read(11,*,end=1) mjd_ask,dat_ask read(12,*,end=1) mjd_bid,dat_bid if((i>=400000001.d0).and.(i<500000001.d0)) then write(61,2) mjd_ask,dat_ask write(62,2) mjd_bid,dat_bid endif enddo endif 1 close(1) 2 format(F15.8,F8.4) stop end and compiling it as usual: f77 fx_splitdat.for -o fx_splitdat Finally, we can extract the rebinned Bid and Ask Price time-series with bin time of 1 hour, i.e.$dt=0.041666667$d, making use of the following f77 code: c Extracting Time-Series from Tick-Data .csv files c (c) Quant at Risk, 2012 c c Program name: fx_rebin.for c Exemplary usage: ./fx_rebin audusd 2 implicit double precision (a-z) parameter (dim=100.d6) double precision f(dim), mjd(dim), step character*50 par1, par2, st call getarg(1,par1) ! file name call getarg(2,par2) ! bining [d] write(st,'(a20)') par2 read(st,'(f20)') step c reading data open(1,file=par1) do i=1,100.d6 read(1,*,end=1) _ mjd(i),f(i) enddo 1 close(1) n=i-1.d0 c main loop j=1.d0 k=1.d0 t2=0. t2=dint(mjd(j)) do while (j.lt.n) i=j if ((mjd(i)+step).gt.(mjd(n))) then print* stop else t2=t2+step endif i=j il=0.d0 s=0.d0 do while (mjd(i).lt.t2) s=s+f(i) i=i+1.d0 il=il+1.d0 ! how many points in segment enddo av=s/il day=t2-step if (il.ge.1.d0) then write(*,3) day,av endif j=j+il enddo 2 format(f30.7,2f30.6) 3 format(f20.8,f8.4) 10 stop end executed in our script for all five part of both tick-data time-series: 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 echo "..rebinning with dt = 1 h" dt=0.041666667 fx_rebin$1.ask.part1 $dt >$1.ask.part1.1h fx_rebin $1.ask.part2$dt > $1.ask.part2.1h fx_rebin$1.ask.part3 $dt >$1.ask.part3.1h fx_rebin $1.ask.part4$dt > $1.ask.part4.1h fx_rebin$1.ask.part5 $dt >$1.ask.part5.1h fx_rebin $1.bid.part1$dt > $1.bid.part1.1h fx_rebin$1.bid.part2 $dt >$1.bid.part2.1h fx_rebin $1.bid.part3$dt > $1.bid.part3.1h fx_rebin$1.bid.part4 $dt >$1.bid.part4.1h fx_rebin $1.bid.part5$dt > $1.bid.part5.1h echo "..appending rebinned files" cat$1.ask.part1.1h $1.ask.part2.1h$1.ask.part3.1h $1.ask.part4.1h$1.ask.part5.1h >$1.ask.1h.tmp cat$1.bid.part1.1h $1.bid.part2.1h$1.bid.part3.1h $1.bid.part4.1h$1.bid.part5.1h > $1.bid.1h.tmp rm *part* echo "..removing empty lines in rebinned files" sed '/^$/d' $1.ask.1h.tmp >$1.ask.1h rm $1.ask.1h.tmp sed '/^$/d' $1.bid.1h.tmp >$1.bid.1h rm $1.bid.1h.tmp echo "..done!" As the final product we obtain two files, say, audusd.bid.1h audusd.ask.1h of the content: $ tail -5 audusd.bid.1h 3772.62500304 0.9304 3772.66666970 0.9283 3772.70833637 0.9266 3772.75000304 0.9263 3772.79166970 0.9253

where the first column is time (MJD) with 1-hour resoulution and the second column contains a simple Price average from all tick prices falling into 1-hour time bin. As for dessert, we plot both Bid and Ask Price time-series what accomplishes our efforts:

## Continuous Measure of Time

Dealing with financial data sets in a raw format sometimes we find ourselves in a need of converting a specific date and time into continuous measure of time expressed by a single number.

The following Fortran code reads from the text file six first columns containing year, month, day, hour, minute, and second, respectively. As the output it returns a Modified Julian Date, a solution often used in those instances where there is a need to express time axis in a computer friendly format.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 c getmjd.for implicit double precision (a-z) integer y,m,d,hh,mm,jd character zb*50   call getarg(1,zb) ! read the text file name as an argument open(1,file=zb) do i=1.d0, 1.d6 read(1,*,end=1) y,m,d,hh,mm,ss jd= d-32075+1461*(y+4800+(m-14)/12)/4+367*(m-2-(m-14)/12*12) _ /12-3*((y+4900+(m-14)/12)/100)/4 mjd=(jd+(hh-12.d0)/24.d0+mm/1440.d0+ss/86400.d0)-2400000.5d0 write(*,'(F20.8)') mjd enddo 1 close(1)   end

Exemplary Data

2012 1 1 12 53 01 2012 1 2 13 28 57 2012 1 3 14 53 06 ...

Compilation and Execution

f77 getmjd.for -o getmjd ./getmjd data.dat

Output

 55927.53681713 55928.56177083 55929.62020833 ...

## Bank format of numbers in Matlab

The following Matlab function allows for displaying any double-type number using a nice bank format that includes commas separators and two digit precision (without rounding). The number $num$ is converted into a new variable of a string type.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 function [str]=bankformat(num) num2=floor(num*1000)/1000; r=int32(100*(num-floor(num))); str = num2str(num2); k=find(str == '.'); if(isempty(k)) str=[str,'.00']; end FIN = min(length(str),find(str == '.')-1); for i = FIN-2:-3:2 str(i+1:end+1) = str(i:end); str(i) = ','; end x=mod(r,10); if(x==0) str=[str,'0']; end k=find(str == '.'); d=length(str)-k; if(d>2) str=str(1:end-(d-2)); end end

Example:

>> n=70149506.5; >> s=bankformat(n); >> fprintf('\nInvested capital: £\%1s\n',s)   Invested capital: £70,149,506.50