MDFA-DeepLearning Package: Hybrid MDFA-RNN networks for machine learning in multivariate time series

Posted on June 26, 2018 by Christian Dallas Blakely

Overview

MDFA-DeepLearning is a library for building machine learning applications on large numbers of multivariate time series data, with a heavy emphasis on noisy (non)stationary data. The goal of MDFA-DeepLearning is to learn underlying patterns, signals, and regimes in multivariate time series and to detect, predict, or forecast them in real-time with the aid of both a real-time feature extraction system based on the multivariate direct filter approach (MDFA) and deep recurrent neural networks (RNN). The feature extraction system utilizes the MDFA-Toolkit to construct K multivariate signals in real-time (the features) where each of the K features targets a certain frequency range in the underlying time series. Furthermore, each (or some) of these features can also be forecasted multiple steps ahead, or smoothed, creating many possibilities of signal or regime learning in time series.

For the deep learning components, in this package we focus on two network structures, namely a recurrent weighted average network (RWA Ostmeyer and Cowell) and a standard long-short term memory network. The RWA cell is a type of RNN cell that computes a recurrent weighted average over every past processing timestep, unlike standard RNN cells which only process information from previous timesteps.

The overall general architecture of the proposed network is given in Figure 1 in the case of an RWA network, which we will discuss in more detail below. For a given sequence of N multivariate time series values which have been transformed appropriately to a stationary sequence, which we denote Y_1, Y_2, …, Y_N, a real-time feature extraction process is applied at each observation which is then used as input to an RWA (or LSTM) network, where the univariate output is a targeted signal value (regression) or a regime value (classification)..

Figure 1: Proposed network design using RWA cells to learn form the real-time feature extractor. The Y_t values are the (multivariate) transformed time series values, and the S_t are univariate outputs describing a target signal or regime

Why Use MDFA-DeepLearning

One might want to develop predictive models in multivariate time series data using MDFA-DeepLearning if the time series exhibit any of the following properties:

High-Dimensionality (many (un)correlated nonstationary time series)
Difficult to forecast using traditional model-based methods (VARIMA/GARCH) or traditional deep learning methods (RNN/LSTM, component decomposition, etc)
Emphasis needed on out-of-sample real-time signal extraction and forecasting
Regime changing in the underlying dynamics (or data generating process) of the time series is a common occurrence

The MDFA-DeepLearning approach differs from most machine learning methods in time series analysis in that an emphasis on real-time feature extraction is utilized where the features extractors are build using the multivariate direct filter approach. The motivation behind this coupling of MDFA with machine learning is that, while many time series decomposition methodologies exist (from empirical mode decomposition to stochastic component analysis methods), all of these rely on either in-sample decompositions on historical data (useless for future data), and/or assumptions about the boundary values, neither of which are attractive when fast, real-time out-of-sample predictions are the emphasis. Furthermore, simply applying standard recurrent neural networks for step-ahead forecasting or signal extraction directly on the original noisy data is a useless exercise – the recurrent networks typically will only learn noise, producing signals and forecasts of little to no value (in most cases, the latter).

As mentioned, the back-end used for the novel feature extraction is the multivariate direct filter approach (MDFA), and is used to extract both local (higher-frequency) and global (low-frequency) features in real-time, out-of-sample, and output these features in a multivariate time series as inputs into an RWA or LSTM recurrent neural network. Thus the package is divided into essentially four different components which all need to be defined properly in order to produce predictive models for time series data:

Labeling interface
Feature extractors
DataSetIterator interface
Learning interface

Labeling interface

The package includes an interface for labeling time series. The labeling process takes segments of historical data, and labels each time series observation in some manner. There are three types of labels that can be used:

Observational labeling: every time series observation is labeled by a signal value (for example a target value computed by a symmetric target filter). This is sequence-to-sequence labeling for time series regression.
Fixed Period labeling: every period (day, week, etc) is labeled, typically by a one-hot vector. This is sequence-to-value labeling. The end of the period is labeled and the rest of the values are not (masked by nonvalues in the code).
Regime labeling: every value in a specific regime is labeled, either by a one-hot vector (for example, long (1,0) short (0,1) neutral (0,0), or trend (1,0) and mean-reverting (0,1)). This is another example of sequence-to-sequence, but using one-hot vectors and now in the form of sequence classification.

Other labeling strategies can certainly be used, but these are the three most common. We will give an outline on how to create a custom labeling strategy in a future article.

Feature Extractors

The package contains a feature extraction class called MDFAFeatureExtraction which, when instantiated, is used as the input to the a DataSetIterator. The MDFAFeatureExtraction contains a default automated feature extraction builder where a value of K is given as the number of features and a lag to indicate smoothing or forecasting steps-ahead.

One application of the MDFA feature extraction tool is to decompose a multivariate time series into K components in real-time which are close to being “orthogonal”, meaning in this sense that the frequency information from each of the components are relatively disjoint. A precise mathematical formulation of this property and examples of the MDFAFeatureExtraction to follow. Another example used for turning-point detection in trends is to decompose the multivariate series into K number of low-frequency components with different speeds and forecast/smoothing characteristics.

DataSetIterator

The DataSetIterator is an interface for ND4J that handles fast N-d array manipulation akin to numpy in Python. More specifically, the DataSetIterator handles traversing through a dataset and preparing data for a recurrent neural network. Our datasets in this package are outputs from the TimeSeries through the MDFAFeatureExtraction objects which then become the input to the RNN network. The DataSetIterator also performs the labeling and how output will be arranged. Thus it is essentially the communication from the underlying time series to the extraction process and then how it is treated as input and output to the RNN. In the package, we have designed two examples of DataSetIterators, one for regression and one for classification, that will be described in more detail in a later article.

Learning interface

Finally, the learning interface is where the final network is defined, all the parameters of the network, the activation and loss functions, and number/type of layers (LSTM, FeedForward, etc). The underlying computational framework for this component uses DeepLearning4J.

Requirements and Example

MDFA-DeepLearning requires both the MDFA-Toolkit package for constructing the time series feature extractors and the Eclipse Deeplearning4j (dl4j) library for the deep recurrent neural network constructors. The dl4j library is freely available at github.com/deeplearning4j, but is included in the build of this package using Gradle as the dependency management tool.

The back-end for the dl4j package will depend on your computational hardware, but is available on a local native basis using CPUs, or can take advantage of GPUs using CUDA libraries (CUDA 8.0 was used to test current version of MDFA-DeepLearnng). In this package I have included a reference to both versions (assuming a standard linux64 architecture).

The back-end used for the novel feature extraction technique, as mentioned, is the MDFA-Toolkit (available here), which will run on the ND4J package. The feature extraction begins by defining K MDFA objects, called MDFABase from the MDFA-Toolkit, with the fixed parameters set for each MDFA object. For example, here we define K=4 MDFABase objects, that will be used to extract different types of trends at different speeds in a fractionally differenced time series. Please refer to the MDFA-Toolkit documentation for more information on the definition of each MDFA parameter.

MDFABase[] anyMDFAs = new MDFABase[4];

anyMDFAs[0] = (new MDFABase()).setLowpassCutoff(Math.PI/8.0)
.setI1(1)
.setSmooth(.2)
.setLag(-3)
.setLambda(2.0)
.setAlpha(2.0)
.setFilterLength(5);

anyMDFAs[1] = (new MDFABase()).setLowpassCutoff(Math.PI/10.0)
.setLag(-2)
.setAlpha(2.0)
.setFilterLength(5);

anyMDFAs[2] = (new MDFABase()).setLowpassCutoff(Math.PI/4.0)
.setDecayStart(.1)
.setDecayStrength(.2)
.setLag(-1)
.setFilterLength(5);

anyMDFAs[3] = (new MDFABase()).setLowpassCutoff(Math.PI/14.0)
.setSmooth(.2)
.setDecayStart(.1)
.setDecayStrength(.1)
.setFilterLength(5);

More concrete, in-depth step by step examples and tutorials will be given in the source code on github and in this blog, but here we will just give a brief overview on an example main program using these features.


/* Define the .csv data file from where we built train/test dataIterators */
String[] dataFiles = new String[]{"AAPL.daily.csv"};

/* Information about the .csv timeseries file */
TimeSeriesFile fileInfo = new TimeSeriesFile("yyyy-MM-dd", "Index", "Open");

/* Define network parameters */
int miniBatchSize = 100;
int totalTrainExamples = 1500;
int totalTestExamples = 300;
int timeStepLength = 60;
int nHiddenLayers = 2;
int nHidden = 216;
int nEpochs = 400;
int seed = 123;
int iterations = 40;
double learningRate = .001;
double gradientNormThreshold = 10.0;

IUpdater updater = new Nesterovs(learningRate, .4);

/* Instantiate Feature Extractors as an array of MDFABase objects */
MDFAFeatureExtraction features = new MDFAFeatureExtraction(anyMDFAs); 

/* Instantiate a new RecurrentMdfaRegression network using the features defined above */
RecurrentMdfaRegression myNet = new RecurrentMdfaRegression(features);

/* Set the data and the DataIterator parameters */
myNet.setTrainingTestData(dataFiles, fileInfo, miniBatchSize, totalTrainExamples, totalTestExamples, timeStepLength);

/* Usually a good idea to normalize the data */
myNet.normalizeData();

/* Build the LSTM (default network) layers */
myNet.buildNetworkLayers(nHiddenLayers, nHidden,
			RecurrentMdfaRegression.setNeuralNetConfiguration(seed, iterations, learningRate, gradientNormThreshold, 0, updater));

/* An optional dl4j control panel to in the browser */
myNet.setupUserInterface();

/* Train on the number of Epochs */
myNet.train(nEpochs);

/* Print/plot results and stats */
myNet.printPredicitions();
myNet.plotBatches(10);

The main points here are that essentially three components need to be defined:

The .csv time series data file from which the DataIterator will extract the time series data for both labeling and learning. Two data sets will be created from this, a train set and a test set. Referrencing to multiple files from which to extract training and test sets is also possible. In dl4j, training and test data is built in the form of a DataSetIterator interface (org.nd4j.linalg.dataset.api.iterator). In the package, we have defined a MDFADataSetIterator and a MDFARegressionDataSetIterator. More DataSetIterators for various applications will be added on an ongoing basis.
The network RecurrentMdfaRegression is initiated, and needs to contain the feature signal extractors. Any set of feature extractors can be added, here we used the ones defined above as an example.
Finally, the LSTM (or recurrent weighted average) network parameters need to be defined, this will then be used to construct the layers of the recurrent network.

With these three steps defined the network should be ready to train and test. The challenge is of course defining the feature extraction parameters. In later articles, we will give tips and tricks into what works best for what type of learning applications in large time series.

MDFA-Toolkit: A JAVA package for real-time signal extraction in large multivariate time series

Posted on June 24, 2018 by Christian Dallas Blakely

The multivariate direct filter approach (MDFA) is a generic real-time signal extraction and forecasting framework endowed with a richly parameterized interface allowing for adaptive and fully-regularized data analysis in large multivariate time series. The methodology is based primarily in the frequency domain, where all the optimization criteria is defined, from regularization, to forecasting, to filter constraints. For an in-depth tutorial on the mathematical formation, the reader is invited to check out any of the many publications or tutorials on the subject from blog.zhaw.ch.

This MDFA-Toolkit (clone here) provides a fast, modularized, and adaptive framework in JAVA for doing such real-time signal extraction for a variety of applications. Furthermore, we have developed several components to the package featuring streaming time series data analysis tools not known to be available anywhere else. Such new features include:

A fractional differencing optimization tool for transforming nonstationary time-series into stationary time series while preserving memory (inspired by Marcos Lopez de Prado’s recent book on Advances in Financial Machine Learning, Wiley 2018).
Easy to use interface to four different signal generation outputs:
Univariate series -> univariate signal
Univariate series -> multivariate signal
Multivariate series -> univariate signal
Multivariate series -> multivariate signal
Generalization of optimization criterion for the signal extraction. One can use a periodogram, or a model-based spectral density of the data, or anything in between.
Real-time adaptive parameterization control – make slight adjustments to the filter process parameterization effortlessly
Build a filtering process from simpler user-defined filters, applying customization and reducing degrees of freedom.

This package also provides an API to three other real-time data analysis frameworks that are now or soon available

iMetricaFX – An app written entirely in JavaFX for doing real-time time series data analysis with MDFA
MDFA-DeepLearning – A new recurrent neural network methodology for learning in large noisy time series
MDFA-Tradengineer – An automated algorithmic trading platform combining MDFA-Toolkit, MDFA-DeepLearning, and Esper – a library for complex event processing (CEP) and streaming analytics

To start the most basic signal extraction process using MDFA-Toolkit, three things need to be defined.

The data streaming process which determines from where and what kind of data will be streamed
A transformation of the data, which includes any logarithmic transform, normalization, and/or (fractional) differencing
A signal extraction definition which is defined by the MDFA parameterization

Data streaming

In the current version, time series data is providing by a streaming CSVReader, where the time series index is given by a String DateTime stamp is the first column, and the value(s) are given in the following columns. For multivariate data, two options are available for streaming data. 1) A multiple column .csv file, with each value of the time series in a separate column 2) or in multiple referenced single column time-stamped .csv files. In this case, the time series DateTime stamps will be checked to see if in agreement. If not, an exception will be thrown. More sophisticated multivariate time series data streamers which account for missing values will soon be available.

Transforming the data

Depending on the type of time series data and the application or objectives of the real time signal extraction process, transforming the data in real-time might be an attractive feature. The transformation of the data can include (but not limited to) several different things

A Box-Cox transform, one of the more common transformations in financial and other non-stationary time series.
(fractional)-differencing, defined by a value d in [0,1]. When d=1, standard first-order differencing is applied.
For stationary series, standard mean-variance normalization or a more exotic GARCH normalization which attempts to model the underlying volatility is also available.

Signal extraction definition

Once the data streaming and transformation procedures have been defined, the signal extraction parameters can then be set in a univariate or multivariate setting. (Multiple signals can be constructed as well, so that the output is a multivariate signal. A signal extraction process can be defined by defining and MDFABase object (or an array of MDFABase objects in the mulivariate signal case). The parameters that are defined are as follows:

Filter length: the length L in number of lags of the resulting filter
Low-pass/band-pass frequency cutoffs: which frequency range is to be filtered from the time-series data
In-sample data length: how much historical data need to construct the MDFA filter
Customization: α (smoothness) and λ (timeliness) focuses on emphasizing smoothness of the filter by mollifying high-frequency noise and optimizing timeliness of filter by emphasizing error optimization in phase delay in frequency domain
Regularization parameters: controls the decay rate and strength, smoothness of the (multivariate) filter coefficients, and cross-series similarity in the multivariate case
Lag: controls the forecasting (negative values) or smoothing (positive values)
Filter constraints i1 and i2: Constrains the filter coefficients to sum to one (i1) and/or the dot product with (0,1…, L) is equal to the phase shift, where L is the filter length.
Phase-shift: the derivative of the frequency response function at the zero frequency.

All these parameters are controlled in an MDFABase object, which holds all the information associated with the filtering process. It includes it’s own interface which ensures the MDFA filter coefficients are updated automatically anytime the user changes a parameter in real-time.

Figure 1: Overview of the main module components of MDFA-Toolkit and how they are connected

As shown in Figure 1, the main components that need to be defined in order to define a signal extraction process in MDFA-Toolkit. The signal extraction process begins with a handle on the data streaming process, which in this article we will demonstrate using a simple CSV market file reader that is included in the package. The CSV file should contain the raw time series data, and ideally a time (or date) stamp column. In the case there is no time stamp column, such a stamp will simply be made up for each value.

Once the data stream has been defined, these are then passed into a time series transformation process, which handles automatically all the data transformations which new data is streamed. As we’ll see, the TargetSeries object defines such transformations and all streaming data is passed added directly to the TargetSeries object. A MultivariateFXSeries is then initiated with references to each TargetSeries objects. The MDFABase objects contain the MDFA parameters and are added to the MultivariateFXSeries to produce the final signal extraction output.

To demonstrate these components and how they come together, we illustrate the package with a simple example where we wish to extract three independent signals from AAPL daily open prices from the past 5 years. We also do this in a multivariate setting, to see how all the components interact, yielding a multivariate series -> multivariate signal.


//Define three data source files, the first one will be the target series
String[] dataFiles = new String[]{"AAPL.daily.csv", "QQQ.daily.csv", "GOOG.daily.csv"};

//Create a CSV market feed, where Index is the Date column and Open is the data
CsvFeed marketFeed = new CsvFeed(dataFiles, "Index", "Open");

/* Create three independent signal extraction definitions using MDFABase:
One lowpass filter with cutoff PI/20 and two bandpass filters
*/
MDFABase[] anyMDFAs = new MDFABase[3];
anyMDFAs[0] = (new MDFABase()).setLowpassCutoff(Math.PI/20.0)
                              .setI1(1)
                              .setHybridForecast(.01)
                              .setSmooth(.3)
                              .setDecayStart(.1)
                              .setDecayStrength(.2)
                              .setLag(-2.0)
                              .setLambda(2.0)
                              .setAlpha(2.0)
                              .setSeriesLength(400); 

anyMDFAs[1] = (new MDFABase()).setLowpassCutoff(Math.PI/10.0)
                              .setBandPassCutoff(Math.PI/15.0)
                              .setSmooth(.1)
                              .setSeriesLength(400); 

anyMDFAs[2] = (new MDFABase()).setLowpassCutoff(Math.PI/5.0)
                              .setBandPassCutoff(Math.PI/10.0)
                              .setSmooth(.1)
                              .setSeriesLength(400);

/*
Instantiate a multivariate series, with the MDFABase definitions,
and the Date format of the CSV market feed
*/
MultivariateFXSeries fxSeries = new MultivariateFXSeries(anyMDFAs, "yyyy-MM-dd");

/*
Now add the three series, each one a TargetSeries representing the series
we will receive from the csv market feed. The TargetSeries
defines the data transformation. Here we use differencing order with
log-transform applied
*/
fxSeries.addSeries(new TargetSeries(1.0, true, "AAPL"));
fxSeries.addSeries(new TargetSeries(1.0, true, "QQQ"));
fxSeries.addSeries(new TargetSeries(1.0, true, "GOOG"));
/*
Now start filling the fxSeries will data, we will start with
600 of the first observations from the market feed
*/
for(int i = 0; i < 600; i++) {
   TimeSeriesEntry observation = marketFeed.getNextMultivariateObservation();
   fxSeries.addValue(observation.getDateTime(), observation.getValue());
}

//Now compute the filter coefficients with the current data
fxSeries.computeAllFilterCoefficients();

//You can also chop off some of the data, he we chop off 70 observations
fxSeries.chopFirstObservations(70);

//Plot the data so far
fxSeries.plotSignals("Original");

Figure 2: Output of the three signals on the target series (red) AAPL

In the first line, we reference three data sources (AAPL daily open, GOOG daily open, and SPY daily open), where all signals are constructed from the target signal which is by default, the first series referenced in the data market feed. The second two series act as explanatory series. The filter coeffcients are computed using the latest 400 observations, since in this example 400 was used as the insample setSeriesLength, value for all signals. As a side note, different insample values can be used for each signal, which allows one to study the affects of insample data sizes on signal output quality. Figure 2 shows the resulting insample signals created from the latest 400 observations.

We now add 600 more observations out-of-sample, chop off the first 400, and then see how one can change a couple of parameters on the first signal (first MDFABase object).


for(int i = 0; i < 600; i++) {
   TimeSeriesEntry observation = marketFeed.getNextMultivariateObservation();
   fxSeries.addValue(observation.getDateTime(), observation.getValue());
}

fxSeries.chopFirstObservations(400);
fxSeries.plotSignals("New 400");

/* Now change the lowpass cutoff to PI/6
   and the lag to -3.0 in the first signal (index 0) */
fxSeries.getMDFAFactory(0).setLowpassCutoff(Math.PI/6.0);
fxSeries.getMDFAFactory(0).setLag(-3.0);

/* Recompute the filter coefficients with new parameters */
fxSeries.computeFilterCoefficients(0);
fxSeries.plotSignals("Changed first signal");

Figure 3: After adding 600 new observations out-of-sample signal values

After adding the 600 values out-of-sample and plotting, we then proceed to change the lowpass cutoff of the first signal to PI/6, and the lag to -3.0 (forecasting three steps ahead). This is done by accessing the MDFAFactory and getting handle on first signal (index 0), and setting the new parameters. The filter coefficients are then recomputed on the newest 400 values (but now all signal values are insample).

In the MDFA-Toolkit, plotting is done using JFreeChart, however iMetricaFX provides an app for building signal extraction pipelines with this toolkit providing the backend where all the automated plotting, analysis, and graphics are handled in JavaFX, creating a much more interactive signal extraction environment. Many more features to the MDFA-Toolkit are being constantly added, especially in regard to features boosting applications in Machine Learning, such as we will see in the next upcoming article.

Big Data analytics in time series

We also implement in MDFA-Toolkit an interface to Apache Spark-TS, which provides a Spark RDD for Time series objects, geared towards high dimension multivariate time series. Large-scale time-series data shows up across a variety of domains. Distributed as the spark-ts package, a library developed by Cloudera’s Data Science team essentially enables analysis of data sets comprising millions of time series, each with millions of measurements. The Spark-TS package runs atop Apache Spark. A tutorial on creating an Apache Spark-TS connection with MDFA-Toolkit is currently being developed.

iMetrica for Linux Ubuntu 64 now available

Posted on September 23, 2016 by Christian Dallas Blakely

The MDFA real-time signal extraction module

My first open-source release of iMetrica for Linux Ubuntu 64 can now be downloaded at my Github, with a Windows 64 version soon to follow. iMetrica is a fast, interactive, GUI-oriented software suite for predictive modeling, multivariate time series analysis, real-time signal extraction, Bayesian financial econometrics, and much more.

The principal use of iMetrica is to provide an interactive environment for the numerical and visual analysis of (multivariate) time series modeling, real-time filtering, and signal extraction. The interactive features in iMetrica boast a modeling and graphics environment for analysts, practitioners, and students of econometrics, finance, and real-time data analysis where no coding or modeling experience is necessary. All the system needs is data which can be piped into the system in many forms, including .csv, .txt, Google/Yahoo Finance, Quandle, .RData, and more. A module for connecting to MySQL databases is currently being developed. One can also simulate their own data from a one or a combination of several different popular data generating models.

With the design intending to be interactive and self-enclosed, one can change modeling data/parameter inputs and see the effects in both graphical and numerical form automatically. This feature is designed to help understand the underlying mechanics of the modeling or filtering process. One can test many attributes of the modeling or filtering process this way both visually and numerically such as sensitivity, nonlinearity, goodness-of-fit, any overfitting issues, stability, etc.

All the computational libraries were written in GNU C and/or Fortran and have been provided as Native libraries to the Java platform via JNI, where Java provides the user-interface, control, graphics, and several other components in a module format, and where each module specializes in a different data analysis paradigm. The modules available in this open-source version of iMetrica are as follows:

1) Data simulation, modeling and fitting using several popular econometric models

(S)ARIMA, (E)GARCH, (Multivariate) Factor models, Stochastic Volatility, High-frequency volatility models, Cycles/Trends, and more
Random number generators from several different types of parameterized distributions to create shocks, outliers, regression components, etc.
Visualize in real-time all components of the modeling process

2) An interactive GUI for multivariate real-time signal extraction using the multivariate direct filter approach (MDFA)

Construct mulitvariate MA filter designs, classical ARMA ZPA filtering designs, or hybrid filtering designs.
Analyze all components of the filtering and signal extraction process, from time-delay and smoothing control, to regularization.
Adaptive real-time filtering
Construct financial trading signals and forecasts
Includes a real-time/frequency analysis module using MDFA

3) An interactive GUI for X-13-ARIMA-SEATS called uSimX13

Perform automatic seasonal adjustment on thousands of economic time series
Compare SARIMA model choices using several different novel signal extraction diagnostics and tools available only in iMetrica
Visualize in real-time several components of modeling process
Analyze forecasts and compare with other models
All of the most important features of X-13-ARIMA-SEATS included

4) An interactive GUI for RegComponent (State Space and Unobserved Component Models)

Construct unobserved signal components and time-varying regression components
Obtain forecasts automatically and compare with other forecasting models

5) Empirical Mode Decomposition

Applies a fast adaptive EMD algorithm to decompose nonlinear, nonstationary data into a trend and instrinsic modes.
Visualize all time-frequency components with automatically generated 2D heat maps.

6) Bayesian Time Series Modeling of ARIMA, (E)GARCH, Multivariate Stochastic Volatility, HEAVY models

Compute and visualize posterior distribtions for all modeling parameters
Easily compare different model dimensions

7) Financial Trading Strategy Engineering with MDFA

Construct financial trading signals in the MDFA module and backtest the strategies on any frequency of data
Perform analysis of the strategies using forward-walk schemes
Automatically optimize certain components of the signal extraction on in-sample data.
Features a toolkit for minimizing probability of backtest overfit

Tutorials on how to use iMetrica can be found on this blog and will be added on a weekly basis, with new tools, features, and modules being added and improved on a consistent basis.

Please send any bug reports, comments, complaints, to clisztian@gmail.com.

Realizing the Future with iMetrica and HEAVY Models

Posted on January 28, 2013 by Christian Dallas Blakely

In this article we steer away from multivariate direct filtering and signal extraction in financial trading and briefly indulge ourselves a bit in the world of analyzing high-frequency financial data, an always hot topic with the ever increasing availability of tick data in computationally convenient formats. Not only has high-frequency intraday data been the basis of higher frequency risk monitoring and forecasting, but it also provides access to building ‘smarter’ volatility prediction models using so-called realized measures of intraday volatility. These realized measures have been shown in numerous studies over the past 5 years or so to provide a solidly more robust indicator of daily volatility. While daily returns only capture close-to-close volatility, leaving much to be said about the actual volatility of the asset that was witnessed during the day, realized measures of volatility using higher frequency data such as second or minute data provide a much clearer picture of open-to-close variation in trading.

In this article, I briefly describe a new type of volatility model that takes into account these realized measures for volatility movement called High frEquency bAsed VolatilitY (HEAVY) models developed and pioneered by Shephard and Sheppard 2009. These models take as input both close-to-close daily returns $r_t$ as well as daily realized measures to yield better forecasting dynamics. The models have been shown to be endowed with the ability to not only track momentum in volatility, but also adjust for mean reversion effects as well as adjust quickly to structural breaks in the level of the volatility process. As the authors (Sheppard and Shephard, 2009) state in their original paper, the focus of these models is on predictive properties, rather than on non-parametric measurement of volatility. Furthermore, HEAVY models are much easier and more robust estimation wise than single source equations (GARCH, Stochastic Volatility) as they bring two sources of volatility information to identify a longer term component of volatility.

The goal of this article is three-fold. Firstly, I briefly review these HEAVY models and give some numerical examples of the model in action using a gnu-c library and Java package called heavy_model that I develped last year for the iMetrica software. The heavy_model package is available for download (either by this link or e-mail me) and features many options that are not available in the MATLAB code provided by Sheppard (bootstrapping methods, Bayesian estimation, track reparameterization, among others). I will then demonstrate the seamless ability to model volatility with these High frEquency bAsed VolatilitY models using iMetrica, where I also provide code for computing realized measures of volatility in Java with the help of an R package called highfrequency (Boudt, Cornelissen, and Payseur 2012).

HEAVY Model Definition

Let’s denote the daily returns as $r_1, r_2, \ldots, r_T$ , where $T$ is the total amount of days in the sample we are working with. In the HEAVY model, we supplement information to the daily returns by a so-called realized measure of intraday volatility based on higher frequency data, such as second, minute or hourly data. The measures are called daily realized measures and we will denote them as $RM_1, RM_2, \ldots, RM_T$ for the total number of days in the sample. We can think of these daily realized measures as an average of variance autocorrelations during a single day. They are supposed to provide a better snapshot of the ‘true’ volatility for a specific day $t$ . Although there are numerous ways of computing a realized measure, the easiest is the realized variance computed as $RM_t = \sum_j (X_{t+t_{j,t}} - X_{t+t_{j-1,t}})^2$ where $t_{j,t}$ are the normalized times of trades on day $t$ . Other methods for providing realized measures includes using Kernel based methods which we will discuss later in this article (see for example http://papers.ssrn.com/sol3/papers.cfm?abstract_id=927483).

Once the realized measures have been computed for $T$ days, the HEAVY model is given by:

$Var(r_t | \mathcal{F}_{t-1}^{HF}) = h_t = \omega_1 + \alpha RM_{t-1} + \beta h_{t-1} + \lambda r^2_t$

$E(RM_t | \mathcal{F}_{t-1}^{HF}) = \mu_t = \omega_2+ \alpha_R RM_{t-1} + \beta_R \mu_{t-1},$

where the stability constraints are $\alpha, \omega_1 \geq 0, \beta \in [0,1]$ and $\omega_2, \alpha_R \geq 0$ with $\lambda + \beta \in [0,1]$ and $\beta_R + \alpha_R \in [0,1]$ . Here, the $\mathcal{F}_{t-1}^{HF}$ denotes the high-frequency information from the previous day $t-1$ . The first equation models the close-to-close conditional variance and is akin to a GARCH type model, whereas the second equation models the conditional expectation of the open-to-close variation.

With the formulation above, one can easily see that slight variations to the model are perfectly plausible. For example, one could consider additional lags in either the realized measure $RM_t$ (akin to adding additional moving average parameters) or the conditional mean/variance variable (akin to adding autoregression parameters). One could also leave out the dependence on the squared returns $r^2_t$ by setting $\lambda$ to zero, which is what the original others recommended. A third variation is adding yet another equation to the pack that models a realized measure that takes into account negative and positive momentum to yield possibly better forecasts as it tracks both losses and gains in the model. In this case, one would add the third component by introducing a new equation for a realized semivariance to parametrically model statistical leverage eﬀects, where falls in asset prices are associated with increases in future volatility. With realized semivariance computed for the $T$ days as $RMS_1, \ldots RMS_T$ , the third equation becomes

$E(RMS_t | \mathcal{F}_{t-1}^{HF}) = \phi_t = \omega_3 + \alpha_{RS} RMS_{t-1} + \beta_{RS} \phi_{t-1}$

where $\alpha_{RS} + \beta_{RS} < 1$ and both positive.

HEAVY modeling in C and Java

To incorporate these HEAVY models into iMetrica, I began by writing a gnu-c library for providing a fast and efficient framework for both quasi-likelihood evaluation and a posteriori analysis of the models. The structure of estimating the models follows very closely to the original MATLAB code provided by Sheppard. However, in the c library I’ve added a few more useful tools for forecasting and distribution analysis. The Java code is essentially a wrapper for the c heavy_model library to provide a much cleaner approach to modeling and analyzing the HEAVY data such as the parameters and forecasts. While there are many ways to declare, implement, and analyze HEAVY models using the c/java toolkit I provide, the most basic steps involved are as follows.

heavyModel heavy = new heavyModel(); heavy.setForecastDimensions(n_forecasts, n_steps); heavy.setParameterValues(w1, w2, alpha, alpha_R, lambda, beta, beta_R); heavy.setTrackReparameter(0); heavy.setData(n_obs, n_series, series); heavy.estimateHeavyModel();

The first line declares a HEAVY model in java, while the second line sets the number of forecasts samples to compute and how many forecast steps to take. Forecasted values are provided for both the return variable $r_t$ (using a bootstrapping methodology) and the $h_t$ , $\mu_t$ variables. In the next line, the parameter values for the HEAVY model are initialized. These are the initial points that are utilized in the quasi-maximum likelihood optimization routine and can be set to any values that satisfy the model constraints. Here, $w1 = \omega_1, w2 = \omega_2$ .

The fourth line is completely optional and is used for toggling (0=off, 1=on) a reparameterization of the HEAVY model so the intercepts of both equations in the HEAVY model are explicitly related to the unconditional mean of squared returns $r^2$ and realized measures $RM_t$ . The reparameterization of the model has the advantage that it eliminates the estimation of $\omega_1, \omega_2$ and instead uses the unconditional means, leaving two less degrees of freedom in the optimization. See page 12 of the Shephard and Sheppard 2009 paper for a detailed explanation of the reparameterization. After setting the initial values, the data is set for the model by inputting the total number of observation $T$ , the number of series (normally set to 2 and the data in column-wise format (namely a double array of length n_obs x n_series, where the first column is the return data $r_t$ and the second column is the daily realized measure data. Finally, with the data set and the parameters initialized we estimate the model in the 6th line. Once the model is finished estimating (should take a few seconds, depending on the number of observations), the heavyModel java object stores the parameter values, forecasts, model residuals, likelihood values, and more. For example, one can print out the estimated model parameters and plot the forecasts of $h_t$ using the following:

heavy.printModelParameters(); heavy.plotForecasts(); Output: w_1 = 0.063 w_2 = 0.053 beta = 0.855 beta_R = 0.566 alpha = 0.024 alpha_R = 0.375 lambda = 0.087

Figure 1 shows the plot of the filtered $h_t, \mu_t$ values for 300 trading days from June 2011 to June 2012 of AAPL with the final 20 points being the forecasted values. Notice that the multistep ahead forecast shows momentum which is one of the attractive characteristics of the HEAVY models as mentioned in the original paper by Shephard and Sheppard.

Figure 1: Plots of the filtered returns and realized measures with 20 step forecasts for Verizon for 300 trading days.

We can also easily plot the estimated joint distribution function $F_{\zeta, \eta}$ by simply using the ﬁltered $h_t, \mu_t$ and computing the devolatilized values $\zeta_t = r_t/ \sqrt{h_t}$ , $\eta_t = (RM_t/\mu_t)^{1/2}$ , leading to the innovations for the model for $t = 2,\ldots,T$ .

Figure 2 below shows the empirical distribution of $F_{\zeta, \eta}$ for 600 days (nearly two years of daily observations from AAPL). The $\zeta_t$ sequence should be roughly a martingale diﬀerence sequences with unit variance and the $\eta_t$ sequence should have unit conditional means and of course be uncorrelated. The empirical results validate the theoretical values.

Figure 2: Scatter plot of the empirical distribution of devolatilized values for h and mu.

In order to compile and run the heavy_model library and the accompanying java wrapper, one must first be sure to meet the requirements for installation. The programs were extensively tested on a 64bit Linux machine running Ubuntu 12.04. The heavy_model library written in c uses the GNU Scientific Library (GSL) for the matrix-vector routines along with a statistical package in gnu-c called apophenia (Klemens, 2012) for the optimization routine. I’ve also included a wrapper for the GSL library called multimin.c which enables using the optimization routines from the GSL library, but were not heavily tested. The first version (version 00) of the heavy_model library and java wrapper can be downloaded at sourceforge.net/projects/highfrequency. As a precautionary warning, I must confess that none of the files are heavily commented in any way as this is still a project in progress. Improvements in code, efficiency, and documentation will be continuously coming.

After downloading the .tar.gz package, first ensure that GSL and Apophenia are properly installed and the libraries are correctly installed to the appropriate path for your gnu c compiler. Second, to compile the .c code, copy the makefile.test file to Makefile and then type make. To compile the heavyModel library and utilize the java heavyModel wrapper (recommended), copy makefile.lib to Makefile, then type make. After it constructs the libheavy.so, compile the heavyModel.java file by typing javac heavyModel.java. Note that the java files were complied successfully using the Oracle Java 7 SDK. If you have any questions about this or any of the c or java files, feel free to contact me. All the files were written by me (except for the optional multimin.c/h files for the optimization) and some of the subroutines (such as the HEAVY model simulation) are based on the MATLAB code by Sheppard. Even though I fully tested and reproduced the results found in other experiments exploring HEAVY models, there still could be bugs in the code. I have not fully tested every aspect (especially the Bayesian estimation components, an ongoing effort) and if anyone would like to add, edit, test, or comment on any of the routines involved in either the c or java code, I’d be more than happy to welcome it.

HEAVY Modeling in iMetrica

The Java wrapper to the gnu-c heavy_model library was installed in the iMetrica software package and can be used for GUI style modeling of high-frequency volatility. The HEAVY modeling environment is a feature of the BayesCronos module in iMetrica that also features other stochastic models for capturing and forecasting volatility such as (E)GARCH, stochastic volatility, mutlivariate stohastic factor modeling, and ARIMA modeling, all using either standard (Q)MLE model estimation or a Bayesian estimation interface (with histograms showing the MCMC results of the parameter chains).

Modeling volatility with HEAVY models is done by first uploading the data into the BayesCronos module (shown in Figure 3) through the use of either the BayesCronos Menu (featured on the top panel) or by using the Data Control Panel (see my previous article on Data Control).

Figure 3: BayesCronos interface in iMetrica for HEAVY modeling.

In the BayesCronos control panel shown above, we estimate a HEAVY model for the uploaded data (600 observations of $r_t, RM_t$ ) that were simulated from a model with omega_1 = 0.05, omega_2 = 0.10, beta = 0.8 beta_R = 0.3, alpha = 0.02, alpha_R = 0.3 (the simulation was done in the Data Control Module).

The model type is selected in the panel under the Model combobox. The number of forecasting steps and forecasting samples (for the $r_t$ variable) are selected in the Forecasting panel. Once those values are set, the model estimates are computed by pressing the “MLE” button in the bottom lower left corner. After the computing is done, all the available plots to analyze the HEAVY model are available by simply clicking the appropriate plotting checkboxes directly below the plotting canvas. This includes up to 5 forecasts, the original data, the filtered $h_t, \mu_t$ values, the residuals/empirical distributions of the returns and realized measures, and the pointwise likelihood evaluations for each observation. To see the estimated parameter values, simply click the “Parameter Values” button in the “Model and Parameters” panel and pop-up control panel will appear showing the estimated values for all the parameters.

Realized Measures in iMetrica

Figure 4: Computing Realized measures in iMetrica using a convenient realized measure control panel.

Importing and computing realized volatility measures in iMetrica is accomplished by using the control panel shown in Figure 4. With access to high frequency data, one simply types in the ticker symbol in the “Choose Instrument” box, sets the starting and ending date in the standard CCYY-MM-DD format, and then selects the kernel used for assembling the intraday measurements. The Time Scale sets the frequency of the data (seconds, minutes hours) and the period scrollbar sets the alignment of the data. The Lags combo box determines the bandwidth of the kernel measuring the volatility. Once all the options have been set, clicking on the “Compute Realized Volatility” button will then produce three data sets for the time period given between start date and end data: 1) The daily log-returns of the asset $r_1, \ldots, r_T$ 2) The log-price of the asset, and 3) The realized volatility measure $RM_1, \ldots, RM_T$ . Once the Java-R highfrequency routine has finished computing the realized measures, the data sets are automatically available in the Data Control Module of iMetrica. From here, one can annualize the realized measures using the weight adjustments in the Data Control Module (see Figure 5). Once content with the weighting, the data can then be exported to the MDFA module or the BayesCronos module for estimating and forecasting the volatility of GOOG using HEAVY models.

Figure 4: The log-return data (blue) and the (annualized) realized measure data using 5 minute returns (pink).

Figure 5: The log-return data (blue) and the (annualized) realized measure data using 5 minute returns (pink) for Google from 1-1-2011 to 6-19-2012.

The Realized Measure uploading in iMetrica utilizes a fantastic R package for studying and working with high frequency financial data called highfrequency (Boudt, Cornelissen, and Payseur 2012). To handle the analysis of high frequency financial data in java, I began by writing a Java wrapper to the R functions of the highfrequency R package to enable GUI interaction shown above in order to download the data into java and then iMetrica. The java environment uses library called RCaller that opens a live R kernel in the Java runtime environment from which I can call and R routines and directly load the data into Java. The initializing sequence looks like this.

caller.getRCode().addRCode("require (Runiversal)"); caller.getRCode().addRCode("require (FinancialInstrument)"); caller.getRCode().addRCode("require (highfrequency)"); caller.getRCode().addRCode("loadInstruments('/HighFreqDataDirectoryHere/Market/instruments.rda')"); caller.getRCode().addRCode("setSymbolLookup.FI('/HighFreqDataDirectoryHere/Market/sec',use_identifier='X.RIC',extension='RData')");
Here, I’m declaring the R packages that I will be using (first three lines) and then I declare where my HighFrequency financial data symbol lookup directory is on my computer (next two lines). This
then enables me to extract high frequency tick data directly into Java. After loading in the desired intrument ticker symbol names, I then proceed to extract the daily log-returns for the given time frame, and then compute the realized measures of each asset using the rKernelCov function in highfrequency R package. This looks something like
for(i=0;i<n_assets;i++) { String mark = instrum[i] + "<-" + instrum[i] + "['T09:30/T16:00',]";

caller.getRCode().addRCode(mark);

String rv = "rv"+i+"<-rKernelCov("+instrum[i]+"$Trade.Price,kernel.type ="+kernels[kern]+", kernel.param="+lags+",kernel.dofadj = FALSE, align.by ="+frequency[freq]+", align.period="+period+", cts=TRUE, makeReturns=TRUE)"

caller.getRCode().addRCode(rv);

caller.getRCode().addRCode("names(rv"+i+")<-'rv"+i+"'"); rvs[i] = "rv_list"+i;

caller.getRCode().addRCode("rv_list"+i+"<-lapply(as.list(rv"+i+"), coredata)"); }

In the first line, I’m looping through all the asset symbols (I create Java strings to load into the RCaller as commands). The second line effectively retrieves the data during market hours only (America/New_York time), then creates a string to call the rKernelCov function in R. I give it all the user defined parameters defined by strings as well. Finally, in the last two lines, I extract the results and put them into an R list from which the java runtime environment will read.

Conclusion

In this article I discussed a recently introduced high frequency based volatility model by Shephard and Sheppard and gave an introduction to three different high-performance tools beyond MATLAB and R that I’ve developed for analyzing these new stochastic models. The heavyModel c/java package that I made available for download gives a workable start for experimenting in a fast and efficient framework the benefit of using high frequency financial data and most notably realized measures of volatility to produce better forecasts. The package will continuously be updated for improvements in both documentation, bug fixes, and overall presentation. Finally, the use of the R package highfrequency embedded in java and then utilized in iMetrica gives a fully GUI experience to stochastic modeling of high frequency financial data that is both conveniently easy to use and fast.

Happy Extracting and Volatilitizing!

Model comparison with data sweeps

Posted on November 14, 2012 by Christian Dallas Blakely

This slideshow requires JavaScript.

A useful exercise in modeling economic time series is to perform a “sliding window” analysis of the data that computes models in subsets of the data and tests for the robustness of signal extractions, forecasts, and parameter variance relative to a growing subset of the data. For instance, for a time series of length 300, one could estimate a model on a shorter subset of the data, say for the first 200 observations, and then increase the amount of observations, re-estimate, and then see how the model parameter values change as the number of observations or data subset increases. One can also see how the signal extractions and forecasts change with additional data. Ideally, if the model is specified correctly for the data, there should be a very small variance in the estimated parameters as more data is added to the time series. It signifies the stability of the model selection. Normally, such an exercise would be tedious to carry out with X-13ARIMA-SEATS, or any other software such as MATLAB or R as scripts or spec files would have to be written for each individual re-estimation and then re-plotted. In the uSimX13 module of iMetrica however, this task has been rendered an easy one with the addition of a sliding windows tool. In this blog entry, we describe this so-called “sliding windows” process and show just how fast and seamless it is to perform model choice robustness and comparisons in iMetrica.

We begin by describing the sliding span/window tool in the iMetrica-uSimX13 module. Once time series data has been loaded into the uSimX13 module from either the uSimX13 main menu or imported from the Data Control module, the uSimX13 computation engine must first be turned on from the uSimX13 menu. Then to access the sliding windows interface, simply click on the “Sliding Span/Window Activate” check box in the main uSimX13 menu (see Figure 1).

Figure 1. Main drop down menu for the uSimX13 module, showing the “Sliding Span/Window Activate” check box.

Once clicked, the entire plotting canvas will turn to a dark shade of blue, which indicates the windowed region in which model estimation occurs. To control the sliding window, place the mouse cursor along one of the edges of the canvas and slowly glide the mouse with the left-mouse button held down either left or right, depending on which edge of the plot canvas you are on. Moving to the left or right with the left mouse button held down, the windowed area will shrink or expand. The model parameters are estimated instantaneously as the window adjusts and in effect, all the available model statistics, diagnostics, signals, and forecasts are computed as well. For example, as the window expands or shrinks, the trend, seasonally adjusted data, and 24-step ahead forecasts can be plotted and viewed in real-time as the window changes (see Figure 2). One can also slide the window to the left or right by placing the mouse anywhere inside the blue-windowed region, holding down the left mouse button and moving along the time domain. This way, the window length will remain fixed, but the window center will move along different subsets of the data. This can be useful for seeing how model parameters can change within regions of data that exhibit regime changes, namely a sequence in the series that suddenly changes in seasonal or cyclical structure after a certain time observation. The data can now be modeled in both sections before and after the regime change occurs in order to compare the estimated parameter values.

Figure 2. The window sliding across different subsets of the data. The signal extractions, forecast, and model parameters are recomputed automatically as the window changes. Forecast comparisons with the real data as the window span moves is now trivial. Here, the plot in cyan represents the original time series data in-sample and the 24 step forecast out-of-sample, and the light green plot is the time series data adjusted for outliers, as indicated in the model box. One can select the plots using the “series components” plot box. The data in gray represents the time series data not used in the model estimation.

Data Sweep

With the ability to seamlessly capture partitions of the data and model within the given partition using the sliding window, a natural extension of this mouse-on-canvas utility is to employ it somehow in comparing different models of the time series data. We call this method of model comparison time series data sweeping (or simply data sweeping) and it involves selecting an initial window of data from the first observation to the $n$ -th observation where n is some number much less than the total number of observations $latex N$ in the data set (say, one third the amount). The data sweep then computes the sliding window from $n$ as the final observation all the way to $N$ , in increments of one (see Figure 3). At each addition to the length of the window, the forecast is computed for up to 24 steps ahead. Of course, since the true time series data is known in the out-of-sample region of computation, we can compute the forecast error for up to $h \leq 24$ steps ahead and sum up these errors as $n$ increases to $N$ . We can do this data sweep for several models, computing the aggregate forecast errors over time. The idea is that the best model for the data will ideally have the smallest forecast error, and thus comparing this forecast error with several models will identify the model with the best overall forecasting ability.

To access the data sweep, simply go to the main uSimX13 menu, shown in Figure 1, and click “Sweep Time Series Control Panel”. This will bring up the main interface for the data sweep (shown in Figures 4-6). To begin the sweep, first select the model and regressors desired to model the data with inside the model selection panel of the main uSimX13 interface. Then choose at which observation you’d like to carry out the data sweep (starting at observation $n=60$ is the default). Lastly, select how many forecast steps you’d like to use in computing the forecast error (1-24). Once content with the settings, click the “Compute time series sweep” button and watch as the window span increases from $n$ to $N$ , recomputing parameters, signals, and forecasts at each step (see slideshow at top of post). Once the sweep is complete, the parameter statistics, Ljung-Box mean value at two different lags, and the total forecast error is displayed in the control panel. To compare this with another model, save the results of the sweep by clicking “save parameters” in the uSimX13 menu, and then choose another model and recompute (while using the same settings as the previous sweep, of course).

To give an example of this process, we begin by simulating a time series data set of length $N = 300$ from a SARIMA model of dimension $(0,1,2)(0,1,1)_{12}$ , namely a seasonal auto-regressive integrated moving-average process with two non-seasonal moving-average parameters, and one seasonal moving average parameter. The data sweep is performed on the simulated data with a forecast error horizon of length 23 using three different SARIMA models, (a) $(0,1,1)(0,1,1)_{12}$ , (b) $(1,1,0)(0,1,1)_{12}$ , (c) $(0,1,2)(0,1,1)_{12}$ , the true model. See Figures 4-6 below to see the data sweep results and the estimated parameter mean and standard deviation, the average Ljung-Box statistics at lag 12 and 0, and the forecast errors for each model. Notice the forecast error for the true model (c) (figure 6) is the lowest followed by model (b) (figure 6) and then (a) (figure 4), which is exactly what we would want.

Figure 4. Model (a) and the parameter statistics, forecast error, and data sweep controls.

Figure 5. Model (b) and the parameter statistics, forecast error, and data sweep controls.

Figure 6. Model (c) (the true model) and the parameter statistics, forecast error, and data sweep controls.

iMetrica and Hybridometrics: Introduction

Posted on October 24, 2012 by Christian Dallas Blakely

The high-frequency Financial Trading interface of iMetrica. Easily construct in-sample trading strategies with an array of optimizers unique to iMetrica and then employ the strategies out-of-sample to test and fine-tune the trading performance.

This blog serves as an introduction and tutorial to Hybridometrics using iMetrica. Hybridometrics is a term used to express the analysis, modeling, signal extraction, and forecasting of univariate and multivariate financial and economic time series data using a combination of model-based and non-model-based methodologies. Ideal combinations of computational paradigms and methodologies used in hybridometrics include, but are not limited to, traditional stochastic models such as (S)ARIMA models, GARCH models, and multivariate stochastic volatiluty models combined with empirical mode decomposition techniques and the multivariate direct filter approach (MDFA). The goal of hybridometric modeling is to obtain signal extractions and forecasts, for official use or government use, all the way to building high-frequency financial trading strategies, that perform better than using only model or non-model based methods alone. In other words, hybridometrics seeks to extract the advantages of different paradigms combined to outperform traditional approaches to time series modeling. The iMetrica software package offers the most versatile and computationally efficient portal to this newly proposed time series modeling paradigm, all while remaining surprisingly easy to use.

The iMetrica software package is a unique system of econometric and financial trading tools that focuses on speed, user interaction, visualization tools, and point-and-click simplicity in building models for time series data of all types. Written entirely in GNU C and Fortran with a rich interactive interface written in Java, the iMetrica software offers an abundance of econometric tools for signal extraction and forecasting in multivariate time series that are both easily accessible with the click of a mouse button and fast with results computed and plotted instantaneously without the need for creating output data files or calling exterior plotting devices.

One powerful feature that is unique to the iMetrica software is the innate capability of easily combining both model-based and non-model based methodologies for designing data forecasts, signal extraction filters, or high-frequency financial trading strategies. Furthermore, the strategies can be computed and tested both in-sample and out-of-sample using an easy to use built-in data partitioner that effectively partitions the data into an in-sample storage where models and filters are computed and then an out-of-sample storage where new data is applied to the in-sample strategy to test for robustness, over-fitting, and many other desired properties. This gives the user complete liberty in creating a fast and efficient test-bed for implementing signal extractions, forecasting regimes, or financial trading strategies.

The iMetrica software environment includes five interacting time series analysis modules for building hybrid forecasts, signal extractions, and trading strategies.

uSimX13 – A computational environment for univariate seasonal auto-regressive integrated moving-average (SARIMA) modeling and simulation using X-13ARIMA-SEATS. Features an interactive approach to modeling seasonal economic time series with SARIMA models and automatic outlier detection, trading day, and holiday regressor effects. Also includes a suite of model comparison tools using both modern and goodness-of-fit signal extraction diagnostics.
BayesCronos – An interactive time series module for signal extraction and forecasting of multivariate economic and financial time series focusing on Bayesian computation and simulation. This module includes a multitude of models including ARIMA, GARCH, EGARCH, Stochastic Volatility, Multivariate Factor Stochastic Volatility, Dynamic Factor, and Multivariate High-Frequency-Based Volatility (HEAVY), with more models continuously being added. For most of the models featured, one can compute the Bayesian and/or the Quasi-Maximum-Likelihood estimated model fits using either a Metropolis-Hastings Monte Carlo Markov Chain approach (Bayesian) or a QMLE formulation for computing the model parameters estimates. Using a convenient model selection panel interface, complete access to model-type, model parameter dimensions, prior distribution parameters is seamlessly available. In the case of Bayesian estimation, one has complete control over the prior distributions of the model parameters and offers interactive visualization of the Monte Carlo Markov Chain parameter samples. For each model, up to 10 sample 36-steps ahead forecasts can be produced and visualized instantaneously along with other important model features such as model residuals, computed volatility, forecasted volatility, factor models, and more. The results can then be easily exported to other modules in iMetrica for additional filtering and/or modeling.
MDFA – An interactive interface to the most comprehensive multivariate real-time direct filter analysis and computation environment in the world. Build real-time filters using both I-MDFA and Zero-Pole Combination (ZPC) filter constructions. The module includes interactive access to timeliness, smoothing, and accuracy controls for filter customization along with parameters for filter regularization to control overfitting. More advanced features include an interface for building adaptive filters, and many controls for filter optimization, customization, data forecasting, and target filter construction.
State Space Modeling – A module for building observed component ARIMA and regression models for univariate economic time series. Similar to the uSimX13 module, the State Space Modeling environment focuses on modeling and forecasting economic time series data, but with much more generality than SARIMA models. An aggregation of observed stochastic components in the form of ARIMA models are stipulated for the time series data (for example trend + seasonal + irregular) and then regression components to model outliers, holiday, and trading day effects are added to the stochastic components giving ultimate flexibility in model building. The module uses regCMPNT, a suite of Fortran code written at the US Census Bureau, for the maximum likelihood and Kalman filter computational routines.
EMD – The EMD module offers a time-frequency decomposition environment for the time-frequency analysis of time series data. The module offers both the original empirical mode decomposition technique of Huan et al. using cubic splines, along with an adaptive approach using reproducing kernels and direct-filtering. This empirical decomposition technique decomposes nonlinear and nonstationary time series into amplitude modulated and frequency modulated (AM-FM) components and then computes the intrinsic phase and instantaneous frequency components from the FM components. All plots of the components as well as the time-frequency heat maps are generated instantaneously.

Along with these modules, there is also a data control module that handles all aspects of time series data input and export. Within this main data control hub, one can import multivariate time series data from a multitude of file formats, as well as download financial time series data directly from Yahoo! finance or another source such as Reuters for higher-frequency financial data. Once the data is loaded, the data can be normalized, scaled, demeaned, and/or log-transformed with a simple slider and button controls, with the effects being plotted on the graphic canvas instantaneously.

Another great feature of the iMetrica software is the ability to learn more about time series modeling through the using of data simulators. The data control module includes an array of data simulating panels for simulating data from a multitude of both univariate and multivariate time series models. With access to control the number of observations, the random seed for the innovation process, the innovation process distribution, and the model parameters, simulated data can be constructed for any type of economic or financial time time series imaginable. The different types of models include (S)ARIMA models, GARCH models, correlated cycle models, trend models, multivariate factor stochastic volatility models, and HEAVY models. From simulating data and toggling the parameters, one can visualize instantly the effects of the each parameter on the simulated data. The data can then be exported to any of the modules for practicing and honing one’s skills in hybrid modeling, signal extraction, and forecasting.

Keep visiting this blog frequently for continuous updates, tutorials, and proposals in the field of econometrics, signal extraction, forecasting, and high-frequency financial trading. using hybridometrics and iMetrica.

Hybrid Signal Extraction, Machine Learning, and Algorithmic Financial Trading

Combining Computational Methodologies for Real-Time Learning and Signal Extraction

Category Archives: Forecasting