Data description

The objective of the competition is to verify whether experts’ recommendations can be used as a reliable basis for making informed decisions regarding investments in a stock market. The task for participants is to devise an algorithm that is able to accurately predict a class of a return rate from an investment in a stock over the next three months, using only tips given by financial experts. All the recommendations from the data were provided by Tipranks – a hub for major analysts, hedge fund managers, bloggers and financial reporters who bring the most accurate and accountable financial advice to the general public.

Data description and format: The data sets for this competition are provided in a tabular format. The training data set, namely ismis17_trainingData.csv, in consecutive lines contains 12,234 records that correspond to recommendations for stock symbols at different points in time. These time points will be referred as decision dates.  Each data record is composed of three columns, separated by semicolons. The first column gives an internal identifier of a stock symbol (true symbols are hidden). The second column of a record stores an ordered list recommendations issued by experts for a given stock during two months before the decision date. The third column gives information about the true return class of the stock, computed over the period of three months after the decision date. It may take one of three values: ‘Buy’, ‘Hold’, ‘Sell’, which correspond to considerably positive, close to zero, and considerably negative returns, respectively.

In each record, the list of recommendations consists of one or more tips from financial experts. Any single recommendation is expressed using four values and put between ‘{}’ brackets. The first value is an identifier of an expert. The second value gives a class of the stock predicted by the expert (‘Buy’, ‘Hold’ or ‘Sell’), and the third value expresses expert’s expectations regarding the return rate of the stock in a future. It needs to be stressed that information regarding the expected return rates may sometimes be inconsistent and generally less reliable than the prediction of the rating, due to different interpretations of stock quotes by experts (e.g. not considering splits and/or dividends). Moreover, some experts do not share their expectations about the returns. Such situations are denoted by NA values in the data. The fourth value in each recommendation quantifies a time distance to the decision date (in days), e.g. if this value is 5, it means that the recommendation was published five days before the decision date. The list of recommendations in each record is sorted by the time distances, thus it can be regarded as a time series.

In order to additionally enrich the competition data, we provide a table that groups experts by companies for which they work (the file named company_expert.csv). In total, the data consist of recommendations from 2,832 experts who are employed in 228 different financial institutions.

The test data file, namely ismis17_testData.csv, consists of 7,555 records. It has a similar format as the training data, however, it does not contain the third column with true return classes. The task for participants is to predict the labels for the test cases. It is important to note that the training and test data sets correspond to different time periods and the records in both sets are given in a random order.

The format of submissions: The participants of the competition are asked to predict return classes of the records from the test set and send us their predictions using the submission system. Each solution should be sent in a single text file containing exactly 7,555 lines (files with an additional empty last line will also be accepted). In the consecutive lines, this file should contain exactly one class label from the set {‘Buy’, ‘Hold’ or ‘Sell’}. Solutions containing any other labels or with a different number of lines will evaluate with an error.

Evaluation of resultsThe submitted solutions will be evaluated on-line and preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a subset of the test set consisting of 1000 records, fixed for all participants. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. Moreover, in order to claim the awards, winners will have to provide source codes that allow reproducing their final solution (in any programming language). All the winners will be officially announced during a special session devoted to this competition, which will be organized at the ISMIS’17 conference (

The assessment of solutions will be done using the accuracy (ACC) measure with an additional cost/reward matrix. For a confusion matrix X, obtained from a vector of predictions preds, and the cost matrix C displayed below, the accuracy is computed as: \[ACC(preds) = \frac{\sum_{i = 1..3} \left( X_{i,i} \cdot C_{i,i} \right) }{\sum_{i = 1..3}\sum_{j = 1..3} \left( X_{i,j} \cdot C_{i,j} \right) }.\]

The cost matrix C used for evaluation of submissions:
preds\truth Buy Hold Sell
Buy 8 4 8
Hold 1 1 1
Sell 8 4 8

For convenience of participants, we provide an exemplary solution file, exemplary_solution.csv, as a reference.

Last modified: Tuesday, 22 November 2016, 7:20 PM