Task description

Our team has obtained a data set containing nearly 260,000 reports from the EWID system, which correspond to actions carried out by the Polish State Fire Service within the city of Warsaw and its surroundings in years 1992 - 2011. We preprocessed a subset of this data and transformed it into a table in which each of the reports is described by over 10,000 attributes. Additionally, we have distinguished three target attributes that correspond to information whether in the described incident there were casualties among firefighters, children or other involved people, respectively. The task in this competition is to identify attributes that can be used to robustly assign the reports to coresponding decisions labels. Solutions submitted by participants will be assessed by measuring quality of a classification obtained using a simple classifier ensemble constructed from the indicated sets of attributes. We hope that participants come up with solutions which will improve our understanding of the risk factors associated various types of accidents.

Data format: The training data set is provided in two formats. The first one is a traditional tabular representation of data as a comma-separated values file, namely trainingData.csv. Each row of this file represents a single EWID report and, in the consecutive columns, it contains values of its characteristics. The attributes in this table can be divided into two groups. The first one contains the features extracted from a quantitative part of the report and the second group corresponds to a document-term matrix obtained from the natural language description sections. In total, the training data available to participants store information about 50,000 incident reports which are described by 11,852 attributes. All the conditional attributes are discrete and only a few have more than two possible values. For convenience of participants, the same data set is available in a sparse matrix format as an EAV file, namely trainingData.eav. In every row the file contains exactly three integer numbers - an identifier of an object, an identifier of an attribute and the corresponding value. To each report there are also assigned values of three binary decision attributes. Information about those values for the training data is stored in a file decisionLabels.csv, which is available for all participants. The first decision attribute indicates incidents where there occurred a serious injury or death of one of the firefighters or members of the rescue team. The second decision attribute indicates cases in which there were children among injured people and the third attribute identifies situations where civilians were hurt. It is worth noting that the nature of the considered problem implies that the provided data set is highly dimensional, since the total number of conditional attributes corresponds to the number of distinct words in the textual part of the reports (after lemmatization) plus several hundreds of attributes from the quantitative part of the reports. The data is also sparse, since only a small fraction of the attributes have a non-zero value for a particular report. In addition, all three decision attributes are highly imbalanced, since the positive classes correspond to relatively rare events. There is also a separate test data set which will be used for the evaluation of submissions. It has similar characteristics to the training data but the test data will not be made available for participants of the competition.

Format of submissions: The participants of the competition are asked to indicate sets of attributes that allow to accurately classify the incidents and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly ten lines. In the consecutive lines, this file should contain at least three integer numbers (in each line) indicating attributes from the training data set, separated by commas and without any spaces. There is no upper limit as to the number of attributes indicated in a single line, however, the evaluation system will penalize solutions that use a large number of features.

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 10% of the test data size. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during CEIM'14 workshop (https://fedcsis.org/ceim) at the FedCSIS'14 conference.

Quality of the submissions will be assessed by measuring performance of a classifier ensemble composed of Naive Bayes models. Those models will be constructed using attribute sets indicated in the submitted solution, separately for each decision attribute. An output of the ensemble will be computed by averaging probabilities of the positive classes returned by individual Naive Bayes models. All training data will be used for the construction of the models and the test will be performed on a separate data set which is not available for participants. The performance of the ensemble will be measured by taking an average Area Under the ROC Curve (AUC) over the probability predictions for each decision attribute, decreased by a penalty for using a large number of conditional attributes. Namely, if we denote by:

$$ \begin{array}{ccl} s & - & \textrm{a submitted solution}, \\ |s| & - & \textrm{a total number of attributes used in the solution (with repetitions)}, \\ AUC_i(s) & - & \textrm{Area Under the ROC Curve (AUC) of a classifier ensemble for the i-th decision attribute}, \end{array} $$

then the quality measure used for the assessment of submissions can be expressed as:

\[score(s) = F \left(\frac{1}{3}\sum\limits_{i = 1}^3 AUC_i(s) - penalty(s)\right)\]

where the penalty is equal to:

\[penalty(s) = \left(\frac{|s| - 30}{1000}\right)^2\]

and the function F: $$F(x) = \begin{cases} x & \textrm{for } x > 0\\ 0 & otherwise\hspace{0.5cm}. \end{cases}$$

An exemplary solution: We prepared a simple solution to give an example of a correctly formatted submision file. It is available here. The attributes in the example we selected based on their correlation with the decisions. The preliminary evaluation score of this solution is 0.9119 - it is displayed on the leaderboard as the baseline_solution score. In case of any questions please post on the forum or write us an email: AAIA14Contest@mimuw.edu.pl

Last modified: Tuesday, 5 May 2015, 2:48 PM