Data for the competition were provided by Research and Development Centre EMAG. They come from an active Polish coal mine. The task is to come up with a prediction model which could be effectively applied to foresee warning levels of methane concentrations at three methane meters placed in a longwall of the mine. In data files, we provide a mining scheme utilized in the mine, which explains the placement of all sensors used for monitoring the mining process. The sensors are described in a separate file. The cutter loader moves along the longwall between the sensors MM262 and MM264. The bigger are the currents consumed by the cutter loader, the more efficient is its mining work, which in theory results in more methane emitted to air. The arrows on the provided scheme show the directions of air flow (with methane flow) in the roads. If the methane concentration measured on any of the sensors reaches the alarm level the cutter loader is switched off automatically. However, if we were able to predict ahead the warning methane concentrations, we could reduce speed of the cutter loader and give methane more time to spread out - before the necessity of switching off the whole production line.

Data format: The time series data sets for this competition are provided in a tabular format. For a convenience of participants the training data set was divided into five smaller chunks, namely trainingData1.csv, ..., trainingData5.csv. Those files were compressed into a single archive trainingData.7z and can be dowloaded from the Data files section after successful enrollment to the competition. In total, the files contain sensor readings for 51,700 time periods, each 10 minutes long, with measurements taken every second (600 values for every sensor in a single series). Values for each time period are stored in a different row of the data. The data include readings from 28 different sensors thus, every row in the data consists of 16,800 values stored in consecutive columns and separated by commas. Names of the data columns, which allow to identify sensor names, are provided in a separate file, namely column_names.txt. Descriptions of the types of sensors used from monitoring the mining process are given in sensor_descriptions.txt and their placement in corridors of the mine is indicated on the provided mining process scheme (mining_process_scheme.png). The time periods in the training data are overlapping and are given in a chronological order.

Labels in the data indicate whether a warning threshold has been reached in a period between three and six minutes after the end of the training period, for three methane meters: MM263, MM264 and MM256. In particular, if a given row corresponds to a period between $$t_{-599}$$ and $$t_{0}$$, then the label for a methane meter MM in this row is 'warning' if and only if $$max(MM(t_{181}), ..., MM(t_{360})) \geq 1.0$$. The labels for the training data are provided in separate files, trainingLabels.7z. The test data file, testData.7z, is in the same format as the training data set, however, the labels for the test series are hidden from participants. It is important to note that time periods in the test data do not overlap and they are given in a randomize order.

Format of submissions: The participants of the competition are asked to predict likelihood of the label 'warning' for particular time series from the test set and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly 5,076 lines (files with an additional empty last line will also be accepted). In the consecutive lines, this file should contain exactly three real numbers corresponding to the target methane meter sensors, separated by a comma. The values do not need to be in a particular range, however, higher numerical values should indicate a higher chance of the label 'warning'.

Evaluation of resultsThe submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. The preliminary score will be computed on a random subset of the test set, fixed for all participants. It will correspond to approximately 20% of the test data. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during a special session devoted to this competition, which will be organized at IJCRS'15 conference (http://kbigdata.or.kr/IJCRS2015/).

The assessment of solutions will be done using the Area Under the ROC Curve (AUC) measure. It will be computed separately for each of the three target sensors. The final score in the competition will correspond to the average AUC for those three sets of predictions. Namely, if for a submitted solution $$s$$ we denote by: $$\begin{array}{ccl} AUC_{MM263}(s) & - & \textrm{AUC of predictions for the sensor MM263}, \\ AUC_{MM264}(s) & - & \textrm{AUC of predictions for the sensor MM264}, \\ AUC_{MM256}(s) & - & \textrm{AUC of predictions for the sensor MM256}, \end{array}$$ then the final score in the competition for a solution s will be computed as: $score(s) = \left(AUC_{MM263}(s) + AUC_{MM264}(s) + AUC_{MM256}(s)\right)/3\hspace{0.2cm}.$

The baseline solution: We prepared an exemplary solution as a reference for participants. It is displayed on the leaderboard as the baseline_solution score. This solution was obtained using two popular algorithms which derive from the theory of rough sets. Namely, a discretization method based on maximum discernibility heuristic [3] was used in a combination with LEM2 algorithm [4] for decision rule induction. Both algorithms were implemented in RoughSets package for R System [5].