Task description

The data set for this competition was provided by FTP Group - the leading information and communication technology enterprise in Vietnam. It was obtained from simulations of product viewing activities, for user with known gender. The data closely follow the real-life distribution in that regard. The task in this competition is to reconstruct the information about user’s gender from product viewing logs. The most method, if it turns out to be sufficiently accurate, will provide a guideline for real-life applications in e-Commerce big data analytics.

Data format: The data for participants were divided into separate training and test sets - trainingData.csv and testData.csv, respectively. Each of these files contains 15,000 records which correspond to product viewing logs. A single log is composed of four columns, separated by commas. The first one is a session ID. The second and the third column correspond to a session start time and session end time, respectively. The last column contains a list of product IDs which were viewed during the session, (the order of viewing is preserved). Consecutive product IDs are separated by semicolons. There is also available trainingLabels.csv file which contains labels identifying true gender of users whose sessions are described in the training data set.

Since a distribution of unique product IDs in the data is very sparse, the IDs contain additional information regarding product category hierarchy. Each product ID can be decomposed into four different IDs which are separated by slashes. The IDs starting with ‘A’ letter are the most general categories and those starting with ‘D’ correspond to individual products. The IDs which start with ‘B’ and ‘C’ are associated with subcategories and sub-subcategories, respectively.

Format of submissions: The participants of the competition are asked to predict the gender of users from the test data and send us their solutions using the submission system. Each solution should be sent in a single text file containing exactly 15,000 lines. The format of submitted files should follow the format of trainingLabels.csv. In the consecutive lines, this file should contain a single label which identifies the gender of a user who generated the corresponding session log in the test set.

Evaluation of results: The submitted solutions will be evaluated on-line and the preliminary results will be published on the competition leaderboard. It will correspond to approximately 20% of the test data. The final evaluation will be performed after completion of the competition using the remaining part of the test data. Those results will also be published on-line. It is important to note that only teams which submit a short report describing their approach before the end of the contest will qualify for the final evaluation. The winning teams will be officially announced during a special session at the PAKDD'15 conference (http://www.pakdd2015.jvn.edu.vn/).

Since the distribution of labels in the data is not balanced, the assessment of solutions will be done using the balanced accuracy measure which is defined as an average accuracy within the decision classes. Namely, for a vector of predictions preds and a vector of true gender labels genders we define the balance accuracy as: \[ACC_{m}(preds, genders) = \frac{|j : preds_{j} = genders_{j} = male|}{|j : genders_{j} = male|}\] \[ACC_{f}(preds, genders) = \frac{|j : preds_{j} = genders_{j} = female|}{|j : genders_{j} = female|}\] \[BAC(preds, genders) = \left(ACC_{f}(preds, genders) + ACC_{m}(preds, genders)\right)/2\]

Last modified: Friday, 10 April 2015, 3:26 PM