Research:Wiki Participation Challenge prognoZit
About the Algorithm
editAuthors: Ben Roth and Fridolin Roth
Team Name: prognoZit
Source code: dumps.wikimedia.org
Ranking in Wikichallenge: 1st
Dependencies
edit- Python
- Octave
Basic Approach
editWe obtain our prediction vector y, i.e. column 2 of the submitted data, by where
- B∈Mn,m is a real matrix with one row for each of the n editors and is a number of features we use for the prediction (e.g. number of edits in a certain time interval)
- is a certain weight vector
Features extracted from the data set
editThe columns of the matrix B are as follows:
- Column 1: Sum of the deltas of edits from 2001-11-01 and before 2010-04-01
- Column 2: Sum of the deltas of edits from 2010-04-01 and before 2010-09-01
- Column 3: Number of reverted edits from 2001-11-01 and before 2010-04-01
- Column 4: Number of reverted edits from 2010-04-01 and before 2010-09-01
- Column 5: Number of edits from 2001-11-01 and before 2006-05-01
- Column 6: Number of edits from 2006-05-01 and before 2008-11-01
- Column 7: Number of edits from 2008-11-01 and before 2009-12-01
- Column 8: Number of edits from 2009-12-01 and before 2010-03-01
- Column 9: Number of edits from 2010-03-01 and before 2010-06-01
- Column 10: Number of edits from 2010-06-01 and before 2010-07-01
- Column 11: Number of edits from 2010-07-01 and before 2010-08-01
- Column 12: Number of edits from 2010-08-01 and before 2010-08-15
- Column 13: Number of edits from 2010-08-15 and before 2010-09-01
An extra column of ones is added (as an offset of the linear regression, see Determination of the weight vector). These are the features for testing (submission, matrix B). For training (determining the weight vector ) the same features are used but the time periods are shifted backwards by 5 months. We call this training matrix .
Determination of the weight vector
editTo determine the weight vector we use the time-shifted matrix and the vector containing the number of edits each editor made within the five months prior to the five months to be predicted. The weight vector is then obtained as the least square linear regression using the octave function .
Working in the log space
editWe have obtained our solution after transforming all the data to the log space, i.e. we have modified all entries by the function before estimating the weight vector. Accordingly, the solution was retransformed by the function
Partitioning the set of editors
editWe have obtained our solution by applying the approach described above not to the whole set of editors at once but to two subsets of editors separately. These subsets are of equal size and are obtained by grouping editors with data provided in the even respectively odd lines of the provided data set.
Data used
editWe have only used the data set provided by the organizers of the competition:
- Number of edits in certain time intervals prior to 2010-09-01
- “Deltas” in certain time intervals prior to 2010-09-01
- Number of reverts in certain time intervals prior to 2010-09-01
Pseudo-code explanation
editfunction training_features(i,j) Return a vector of counts (edits, deltas, reverts - dependent on feature j), for a certain training time interval and editor i from the provided dataset. endfunction function testing_features(i,j) Return a vector of counts (edits, deltas, reverts - dependent on feature j), for a certain testing time interval and editor i. endfunction n <- number of editors m <- number of features let A, B be n x (m+1) matrices for i = 1 to n: A(i,1) <- 1 B(i,1) <- 1 for j = 2 to m + 1: A(i,j) <- training_feature(i,j) B(i,j) <- testing_feature(i,j) Take the logarithm of all entries of A and B: A(i,j) <- log((A(i,j) + 1) B(i,j) <- log((B(i,j) + 1) Let z be an n vector containing the number of edits done in the 5 months prior to the prediction interval (= the 5 months after the last training time interval). z <- log(z+1) for i = 1 to n / 2: A_even(i,j) <- A(2i,j) A_odd(i,j) <- A(2i - 1,j) B_even(i,j) <- B(2i,j) B_odd(i,j) <- B(2i - 1,j) z_even(i) <- z(2i) z_odd(i) <- z(2i - 1) Solve least square linear regression for A and z, even and odd indices separately: w_even <- A_even \ z_even w_odd <- A_odd \ z_odd Calculate the prediction vector in the log space for i = 1 to n / 2: y(2i) <- B_even(i,:)*w_even y(2i - 1) <- B_odd(i,:)*w_odd Retransform to the non-log space: y <- max(0, exp(y)-1) return y