WSoR datasets/user first msg
Explain what the dataset is and what it is useful for. Make sure to *at least* include what you used it for.
Location
editdb42:staeiou.user_first_msg
mysql> describe staeiou.user_first_msg_final; select * from staeiou.user_first_msg_final limit 1000,6; +-----------------------------+-----------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-----------------------------+-----------------+------+-----+---------+-------+ | talk_page_id | int(6) signed | yes | PRI | 0 | | | talk_page_title | varbinary(143) | NO | | | | | recipient_user_id | int(5) unsigned | YES | | 0 | | | recipient_name | varbinary(143) | YES | | NULL | | | rev_comment | varbinary(143) | YES | | NULL | | | sender_user_id | int(5) unsigned | YES | | 0 | | | sender_user_name | varbinary(143) | YES | | | | | rev_timestamp | datetime | YES | | NULL | | | recipient_user_registration | datetime | YES | | NULL | | | datediff_msg_reg | int(7) | YES | | NULL | | | timediff_msg_reg | time | YES | | NULL | | | user_reg_ymd | varchar(10) | YES | | NULL | | | first_msg_ymd | varchar(10) | YES | | NULL | | | bot | int(1) | YES | | NULL | | | self | int(1) | YES | | NULL | | | msg_type | varbinary(20) | YES | | NULL | | | msg_type_sub | varbinary(20) | YES | | NULL | | | tool | varbinary(50) | YES | | NULL | | | recipient_edit_count | int(11) | YES | | NULL | | | first_msg_year | int(4) | YES | | NULL | | | block_date | datetime | NO | | NULL | | | is_anon | int(1) | YES | | NULL | | | edits_after_msg | int(11) | YES | | NULL | | | edits_before_msg | int(11) | YES | | NULL | | +-----------------------------+-----------------+------+-----+---------+-------+ 24 rows in set (0.00 sec) +--------------+-----------------+-------------------+----------------+---------------+----------------+------------------+---------------------+-----------------------------+------------------+------------------+--------------+---------------+------+------+----------+--------------+------+----------------------+----------------+------------+---------+-----------------+------------------+ | talk_page_id | talk_page_title | recipient_user_id | recipient_name | rev_comment | sender_user_id | sender_user_name | rev_timestamp | recipient_user_registration | datediff_msg_reg | timediff_msg_reg | user_reg_ymd | first_msg_ymd | bot | self | msg_type | msg_type_sub | tool | recipient_edit_count | first_msg_year | block_date | is_anon | edits_after_msg | edits_before_msg | +--------------+-----------------+-------------------+----------------+---------------+----------------+------------------+---------------------+-----------------------------+------------------+------------------+--------------+---------------+------+------+----------+--------------+------+----------------------+----------------+------------+---------+-----------------+------------------+ | 152291 | Wishu Burnstar | 5814 | Wishu Burnstar | Greetings! | 62 | Mav | 2002-11-27 07:45:18 | 2002-11-27 04:47:55 | 0 | 02:57:23 | 2002-11-27 | 2002-11-27 | 0 | 0 | other | NULL | NULL | 19 | 2002 | NULL | 0 | NULL | NULL | | 152730 | Talshiarr | 5363 | Talshiarr | Welcome! | 1078 | Toby Bartels | 2002-11-28 18:59:30 | 2002-11-09 08:34:11 | 19 | 466:25:19 | 2002-11-09 | 2002-11-28 | 0 | 0 | welcome | NULL | NULL | 256 | 2002 | NULL | 0 | NULL | NULL | | 152927 | MartinHarper | 5862 | MartinHarper | experimenting | 5862 | MartinHarper | 2002-11-29 10:56:11 | 2002-11-29 10:38:06 | 0 | 00:18:05 | 2002-11-29 | 2002-11-29 | 0 | 1 | other | NULL | self | 26089 | 2002 | NULL | 0 | NULL | NULL | +--------------+-----------------+-------------------+----------------+---------------+----------------+------------------+---------------------+-----------------------------+------------------+------------------+--------------+---------------+------+------+----------+--------------+------+----------------------+----------------+------------+---------+-----------------+------------------+ 3 rows in set (0.00 sec)
Each row represents a user who has at least one live revision to their user talk page. There is a row in this table for the first message sent to each registered user.
talk_page_id
: enwiki.page.page_id of the receiving user's talk pagetalk_page_title
: enwiki.page.page_title of the receiving user's talk page (is close to, often exactly the user's user_name)recipient_user_id
: enwiki.revision.rev_user of the editor who made the first edit to the user's talk pagerecipient_name
: enwiki.revision.rev_user_text of the editor who made the first edit to the user's talk pagerev_comment
: enwiki.revision.rev_comment of the first edit to the user's talk pagesender_user_id
: enwiki.revision.rev_user of the user who made the talk page editsender_user_name
: enwiki.revision.rev_user_text of the user who made the talk page editrev_timestamp
: enwiki.revision.rev_timestamp of the first edit to the user's talk pagerecipient_user_registration
: enwiki.user.user_registration date of the user whose talk page was editeddatediff_msg_reg
: days between registration and first message, using DATEDIFF(rev_timestamp,user_registration); ~5000 are negative due to bad registration data pre-2004timediff_msg_reg
: time between registration and first message, using TIMEDIFF(rev_timestamp,user_registration); due to the variable type, values are out of bound for > 838 hoursuser_reg_ymd
: text field containing the year, month, and day of the user's registration (used for indexing)first_msg_ymd
: text field containing the year, month, and day of the user's first message (used for indexing)bot
: boolean value as to whether the user who left the first message was a bot (from halfak.bots?)self
: boolean value as to whether the first user_id who left the first message was the user_id who received the message (edits own talk page)msg_type
: message type; can be deletion, notification, talkback, warning, welcomemsg_type_sub
: sub message type; can be blocked, csd, listing, prop, tag, level1, level2, level3, level4tool
: bot or automated tool usedrecipient_edit_count
: enwiki.user.user_editcount (approximate edit count of the user who received the message as of July 2011)first_msg_year
: year of the first message sent (for indexing)block_date
: enwiki.logging.log_timestamp of the first time the user who received the message was blocked (NULL if never been blocked)is_anon
: whether the username is an IP addressedits_after_msg
: number of edits to all namespaces before the message was receivededits_before_msg
: number of edits to all namespaces after the message was received
Reproduction
editPseudocode (will be expanded):
- Get every distinct user_name from enwiki.user
- Join enwiki.user.user_name with enwiki.page.page_title to get all user talk pages
- Join enwiki.revision.rev_page with enwiki.page.page_id to get all the first revisions to a user talk page
- Parse rev_comment to determine tool and message type (see section below)
- Parse rev_user and user_id to determine metadata
Parsing rev_comment for tool/bot used
editUPDATE user_first_msg as ufm, halfak.bot_20110711 as b SET ufm.bot = b.user_id WHERE ufm.sender_user_id = b.user_id; UPDATE user_first_msg SET bot = 1 WHERE bot > 0; UPDATE rev_table SET tool = 'huggle' WHERE rev_comment LIKE '%WP:HG%' OR rev_comment LIKE '%WP:HUGGLE%' OR rev_comment RLIKE "(Reverted ([0-9]+ )?edits by \[\[Special:Contributions/[^\|]+\|[^]]+\]\] \(\[\[User talk:[^\|]+\|talk\]\]\) to last version by .+)|(Message re\. \[\[[^]]+\]\])|(Level [0-9]+ warning re\. \[\[[^]]+\]\])" UPDATE rev_table SET tool = 'twinkle' WHERE rev_comment LIKE '%WP:TW%'; UPDATE rev_table SET tool = 'friendly' WHERE rev_comment LIKE '%WP:FRIENDLY%' OR rev_comment LIKE '%WP:Friendly%'; UPDATE rev_table SET tool = 'vandalproof' WHERE rev_comment LIKE '%WP:VPRF%' OR rev_comment LIKE '%WP:VandalProof%' OR rev_comment LIKE '%VandalProof|VandalProof%' OR rev_comment LIKE '%WP:VP2%' OR rev_comment LIKE '%WP:VandalProof%'; UPDATE rev_table SET tool = 'stiki' WHERE rev_comment LIKE '%|STiki]]%'; UPDATE rev_table SET tool = 'npwatcher' WHERE rev_comment LIKE '%|NPWatcher%'; UPDATE rev_table SET tool = 'vandalsniper' WHERE rev_comment LIKE '%|VandalSniper%'; UPDATE rev_table SET tool = 'wikimonitor' WHERE rev_comment LIKE '%m:WikiMonitor%'; UPDATE rev_table SET tool = 'mwt' WHERE rev_comment LIKE '%MWT|MWT]]%'; UPDATE rev_table SET tool = 'awb' WHERE rev_comment LIKE '%AWB|AWB]]%' OR rev_comment LIKE '%AutoWikiBrowser%'; UPDATE rev_table SET tool = 'cluebot' WHERE lower(cast(rev_user_text as CHAR)) LOWER LIKE 'cluebot%'; UPDATE rev_table SET tool = 'antivandalbot' WHERE rev_user_text = 'AntiVandalBot'; UPDATE rev_table SET tool = 'orphanbot' WHERE rev_user_text = 'OrphanBot'; UPDATE rev_table SET tool = 'pseudobot' WHERE rev_user_text = 'PseudoBot'; UPDATE rev_table SET tool = 'voabot' WHERE rev_user_text LIKE 'VoABot%'; UPDATE rev_table SET tool = 'martinbot' WHERE rev_user_text = 'MartinBot'; UPDATE rev_table SET tool = 'stbot' WHERE rev_user_text LIKE 'STBot%'; UPDATE rev_table SET tool = 'sqlbot' WHERE rev_user_text = 'SQLbot'; UPDATE rev_table SET tool = 'tawkerbot' WHERE rev_user_text LIKE 'Tawkerbot%'; UPDATE rev_table SET tool = 'sinebot' WHERE rev_user_text = 'SineBot'; UPDATE rev_table SET tool = 'csdwarnbot' WHERE rev_user_text = 'CSDWarnBot'; UPDATE rev_table SET tool = 'antispambot' WHERE rev_user_text = 'AntiSpamBot'; UPDATE rev_table SET tool = 'imagetaggingbot' WHERE rev_user_text = 'ImageTaggingBot'; UPDATE rev_table SET tool = 'dashbot' WHERE rev_user_text = 'DASHBot'; UPDATE rev_table SET tool = 'bjbot' WHERE rev_user_text = 'BJBot';
Parsing rev_comment for message type
editUPDATE rev_table SET msg_type = 'warning', msg_type_sub='level1' WHERE rev_comment LIKE 'Message re.%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level2' WHERE rev_comment LIKE 'Level 2 warning%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level3' WHERE rev_comment LIKE 'Level 3 warning%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level4' WHERE rev_comment LIKE 'Level 4 warning%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level1' WHERE rev_comment LIKE 'General note:%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level2' WHERE rev_comment LIKE 'Caution:%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level3' WHERE rev_comment LIKE 'Warning:%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'warning', msg_type_sub='level4' WHERE rev_comment LIKE 'Final warning:%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'csd' WHERE rev_comment LIKE 'Notification: speedy deletion%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'tag' WHERE rev_comment LIKE 'Notification: tagging for deletion%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'prop' WHERE rev_comment LIKE 'Notification: proposed deletion%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'listing' WHERE rev_comment LIKE 'Notification: listing at%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'talkback' WHERE rev_comment LIKE 'Talkback%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'blocked' WHERE rev_comment LIKE 'You have been%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'blocked' WHERE rev_comment LIKE 'Your IP address has been blocked%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'welcome' WHERE rev_comment LIKE 'Added welcome%' AND tool = 'twinkle'; UPDATE rev_table SET msg_type = 'welcome' WHERE rev_comment LIKE 'Welcoming%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'csd' WHERE rev_comment LIKE 'Notification: Speedy deletion%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'deletion', msg_type_sub = 'prop' WHERE rev_comment LIKE 'Notification: Proposed deletion%' AND tool = 'huggle'; UPDATE rev_table SET msg_type = 'notification', msg_type_sub = 'blocked' WHERE rev_comment LIKE 'Notification: Blocked%' AND tool = 'huggle';
Notes
editIs the dataset incomplete? Are there known issues? What kind of things should someone who uses this dataset be aware of?