Research:Teahouse group dynamics/Dataset

Schema for Teahouse replies dataset

edit

Replies by Wikipedians to questions at the Teahouse. One reply per row.

  • id: sequential record ID, sorted by author and (usually) chronological timestamp.
  • threadid: sequential thread ID. Threads are counted from the first thread in the first Teahouse archive page, so threadid 1 corresponds to the thread "en:Wikipedia:Teahouse/Questions/Archive_1#References/Sources_for_Television_Shows_and_Books".
  • archive: sequential archive ID, so archive 1 corresponds to Wikipedia:Teahouse/Questions/Archive_1.
  • title: canonical title of the thread, as listed in the archive. May differ from original thread title, if that title was edited before the thread was archived.
  • replyno: position of the reply within the thread.
  • isreply: whether the post is a reply or not (non-replies are often the question that initiated the thread).
  • timestamp: timestamp in UTC.
  • welcome: whether the reply starts with a welcome (1) or not (0).
  • selfreply: whether the reply is to a thread started by the same author.
  • isfirstreply: whether the post is the first reply in the thread.
  • policies: list of tuples of the form (PAGENAME_IN_CAPS, page_type).
  • policycount: number of policies linked in the post.
  • postlength: number of characters in the post.
  • welcome_prev_page: proportion of first replies in threads on the previous archive page that started with a welcome.
  • welcome_2prev_page: proportion of first replies in threads on the archive page preceding the previous archive page that started with a welcome.
  • policyprop_prev_page: proportion of policy links to total words in first replies on the previous archive page.
  • policyprop_2prev_page: proportion of policy links to total words in first replies on the archive page preceding the previous archive page.
  • welcomenotfirst_prev_page: proportion of subsequent replies in threads on the previous archive page that started with a welcome.
  • welcomenotfirst_2prev_page: proportion of subsequent replies in threads on the archive page preceding the previous archive page that started with a welcome.
  • welcome_2prev_page: proportion of first replies in threads on the archive page preceding the previous archive page that started with a welcome.
  • duplicated: whether this post is a duplicate of another post in the archive. Accidental duplication of threads is rare in the dataset.
  • uname_parsed: username parsed from the archived thread. May differ from current user_name. Should match author.
  • user_name: if different from uname_parsed, the post author changed their username between the time the thread was archived and the time the dataset was generated.
  • 'user_id: user ID of author
  • rev_id: if the author is a host, the revision ID of their host profile creation.
  • rev_timestamp: if the author is a host, the UTC timestamp of their host profile creation.
  • is_host: whether the author is a host ("Host") or not ("Non Host").
  • age: number of days between the date of the author's first Teahouse edit and the date of the current post.
  • poster_age:
  • helpdesk_count:
  • helpdesk_start:
  • is_helpdesk:
  • teahouse_count:
  • wikipedia_count: