Research talk:Automated classification of article importance/Work log/2017-03-16
Thursday, March 16, 2017
editToday I'll continue where I left off yesterday by training the GBM classifier and inspecting misclassified articles.
Gradient Boost Model
editWe train a GBM in much the same way as we did for our Random Forests and SVMs yesterday. First we use 10-fold cross-validation on the 1,000 item training set with varying minimum node sizes (which is equivalent to the RFs "terminating node size") to identify the best-performing minimum node and forest sizes. Using an iterative approach we find that for the initial case with number of views and number of links from all other Wikipedia articles as predictors, a minimum node size of 8 and a forest with 1,221 trees is preferred. This is then used to predict the articles in the test set, where we get an overall accuracy of 53.75% and the following confusion matrix:
Top | High | Mid | Low | |
---|---|---|---|---|
Top | 28 | 10 | 2 | 0 |
High | 13 | 16 | 7 | 4 |
Mid | 2 | 9 | 16 | 13 |
Low | 0 | 1 | 13 | 26 |
This model performs well on Top- and Low-importance articles, but does not so well on High- and Mid-importance articles. Given yesterday's models confusion between Top- and High-importance articles, it is not surprising to see the GBM struggle with these as well.
Next we try using project-internal links as the second predictor, and again tune the minimum node size and forest size parameters, and find that 16 and 1,435 respectively have the best performance. Applying those to the test set reports an overall accuracy of 59.38%. The improved performance comes in the Mid-importance class, where 26 articles were correctly predicted, compared to 16 in the previous model. Performance in the other classes is roughly the same.
Lastly we try all three predictors. Here we find that 32 minimum observations and a forest with 1,830 trees has the best cross-validation performance. Running this configuration on the test set we get an overall accuracy of 60%. The results are very similar to those with just using project-internal links, having both of them together does not seem to really add much information.
2,000 item training set
editWe then redo the process on the 2,000 item training set. First using global links and article views, we find that a minimum node size of 64 and a forest with 2,278 trees has the best cross-validation performance. Applying this to the test set we get an overall accuracy of 54.38%. This is slightly higher than what was reported with the smaller training set.
Using project-internal wikilinks we find that a minimum node size of 4 and a forest with 2,486 trees has the best cross-validation performance. On the test set, this model achieves 51.88% accuracy, slightly below what we saw previously.
Lastly, using all three predictors we find that the minimum node size should be 32 and the forest size be 2,497 trees, as that has the best cross-validation performance. On the test set, this model achieves the same overall accuracy as the previous model, at 51.88%. This is much lower than what we saw with the smaller training dataset, and in that way similar to how the training/testing performance of the Random Forest classifier turned out. So also in the case of this project-specific classifier do we find that the SVM is the higher performer.
Classification errors
editWe select the highest performing SVM classifier (using all three predictors and trained using the 1,000 item training set) and then examine the articles it misclassified. We are primarily interested in predictions that are far away from the actual rating, meaning that we disregard errors of the neighboring class (e.g. Top-importance articles being predicted High-importance). Here is a confusion matrix which lists the interesting articles, columns are predicted ratings and the rows are the true ratings.
Top | High | Mid | Low | |
---|---|---|---|---|
Top | ||||
High | ||||
Mid | ||||
Low |
I suspect that the first thing WPMED is going to ask for is a list across all of their articles, since our test dataset only contains 160 articles. In the whole dataset, the distribution of article predictions is as follows:
Top | High | Mid | Low | |
---|---|---|---|---|
Top | 72 | 16 | 1 | 1 |
High | 240 | 520 | 177 | 78 |
Mid | 325 | 2,129 | 4,099 | 2,412 |
Low | 70 | 1,339 | 4,632 | 13,251 |
Some of these categories are clearly too big to list completely, for instance if we use a distance of two classes as our threshold, we need to list 1,339 Low-importance articles predicted to be High-importance. That would be counterproductive. Instead we focus on the somewhat smaller classes, and list them individually:
- Top-importance predicted to be Low-importance
- Top-importance predicted to be Mid-importance
- High-importance predicted to be Low-importance
- 1852–60 cholera pandemic
- 1881–96 cholera pandemic
- Adherence (medicine)
- Anna Suk-Fong Lok
- Asclepius
- Autosomal dominant polycystic kidney disease
- Basic symptoms of schizophrenia
- Benjamin Spock
- Breast cancer management
- Children's hospice
- Christine Williams (nutritionist)
- Chukwuedu Nwokolo
- Cocaine (data page)
- Complications of hypertension
- Continuous passive motion
- Coronary ischemia
- Darbepoetin alfa
- Destination therapy
- Dimitrios Trichopoulos
- Discovery and development of nucleoside and nucleotide reverse-transcriptase inhibitors
- Drug use
- Drugs for acid-related disorders
- Ear pain
- Effects of parasitic worms on the immune system
- Eleanor Montague
- Elisa Oricchio
- Elisabeth Binder
- Elixir sulfanilamide
- Environmental enteropathy
- Facial skeleton
- FOLFOXIRI
- François Fournier de Pescay
- Gaseous signaling molecules
- Glucose cycle
- Google Flu Trends
- Gross Motor Function Classification System
- Health care fraud
- Health effects of salt
- Hepatitis C and HIV coinfection
- HIV and pregnancy
- Hua Eleanor Yu
- Integrated Management of Childhood Illness
- Karen C. Johnson
- Kathleen I. Pritchard
- Lea test
- Legionellales
- List of Legionnaires' disease outbreaks
- List of man-made mass poisoning incidents
- List of unsolved problems in medicine
- Maria Abbracchio
- Medical advice
- Medical privacy
- Mineral (nutrient)
- Mouth infection
- National Center for Immunization and Respiratory Diseases
- Pathophysiology of hypertension
- Precautionary principle
- Procedural sedation and analgesia
- Rat Genome Database
- Reading disability
- Resource-based relative value scale
- Reuptake inhibitor
- Root sheath
- Rotaviral enteritis
- Self-disorder
- Social immunity
- Tainted blood scandal (United Kingdom)
- Tebello Nyokong
- Testicular self-examination
- Transcultural nursing
- Translators Without Borders
- Tuberculosis in India
- Tuberculosis in relation to HIV
- Ulysses syndrome
- Uroscopy
- Weaning
- Weight management
- WHO Surgical Safety Checklist
- Mid-importance predicted to be Top-importance
- Abortion
- ACE inhibitor
- Acetazolamide
- Acetylcholinesterase inhibitor
- Acupuncture
- Adipose tissue
- Agoraphobia
- Alanine transaminase
- Allopurinol
- Alopecia areata
- Amiodarone
- Amitriptyline
- Amoxicillin/clavulanic acid
- Ampicillin
- Anabolic steroid
- Androgen
- Anesthesiologist
- Anesthetic
- Angina pectoris
- Anthrax
- Anti-inflammatory
- Anticoagulant
- Antigen
- Aphasia
- Aripiprazole
- Arousal
- Artificial insemination
- Asphyxia
- Assisted suicide
- Ataxia
- Atrial fibrillation
- Atropine
- Autism spectrum
- Azathioprine
- Betamethasone
- Bevacizumab
- Bile acid
- Biotin
- Bisphenol A
- Blood plasma
- Blood sugar
- Bloodletting
- Bone
- Botulinum toxin
- Brain death
- Brain–computer interface
- BRCA1
- Brucellosis
- Bubonic plague
- Budesonide
- Buprenorphine
- Calcium
- Cannabidiol
- Carcinogen
- Cat-scratch disease
- Catatonia
- Catecholamine
- Celecoxib
- Cephalosporin
- Cetirizine
- Chelation therapy
- Chikungunya
- Child development
- Chloramphenicol
- Chlorhexidine
- Chlorpromazine
- Cholecalciferol
- Ciclosporin
- Cisplatin
- Clarithromycin
- Clindamycin
- Clonazepam
- Clotrimazole
- Clozapine
- Codeine
- Color blindness
- Corticosteroid
- COX-2 inhibitor
- Crohn's disease
- CT scan
- Cyclophosphamide
- Decompression sickness
- Delusion
- Depression (mood)
- Dexamethasone
- Dextroamphetamine
- Dichlorodiphenyltrichloroethane
- Diltiazem
- Diphenhydramine
- Dissociative identity disorder
- Diuretic
- Donepezil
- Dopamine
- Down syndrome
- Doxorubicin
- Doxycycline
- Doxylamine
- Dwarfism
- Edema
- Effects of cannabis
- Electric shock
- Electroconvulsive therapy
- Enalapril
- Encephalitis
- Ephedrine
- Epidemiology
- Epigenetics
- Erysipelas
- Erythropoietin
- Escitalopram
- Ethanol
- Flatulence
- Fluorouracil
- Folic acid
- Follicle-stimulating hormone
- Fragile X syndrome
- Furosemide
- Gabapentin
- Gastroesophageal reflux disease
- Gender dysphoria
- Germ theory of disease
- Glucagon
- Glucose tolerance test
- Gonorrhea
- Graves' disease
- Guaifenesin
- Guillain–Barré syndrome
- Gynaecology
- Hair loss
- Hallucination
- Haloperidol
- Hangover
- Hantavirus
- Heart
- Heart rate
- Hematopoietic stem cell
- Heparin
- Herpes simplex virus
- Homeopathy
- Hydrocephalus
- Hydrochlorothiazide
- Hydrocortisone
- Hydromorphone
- Hyoscine
- Hypercapnia
- Hypericum perforatum
- Hyperthermia
- Hypertrophic cardiomyopathy
- Hypothermia
- Ibuprofen
- Imatinib
- In vitro fertilisation
- Infant
- Intellectual disability
- Intensive care unit
- Interferon
- Intersex
- Intravenous therapy
- Ipratropium bromide
- Isotretinoin
- Jaundice
- Karyotype
- Kava
- Ketamine
- Ketoconazole
- Lactobacillus
- Lamotrigine
- Laryngitis
- Levothyroxine
- Lidocaine
- Life expectancy
- Linezolid
- Lisdexamfetamine
- Lithium (medication)
- Long-term effects of cannabis
- Loperamide
- Loratadine
- Lorazepam
- Losartan
- Luteinizing hormone
- Lysergic acid diethylamide
- Macrobiotic diet
- Macular degeneration
- Mania
- Mannitol
- MDMA
- Median lethal dose
- Medicaid
- Medical cannabis
- Melanin
- Melatonin
- Memory
- Menstruation
- Mesalazine
- Methadone
- Methanol
- Methicillin-resistant Staphylococcus aureus
- Methotrexate
- Methylene blue
- Methylphenidate
- Methylprednisolone
- Metoprolol
- Metronidazole
- Midazolam
- Midwife
- Mifepristone
- Misoprostol
- Modafinil
- Monoclonal antibody
- Morphine
- Motion sickness
- Multiple birth
- Multiple myeloma
- Muscle contraction
- Muscular dystrophy
- Mycobacterium tuberculosis
- Naloxone
- Narcolepsy
- Necrosis
- Neuroplasticity
- Neurosis
- Neutrophil
- Niacin
- Nicotine
- Nifedipine
- Nitric oxide
- Norepinephrine
- Obsessive–compulsive disorder
- Occupational therapy
- Olfaction
- Oncogene
- Ondansetron
- Opioid
- Optometry
- Otorhinolaryngology
- Oxycodone
- Paclitaxel
- Paracetamol
- Paranoid schizophrenia
- Paraplegia
- Paroxetine
- Patient Protection and Affordable Care Act
- Peritonitis
- Personality disorder
- Phobia
- Physical therapy
- Placebo
- Plague (disease)
- Pleurisy
- Polio vaccine
- Positron emission tomography
- Potassium iodide
- Prednisolone
- Pregabalin
- Propofol
- Propranolol
- Psychiatrist
- Psychoactive drug
- Psychotherapy
- Puberty
- Pulmonary edema
- Pulse
- Quality of life
- Quinine
- Randomized controlled trial
- Ranitidine
- Renin
- Repetitive strain injury
- Residency (medicine)
- Retinol
- Rheumatism
- Riboflavin
- Salbutamol
- Sarcoidosis
- Savant syndrome
- Self-harm
- Sense
- Serotonin–norepinephrine reuptake inhibitor
- Sertraline
- Sexual intercourse
- Simvastatin
- Speech-language pathology
- Spironolactone
- Staphylococcus aureus
- Stillbirth
- Strabismus
- Streptococcus
- Streptomycin
- Strychnine
- Stuttering
- Substance abuse
- Sulfonamide (medicine)
- Syncope (medicine)
- T cell
- Tamoxifen
- Tendinitis
- Testicular cancer
- Tetrodotoxin
- Thiamine
- Thyroid cancer
- Topiramate
- Tourette syndrome
- Trastuzumab
- Tretinoin
- Trichinosis
- Trichotillomania
- Ultrasound
- Universal health care
- Urination
- Urine
- Valsartan
- Vancomycin
- Vasopressin
- Verapamil
- Vitamin C
- Vitamin D
- Vitamin E
- Vomiting
- Warfarin
- William Harvey
- X chromosome
- Yellow fever
- Zolpidem
- Zoonosis
- Zygote
- Low-importance predicted to be Top-importance
- Albinism
- Biofeedback
- Brain natriuretic peptide
- Buspirone
- Cadaver
- Carbon tetrachloride
- Chloroform
- Cholecystokinin
- Desomorphine
- Dextromethorphan
- Dominance (genetics)
- Edward Jenner
- Electrolyte
- Euphoria
- Flow cytometry
- Fungus
- Galen
- Ghrelin
- Gluconeogenesis
- Gut flora
- Head transplant
- Health care in the United States
- Health insurance in the United States
- Health Insurance Portability and Accountability Act
- Hippocrates
- Homocysteine
- Human leukocyte antigen
- Humorism
- Hypersexuality
- Lecithin
- Leptin
- Life extension
- Low-density lipoprotein
- Magnesium
- Medicare (United States)
- Merck & Co.
- Mutagen
- Mycoplasma
- Naltrexone
- National Health Service
- Nicotine poisoning
- Nurse practitioner
- Oliver Sacks
- Out-of-body experience
- Paraphilia
- Pentoxifylline
- Permethrin
- Pesticide
- Phentermine
- Phosgene
- Piracetam
- Polychlorinated biphenyl
- Protozoa
- Ramipril
- Reactive oxygen species
- Robert Koch
- Rudolf Virchow
- Sarin
- Scar
- Sexual addiction
- Sleepwalking
- Somatostatin
- Stem cell
- Sugar substitute
- Tissue plasminogen activator
- Tocopherol
- Vitamin A
- VX (nerve agent)
- Working memory
- Y chromosome