Research talk:Automated classification of draft quality/Work log/2016-12-03
Saturday, December 3, 2016
editHey folks! I have been working on this on and off for a couple of days. I haven't had a good opportunity to sit down and focus on it so I didn't pull up the log.
Here's the gist: Last time I was looking at the distribution of sentence scores and the difference between scores from the 4 different PCFGs. In this log, I'll document what I learned about looking into the model to see what it differentiates. In this set of examples, I'm using the training data as a sort of exploratory deal mechanism to learn what exactly these models are and are not able to catch. Doing this analysis doesn't substitute for a real evaluation where we score sentences that we'd previously withheld. But it will give us a sense of to what extent we can differentiate between FA, Spam, Vandalism and Attack at all.
Spam that we think looks like FA content
editscores.normalized.own_model[ quality == "spam" & model == "FA" & productions > 1,][order(log_proba_diff, decreasing=T)]
- Not only will it stink, it's going to shorten the lifetime of the device.
- We not too long ago had to have this carried out at our home; the basement drain was clogged; the plumber finally pulled out a mass of tree roots the scale of a volleyball!
- Battle through hordes of undead, skeletons, orcs, goblin and monsters.
Spam that we're really sure is not FA
editscores.normalized.own_model[ quality == "spam" & model == "FA" & log_proba_diff > -2.7 & productions > 1,][order(log_proba_diff)][1:10]
- Intuit QuickBooks Tech Support Phone\t1 800 903 7315\t\u00a024*7 Support \u00a0\n...
- b.er\n http://upstart.ubuntu.com/wiki/Obama%2B1888%20624%204666%20Turbotax%20Helpdesk%20number,%20helpdesk%20phone%20number.
Vandalism content that looks like FA content
editscores.normalized.own_model[ quality == "vandalism" & model == "FA" & productions > 1,][order(log_proba_diff, decreasing=T)][1:10]
- p. 288.
- Sodium chloride supplies essential ions.
- The temple was destroyed in the VII century, during the Byzantine invasion.
Vandalism that we're really sure is not FA
editscores.normalized.own_model[ quality == "vandalism" & model == "FA" & productions > 1,][order(log_proba_diff)][1:10]
- mynamenickpang
- HEAVEN ON TEARS HEAVEN ON TEARS <repeat 100 times>
- UNCYCLOPEDIA IS SHIT
- Meowwwwwwwwwwwwwww
- charlie is jesus jesus is charlie charlie is jesus jesus is charlie charlie is jesus <repeat 100 times>
- r.t quickbooks T.
Attack content that looks like FA content
editscores.normalized.own_model[ quality == "attack" & model == "FA",][order(log_proba_diff, decreasing=T)][1:10]
- Colonel G.
- p. 251.
- Hitler was a decorated veteran of World War I.
- 98 mm long.
Attack that we're really sure is not FA
editscores.normalized.own_model[ quality == "attack" & model == "FA" & productions > 1,][order(log_proba_diff)][1:10]
- POO....POO....
- Chris Mostly Bums Dwarves Chris Mostly Bums Dwarves <repeat 20 times>
- h.. h..
- c.. c..
- BEEF! BEEF!
- REDIRECT Donald Trump
- Integer euismod lacus luctus magna.
All in all, this looks pretty good. I think we're ready to start applying this to new data. --EpochFail (talk) 20:15, 3 December 2016 (UTC)