47 thoughts on “Anti-Learning (So Bad, it's Good) – Computerphile”

  1. I like the video, well done. Looking on some comments below: it is not easy to explain some concepts, which are naturally captured by a few algebraic equations, in words. The idea of anti-learning is clearly stated in Uwe's presentation. However, it interferes with attempts by listeners, perhaps subconscious, to make a simple analytical model of it. Even simplest algebraic equation in words is a mess, until you have a proper model in your mind. Personally, I would like to see addition yet another, 2-dimensional version of xor which I used in my explanations of anti-learning. Maybe it is time for another you-tube video. In more general terms, I see here a need for formal maths which we cannot escape. At this moment I appreciate legacy of Rene Descartes, who introduced co-ordinates to geometry, so we now can perform very complicated derivations on space and time, universe, etc. Before that, the geometry was really hard to deal with and reduced in its scope.

  2. 1:36 "One summer, I somehow agreed to supervise this medical student on a masters project, I don't know why I agreed to this." Well, he provided an insight to a problem deep experts had failed to solve as he brought a different perspective to the data set. If I was you I would agree to supervising a few more masters students.

  3. If they were getting 45% accuracy on the test set, they must've messed something up. Given a balanced dataset, if the network couldn't learn anything useful from the data, it can predict 1 single outcome with 100% probability and still have 50% accuracy. Given an unbalanced dataset, it can do even better. How did they work on this for 2 years and achieve 45% accuracy? That's insane.

    Also, he is under the very false impression that 500 samples is a decent size dataset, which is not true. MNIST, a dataset for handwritten digits, has 50 or 60k samples. And that's just to recognize digits. He wants to train a network to basically diagnose the appropriate treatment for cancer patients using only 500 samples? That's just absurd. Also, like other comments said, simply inverting the prediction doesn't actually improve accuracy.

    This guy is the Head of Computing & IT at his university. That's actually a little sad.

  4. I really, really, really cannot hear about these "med applications" anymore. Clueless, evil doctors who want a program to decide whether to operate on someone because, as I said, they are clueless and evil.

    Important newsflash: Doctors make up diseases because there's not enough work for them in reality.

  5. Explanation: "I've learned a simple way of choosing an answer to a complicated problem involving 2 choices. I know that I'm usually wrong more often then right. So, I'll choose the opposite answer to one I would normally pick. Now I'm right more often than not."

  6. He could used the blank side of the paper to do that simple graphic…or div by 4 a single side that works too…

  7. Wouldn't it be more efficient to use pattern recognition on data sets? Spit the cases into A-B groups "healthy" and "not healthy", human categorize them, then have the machine look for commonalities. Then spit each group again, re-run the algorithms, split again, re-run, etc. That way you'd end up with a machine that can not only do the initial A-B grouping, but also potentially tease out new patterns that the humans didn't notice before. Sounds like they were trying to have the machine do all the work unaided.

  8. Am I the oney one who shudders at the waste of paper? He's only drawing at every other sheet. And his examples are simple enough that you could fit many of them onto a single one.

  9. I guess what he meant to say by "reversing" is that they used machine learning, but for the wrong answers since they are easier to find. This way, with enough of them, you will eventually find the right answers.

  10. If you included a third dimension of time (for example, from the medical data mentioned in the video) you might be able to program a parabolic arc to include the functions you are looking to compute?

  11. This video is bad because it had no context. You just jump in, and assume that people know what the topic is, and what your going to be talking about, and where this discussion is, in the grand body of knowledge that a person may learn in life. I expect better than this, because the other videos on this channel are typically better.

  12. (?) a transistor is stop or go, and a neuron can do that or sort into categories up to a full reflection. So if you set up a circuit to sort data like a set of partial mirrors, would that do this job?

    …Inspiration from organic process that cycles absorb/exclude over and over.

  13. This doesn't make any sense. You can't reduce entropy (improve predictability) by inverting your guess at the end. See the data processing theorem:

    https://en.wikipedia.org/wiki/Data_processing_inequality

  14. As many others have pointed out in the comments, this video is quite uninformative as to what "Anti-Learning" is actually supposed to be. So I've looked at a couple of the papers of Aickelin and Roadknight and even though those are not particularly clear on what they're doing either, it seems like things are being misrepresented here.

    So first of all, what they call "Anti-Learning" is not the reversal of the classifier's predictions, but the phenomenon that a classifier achieves less than chance performance on the test set (or sets, as they are cross-validating). They claim that instead of being a sign of overfitting to the training data, this situation arises because the structure of the population is such that many non-similar cases are summarized under a single label, such that the population ought to be misrepresented in the sample similar to a situation of XOR, where one of the four combinations (00, 01, 10, 11) is missing from the sample, which necessarily leads to an incorrect classification of new data points that show said combination.
    They substantiate their distinction of anti-learning from overfitting by showing how, for a learnable data set, a neural network's performance on the training AND test set increase with an increase in flexibility of the model (more hidden nodes), but only up to the point where the test performance starts decreasing again (as the model starts overfitting), while for a data set that results in anti-learning, the test performance stays below 50% throughout, despite an increase in training set performance. (And for a random data set the performance on the test set stays around 50%).
    (I don't really find this convincing. The classifier picks up on distinctions suggested by the sample that don't hold up in the general population – that's overfitting to me and this is just a particular case, where the sample is systematically unrepresentative.)

    The classifiers they tried were not only linear classifiers, they used Bayesian Networks, Logistic Regression, Naive Bayes, Classification Tree Approaches, MLP and SVM (though I don't understand why this list doesn't include KNN), all of which performed poorly trained on sets of 35, 45 or 55 features. As stage I and IV could be classified reliably, their analyses focus on distinguishing stage II and III – so it's a binary classification problem. This is relevant, because their trick of flipping the classification wouldn't work otherwise. Reversing the predictions only makes sense when the reverse is actually specified (0 instead of 1 and vice versa), which it wouldn't be were there four classes.

    Lastly, I don't see where the 78% accuracy he reports on come from, though. From their 2015 paper all I see is the accuracy they get when they use an ensemble of different classifiers (half of those are trained to perform well, while the other half is trained to perform terrible and gets reversed) and – and this is what he really should have mentioned if that's what he is talking about – they only get this higher accuracy in a subset of the cases, namely those where the respective ensemble agrees. So they get the highest accuracy (~90%) for those cases where all six classifiers give the same label, but that is also a very small subset of the sample (29 data points).

  15. I would be curious how it "anti-learning" works precisely, and how well it generalizes when compared to adding quadratic interaction terms or using a simple three layer neural network.

  16. Im not a data scientist.. it sounded like it was harder for them to weave a line between all the 200 dimensions of datapoints when they were looking to group patients into 2 groups. I think the solution they went with was to instead just concentrate on trying to how produce the worst result.. and then just flip that. Its basically like saying, we want to flip a biased coin but produce heads. If we know the coin is biased to produce tails 80% of the time, then we all need to do is toss the coin, then flip it once = 80% chance of heads. The thing I dont understand is why is it easier to classify the data wrongly, i.e why is it easier to wrongly conclude a sick patient is healthy, or vice versa, when looking at those 200 data points (and get that wrong answer 80% of the time) than finding the right answer 80% of the time without the "flipping". Perhaps thats where I need to learn maths 😉

  17. Im not a data scientist.. it sounded like it was harder for them to weave a line between all the 200 dimensions of datapoints when they were looking to group patients into 2 groups. I think the solution they went with was to instead just concentrate on trying to how produce the worst result.. and then just flip that. Its basically like saying, we want to flip a biased coin but produce heads. If we know the coin is biased to produce tails 80% of the time, then we all need to do is toss the coin, then flip it once = 80% chance of heads. The thing I dont understand is why is it easier to classify the data wrongly, i.e why is it easier to wrongly conclude a sick patient is healthy, or vice versa, when looking at those 200 data points (and get that wrong answer 80% of the time) than finding the right answer 80% of the time without the "flipping". Perhaps thats where I need to learn maths 😉

  18. Sir James Goldsmith would use this technique back in the 70's. He'd phone up a few stock brokers and ask them their opinion. if they all said, "sell, sell! get out!" He'd buy and if they all shouted, "buy, buy! It's going through the roof!" he'd sell.

  19. I could see how SVM might fail in the example he gave because it tries to linearly separate the data, but wouldn't a simple decision tree be able to learn data grouped in that way? Of course I realize the 2D example is illustrative and that his actual dataset is higher dimensional.

  20. Great story about how research works in real life. Thanks I appreciate it.

    I'm reading Rojas' "Neural Networks: A Systemic Introduction", and it discusses the XOR problem in the context on NN:n, McCullogh-Pitts units and Perceptrons, and conceptualizes with Minsky's research contributions et cetera. It's a good book, check it out people!

  21. First time I dislike a video. I have descent machine learning knowledge, not explaining the actual method in the title of the video is just disappointing.

  22. There's seven and a half minutes of video explaining a problem, leading into a 10-second explanation overview of how it was solved. It is still completely unclear to me what this video is trying to explain.

    Was there more to this interview that was edited out for some reason?

Leave a Reply

Your email address will not be published. Required fields are marked *