Any better way for choosing threshold?
Created by: kdlin
According to the documentation, the threshold is "n_samples * contamination" most abnormal samples in "decision_scores_". While it is useful in some applications, I wonder if one can do better for general uses. In other words, randomly picking X samples with the highest decision_scores_ and ignoring the scores themselves may not be the best idea.
As an example, say we have 1,000 samples and the contimination is 0.1. It looks like that it will pick 10 samples as the outliers, regardless of the actual decision_scores_ of these 1,000 samples. If 999 of them have low scores and only 1 has high score, it still picks 10 samples.