Spam filtering, Part I. Spam filters are built on principles similar to those used in logistic regression. We fit a probability that each message is spam or not spam. We have several email variables for this problem: to multiple, cc, attach, dollar, winner, inherit, password, format, re subj, exclaim subj, and sent email. We won’t describe what each variable means here for the sake of brevity, but each is either a numerical or indicator variable.
(a) For variable selection, we fit the full model, which includes all variables, and then we also fit each model where we’ve dropped exactly one of the variables. In each of these reduced models, the AIC value for the model is reported below. Based on these results, which variable, if any, should we drop as part of model selection? Explain.
None Dropped 1863.50 to multiple 2023.50 cc 1863.18 attach 1871.89 dollar 1879.70 winner 1885.03 inherit 1865.55 password 1879.31 format 2008.85 re subj 1904.60 exclaim subj 1862.76 sent email 1958.18
(b) Consider the following model selection stage. Here again we’ve computed the AIC for each leave-one- variable-out model. Based on the results, which variable, if any, should we drop as part of model selection? Explain.
Variable Dropped AIC
None Dropped 1862.41 to multiple 2019.55 attach 1871.17 dollar 1877.73 winner 1884.95 inherit 1864.52 password 1878.19 format 2007.45 re subj 1902.94 sent email 1957.56
Spam filtering, Part II, we encountered a data set where we applied logistic regression to aid in spam classification for individual emails. In this exercise, we’ve taken a small set of these variables and fit a formal model with the following output:
Estimate Std. Error z value Pr(>|z|) (Intercept) -0.8124 0.0870 -9.34 0.0000 to multiple -2.6351 0.3036 -8.68 0.0000
winner 1.6272 0.3185 5.11 0.0000 format -1.5881 0.1196 -13.28 0.0000 re subj -3.0467 0.3625 -8.40 0.0000
(a) Write down the model using the coefficients from the model fit.
(b) Suppose we have an observation where to multiple = 0, winner = 1, format = 0, and re subj = 0. What is the predicted probability that this message is spam?
(c) Put yourself in the shoes of a data scientist working on a spam filter. For a given message, how high must the probability a message is spam be before you think it would be reasonable to put it in a spambox (which the user is unlikely to check)? What tradeoffs might you consider? Any ideas about how you might make your spam-filtering system even better from the perspective of someone using your email service?