Skip to main content

Posts

[ML Question] Handling Data Skew

Assume that we have a data set consisting of two classes A and B such that the data belonging to class A and B are in the ratio 1:9. We want to train a logistic regression on the dataset.  What will you do down sample B or up sample A? Are the two options equivalent ? If the two options are not equivalent then what are advantages/disadvantages of the two options? 
Recent posts

[ML Questions] Situation of Missing Data

Assume that you have constructed a logistic regression model for a binary classification problem to predict network intrusion. The ratio of positive class (there is a network intrusion) samples to negative class (there is no network intrusion) samples was 1:99 in the training data. There were a large number (approx. 1000) of features used to train a model. All the conventional ML wisdom was used to train the model including right training data scaling, cost sensitive training, feature normalization etc. When you deployed the system into actual production, you observe that a large percentage (approx 10%) of actual instances are having missing values, where as in our training data we had 100% coverage for features. To relabel the new samples with missing features will take atleast 15 days. What will you do in the meantime while the relabeling data comes?

[ML Question] Error of a linear regression line

We are given a one dimensional data. The value of the single feature lies in the closed interval \([-2, 2]\). The associated target value corresponding to each data point, lies in the closed interval \([-2, 2]\), such that if we plot the feature value on x-axis and the associated target value on y-axis, then we get a circle. Assume that we fit a linear regression line to the data, we can get infinite regression lines passing through origin. Can we derive the formula for error? Will the error be same for every regression line passing through origin?