Predictive modeling with missing data R-Bloggers

[This article was first published on Jason Bryer, and kindly contributed to R-bloggers]. (You can report problems here about the content on this page)

Do you want to share your content on R-bloggers? Click here if you have a blog, or here If you don’t.

Most predictive modeling strategies require that there are no missing data for model estimate. When there are missing data, there are generally two strategies to work with missing data: 1.) Exclude the variables (columns) or observations (rows) where there are missing data; or 2.) assign the missing data. However, data is often missing in systematic ways. Excluding data from training is the ignoring potentially predictive information and for many imputation procedures the missing assumption is violated (MCAR) assumption. The Medley package implements a solution for modeling when systematic patterns are missing. A working example of predicting student retention of a larger study of the diagnostic assessment and the reaching of university skills (DAACs) will be investigated. In this study, demographic data was collected in the registration of all students and subsequently completed students Diagnostic reviews in self -regulating learning (SRL), writing, mathematics and reading during their first few weeks of the semester. Although all students were expected to complete Daac’s, there was no consequence and therefore a large percentage of the student did not or just a few of the assessments. The resulting dataset has three dominant response patterns: 1.) students who have all completed all four assessments, 2.) students who have only completed the SRL assessment, and 3). Students who have not completed any of the assessments. The purpose of the medley algorithm is to take advantage of missing data patterns. For this example, the medley algorithm has trained three predictive models: 1.) Demography plus all four reviews, 2.) Demography Plus SRL assessment, and 3.) Only demography. For both training and predictions, the model used for each student is based on which data is available. That is, if a student only completed SRL, Model 2 would be used. The medley algorithm can be used with most statistical models. Both logistics regression and random forest are used for this study. The accuracy of the medley algorithm was 3.5% better than the use of only the full data and 3.1% better than the use of a data set where missing data was allocated using the mouse package. The Medley package offers an approach for predictive modeling using the same training and prediction framework that R users are used to using it. There are countless parameters that can be changed, including what underlying statistical models are used for training. Additional diagnostic functions are available to explore missing data patterns.

To register for the conference, go to https://user2025.r-project.org

Session schedule: https://user2025.r-project.org/program/in-person/

For more information about the project, go to: https://github.com/jbryer/medley

#Predictive #modeling #missing #data #RBloggers

Predictive modeling with missing data R-Bloggers

Related

Like this:

Related

Similar Posts

A style insight

The light

Leave a Reply Cancel reply

Related

Share this:

Like this:

Related

Similar Posts

Leave a Reply Cancel reply