Eliminating NAs in matched data in R
This was bugging the shit out of me in our is271B (quant methods) final... I had three arrays with NAs in different rows of each array... I had done a weird brute-force machination like the following to get rid of the NAs and preserve the pairwise matching of the data observations. In the R code for my final, I eliminated NAs in one and then got rid of the corresponding datapoints in the other arrays like so:
> climaten <- climate[!is.na(climate)] > urbann <- urban[!is.na(climate)] > cropgrown <- cropgrow[!is.na(climate)] > climatenn <- climaten[!is.na(urbann)] > urbannn <- urbann[!is.na(urbann)] > cropgrownn <- cropgrown[!is.na(urbann)] > climaten <- climatenn[!is.na(cropgrownn)] > urbann <- urbannn[!is.na(cropgrownn)] > cropgrown <- cropgrownn[!is.na(cropgrownn)]
As you can see, if you're dealing with a ginormous dataset with many variables, this becomes hella tedious hella fast.
This bugged me sufficiently that I figured out how real R users do it. They put all the variables in question in an R data frame and then use the function na.omit() to remove all rows with NAs in them (you can do very complex types of extraction using the subset() function.). Here's how to do it in R:
> library("foreign") > dat <- read.dta("world95.dta") > > #attach (make available) data variables > attach(dat) > > #put vars in question in data frame to clean NAs > dattmp <- data.frame(climate,urban,cropgrow) > > #this removes rows with any NAs > datclean <- na.omit(dattmp) > > #take vars, cleaned of NAs, back from data frame to array > climaten <- datclean$climate > urbann <- datclean$urban > cropgrownn <- datclean$cropgrown > > #a quick boxplot > boxplot(urbann ~ climaten) > > #... we can continue from here.