Skip to content

mkinlan/saveR

Repository files navigation

saveR

The goal of saveR is to provide the user with more flexibility during EDA by removing rows or columns from a data frame based on the percentage of data present in the row or column.

Installation

You can install the development version of saveR from GitHub with:

# install.packages("pak")
pak::pak("mkinlan/saveR")

Example

Say you have a data set with nulls (aka, every data set) about products:

library(saveR)
## 
df <- data.frame(
  ID    = c(1234,2145,3468,2487,1222),
  Product_Name = c("ProdA", "ProdB", "ProdC", "ProdD", "ProdE"),
  Weight      = c(NA, 30, 22, 28, 35),
  Shipping_Zip  = c("12345", "23456", "34567", "45678", "56789"),
  Cost    = c(50000, NA, 55000, NA, NA),
  Phase    = c("Dev", "Prod", "Dev", "Obsolete", "Prod"),
  stringsAsFactors = FALSE  
)

df
#>     ID Product_Name Weight Shipping_Zip  Cost    Phase
#> 1 1234        ProdA     NA        12345 50000      Dev
#> 2 2145        ProdB     30        23456    NA     Prod
#> 3 3468        ProdC     22        34567 55000      Dev
#> 4 2487        ProdD     28        45678    NA Obsolete
#> 5 1222        ProdE     35        56789    NA     Prod

You can easily clean a data set like this using a number of methods to remove nulls throughout the df:

df1<-na.omit(df)
df1
#>     ID Product_Name Weight Shipping_Zip  Cost Phase
#> 3 3468        ProdC     22        34567 55000   Dev

or

df2<-df %>% filter(is.na(df$Cost))
df2
#> Time Series:
#> Start = 1 
#> End = 5 
#> Frequency = 1 
#>   [,1] [,2] [,3] [,4] [,5] [,6]
#> 1   NA   NA   NA   NA   NA   NA
#> 2   NA   NA   NA   NA   NA   NA
#> 3 5866    7   NA    7   NA    6
#> 4   NA   NA   NA   NA   NA   NA
#> 5   NA   NA   NA   NA   NA   NA
df3<-df %>% tidyr::drop_na()
df3
#>     ID Product_Name Weight Shipping_Zip  Cost Phase
#> 1 3468        ProdC     22        34567 55000   Dev

or

#select rows that are complete for column "Income"
df4<-df[complete.cases(df[ , 5]),]
df4
#>     ID Product_Name Weight Shipping_Zip  Cost Phase
#> 1 1234        ProdA     NA        12345 50000   Dev
#> 3 3468        ProdC     22        34567 55000   Dev

But, maybe you don’t want to get rid of all of the nulls in a row or column. For me, this is often because I’m exploring and cleaning a wide data set, and sometimes there so many columns that the easiest thing to do is to cut the whole thing down by chopping out the columns with blanks.

But, by doing that, one risks the possibility of losing columns that are majority non-null, simply because they contain some null values. In situations like this, there might be some value in filtering out percentages of nulls. Then, the remaining data frame could be further assessed to determine if it would be worth it to take steps to try and fill in the remaining missing data.

For example, in the data set above, Cost is missing a number of data points, but Weight is only missing one. Additionally, Weight likely refers to a product weight, which is probably standardized. It’s likely that there is a table of product weights somewhere, and this could be used to fill in missing data.

Using the save_cols function from saveR would let us easily locate columns like Weight, where filling in missing data might be a fairly easy fix:

result<-save_cols(df,0.75)
result
#>     ID Product_Name Weight Shipping_Zip    Phase
#> 1 1234        ProdA     NA        12345      Dev
#> 2 2145        ProdB     30        23456     Prod
#> 3 3468        ProdC     22        34567      Dev
#> 4 2487        ProdD     28        45678 Obsolete
#> 5 1222        ProdE     35        56789     Prod

The resulting data frame has lost the Cost column, which was majority blank, but kept the Weight column, because 75% of the column contained non-null data.

About

A package to save good data from messy data frames!

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages