[Getting and Cleaning data] Week 3

  • Week 3
  • Subsetting and Sorting
  • Summarizing data
  • Creating new variables
  • Reshaping data
  • Managing data frame with dplyr package Introduction
  • Managing data frames with dplyr package Basic tools
  • Merging data


  • For more details, see the html file here.

    Week 3


    Subsetting and Sorting


    Once you have loaded you data into R, what you might want to do is manipulate that data, and so you can set it to be a tidy data set: variables in the columns and observations in the rows, and only the observations that you want to be able to analysis.
  • Step 1: Subsetting - quich review
  • set.seed(13435)
    X "var1" = sample(1:5), "var2" = sample(6:10), "var3" = sample(11:15))
    X 1:5), ]
    X$var2[c(1,3)] # select the first column 
    X[ ,1]
    X[ ,"var1"]
    X$var1
    # select both column and rows
    X[1:2, 2]
    X[1:2, "var2"]
    X[1:2, ]$var2
  • Step 2: Logicals ands and ors
  • X[(X$var1 <= 3 & X$var3 > 11), ]
    X[(X$var1 <= 3 | X$var3 > 15), ]
  • Step 3: Dealing with missing value “NA”
  • X[which(X$var2 > 8), ]
    X[X$var2 > 8, ] 

    From the second line, we can see that if the column has NAs, we cannot use logical statement to select observations, and we need to use which command.
  • Step 4: Sorting
  • sort(X$var1)
    sort(X$var1, decreasing = TRUE)
    # Put NA last
    sort(X$var2, na.last = TRUE) 
    # Put NA first
    sort(X$var2, na.last = FALSE) 
    # NA removes
    sort(X$var2)
    # order data frame by variable "var1"
    X[order(X$var1), ] 
    # order data frame by multiple variable
    X[order(X$var1, X$var3), ] 
  • Step 5: Ordering with plyr package
  • library(plyr)
    arrange(X, var1) # 
    arrange(X, desc(var1))
  • Step 6: Adding rows and columns
  • # one way to add column
    X$var4 5) 
    X
    # the other way to add column
    Y 5))
    Y

    Summarizing data

  • Step 1: Getting the data from the website
  • if(!file.exists("./data")) dir.create("./data")
    fileUrl "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
    download.file(fileUrl, destfile = "./data/restaurants.csv")
    restData "./data/restaurants.csv")
  • Step 2: Look at a bit of the data
  • # top 3 rows
    head(restData, 3)
    # last 3 rows
    tail(restData, 3)
  • Step 3: Make summary
  • summary(restData)
  • Step 4: More in depth information
  • str(restData)
  • Step 5: Quantiles of quantitative variable
  • quantile(restData$councilDistrict, na.rm = TRUE)
    quantile(restData$councilDistrict, probs = c(0.5, 0.75, 0.9))
  • Step 6: Make table
  • # useNA = "ifany" gives you the NO.of NAs, the default removing NAs
    table(restData$zipCode, useNA = "ifany")
    # two dimensional table
    table(restData$councilDistrict, restData$zipCode)
  • Step 7: Check for missing values
  • sum(is.na(restData$councilDistrict))
    any(is.na(restData$councilDistrict))
    all(restData$zipCode > 0)
  • Step 8: Row and column sums
  • colSums(is.na(restData))
    all(colSums(is.na(restData)) == 0)
  • Step 9: Values with specific charactor
  • table(restData$zipCode %in% c("21212"))
    table(restData$zipCode %in% c("21212", "21213"))
    head(restData[restData$zipCode %in% c("21212", "21213"), ], 3)
  • Step 10: Cross tables
  • data("UCBAdmissions")
    DF as.data.frame(UCBAdmissions)
    summary(DF)
    # entry stands for sum of Freq
    xt # this is different table: entry stands for NO. of cross observation
    with(DF, table(Gender, Admit))
  • Step 11: Flat tables
  • warpbreaks$replicate 1:9, len = 54)
    xt 
  • Step 12: Size of a data size
  • fakeData 1e5)
    object.size(fakeData)
    print(object.size(fakeData), units = "MB")

    Creating new variables


    Why create new variables?
  • Ofen the raw data won’t have a value you are looking for.
  • You will need to transform the data to get the values you would like.
  • Usually you will add those values to the data frame you are working with.
  • Common variables to create:
  • Missingness indicators.
  • “Cutting up” quantitative variabels.
  • Applying transforms

  • Step 1: Getting data from website
  • if(!file.exists("./data")) dir.create("./data")
    fileUrl "https://data.baltimorecity.gov/api/views/k5ry-ef3g/rows.csv?accessType=DOWNLOAD"
    download.file(fileUrl, destfile = "./data/restaurants.csv")
    restData "./data/restaurants.csv")
  • Step 2: Creating sequences
  • # creating sequence by 
    s1 1, 10, by = 2)
    s1
    # creasting sequence length
    s2 1, 10, length = 3)
    s2
    # creating sequence base on a vactor
    x 1, 3, 8, 25, 100)
    seq(along = x)
  • Step 3: Sebsetting variable
  • restData$nearMe in% c("Roland Park", "Homeland")
    table(restData$nearMe)
  • Step 4: Creating binary variable
  • restData$zipWrong 0, TRUE, FALSE)
    table(restData$zipWrong, restData$zipCode<0)
  • Step 5: Creating categorical variables
  • restData$zipGroups 
  • Step 6: Easier cutting
  • library(Hmisc)
    restData$zipGroups 4)
    table(restData$zipGroups)
  • Step 7: Creating factor variables
  • restData$zcf #restData$zcf
    class(restData$zcf)
  • Step 8: Levels of factor variables
  • yesno "yes", "no"), size = 10, replace = TRUE)
    yesnofac "yes", "no"))
    yesnofac
    relevel(yesnofac, ref = "no")
    as.numeric(yesnofac) # name the ref level be 1, the other be 2
    as.numeric(relevel(yesnofac, ref = "no"))
  • Step 9: Cutting produces factor variables
  • class(restData$zipGroups)
  • Step 10: Using the mutate function
  • library(Hmisc)  # cut2
    library(plyr)  # mutate
    restData2 4))
    table(restData2$zipGroups)
  • Step 11: Common transforms
  • abs(x) absolute value
  • sqrt(x) square root
  • ceiling(x) ceiling(3.14) = 4
  • floor(x) floor(3.14) = 3
  • round(x, digits = n) round(3.1415926, digits = 2) = 3.14
  • signif(x, digits = n) signif(3.1415926, digits = 2) = 3.1
  • cos(x) or sin(x) etc.
  • log(x) natural logarithm
  • log2(x) or log10(x) other common logs.
  • exp(x) exponentiating x


  • Reshaping data


    The goal is tidy data: - Each variable forms a column - Each observation forms a row - Each table/file stores data about one kind of observation.
  • Step 1: Start with reshaping
  • library(reshape2)
    head(mtcars)
  • Step 2: Melting data frames
  • mtcars$carname # make multiple columns to one variable
    carMelt "carname", "gear", "cyl"), measure.vars = c("mpg", "hp") )
    # carMelt: a tall and skinny data set
    head(carMelt, 3)
    tail(carMelt, 3)
  • Step 3: Casting data frame
  • # work on the melt data set
    clyData 
  • Step 4: Averaging values
  • head(InsectSprays)
    tapply(InsectSprays$count, InsectSprays$spray, sum)
  • Step 5: Another way: split
  • spIns 
  • Step 6: Another way - combine
  • unlist(sprCount)
    sprCount 
  • Step 9: Another way - plyr package
  • library(plyr)
    ## ddply: Split data frame, apply function, and return results in a data frame.
    #ddply(InsectSprays, .(spray), summarize, sum = sum(count))
    # Error in .fun(piece, ...) : argument "by" is missing, with no default
    ddply(InsectSprays, .(spray), plyr::summarize, sum = sum(count))
  • Step 10: Creating a new variable
  • spraySums 

    Managing data frame with dplyr package – Introduction


    The data frame is a key data structure in statistics and in R.
  • There is one observarion per row.
  • Each column represents a variable or measure or characteristic.
  • Primary implementation that you will use is the default R implementation.
  • Other implementations, particularly relational databases systems.
  • dplyr
  • Developed by hadley Wickham of Rstudio.
  • An optimized and distilled version of plyr package.
  • Does not provide any “new” functionality per se, but greatly simplifies existing functionality in R.
  • Provides a “grammar” (in particular verbs) for data manipulation.
  • Is very fast, as many key operation are coded in C++.
  • dplyr Verbs:
  • select : return a subsets of columns of a data frame.
  • filter : extracts a subsets of rows of a data frame based on a logical condition.
  • arrange : reorder rows of a data frame.
  • rename : rename variables in a data frame.
  • mutate : add new variable/columns or transforming existing variables.
  • summarize/ summarise : generate summary statistics of different variables in the data frame, possible within strata.

  • There is also a handy print method that prevents you from printing a lot of data to the console.

    Managing data frames with dplyr package – Basic tools

  • Step 1: download data and load data into R
  • library(dplyr)
    options(width = 105)
    # download data and load data into R
    fileUrl "https://raw.github.com/DataScienceSpecialization/courses/master/03_GettingData/dplyr/chicago.rds"
    if(!file.exists("./data")) dir.create("./data")
    download.file(fileUrl, destfile = "./data/chicago.rds", mode = "wb")
    chicago "./data/chicago.Rds")
  • Step 2: Summary statistics
  • dim(chicago)
    str(chicago)
    names(chicago)
  • Step 3: Select columns using select command.
  • head(select(chicago, city:dptp))
    head(select(chicago, -(city:dptp)))
    i "city", names(chicago))
    j "dptp", names(chicago))
    head(chicago[, -(i:j)])
  • Step 4: Select rows using filter command.
  • chic.f  30)
    head(chic.f, 3)
    chic.f 30 & tmpd > 80)
    head(chic.f, 3)
    chicago 2)
    tail(chicago, 2)
  • Step 5: Reorder data frame using arrange command.
  • chicago 2)
    tail(chicago, 2)
  • Step 6: Rename variables using rename command.
  • chicago 3)
  • Step 7: Adding new variable using mutate command.
  • chicago 3)
    chicago 1*(tmpd >80), labels = c("cold", "hot")))
  • Step 8: Summarize grouped data using summarize command.
  • hotcold as.POSIXlt(date)$year+1900)
    years 
  • Step 9: Making chain command using %>% command.
  • chicago %>% mutate(month = as.POSIXlt(date)$mon + 1) %>% group_by(month) %>% summarize(pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), mo2 = median(no2tmean2))
    # =
    summarize(group_by(mutate(chicago, month = as.POSIXlt(date)$mon+1), month), pm25 = mean(pm25, na.rm = TRUE), o3 = max(o3tmean2), mo2 = median(no2tmean2))

    Merging data

  • Step 1: Peer review data
  • if(!file.exists("./data")) dir.create("./data")
    # download data set
    fileUrl1 "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
    fileUrl2 "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
    download.file(fileUrl1, destfile = "./data/reviews.csv")
    download.file(fileUrl2, destfile = "./data/solution.csv")
    # load data set
    reviews "./data/reviews.csv")
    solutions "./data/solution.csv")
    # view data set
    head(reviews, 2)
    head(solutions, 2)
  • Step 2: Marging data
  • Merges data frames
  • Important paraeters: x, y, by, by.x, by.y, all

  • names(reviews)
    names(solutions)
    mergeData "solution_id", by.y = "id", all = TRUE)
    head(mergeData)
  • Step 3: Default - merge all common column names
  • intersect(names(solutions), names(reviews))
    mergeData2 2)
  • Step 4: Using join in the plyr package.
  • df1 1:10), x = rnorm(10))
    df2 1:10), y = rnorm(10))
    arrange(join(df1, df2), id)
  • Step 5: If you have multiple data frames
  • df3 1:10), z = rnorm(10))
    dfList = list(df1, df2, df3)
    join_all(dfList)

    좋은 웹페이지 즐겨찾기