[Getting and Cleaning data] Week 4

  • Week 4
  • Editing text variables
  • Regular expressions
  • Working with dates


  • More details could be found in the html file here

    Week 4


    Editing text variables


    Important points about text in data set
  • Names of variables should be
  • All lower cases when possible
  • Descriptive (Diagnosis versus Dx)
  • Not duplicated
  • Not have underscores or dots or white spaces

  • Variables with caracter values
  • Should usually be made into factor variables(depend on application)
  • Should be descriptive(use TRUE/FALSE instead of 0/1 and Male/Femal versus 0/2 or M/F)

  • Step 1: Fixing charactre vectors topupper and tolower functions.
  • if(!file.exists("./data")) dir.create("./data")
    fileUrl "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
    download.file(fileUrl, destfile = "./data/cameras.csv")
    cameraData "./data/cameras.csv")
    names(cameraData)
    tolower(names(cameraData))
  • Step 2: Fixing character vectors strsplit function.
  • Good for automatically splitting variable names.
  • Important paramters:x and split

  • splitNames "\\.")
    splitNames[[5]]
    splitNames[[6]]
  • Step 3: Quick aside lists
  • myList "A", "b", "c"), numbers = 1:3, matrix(1:15, 5))
    head(myList)
  • Step 4: Fixing character vectors sapply
  • Applies a function to each element in a vector or list.
  • Implortant parameted: x Fun

  • splitNames[[6]][1]
    firstElement 1]
    sapply(splitNames, firstElement)
  • Step 5: Peer review data
  • if(!file.exists("./data")) dir.create("./data")
    # download data set
    fileUrl1 "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
    fileUrl2 "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
    download.file(fileUrl1, destfile = "./data/reviews.csv")
    download.file(fileUrl2, destfile = "./data/solution.csv")
    # load data set
    reviews "./data/reviews.csv")
    solutions "./data/solution.csv")
    # view data set
    head(reviews, 2)
    head(solutions, 2)
  • Step 6: Fixing character vectors sub() (replace the first match)
  • names(reviews)
    sub("_", "", names(reviews))
  • Step 7: Fixing character vectors gsub() (replace globally)
  • testName "this_is_a_test"
    sub("_", "", testName)
    gsub("_", "", testName)
  • Step 8: Find values grep() and grepl() functions
  • grep("Alameda", cameraData$intersection) # return index
    table(grepl("Alameda", cameraData$intersection)) # return true or false
    cameraData2 "Alameda", cameraData$intersection), ]
  • Step 9: More on grep()
  • grep("Alameda", cameraData$intersection, value = TRUE) # retrun names containing "Aladema"
    grep("JeffStreet", cameraData$intersection)
    length(grep("JeffStreet", cameraData$intersection))
  • Step 10: More useful string functions
  • library(stringr)
    nchar("Jeffrey Leek")
    substr("jeffrey Leek", 1, 7)
    paste("Jeffrey", "Leek")
    paste0("Jeffrey", "Leek")
    str_trim("Jeff    ")

    Regular expressions


    Regular expressions:
  • A ‘regular expression’ is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression. Here we cansider the extended regular expressions used in grep, grepl, regexpr, gregexpr, sub, gsub and strsplit .
  • Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ? , but note that whether these have a special meaning depends on the context.
  • Positions
  • 1: ^ matches the begining.
  • 2: $ matches the end.
  • 3: \b matches the empty string at either edge of a word.
  • 4: \B matches the empty string provided it is not at an edge of a word.

  • Quantifiers
  • 1: * matches at least 0 times.
  • 2: + matches at least 1 times.
  • 3: ? matches at most 1 times.
  • 4: {m} matches exactly m times.
  • 5: {m.} matches at least m times.
  • 6: {n, m} matches between n to m times.

  • Others:
  • 1: [ ] matches any character appearing in [] . ex: [a-z]
  • 2: [^ ] matches any character not appearing in [ ] .
  • 3: . matches any character.
  • 4: | matches alternative metacharacters.
  • 5: \ suppress the special meaning of metacharacters in regular expression.
  • 6: () groups expression.

  • Character classes:
  • 1: [:digit:] or \d equivalent to [0-9] .
  • 2: [:lower:] equivalent to [a-z] .
  • 3: [:upper:] equivalent to [A-Z] .
  • 4: [:alpha:] equivalent to [a-zA-Z] or [[:lower:][:upper:]] .
  • 5: [:alnum:] equivalent to [A-z0-9] or [[:digit:][:alpha:]] .
  • 6: \w equivalent to [[:apnum]_] or [A-z0-9_] .
  • 7: \W equivalent [^A-z0-9] .
  • 8: [:xdigit:] matches 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f .
  • 9: [:blank:] matches space or tab.
  • 10: [:space:] marches tab, newline, vertical tab, form feed, carriage return, space.
  • 11: \s space ” “.
  • 12: \S not space.
  • 13: [:punct] matches ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~ .
  • 14: [:graph:] equivalent to [[:alnum:][:punct:]] .
  • 15: [:print:] equivalent to [[:alnum:][:punct:]\\s] .
  • 16: [:cntrl:] control characters, like
    or \r , [\x00-\x1F\x7F] .


  • R function summary:
  • 1: Identify match to a pattern: grep(..., value = FALSE) , grepl() , stringr::str_detect() .
  • 2: Extract match to a pattern: grep(..., value = TRUE) , stringr::str_extract() , stringr::str_extract_all() .
  • 3: Locate pattern within a string, i.e. give the start position of matched patterns. regexpr() , gregexpr() , stringr::str_locate() , string::str_locate_all() .
  • 4: Replace a pattern: sub() , gsub() , stringr::str_replace() , stringr::str_replace_all() .
  • 5: Split a string using a pattern: strsplit() , stringr::str_split() .

  • Working with dates

  • Step 1: Starting simple. date() returns a character that gives you the date and time.
  • d1 class(d1)
  • Step 2: Data class.
  • d2 class(d2)
  • Step 3: Formatting dates.
  • %d = days as number(0-31).
  • %a = abbreviated weekday.
  • %A = unabbreviated weekday.
  • %m = month(00-12).
  • %b = abbreviated month.
  • %B = unabbreviated month.
  • %y = 2 digit year.
  • %Y = 4 digit year.

  • format(d2, "%a %b %d")
  • Step 4: Creating dates.
  • # if returns NA, please use
    lct "LC_TIME")
    Sys.setlocale("LC_TIME", "C")
    x "1jan1960", "2jan1960", "31mar1960", "30Jul1960")
    z as.Date(x, "%d%b%Y")
    z
    z[1] - z[2]
    as.numeric(z[1] - z[2])
  • Step 5: Converting to Julian.
  • weekdays(d2)
    months(d2)
    julian(d2)
  • Step 6: lubridate package.
  • library(lubridate)
    ymd("20140108")
    mdy("08/04/2013")
    dmy("03-04-2013")
  • Step 7: Dealing with time.
  • ymd_hms("2011-08-03 10:15:03")
    ymd_hms("2011-08-03 10:15:03", tz = "Pacific/Auckland")
  • Step 8: Some functions have slightly different syntax.
  • x "1jan2013", "2jan2013", "31mar2013", "30Jul2013"))
    wday(x[1])
    wday(x[1], label = TRUE)
    ymd("1989 May 17")
    mdy("March 12 1975")
    dmy(25081985)
    ymd("1920/1/2")
    ymd_hms(now())
    hms("03:22:14")
  • Step 9: Dealing with vector of dates.
  • dt2 "2014-05-14", "2014-09-22", "2014-07-11")
    ymd(dt2)

    좋은 웹페이지 즐겨찾기