[Getting and Cleaning data] Week 4
13084 단어 statisticsRcourseradatascience
More details could be found in the html file here
Week 4
Editing text variables
Important points about text in data set
topupper and tolower functions. if(!file.exists("./data")) dir.create("./data")
fileUrl "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/cameras.csv")
cameraData "./data/cameras.csv")
names(cameraData)
tolower(names(cameraData)) strsplit function. splitNames "\\.")
splitNames[[5]]
splitNames[[6]] lists myList "A", "b", "c"), numbers = 1:3, matrix(1:15, 5))
head(myList) sapply splitNames[[6]][1]
firstElement 1]
sapply(splitNames, firstElement) if(!file.exists("./data")) dir.create("./data")
# download data set
fileUrl1 "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1, destfile = "./data/reviews.csv")
download.file(fileUrl2, destfile = "./data/solution.csv")
# load data set
reviews "./data/reviews.csv")
solutions "./data/solution.csv")
# view data set
head(reviews, 2)
head(solutions, 2) sub() (replace the first match) names(reviews)
sub("_", "", names(reviews)) gsub() (replace globally) testName "this_is_a_test"
sub("_", "", testName)
gsub("_", "", testName) grep() and grepl() functions grep("Alameda", cameraData$intersection) # return index
table(grepl("Alameda", cameraData$intersection)) # return true or false
cameraData2 "Alameda", cameraData$intersection), ] grep() grep("Alameda", cameraData$intersection, value = TRUE) # retrun names containing "Aladema"
grep("JeffStreet", cameraData$intersection)
length(grep("JeffStreet", cameraData$intersection)) library(stringr)
nchar("Jeffrey Leek")
substr("jeffrey Leek", 1, 7)
paste("Jeffrey", "Leek")
paste0("Jeffrey", "Leek")
str_trim("Jeff ") Regular expressions
Regular expressions:
grep, grepl, regexpr, gregexpr, sub, gsub and strsplit . . \ | ( ) [ { ^ $ * + ? , but note that whether these have a special meaning depends on the context. ^ matches the begining. $ matches the end. \b matches the empty string at either edge of a word. \B matches the empty string provided it is not at an edge of a word. * matches at least 0 times. + matches at least 1 times. ? matches at most 1 times. {m} matches exactly m times. {m.} matches at least m times. {n, m} matches between n to m times. [ ] matches any character appearing in [] . ex: [a-z] [^ ] matches any character not appearing in [ ] . . matches any character. | matches alternative metacharacters. \ suppress the special meaning of metacharacters in regular expression. () groups expression. [:digit:] or \d equivalent to [0-9] . [:lower:] equivalent to [a-z] . [:upper:] equivalent to [A-Z] . [:alpha:] equivalent to [a-zA-Z] or [[:lower:][:upper:]] . [:alnum:] equivalent to [A-z0-9] or [[:digit:][:alpha:]] . \w equivalent to [[:apnum]_] or [A-z0-9_] . \W equivalent [^A-z0-9] . [:xdigit:] matches 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f . [:blank:] matches space or tab. [:space:] marches tab, newline, vertical tab, form feed, carriage return, space. \s space ” “. \S not space. [:punct] matches ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~ . [:graph:] equivalent to [[:alnum:][:punct:]] . [:print:] equivalent to [[:alnum:][:punct:]\\s] . [:cntrl:] control characters, like
or \r , [\x00-\x1F\x7F] . R function summary:
grep(..., value = FALSE) , grepl() , stringr::str_detect() . grep(..., value = TRUE) , stringr::str_extract() , stringr::str_extract_all() . regexpr() , gregexpr() , stringr::str_locate() , string::str_locate_all() . sub() , gsub() , stringr::str_replace() , stringr::str_replace_all() . strsplit() , stringr::str_split() . Working with dates
date() returns a character that gives you the date and time. d1 class(d1) d2 class(d2) %d = days as number(0-31). %a = abbreviated weekday. %A = unabbreviated weekday. %m = month(00-12). %b = abbreviated month. %B = unabbreviated month. %y = 2 digit year. %Y = 4 digit year. format(d2, "%a %b %d") # if returns NA, please use
lct "LC_TIME")
Sys.setlocale("LC_TIME", "C")
x "1jan1960", "2jan1960", "31mar1960", "30Jul1960")
z as.Date(x, "%d%b%Y")
z
z[1] - z[2]
as.numeric(z[1] - z[2]) weekdays(d2)
months(d2)
julian(d2) lubridate package. library(lubridate)
ymd("20140108")
mdy("08/04/2013")
dmy("03-04-2013") ymd_hms("2011-08-03 10:15:03")
ymd_hms("2011-08-03 10:15:03", tz = "Pacific/Auckland") x "1jan2013", "2jan2013", "31mar2013", "30Jul2013"))
wday(x[1])
wday(x[1], label = TRUE)
ymd("1989 May 17")
mdy("March 12 1975")
dmy(25081985)
ymd("1920/1/2")
ymd_hms(now())
hms("03:22:14") dt2 "2014-05-14", "2014-09-22", "2014-07-11")
ymd(dt2)
이 내용에 흥미가 있습니까?
현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:
SPSS Statistics 27에서 "효과량"출력최근의 학술논문에서는 실험에서 유의한 차이가 있는지 여부를 나타내는 p-값뿐만 아니라 그 차이에 얼마나 효과가 있는지를 나타내는 효과량의 제시가 요구되고 있다. 일반적으로 두 가지 차이점은 효과량을 계산할 때 분산을...
텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.
CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.