[Getting and Cleaning data] Week 4

13084 단어 statistics R coursera data science

Week 4

Editing text variables

Regular expressions

Working with dates

More details could be found in the html file here

Week 4

Editing text variables

Important points about text in data set

Names of variables should be

All lower cases when possible

Descriptive (Diagnosis versus Dx)

Not duplicated

Not have underscores or dots or white spaces

Variables with caracter values

Should usually be made into factor variables(depend on application)

Should be descriptive(use TRUE/FALSE instead of 0/1 and Male/Femal versus 0/2 or M/F)

Step 1: Fixing charactre vectors topupper and tolower functions.

if(!file.exists("./data")) dir.create("./data")
fileUrl "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile = "./data/cameras.csv")
cameraData "./data/cameras.csv")
names(cameraData)
tolower(names(cameraData))

Step 2: Fixing character vectors strsplit function.

Good for automatically splitting variable names.

Important paramters:x and split

splitNames "\\.")
splitNames[[5]]
splitNames[[6]]

Step 3: Quick aside lists

myList "A", "b", "c"), numbers = 1:3, matrix(1:15, 5))
head(myList)

Step 4: Fixing character vectors sapply

Applies a function to each element in a vector or list.

Implortant parameted: x Fun

splitNames[[6]][1]
firstElement 1]
sapply(splitNames, firstElement)

Step 5: Peer review data

if(!file.exists("./data")) dir.create("./data")
# download data set
fileUrl1 "https://dl.dropbox.com/u/7710864/data/reviews-apr29.csv"
fileUrl2 "https://dl.dropbox.com/u/7710864/data/solutions-apr29.csv"
download.file(fileUrl1, destfile = "./data/reviews.csv")
download.file(fileUrl2, destfile = "./data/solution.csv")
# load data set
reviews "./data/reviews.csv")
solutions "./data/solution.csv")
# view data set
head(reviews, 2)
head(solutions, 2)

Step 6: Fixing character vectors sub() (replace the first match)

names(reviews)
sub("_", "", names(reviews))

Step 7: Fixing character vectors gsub() (replace globally)

testName "this_is_a_test"
sub("_", "", testName)
gsub("_", "", testName)

Step 8: Find values grep() and grepl() functions

grep("Alameda", cameraData$intersection) # return index
table(grepl("Alameda", cameraData$intersection)) # return true or false
cameraData2 "Alameda", cameraData$intersection), ]

Step 9: More on grep()

grep("Alameda", cameraData$intersection, value = TRUE) # retrun names containing "Aladema"
grep("JeffStreet", cameraData$intersection)
length(grep("JeffStreet", cameraData$intersection))

Step 10: More useful string functions

library(stringr)
nchar("Jeffrey Leek")
substr("jeffrey Leek", 1, 7)
paste("Jeffrey", "Leek")
paste0("Jeffrey", "Leek")
str_trim("Jeff    ")

Regular expressions

Regular expressions:

A ‘regular expression’ is a pattern that describes a set of strings. Two types of regular expressions are used in R, extended regular expressions (the default) and Perl-like regular expressions used by perl = TRUE. There is a also fixed = TRUE which can be considered to use a literal regular expression. Here we cansider the extended regular expressions used in grep, grepl, regexpr, gregexpr, sub, gsub and strsplit .

Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ? , but note that whether these have a special meaning depends on the context.

Positions

1: ^ matches the begining.

2: $ matches the end.

3: \b matches the empty string at either edge of a word.

4: \B matches the empty string provided it is not at an edge of a word.

Quantifiers

1: * matches at least 0 times.

2: + matches at least 1 times.

3: ? matches at most 1 times.

4: {m} matches exactly m times.

5: {m.} matches at least m times.

6: {n, m} matches between n to m times.

Others:

1: [ ] matches any character appearing in [] . ex: [a-z]

2: [^ ] matches any character not appearing in [ ] .

3: . matches any character.

4: | matches alternative metacharacters.

5: \ suppress the special meaning of metacharacters in regular expression.

6: () groups expression.

Character classes:

1: [:digit:] or \d equivalent to [0-9] .

2: [:lower:] equivalent to [a-z] .

3: [:upper:] equivalent to [A-Z] .

4: [:alpha:] equivalent to [a-zA-Z] or [[:lower:][:upper:]] .

5: [:alnum:] equivalent to [A-z0-9] or [[:digit:][:alpha:]] .

6: \w equivalent to [[:apnum]_] or [A-z0-9_] .

7: \W equivalent [^A-z0-9] .

8: [:xdigit:] matches 0 1 2 3 4 5 6 7 8 9 A B C D E F a b c d e f .

9: [:blank:] matches space or tab.

10: [:space:] marches tab, newline, vertical tab, form feed, carriage return, space.

11: \s space ” “.

12: \S not space.

13: [:punct] matches ! " # $ % & ’ ( ) * + , - . / : ; < = > ? @ [ ] ^ _ ` { | } ~ .

14: [:graph:] equivalent to [[:alnum:][:punct:]] .

15: [:print:] equivalent to [[:alnum:][:punct:]\\s] .

16: [:cntrl:] control characters, like

or \r , [\x00-\x1F\x7F] .

R function summary:

1: Identify match to a pattern: grep(..., value = FALSE) , grepl() , stringr::str_detect() .

2: Extract match to a pattern: grep(..., value = TRUE) , stringr::str_extract() , stringr::str_extract_all() .

3: Locate pattern within a string, i.e. give the start position of matched patterns. regexpr() , gregexpr() , stringr::str_locate() , string::str_locate_all() .

4: Replace a pattern: sub() , gsub() , stringr::str_replace() , stringr::str_replace_all() .

5: Split a string using a pattern: strsplit() , stringr::str_split() .

Working with dates

Step 1: Starting simple. date() returns a character that gives you the date and time.

d1 class(d1)

Step 2: Data class.

d2 class(d2)

Step 3: Formatting dates.

%d = days as number(0-31).

%a = abbreviated weekday.

%A = unabbreviated weekday.

%m = month(00-12).

%b = abbreviated month.

%B = unabbreviated month.

%y = 2 digit year.

%Y = 4 digit year.

format(d2, "%a %b %d")

Step 4: Creating dates.

# if returns NA, please use
lct "LC_TIME")
Sys.setlocale("LC_TIME", "C")
x "1jan1960", "2jan1960", "31mar1960", "30Jul1960")
z as.Date(x, "%d%b%Y")
z
z[1] - z[2]
as.numeric(z[1] - z[2])

Step 5: Converting to Julian.

weekdays(d2)
months(d2)
julian(d2)

Step 6: lubridate package.

library(lubridate)
ymd("20140108")
mdy("08/04/2013")
dmy("03-04-2013")

Step 7: Dealing with time.

ymd_hms("2011-08-03 10:15:03")
ymd_hms("2011-08-03 10:15:03", tz = "Pacific/Auckland")

Step 8: Some functions have slightly different syntax.

x "1jan2013", "2jan2013", "31mar2013", "30Jul2013"))
wday(x[1])
wday(x[1], label = TRUE)
ymd("1989 May 17")
mdy("March 12 1975")
dmy(25081985)
ymd("1920/1/2")
ymd_hms(now())
hms("03:22:14")

Step 9: Dealing with vector of dates.

dt2 "2014-05-14", "2014-09-22", "2014-07-11")
ymd(dt2)

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

SPSS Statistics 27에서 "효과량"출력

최근의 학술논문에서는 실험에서 유의한 차이가 있는지 여부를 나타내는 p-값뿐만 아니라 그 차이에 얼마나 효과가 있는지를 나타내는 효과량의 제시가 요구되고 있다. 일반적으로 두 가지 차이점은 효과량을 계산할 때 분산을...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

JSXGraphi/MathJax의 사용 방법: 함수 도표의 그리기와 공식의 유형 설정을 위한 JavaScript 라이브러리

Javascript의 정렬 작업

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다