"Learning R"노트 Chapter 13 Cleaning 데이터에 문자열 지우기

데이터 세척은 데이터 분석에서 가장 번잡하고 골치 아픈 부분이다.
문자열 세척
R 자체 함수
grep,grepl,regexpr은 R이 자체로 가지고 있는 세 문자열의 일치 함수입니다.

grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE,
     fixed = FALSE, useBytes = FALSE, invert = FALSE)
grepl(pattern, x, ignore.case = FALSE, perl = FALSE,
      fixed = FALSE, useBytes = FALSE)
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
    fixed = FALSE, useBytes = FALSE)
regexpr(pattern, text, ignore.case = FALSE, perl = FALSE,
        fixed = FALSE, useBytes = FALSE)

grep는pattern에 일치하는 요소의 하표를 되돌려줍니다. 기본값은 인덱스입니다.grepl은pattern에 일치하는 논리 값을 되돌려줍니다.class는logical입니다.sub는 입력 길이와 일치하는string을 되돌려줍니다. 일치하는 패널을 Replacement로 바꿉니다.regexpr은 입력 길이와 일치하는 integer vector를 되돌려줍니다. 모든 요소에pattern 문자가 일치하는 시작 위치를 가리키며, 일치하지 않으면 -1을 되돌려줍니다.
stringr 패키지
stringr는 문자열을 더 잘 조작할 수 있는 일련의 wrapper를 제공합니다.
modifier functions
stringr의pattern은 기본적으로 정규 표현식 (즉regex) 이다.수정을 진행하려면stringr는 4가지modifier functions를 제공합니다.ignore_case는 대소문자를 무시하는 스위치입니다.
fixed:Compare literal bytes in the string. This is very fast, but not usually what you want for non-ASCII character sets.

fixed(pattern, ignore_case = FALSE)

coll:Compare strings respecting standard collation rules.

coll(pattern, ignore_case = FALSE, locale = "en", ...)

regex:The default. Uses ICU regular expressions.

regex(pattern, ignore_case = FALSE, multiline = FALSE, comments = FALSE,
  dotall = FALSE, ...)

boundary:Match boundaries between things.

boundary(type = c("character", "line_break", "sentence", "word"),
  skip_word_none = NA, ...)

str_detect (grepl)
str_detect ()는grepl에 해당하며 논리vector로 되돌아옵니다.pattern은vector일 수 있습니다

str_detect(string, pattern)
> fruit  str_detect(fruit, "^a")
[1]  TRUE FALSE FALSE FALSE
> str_detect("aecfg", letters[1:6])
[1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE

str_split(strsplit)
str_split은 R 자체 strsplit에 해당합니다.string 입력을 받아들여 분리된list를 되돌려줍니다.반환 후 길이가 일치하는지 확인하면strsplit_fixed, 이렇게 하면 matrix를 되돌려줍니다.

str_split(string, pattern, n = Inf, simplify = FALSE)
str_split_fixed(string, pattern, n) #n

str_count
str_count 출력 패널의 계수, 즉 interger vector입니다.pattern 기본값은 빈 문자열입니다.

str_count(string, pattern = "")##
> str_count(fruit)
[1] 5 6 4 8
> str_count(fruit, c("a", "b", "p", "p"))
[1] 1 1 1 3 #  vector     ！

str_replace(sub)
str_replace는 R이 자체로 가지고 있는sub에 해당하며,string마다 요소 내부의 첫 번째 일치만 대체합니다.반면strreplace_all는 모든 일치를 대체합니다.

str_replace(string, pattern, replacement)
str_replace_all(string, pattern, replacement)
> str_replace(fruit, "[aeiou]", "-")
[1] "-pple"    "b-nana"   "p-ar"     "p-napple"
> str_replace_all(fruit, "[aeiou]", "-")
[1] "-ppl-"    "b-n-n-"   "p--r"     "p-n-ppl-"

str_replace_나 함수 는 특수한 wrapper 로 NA 를 문자열 'NA' 로 변환할 수 있다

str_replace_na(string, replacement = "NA")
> str_replace_na(c(NA, "abc", "def"))
[1] "NA"  "abc" "def"

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

다양한 언어의 JSON

JSON은 Javascript 표기법을 사용하여 데이터 구조를 레이아웃하는 데이터 형식입니다. 그러나 Javascript가 코드에서 이러한 구조를 나타낼 수 있는 유일한 언어는 아닙니다. 저는 일반적으로 '객체'{}...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

"Learning R"노트 Chapter 13 Cleaning 데이터에 문자열 지우기

좋은 웹페이지 즐겨찾기