문제

아래와 같은 Excel 파일을 어떻게 R로 가져오시겠습니까?

테이블의 셀에는 텍스트 데이터가 없지만 채우기 색상에는 의미가 있습니다.
readxl 패키지는 몇 가지 이유로 별로 도움이 되지 않습니다.

데이터가 “tidy”이 아닙니다. 왼쪽에 병합된 셀은 범주 정보를 전달합니다.

실제로 정보가 있는 곳은 셀의 서식입니다.

해결책

tidyxl 및 unpivotr 패키지를 입력합니다.
tidyxl 패키지는 Excel 파일의 각 셀을 데이터 프레임의 행으로 가져오고 해당 위치, 내용 및 형식을 설명하는 열을 포함합니다.
unpivotr 패키지는 tidyxl 패키지에서 생성된 데이터 프레임을 활용하여 정리할 수 있도록 합니다.

먼저 인기있는 온라인 서비스를 사용하여 original PDF file을 Excel로 변환했습니다. 변환된 Excel 파일을 다운로드할 수 있습니다here.

앞서 언급한 패키지를 사용하면 위의 표를 읽는 것이 다음과 같이 간단해집니다.

library(dplyr)
library(purrr)
library(tidyxl)
library(unpivotr)
library(here)

filename <- here("raw-data/produtos_epoca-converted.xlsx")

# The workbook contains several sheets. We first import all tables to a
# list
tables_names <-
  c("Table 1", "Table 3", "Table 4", "Table 5", "Table 6", "Table 7")
tables_to_read <- map(tables_names, xlsx_cells, path = filename)

# We create a function to read each sheet
import_table <- function(df) {
  # Each fill color represent a different information. First, we create a
  # pallette of the fill colors in the sheet that can be indexed by the
  # `local_format_id` of a given cell to get the fill color of that cell
  fill_color_palette <-
    xlsx_formats(filename, "Table 1")$local$fill$patternFill$fgColor$rgb

  # Since the table has different headings, we have to filter out these
  # headings in order to have only the cells with data. Then, we create a
  # new column for the fill colors by looking up the `local_format_id` of
  # each cell in the pallette. Following, we create another column where
  # we codify this information.
  availability <-
    df %>%
    filter(row >= 2, col >= 3) %>%   # filter out headers
    mutate(fill_color = fill_color_palette[local_format_id]) %>%
    mutate(
      availability = case_when(
        fill_color == "FFFF7F7F" ~ "Low",
        fill_color == "FFFFFFCC" ~ "Medium",
        fill_color == "FFCCFFCC" ~ "High"
      )
    ) %>%
    select(availability)

  # We now transform all the headings so we can have a tidy data
  df %>%
    behead("left-up", category) %>%
    behead("left", produce) %>%
    behead("up", month) %>%
    bind_cols(availability) %>%
    select(category, produce, month, availability)
}

# Let's apply the function to our list of sheets
availability_ceagesp <- map_dfr(tables_to_read, import_table)

완료!

결과

그래서 우리는 정돈되지 않은 데이터와 의미를 전달하는 서식이 포함된 Excel 파일을 가져왔습니다.

이제 데이터는 이미 추가 처리를 위해 준비되었으며 필요에 따라 탐색할 수 있습니다.

availability_ceagesp %>%
  filter(month == "Set",
         availability == "High")

## # A tibble: 89 x 4
##    category produce                 month availability
##    <chr>    <chr>                   <chr> <chr>       
##  1 Frutas   Abacate Breda/Margarida Set   High        
##  2 Frutas   Abiu                    Set   High        
##  3 Frutas   Acerola                 Set   High        
##  4 Frutas   Banana Maçã             Set   High        
##  5 Frutas   Banana Prata            Set   High        
##  6 Frutas   Caju                    Set   High        
##  7 Frutas   Graviola                Set   High        
##  8 Frutas   Jabuticaba              Set   High        
##  9 Frutas   Kiwi Estrangeiro        Set   High        
## 10 Frutas   Laranja Lima            Set   High        
## # ... with 79 more rows

Reference

이 문제에 관하여(복잡한 Excel 파일 읽기), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/leonardoshibata/reading-complex-excel-files-2fjk

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

복잡한 Excel 파일 읽기

문제

해결책

결과

Reference

좋은 웹페이지 즐겨찾기