Star Trek 🖖에서 인간/컴퓨터 상호 작용 모델링

12167 단어 rstats datascience machinelearning tutorial

이것은 screencasts 패키지를 사용하는 방법을 보여주는 최신 시리즈tidymodels로, 이제 막 시작하는 것부터 더 복잡한 모델을 조정하는 것까지 포함합니다. 오늘의 스크린캐스트는 Star Trek 인간/컴퓨터 상호 작용에 대한 이번 주workflowsets와 함께 #TidyTuesday dataset을 통해 기능 엔지니어링 및 모델링 접근 방식의 여러 조합을 평가하는 방법에 대한 고급 주제에 관한 것입니다.

다음은 비디오 대신 또는 비디오에 추가하여 읽기를 선호하는 사람들을 위해 비디오에서 사용한 코드입니다.

데이터 탐색

우리의 모델링 목표는 사람이 말한 것과 컴퓨터가 말한 것computer interactions from Star Trek을 예측하는 것입니다.

library(tidyverse)
computer_raw <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-17/computer.csv")

computer_raw %>%
  distinct(value_id, .keep_all = TRUE) %>%
  count(char_type)


## # A tibble: 2 × 2
## char_type n
## <chr> <int>
## 1 Computer 178
## 2 Person 234

컴퓨터와 사람이 말할 가능성이 더 높은 단어는 무엇입니까?

library(tidytext)
library(tidylo)

computer_counts <-
  computer_raw %>%
  distinct(value_id, .keep_all = TRUE) %>%
  unnest_tokens(word, interaction) %>%
  count(char_type, word, sort = TRUE)

computer_counts %>%
  bind_log_odds(char_type, word, n) %>%
  filter(n > 10) %>%
  group_by(char_type) %>%
  slice_max(log_odds_weighted, n = 10) %>%
  ungroup() %>%
  ggplot(aes(log_odds_weighted,
    fct_reorder(word, log_odds_weighted),
    fill = char_type
  )) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(vars(char_type), scales = "free_y") +
  labs(y = NULL)

불용어는 가중 로그 승산이 가장 높은 단어에 속합니다. 그들은 이 상황에서 매우 유익합니다.

모델 구축 및 비교

"데이터 예산"을 설정하여 모델링을 시작하겠습니다. 이것은 매우 작은 데이터 세트이므로 모델에서 놀라운 결과를 기대하지는 않지만 이러한 개념 중 일부를 보여주는 것은 재미있고 좋은 방법입니다.

library(tidymodels)

set.seed(123)

comp_split <-
  computer_raw %>%
  distinct(value_id, .keep_all = TRUE) %>%
  select(char_type, interaction) %>%
  initial_split(prop = 0.8, strata = char_type)

comp_train <- training(comp_split)
comp_test <- testing(comp_split)

set.seed(234)
comp_folds <- bootstraps(comp_train, strata = char_type)
comp_folds


## # Bootstrap sampling using stratification 
## # A tibble: 25 × 2
## splits id         
## <list> <chr>      
## 1 <split [329/118]> Bootstrap01
## 2 <split [329/128]> Bootstrap02
## 3 <split [329/134]> Bootstrap03
## 4 <split [329/124]> Bootstrap04
## 5 <split [329/118]> Bootstrap05
## 6 <split [329/116]> Bootstrap06
## 7 <split [329/106]> Bootstrap07
## 8 <split [329/124]> Bootstrap08
## 9 <split [329/121]> Bootstrap09
## 10 <split [329/121]> Bootstrap10
## # … with 15 more rows

기능 엔지니어링과 관련하여 불용어를 제거해야 하는지, 예측 변수를 중앙에 놓고 크기를 조정해야 하는지, 클래스의 균형을 맞춰야 하는지 미리 알 수 없습니다. 성능을 비교할 수 있도록 이러한 모든 작업을 수행하는 기능 엔지니어링 레시피를 만들어 보겠습니다.

library(textrecipes)
library(themis)

rec_all <-
  recipe(char_type ~ interaction, data = comp_train) %>%
  step_tokenize(interaction) %>%
  step_tokenfilter(interaction, max_tokens = 80) %>%
  step_tfidf(interaction)

rec_all_norm <-
  rec_all %>%
  step_normalize(all_predictors())

rec_all_smote <-
  rec_all_norm %>%
  step_smote(char_type)

## we can `prep()` just to check if it works
prep(rec_all_smote)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 1
## 
## Training data contained 329 data points and no missing data.
## 
## Operations:
## 
## Tokenization for interaction [trained]
## Text filtering for interaction [trained]
## Term frequency-inverse document frequency with interaction [trained]
## Centering and scaling for tfidf_interaction_a, ... [trained]
## SMOTE based on char_type [trained]

이제 불용어를 제거하는 것과 동일한 작업을 수행해 보겠습니다.

rec_stop <-
  recipe(char_type ~ interaction, data = comp_train) %>%
  step_tokenize(interaction) %>%
  step_stopwords(interaction) %>%
  step_tokenfilter(interaction, max_tokens = 80) %>%
  step_tfidf(interaction)

rec_stop_norm <-
  rec_stop %>%
  step_normalize(all_predictors())

rec_stop_smote <-
  rec_stop_norm %>%
  step_smote(char_type)

## again, let's check it
prep(rec_stop_smote)


## Data Recipe
## 
## Inputs:
## 
## role #variables
## outcome 1
## predictor 1
## 
## Training data contained 329 data points and no missing data.
## 
## Operations:
## 
## Tokenization for interaction [trained]
## Stop word removal for interaction [trained]
## Text filtering for interaction [trained]
## Term frequency-inverse document frequency with interaction [trained]
## Centering and scaling for 80 items [trained]
## SMOTE based on char_type [trained]

텍스트 데이터에 잘 작동하는 두 가지 모델인 서포트 벡터 머신과 나이브 베이즈 모델을 사용해 봅시다.

library(discrim)

nb_spec <-
  naive_Bayes() %>%
  set_mode("classification") %>%
  set_engine("naivebayes")

nb_spec


## Naive Bayes Model Specification (classification)
## 
## Computational engine: naivebayes


svm_spec <-
  svm_linear() %>%
  set_mode("classification") %>%
  set_engine("LiblineaR")

svm_spec


## Linear Support Vector Machine Specification (classification)
## 
## Computational engine: LiblineaR

이제 우리는 이 모든 것을 workflowset에 함께 넣을 수 있습니다.

comp_models <-
  workflow_set(
    preproc = list(
      all = rec_all,
      all_norm = rec_all_norm,
      all_smote = rec_all_smote,
      stop = rec_stop,
      stop_norm = rec_stop_norm,
      stop_smote = rec_stop_smote
    ),
    models = list(nb = nb_spec, svm = svm_spec),
    cross = TRUE
  )

comp_models


## # A workflow set/tibble: 12 × 4
## wflow_id info option result    
## <chr> <list> <list> <list>    
## 1 all_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 2 all_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 3 all_norm_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 4 all_norm_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 5 all_smote_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 6 all_smote_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 7 stop_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 8 stop_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 9 stop_norm_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 10 stop_norm_svm <tibble [1 × 4]> <opts[0]> <list [0]>
## 11 stop_smote_nb <tibble [1 × 4]> <opts[0]> <list [0]>
## 12 stop_smote_svm <tibble [1 × 4]> <opts[0]> <list [0]>

이러한 모델에는 조정 매개변수가 없으므로 fit_resamples()를 사용하여 부트스트랩 리샘플을 사용하여 기능 엔지니어링 레시피와 모델 사양의 각 조합이 어떻게 수행되는지 평가해 보겠습니다.

set.seed(123)
doParallel::registerDoParallel()

computer_rs <-
  comp_models %>%
  workflow_map(
    "fit_resamples",
    resamples = comp_folds,
    metrics = metric_set(accuracy, sensitivity, specificity)
  )

우리는 이러한 결과를 빠르게 높은 수준으로 시각화할 수 있습니다.

autoplot(computer_rs)

모든 SVM은 적어도 전반적인 정확도에 있어서는 모든 순진한 Bayes 모델보다 더 나았습니다. 또한 더 깊이 파고들어 결과를 더 많이 탐색할 수 있습니다.

rank_results(computer_rs) %>%
  filter(.metric == "accuracy")


## # A tibble: 12 × 9
## wflow_id .config .metric mean std_err n preprocessor model rank
## <chr> <chr> <chr> <dbl> <dbl> <int> <chr> <chr> <int>
## 1 all_svm Preprocess… accuracy 0.679 0.00655 25 recipe svm_l… 1
## 2 all_norm_… Preprocess… accuracy 0.658 0.00756 25 recipe svm_l… 2
## 3 stop_svm Preprocess… accuracy 0.652 0.00700 25 recipe svm_l… 3
## 4 all_smote… Preprocess… accuracy 0.650 0.00611 25 recipe svm_l… 4
## 5 stop_norm… Preprocess… accuracy 0.646 0.00753 25 recipe svm_l… 5
## 6 stop_smot… Preprocess… accuracy 0.632 0.00914 25 recipe svm_l… 6
## 7 all_norm_… Preprocess… accuracy 0.589 0.00678 25 recipe naive… 7
## 8 all_smote… Preprocess… accuracy 0.575 0.0115 25 recipe naive… 8
## 9 stop_smot… Preprocess… accuracy 0.573 0.00971 25 recipe naive… 9
## 10 stop_norm… Preprocess… accuracy 0.571 0.00950 25 recipe naive… 10
## 11 all_nb Preprocess… accuracy 0.570 0.0102 25 recipe naive… 11
## 12 stop_nb Preprocess… accuracy 0.559 0.0120 25 recipe naive… 12

주목해야 할 몇 가지 흥미로운 사항은 다음과 같습니다.

SMOTE를 통해 클래스의 균형을 조정하면 실제로 민감도와 특이도가 예상대로 변경됩니다

불용어를 제거하는 것은 대부분 나쁜 생각인 것 같습니다!

최종 모델 학습 및 평가

전체 정확도를 높게 유지하고 싶으므로 rec_all 및 svm_spec를 선택합니다. last_fit()를 사용하여 모든 교육 데이터에 한 번 적합하고 테스트 데이터에서 한 번 평가할 수 있습니다.

comp_wf <- workflow(rec_all, svm_spec)

comp_fitted <-
  last_fit(
    comp_wf,
    comp_split,
    metrics = metric_set(accuracy, sensitivity, specificity)
  )

comp_fitted


## # Resampling results
## # Manual resampling 
## # A tibble: 1 × 6
## splits id .metrics .notes .predictions .workflow
## <list> <chr> <list> <list> <list> <list>   
## 1 <split [329/83]> train/test split <tibble [… <tibble … <tibble [83 … <workflo…

어떻게 되었습니까?

collect_metrics(comp_fitted)


## # A tibble: 3 × 4
## .metric .estimator .estimate .config             
## <chr> <chr> <dbl> <chr>               
## 1 accuracy binary 0.735 Preprocessor1_Model1
## 2 sens binary 0.611 Preprocessor1_Model1
## 3 spec binary 0.830 Preprocessor1_Model1

예측을 볼 수도 있고 예를 들어 혼동 행렬을 만들 수도 있습니다.

collect_predictions(comp_fitted) %>%
  conf_mat(char_type, .pred_class) %>%
  autoplot()

다른 방법보다 컴퓨터와 대화하는 사람을 식별하는 것이 더 쉬웠습니다.

이것은 선형 모델이기 때문에 각 방향에서 가장 큰 효과 크기 항에 대해 모델의 단어에 대한 계수를 볼 수도 있습니다.

extract_workflow(comp_fitted) %>%
  tidy() %>%
  group_by(estimate > 0) %>%
  slice_max(abs(estimate), n = 10) %>%
  ungroup() %>%
  mutate(term = str_remove(term, "tfidf_interaction_")) %>%
  ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate > 0)) +
  geom_col(alpha = 0.8) +
  scale_fill_discrete(labels = c("people", "computer")) +
  labs(y = NULL, fill = "More from...")

Reference

이 문제에 관하여(Star Trek 🖖에서 인간/컴퓨터 상호 작용 모델링), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/juliasilge/modeling-human-computer-interactions-on-star-trek-4l6b

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

대구AI스쿨 실무 프로젝트

비동기식 IO-Associal Review yield from

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다