nlp 중국어 데이터 사전 처리

이 블로그는 중국어 데이터의 사전 처리 과정을 상세하게 소개하고 일정량의 코드를 곁들여 실례로 삼는다

데이터 로드(기본 csv 형식)

import pandas as pd
datas = pd.read_csv("./test.csv", header=0, index_col=0) # DataFrame
n_datas = data.to_numpy() # ndarray   numpy    （    ）

빈 행 제거

def delete_blank_lines(sentences):
    return [s for s in sentences if s.split()]

no_line_datas = delete_blank_lines(n_datas)

숫자 제거

DIGIT_RE = re.compile(r'\d+')
no_digit_datas = DIGIT_RE.sub('', no_line_datas)
def delete_digit(sentences):
    return [DIGIT_RE.sub('', s) for s in sentences]

문장 형식을 판단하다(단순문 또는 복잡문)

STOPS = ['。', '.', '?', '？', '!', '！']  #        
def is_sample_sentence(sentence):
    count = 0
    for word in sentence:
        if word in STOPS:
            count += 1
            if count > 1:
                return False
    return True

중국어 문장 부호 제거

from string import punctuation
import re

punc = punctuation + u'.,;《》？！“”‘’@#￥%…&×（）——+【】{};；●，。&～、|\s:：'
def delete_punc(sentences):
    return [re.sub(r"[{}]+".format(punc), '', s) for s in a]

영어 제외(한자만 남기기)

ENGLISH_RE = re.compile(r'[a-zA-Z]+')
def delete_e_word(sentences):
    return [ENGLISH_RE.sub('', s) for s in sentences]

부호와 특수 기호를 제거하다

정규 표현식을 사용하여 관련 무용 기호와 부호를 제거하다

#             ，     ，                       ，             。
SPECIAL_SYMBOL_RE = re.compile(r'[^\w\s\u4e00-\u9fa5]+')
def delete_special_symbol(sentences):
    return [SPECIAL_SYMBOL_RE.sub('', s) for s in sentences]

중국어 분사

#   jieba
def seg_sentences(sentences):
    cut_words = map(lambda s: list(jieba.cut(s)), sentences)
    return list(cut_words)

#   pyltp  
def seg_sentences(sentences):
    segmentor = Segmentor()
    segmentor.load('./cws.model') #         
    seg_sents = [list(segmentor.segment(sent)) for sent in sentences]
    segmentor.release()
    return seg_sents

비활성화어 제거

#            
stopwords = []
def delete_stop_word(sentences):
    return [[word for word in s if word not in stopwords] for s in sentences]

References https://www.cnblogs.com/lookfor404/p/9784630.html https://blog.csdn.net/hfutdog/article/details/86495574

이 내용에 흥미가 있습니까?

현재 기사가 여러분의 문제를 해결하지 못하는 경우 AI 엔진은 머신러닝 분석(스마트 모델이 방금 만들어져 부정확한 경우가 있을 수 있음)을 통해 가장 유사한 기사를 추천합니다:

3:02: Android 핫 부팅 최적화

정의: 응용 프로그램을 시작할 때 백엔드에 이미 이 응용 프로그램의 프로세스가 있습니다. (예: 백업 키, 홈 키를 누르면 응용 프로그램은 종료되지만 이 응용 프로그램의 프로세스는 백엔드에 남아 작업 목록에 들어가 ...

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

CC BY-SA 2.5, CC BY-SA 3.0 및 CC BY-SA 4.0에 따라 라이센스가 부여됩니다.

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다