spaCy에서 문장을 추출하는 방법

48683 단어 python programming computerscience datascience

NLP 부분에는 간단한 문장과 복잡한 문장에서 (nsubj, VERB, dobj 등)에 대한 종속성 정보에서 단어를 복잡하게 추출하는 섹션이 있습니다.

예를 들어

Simple Sentence
- Subject <- VERB -> dobj

A complex sentence has many layers of dependency
# has many subject
- (nsubj -> nsubj) <- VERB -> dobj
# has many verb
- (nsubj -> nsubj) <- VERB -> dobj, VERB -> VERB
# has many object
- (nsubj ->nsubj) <- VERB -> dobj, VERB -> VERB -> dobj

spaCy에서 파이썬으로 문장을 추출하는 방법, 단어 추출을 위한 단계별 규칙을 만들어 문법 구조를 분석하는 방법을 소개해야겠습니다.

1단계: spaCy 가져오기

2단계: 문장 추출을 위한 Phrases() 클래스를 만듭니다. 아래 코드에서 후속 조치를 취할 수 있습니다.

import spacy

class Phrases():
   def __init__(self, sentence):
       self.nlp = spacy.load('en_core_web_sm')
       self.sentence = str(sentence)
       self.doc = self.nlp(self.sentence)
       self.sequence = 0
       self.svos = []

3단계: 문장의 구를 병합하는 방법을 만듭니다. 아래 코드에서 후속 조치를 취할 수 있습니다.

def merge_phrases(self):
    with self.doc.retokenize() as retokenizer:
        for np in list(self.doc.noun_chunks):
                attrs = {
                    "tag": np.root.tag_,
                    "lemma": np.root.lemma_,
                    "ent_type": np.root.ent_type_,
                }
                retokenizer.merge(np, attrs=attrs)
    return self.doc

4단계: 문장의 구두점을 병합하는 방법을 만듭니다. 아래 코드에서 후속 조치를 취할 수 있습니다.

def merge_punct(self):
        spans = []
        for word in self.doc[:-1]:
            if word.is_punct or not word.nbor(1).is_punct:
                continue
            start = word.i
            end = word.i + 1
            while end < len(self.doc) and self.doc[end].is_punct:
                end += 1
            span = self.doc[start:end]
            spans.append((span, word.tag_, word.lemma_, word.ent_type_))
        with self.doc.retokenize() as retokenizer:
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
        return self.doc

5단계: 문장의 문법을 추출하는 방법을 만듭니다. 첫 번째 단계에서는 문장 매개변수의 텍스트를 가져오고 NLP를 초기화하고 문장을 분석하여 doc 개체에 보관합니다. 아래 코드에서 후속 조치를 취할 수 있습니다.

def get_svo(self, sentence):
   doc = self.nlp(sentence)
   doc = self.merge_phrases()
   doc = self.merge_phrases()

이 방법에서 doc 객체로 수동 및 능동 문장 확인을 확장하고, 주동사와 동사의 자식을 모두 찾고, 주동사로 주어를 찾고, 주어의 접속으로 주어를 찾고, 목적어를 찾아야 합니다. 주동사, 주동사와 함께 전치사 수식어를 찾고 S, V, O를 추출하고 SVO를 준비하여 목록을 완성합니다.

수동태와 능동태를 확인하는 첫 번째 부분에서는 doc에서 토큰을 가져와서 수동태나 능동태를 확인해야 합니다.

Simple Sentence
#Subject <- VERB -> dobj

In the sample sentence, We found the "auxpass" in the dependency of the sentence.
#Subject <- auxpass <- VERB -> dobj

def is_passive(self, tokens):
   for tok in tokens:
      if tok.dep_ == "auxpass":
        return True
   return False

동사 찾기 방법에서 나는 모든 주요 동사를 얻었습니다. 아래 코드를 따라갈 수 있습니다.

def _is_verb(self, token):
   return token.dep_ in ["ROOT", "xcomp", "appos", "advcl", "ccomp", "conj"] and token.tag_ in ["VB", "VBZ", "VBD", "VBN", "VBG", "VBP"]

def find_verbs(self, tokens):
   verbs = [tok for tok in tokens if self._is_verb(tok)]
   return verbs

주어를 찾다 보니 동사에서 주어를 모두 찾았습니다. 아래 코드를 따라갈 수 있습니다.

def get_all_subs(self, v):
   #get all subjects
   subs = [tok for tok in v.lefts if tok.dep_ in ["ROOT", "nsubj", "nsubjpass"] and tok.tag_ in ["NN" , "NNS", "NNP"]]
   if len(subs) == 0:
     #get all subjects from the left of verb ("nsubj" <= "preconj" <= VERB)
     subs = [tok for tok in v.lefts if tok.dep_ in ["preconj"]]
     for sub in subs:
        rights = list(sub.rights)
        right_dependency = [tok.lower_ for tok in rights]
        if len(right_dependency) > 0:
           subs = right_dependency[0]
   return subs

목적어를 찾았을 때 나는 동사에서 목적어를 모두 얻었다. 아래 코드를 따라갈 수 있습니다.

def get_all_objs(self, v, is_pas):
   #get list the right of dependency with VERB (VERB => "dobj" or "pobj")
   rights = list(v.rights)
   objs = [tok for tok in rights if tok.dep_ in ["dobj", "dative", "attr", "oprd", "pobj"] or (is_pas and tok.dep_ == 'pobj')]
   #get all objects from the right of dependency (VERB => "dobj" or "pobj")
   for obj in objs:
      #on the right of dependency, you can get objects from prepositions (VERB => "dobj" => "prep" => "pobj")
      rights = list(obj.rights) 
      objs.extend(self._get_objs_from_prepositions(rights, is_pas))
   return v, objs

** 전치사에서 목적어를 얻을 수 있습니다 **

def _get_objs_from_prepositions(self, deps, is_pas):
   objs = []
   for dep in deps:
      if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
         objs.extend([tok for tok in dep.rights if tok.dep_  in ["dobj", "dative", "attr", "oprd", "pobj"] or (tok.pos_ == "PRON" and tok.lower_ == "me") or (is_pas and tok.dep_ == 'pobj')])
   return objs

마지막으로 메서드에서 추출 SVO를 단계적으로 수행해야 합니다.

def get_svo(self, sentence):
   doc = self.nlp(sentence)
   doc = self.merge_phrases()
   doc = self.merge_phrases()

   #check passive and active sentence
   is_pas = self.is_passive(doc)

   #find the main verb and child of a verb
   verbs = self.find_verbs(doc) 

   #more than verb
   for verb in verbs:
      self.sequence += 1

      #find the subject with the main verb
      subject = self.get_all_subs(verb)

      #find the object with the main verb                
      verb, obj = self.get_all_objs(verb, is_pas)

      #find prepositional modifier with the main verb  
      to_pobj = self.main_get_to_pobj(verb)

      #find prepositional modifier with the main verb
      if to_pobj is not None:
        self.svos.append((self.sequence, subject, verb, obj, to_pobj))
      else:
        self.svos.append((self.sequence, subject, verb, obj, ""))

최종 수업

import spacy

class Phrases():
    def __init__(self, sentence):
       self.nlp = spacy.load('en_core_web_sm')
       self.sentence = str(sentence)
       self.doc = self.nlp(self.sentence)
       self.sequence = 0
       self.svos = []

    def merge_phrases(self):
        with self.doc.retokenize() as retokenizer:
            for np in list(self.doc.noun_chunks):
                    attrs = {
                        "tag": np.root.tag_,
                        "lemma": np.root.lemma_,
                        "ent_type": np.root.ent_type_,
                    }
                    retokenizer.merge(np, attrs=attrs)
        return self.doc

    def merge_punct(self):
        spans = []
        for word in self.doc[:-1]:
            if word.is_punct or not word.nbor(1).is_punct:
                continue
            start = word.i
            end = word.i + 1
            while end < len(self.doc) and self.doc[end].is_punct:
                end += 1
            span = self.doc[start:end]
            spans.append((span, word.tag_, word.lemma_, word.ent_type_))
        with self.doc.retokenize() as retokenizer:
            for span, tag, lemma, ent_type in spans:
                attrs = {"tag": tag, "lemma": lemma, "ent_type": ent_type}
                retokenizer.merge(span, attrs=attrs)
        return self.doc

    def is_passive(self, tokens):
        for tok in tokens:
            if tok.dep_ == "auxpass":
                return True
        return False

    def _is_verb(self, token):
        return token.dep_ in ["ROOT", "xcomp", "appos", "advcl", "ccomp", "conj"] and token.tag_ in ["VB", "VBZ", "VBD", "VBN", "VBG", "VBP"]

    def find_verbs(self, tokens):
        verbs = [tok for tok in tokens if self._is_verb(tok)]
        return verbs

    def get_all_subs(self, v):
        #get all subjects
        subs = [tok for tok in v.lefts if tok.dep_ in ["ROOT", "nsubj", "nsubjpass"] and tok.tag_ in ["NN" , "NNS", "NNP"]]
        if len(subs) == 0:
            #get all subjects from the left of verb ("nsubj" <= "preconj" <= VERB)
            subs = [tok for tok in v.lefts if tok.dep_ in ["preconj"]]
            for sub in subs:
                rights = list(sub.rights)
                right_dependency = [tok.lower_ for tok in rights]
                if len(right_dependency) > 0:
                    subs = right_dependency[0]
        return subs

    def get_all_objs(self, v, is_pas):
        #get list the right of dependency with VERB (VERB => "dobj" or "pobj")
        rights = list(v.rights)
        objs = [tok for tok in rights if tok.dep_ in ["dobj", "dative", "attr", "oprd", "pobj"] or (is_pas and tok.dep_ == 'pobj')]
        #get all objects from the right of dependency (VERB => "dobj" or "pobj")
        for obj in objs:
            #on the right of dependency, you can get objects from prepositions (VERB => "dobj" => "prep" => "pobj")
            rights = list(obj.rights) 
            objs.extend(self._get_objs_from_prepositions(rights, is_pas))
        return v, objs

    def _get_objs_from_prepositions(self, deps, is_pas):
        objs = []
        for dep in deps:
            if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
                objs.extend([tok for tok in dep.rights if tok.dep_  in ["dobj", "dative", "attr", "oprd", "pobj"] or (tok.pos_ == "PRON" and tok.lower_ == "me") or (is_pas and tok.dep_ == 'pobj')])
        return objs

    def get_svo(self, sentence):
        doc = self.nlp(sentence)
        doc = self.merge_phrases()
        doc = self.merge_phrases()

        #check passive and active sentence
        is_pas = self.is_passive(doc)

        #find the main verb and child of a verb
        verbs = self.find_verbs(doc) 

        #more than verb
        for verb in verbs:
            self.sequence += 1

            #find the subject with the main verb
            subject = self.get_all_subs(verb)

            #find the object with the main verb                
            verb, obj = self.get_all_objs(verb, is_pas)

            #find prepositional modifier with the main verb  
            to_pobj = self.main_get_to_pobj(verb)

            #You can continue create method for extract word ...

            #finally, we can find prepositional modifier with the main verb
            if to_pobj is not None:
                #result SVO
                self.svos.append((self.sequence, subject, verb, obj, to_pobj))
            else:
                #result SVO
                self.svos.append((self.sequence, subject, verb, obj, ""))

문장으로 테스트해 보세요.

첫 번째 문장: 워크플로 컨트롤러인 CSO는 할당된 사건을 처리합니다.
추출 결과: [(1, [워크플로 컨트롤러], 핸들, [할당된 인시던트], '')]

두 번째 문장: 헬프 데스크에 전화하여 요청하십시오.
추출 결과: [(1, [], Call, [헬프데스크], ''), (2, [], make, [요청], '')]

세 번째 문장: 입고 부서는 상품을 공급업체에 반환하고 시스템은 구매 부서에 알림을 보냅니다.
추출결과 : [(1, [수령부], 반품, [상품], ['판매자']), (2, [], 발송, [통보, 구매부], '')]

Reference

이 문제에 관하여(spaCy에서 문장을 추출하는 방법), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://dev.to/suttipongk/how-to-extract-sentence-from-spacy-hdh

텍스트를 자유롭게 공유하거나 복사할 수 있습니다.하지만 이 문서의 URL은 참조 URL로 남겨 두십시오.

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

마지막 단어의 길이

Docker로 Rails6+MySQL 환경 구축

좋은 웹페이지 즐겨찾기

개발자 우수 사이트 수집

개발자가 알아야 할 필수 사이트 100선 추천 우리는 당신을 위해 100개의 자주 사용하는 개발자 학습 사이트를 정리했습니다