[AIFFEL] 22.Feb.17, Exploration_BERT

오늘의 학습 리스트

random이란 모듈도 따로 있다.
random seed 설정하는 법

random_seed = 1234
random.seed(random_seed)
np.random.seed(random_seed)
tf.random.set_seed(random_seed)

random seed를 설정해도 반복해서 실행하면 값이 다른 이유
- "Setting the seed means the next random call is the same; it sets the sequence of random numbers such that any code that produces or uses random numbers (with NumPy) will now produce the same sequence of numbers."
- 아래는 예시

>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.rand(4)
array([0.99724328, 0.12816238, 0.17899311, 0.75292543])
>>> np.random.rand(4)
array([0.66216051, 0.78431013, 0.0968944 , 0.05857129])
>>> np.random.rand(4)
array([0.96239599, 0.61655744, 0.08662996, 0.56127236])

>>> np.random.seed(2021)
>>> np.random.rand(4)
array([0.60597828, 0.73336936, 0.13894716, 0.31267308])
>>> np.random.rand(4)
array([0.99724328, 0.12816238, 0.17899311, 0.75292543])
>>> np.random.rand(4)
array([0.66216051, 0.78431013, 0.0968944 , 0.05857129])
>>> np.random.rand(4)
array([0.96239599, 0.61655744, 0.08662996, 0.56127236])

__future__()가 뭔지 이제 알았다.
- __future__ module is a built-in module in Python that is used to inherit new features that will be available in the new Python versions..
- 그니까 예를 들어 내 파이썬 버전이 2인데 파이썬 3에 맞는 어떤 모듈을 써야한다면 파이썬 2(old version)에서도 윗 버전용 모듈을 쓸 수 있다는 뜻(맞나....?)
재귀함수의 모형이 또 나온다.
- 재귀함수는 본인의 함수 내에서 무한루프를 끊을 수 있는 조건이 있어야 하는 것 같다.

def print_json_tree(data, indent=""):
    for key, value in data.items():
        if type(value) == list:     # list 형태의 item은 첫번째 item만 출력
            print(f'{indent}- {key}: [{len(value)}]')
            print_json_tree(value[0], indent + "  ")
        else:
            print(f'{indent}- {key}: {value}')

json 파일 형식
- 딕셔너리처럼 생겼다.
- 그런데 python과 다른 종류이다 보니 python에서 json파일을 읽으면 serialization & deserialization 된다고 보면 될 것 같다.
- 그리고 이때 각각의 자료형에 맞는 각자의 자료형이 있다.

한국말의 경우 word 기반 단어사전으로 하면 갯수가 너무 많아짐(읽다, 읽고, 읽어서, 읽었는데, 읽다가...)
데이터셋에 데이터 + 라벨인데 따로 zip 같은 형식으로 넣으려면 딕셔너리로 넣어주나 보다.

dataset = tf.data.Dataset.from_tensor_slices((
    {
        'inputs': questions,
        'dec_inputs': answers[:, :-1]
    },
    {
        'outputs': answers[:, 1:]
    },
))

"The main reason to subclass tf.keras.layers.Layer instead of using a Lambda layer is saving and inspecting a Model. Lambda layers are saved by serializing the Python bytecode, which is fundamentally non-portable. They should only be loaded in the same environment where they were saved. Subclassed layers can be saved in a more portable way by overriding their get_config method. Models that rely on subclassed Layers are also often easier to visualize and reason about."
- https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda
- 컴퓨터 공학적인 얘기가 또 나온다.
- tf.keras.layers.Lambda라는 게 있어서(직관적으로 무엇인줄은 알겠지만) 뭔가 찾아봤는데
- Serialization 관련한 이유가 있단다....

미니프로젝트

BERT 모델을 활용해 한국어 질의응답이 가능한 모델을 만들어보자.

오늘은 다 어렵다.....
- 그런데 전처리 부분이 제일 어려운 것 같다.(아닌가..?)
데이터 전처리
- 일단 요약하자면,
  1. 모든 단어를 subword로 나눈다.
    - subword로 나누는 이유는 '읽다, 읽고, 읽었는데... 등을 효과적으로 단어 사전에 저장하기 위해'
    - 다행히 이것은 한국어 형태소에 잘 pre-trained(?)된 패키지를 가져와서 진행한다.
  2. 그 subword마다 고유 정수를 붙인다.
  3. 그런데 subword마다 고유 정수만 있는 게 아니라, 문장 내에서 글자의 번호(즉, index)도 따로 보관한다.
  4. 그런데 subword의 고유 정수와 글자의 index가 서로 연관 되어야 한다.
    - 그래야 '나 어떤 단어 찾고 싶은데, 그 단어는 문장에서 어디 있지?라고 찾아갈 수 있으니'
  5. 그래서 그 글자가 어떤 고유 정수의 단어의 부속인지 나타내는 리스트가 또 필요하다.
  6. 그 외에 더 있나...?
LSTM 모델로 학습해서 loss 시각화
BERT 모델(구조만)로 조금만 학습해서 loss 시각화
BERT 모델(pre-trained) 전이 학습해서 loss 시각화
- 질문에 대한 inference 구하기

Author And Source

이 문제에 관하여([AIFFEL] 22.Feb.17, Exploration_BERT_KorQuAD), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@moondeokjong/AIFFEL-22.Feb.17-ExplorationBERTKorQuAD

우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)

[AIFFEL] 22.Feb.17, Exploration_BERT_KorQuAD

오늘의 학습 리스트

미니프로젝트

BERT 모델을 활용해 한국어 질의응답이 가능한 모델을 만들어보자.

Author And Source

좋은 웹페이지 즐겨찾기