Implementation) DyRep: Learning Representations over Dynamic Graphs

Previous Review

Paper Review)

https://velog.io/@rlawlsgus117/Paper-Review-DyRep-Learning-Representations-over-Dynamic-Graphs

Code Review_1)

https://velog.io/@rlawlsgus117/Code-Review-DyRep-Learning-Representations-over-Dynamic-Graphs

Original Source Code (not official)

https://github.com/code-ball/dynamic-embedding

Dataset

reddit의 posts와 comments 데이터를 가지고 Dyrep을 통해 test post에서 comment 생성 유무 prediction을 진행한다.

Issue


Issue #001 : [FIX] Changed Dataset and train_data, test_data

Before : (u, v, t, k) -> k = existence of comment for present time

Aim to : (u, v, t, k) -> k = existence of comment for next time

data_preprocessing

...

comment_dict_link = defaultdict()

# comment's linked key를 key값으로 가지는 dict 생성.
# 해당 key 값에 할당되는 data = comment_key

        if link_key not in comment_dict_link:
            comment_dict_link[link_key] = [comment_key]
        else:
            comment_dict_link[link_key] += [comment_key]
...

def makeTrainData(subreddit):

    for key in post_train_dict:
        post_id = int(post_train_dict[key][0])
        post_time = int(post_train_dict[key][8])
        
        # (initial) post generation event에 대한 data
        
        #post's comment가 1개도 없는 경우
        if key not in comment_dict_link: 
            train_data.append([0, post_id, post_time, 0])
        #post's comment가 1개 이상인 경우
        else:
            train_data.append([0, post_id, post_time, 1])
            
    for key in comment_dict_link:
        for comment_key in comment_dict_link[key]:
            comment_id = int(comment_dict[comment_key][0])
            parent_key = comment_dict[comment_key][5]

            if subreddit == 'news':
                comment_time = int(comment_dict[comment_key][8])
            else:
                comment_time = int(comment_dict[comment_key][7])
			
            #해당 comment가 post's comment이고, parent post가 valid한 경우
            if parent_key == key and parent_key in post_train_dict:
                parent_id = int(post_train_dict[key][0])
                train_data.append([parent_id, comment_id, comment_time, 1])
                
            #해당 comment가 comment's comment이고, parent post가 valid한 경우
            elif parent_key != key and parent_key in comment_dict:
                parent_id = int(comment_dict[parent_key][0])
                train_data.append([parent_id, comment_id, comment_time, 1])
        
        #한 post의 마지막 comment event의 k = 0
        train_data[-1][-1] = 0

Issue #002 : [FIX] Changed post_id, comment_id

Problem

post와 comment의 embedding을 각기 다른 tensor에 저장. but, Dataset에서 u, v가 id 값으로만 저장되어 있어, post인지 comment인지 식별 불가

Solved

각 subreddit의 post, comment의 text에 대한 embedding (converted by BERT)을 단일 tensor로 생성하고
Dataset의 u, v의 embedding값을 불러올 때, tensor의 인덱스로 접근한다.

Aim to

post와 comment의 raw data 생성 시
[odd] post_id = 2(post_id) -1
[even] comment_id = 2(comment_id)
으로 분리해서 export. -> 하나의 tensor에서 인덱스로 접근 가능.


Issue #003 : [FIX] Sliced text length into maximum input length 512

Problem

text -> BERT -> embedding을 하는 도중 max_len(512)를 넘어가는 text 존재

Solved

maximum length를 넘어가는 text에 대해 maximum length까지만 text로 지정한다.

BERT.py

text = text_dict[key][0]
        marked_text = "[CLS] " + text + " [SEP]"
        tokenized_text = tokenizer_Bert.tokenize(marked_text)

        if len(tokenized_text) > 512:
            tokenized_text = tokenized_text[:511]
            tokenized_text.append('[SEP]')

Issue #004 : [FEAT] Randomly suffled train & test data

dataloader.py

    train_data = np.random.permutation(train_data)
    train_data = torch.from_numpy(np.array(train_data, dtype=float))
    train_batch_num = train_data.size(0) // batch
    train_data = train_data.narrow(0, 0, train_batch_num * batch)
    train_data = train_data.view(train_batch_num, batch, -1)

[보류] Issue #005 : [FIX] Added trapz of output_survival to survival_loss

main.py

Before

	intensity_loss = torch.sum(torch.log(output_intensity))
    	survival_loss = torch.sum(output_survival, dim=1)
        mini_batch_loss = -intensity_loss + survival_loss

Solved

	intensity_loss = torch.sum(torch.log(output_intensity))
        survival_loss = torch.sum(torch.trapz(output_survival, dim=1))
        mini_batch_loss = -intensity_loss + survival_loss

Issue #006 : RuntimeError: CUDA out of memory

Problem

news subreddit에서 BERT embedding 진행하는 text 수 = 1218865

CUDA out of memory...

Not solved


Issue #007 : Resizing embedding Tensor

Problem

BERT에서 추출한 text_embedding.shape = tensor.size([1, n, 768])

model에 들어가야할 embedding shape = tensor.size([1, 1, em_size(defualt = 32)])

embedding size에 flexible하게 tensor size가 변경 필요.
Initial embedding을 mean(dim=1)로 [1, 768]로 변경 후 pickle 저장.

model 불러올때 입력 em_size에 맞춰 reshape 진행 필요.
-> cost time

Solved

def LoadEmbeddings(subreddit, em_size):
    with open('./data/feature_dict_' + subreddit + '.pickle', 'rb') as f:
        data = pickle.load(f)

        embedding = data.unsqueeze(1)
        embedding = torch.reshape(embedding, [len(data), -1, em_size])
        embedding = torch.mean(embedding, dim=1)

    return embedding

Issue #008 : psi

Problem

psi = Variable(torch.ones(num_dynamics, 1))
검증 필요


Issue #009 : batch

Problem

default batch : 300

Avg.Loss : 800~850

changed batch : 30

Avg.Loss : 70~75

좋은 웹페이지 즐겨찾기