Implementation) DyRep: Learning Representations over Dynamic Graphs
Previous Review
Paper Review)
https://velog.io/@rlawlsgus117/Paper-Review-DyRep-Learning-Representations-over-Dynamic-Graphs
Code Review_1)
https://velog.io/@rlawlsgus117/Code-Review-DyRep-Learning-Representations-over-Dynamic-Graphs
Original Source Code (not official)
https://github.com/code-ball/dynamic-embedding
Dataset
reddit의 posts와 comments 데이터를 가지고 Dyrep을 통해 test post에서 comment 생성 유무 prediction을 진행한다.
Issue
Issue #001 : [FIX] Changed Dataset and train_data, test_data
Before : (u, v, t, k) -> k = existence of comment for present time
Aim to : (u, v, t, k) -> k = existence of comment for next time
data_preprocessing
...
comment_dict_link = defaultdict()
# comment's linked key를 key값으로 가지는 dict 생성.
# 해당 key 값에 할당되는 data = comment_key
if link_key not in comment_dict_link:
comment_dict_link[link_key] = [comment_key]
else:
comment_dict_link[link_key] += [comment_key]
...
def makeTrainData(subreddit):
for key in post_train_dict:
post_id = int(post_train_dict[key][0])
post_time = int(post_train_dict[key][8])
# (initial) post generation event에 대한 data
#post's comment가 1개도 없는 경우
if key not in comment_dict_link:
train_data.append([0, post_id, post_time, 0])
#post's comment가 1개 이상인 경우
else:
train_data.append([0, post_id, post_time, 1])
for key in comment_dict_link:
for comment_key in comment_dict_link[key]:
comment_id = int(comment_dict[comment_key][0])
parent_key = comment_dict[comment_key][5]
if subreddit == 'news':
comment_time = int(comment_dict[comment_key][8])
else:
comment_time = int(comment_dict[comment_key][7])
#해당 comment가 post's comment이고, parent post가 valid한 경우
if parent_key == key and parent_key in post_train_dict:
parent_id = int(post_train_dict[key][0])
train_data.append([parent_id, comment_id, comment_time, 1])
#해당 comment가 comment's comment이고, parent post가 valid한 경우
elif parent_key != key and parent_key in comment_dict:
parent_id = int(comment_dict[parent_key][0])
train_data.append([parent_id, comment_id, comment_time, 1])
#한 post의 마지막 comment event의 k = 0
train_data[-1][-1] = 0
Issue #002 : [FIX] Changed post_id, comment_id
Problem
post와 comment의 embedding을 각기 다른 tensor에 저장. but, Dataset에서 u, v가 id 값으로만 저장되어 있어, post인지 comment인지 식별 불가
Solved
각 subreddit의 post, comment의 text에 대한 embedding (converted by BERT)을 단일 tensor로 생성하고
Dataset의 u, v의 embedding값을 불러올 때, tensor의 인덱스로 접근한다.
Aim to
post와 comment의 raw data 생성 시
[odd] post_id = 2(post_id) -1
[even] comment_id = 2(comment_id)
으로 분리해서 export. -> 하나의 tensor에서 인덱스로 접근 가능.
Issue #003 : [FIX] Sliced text length into maximum input length 512
Problem
text -> BERT -> embedding을 하는 도중 max_len(512)를 넘어가는 text 존재
Solved
maximum length를 넘어가는 text에 대해 maximum length까지만 text로 지정한다.
BERT.py
text = text_dict[key][0]
marked_text = "[CLS] " + text + " [SEP]"
tokenized_text = tokenizer_Bert.tokenize(marked_text)
if len(tokenized_text) > 512:
tokenized_text = tokenized_text[:511]
tokenized_text.append('[SEP]')
Issue #004 : [FEAT] Randomly suffled train & test data
dataloader.py
train_data = np.random.permutation(train_data)
train_data = torch.from_numpy(np.array(train_data, dtype=float))
train_batch_num = train_data.size(0) // batch
train_data = train_data.narrow(0, 0, train_batch_num * batch)
train_data = train_data.view(train_batch_num, batch, -1)
[보류] Issue #005 : [FIX] Added trapz of output_survival to survival_loss
main.py
Before
intensity_loss = torch.sum(torch.log(output_intensity))
survival_loss = torch.sum(output_survival, dim=1)
mini_batch_loss = -intensity_loss + survival_loss
Solved
intensity_loss = torch.sum(torch.log(output_intensity))
survival_loss = torch.sum(torch.trapz(output_survival, dim=1))
mini_batch_loss = -intensity_loss + survival_loss
Issue #006 : RuntimeError: CUDA out of memory
Problem
news subreddit에서 BERT embedding 진행하는 text 수 = 1218865
CUDA out of memory...
Not solved
Issue #007 : Resizing embedding Tensor
Problem
BERT에서 추출한 text_embedding.shape = tensor.size([1, n, 768])
model에 들어가야할 embedding shape = tensor.size([1, 1, em_size(defualt = 32)])
embedding size에 flexible하게 tensor size가 변경 필요.
Initial embedding을 mean(dim=1)로 [1, 768]로 변경 후 pickle 저장.
model 불러올때 입력 em_size에 맞춰 reshape 진행 필요.
-> cost time
Solved
def LoadEmbeddings(subreddit, em_size):
with open('./data/feature_dict_' + subreddit + '.pickle', 'rb') as f:
data = pickle.load(f)
embedding = data.unsqueeze(1)
embedding = torch.reshape(embedding, [len(data), -1, em_size])
embedding = torch.mean(embedding, dim=1)
return embedding
Issue #008 : psi
Problem
psi = Variable(torch.ones(num_dynamics, 1))
검증 필요
Issue #009 : batch
Problem
default batch : 300
Avg.Loss : 800~850
changed batch : 30
Avg.Loss : 70~75
Author And Source
이 문제에 관하여(Implementation) DyRep: Learning Representations over Dynamic Graphs), 우리는 이곳에서 더 많은 자료를 발견하고 링크를 클릭하여 보았다 https://velog.io/@rlawlsgus117/Code-Review2-DyRep-Learning-Representations-over-Dynamic-Graphs저자 귀속: 원작자 정보가 원작자 URL에 포함되어 있으며 저작권은 원작자 소유입니다.
우수한 개발자 콘텐츠 발견에 전념 (Collection and Share based on the CC Protocol.)