diff --git a/.gitignore b/.gitignore index 3df43c8..8e4d678 100644 --- a/.gitignore +++ b/.gitignore @@ -2,11 +2,15 @@ models output wandb +src +output +#files +.args.py #add when u clone this file -conf.yml +# conf.yml # README.md -.gitignore +# .gitignore NanumGothic.ttf #etc /.ipynb_checkpoints diff --git a/README.md b/README.md index 95d154b..b1176e6 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,32 @@ -# pstage_04_dkt +# pstage_04_dkt(Deep Knowledge Tracing) +- 기간 : 2021.05.24~2021.06.15 +- 대회 내용 : 학생의 지식 상태를 추적하여 문제 리스트 중 마지막 문제 정답 여부 예측(AUC : 0.8362 최종 7등/15팀 중) +![task img](https://user-images.githubusercontent.com/52443401/126865028-66d9f100-e1c3-4633-8790-86c1f7d84f47.JPG) +- 수행 요약 : 학생이 과목별, 시험지별 수행능력이 다름을 인지하고 한 학생을 여러 학생으로 split 하여 통계량 추출, short-term에 집중하였음 +- 사용 모델 : LGBM, LSTM, LSTM with attention, Bert, Saint, LastQuery +### Important Technic +- k-fold (with user split) +- config.yml파일을 통한 실험으로 편리함 증진 +- NN모델에 범주/연속형 피처를 자유롭게 넣을 수 있게 수정 +- solve_time관련 feature들 생성 +- user_month_split + +### Important Feature +- user's last order time +![fi 사진](https://user-images.githubusercontent.com/52443401/126864608-e6af562b-e2b0-4ad7-9c2f-7a86bbac5b98.png) + + ## config 파일을 통한 실행법 -### 1. config setting -모델 선택, 하이퍼 파라미터 선택 +### 1. config.yml setting +모델 선택, 하이퍼 파라미터 선택, 기타 테크닉 옵션 선택 ### 2. $ python3 train / inference .py 기존과 동일 ### 3. $ python3 whole-in-one.py 학습-추론 한번에 실행 +단, lgbm은 inference를 따로 수행하지 않아도 됩니다. train부분에서 모두 처리 +실행시 폴더에 학습 때 사용한 하이퍼 파라미터와 피처를 json으로 저장 ### 4. $ python3 submit.py -key와 파일path를 입력하면 다운로드할 필요 없이 서버에서 바로 제출 - -## lgbm 합치기 -### 1. __feature_engineering \ No newline at end of file +key와 파일path를 입력하면 제출용 csv를 다운로드할 필요 없이 서버에 바로 제출 diff --git a/args.py b/args.py index 6984cf0..1bf351d 100644 --- a/args.py +++ b/args.py @@ -2,27 +2,34 @@ import argparse - def parse_args(mode='train'): parser = argparse.ArgumentParser() + - parser.add_argument('--task_name', default='lstm_attn', type=str, help='task_name') parser.add_argument('--seed', default=42, type=int, help='seed') - parser.add_argument('--device', default='gpu', type=str, help='cpu or gpu') + parser.add_argument('--device', default='cpu', type=str, help='cpu or gpu') + parser.add_argument('--data_dir', default='/opt/ml/input/data/train_dataset', type=str, help='data directory') parser.add_argument('--asset_dir', default='asset/', type=str, help='data directory') + parser.add_argument('--infer',default=False,help='inferenc or not') + parser.add_argument('--Tfixup',default=True,help='tfix or not') + parser.add_argument('--file_name', default='train_data.csv', type=str, help='train file name') + # parser.add_argument('--file_name', default='test_jongho.csv', type=str, help='train file name') + parser.add_argument('--model_dir', default='models/', type=str, help='model directory') parser.add_argument('--model_name', default='model.pt', type=str, help='model file name') parser.add_argument('--output_dir', default='output/', type=str, help='output directory') parser.add_argument('--test_file_name', default='test_data.csv', type=str, help='test file name') + # parser.add_argument('--test_file_name', default='test2.csv', type=str, help='test file name') - parser.add_argument('--max_seq_len', default=20, type=int, help='max sequence length') + parser.add_argument('--max_seq_len', default=24, type=int, help='max sequence length') + parser.add_argument('--num_workers', default=1, type=int, help='number of workers') # 모델 @@ -42,9 +49,10 @@ def parse_args(mode='train'): parser.add_argument('--log_steps', default=50, type=int, help='print log per n steps') - ### 중요 ### - parser.add_argument('--model', default='lstmattn', type=str, help='model type') - parser.add_argument('--optimizer', default='adam', type=str, help='optimizer type') + + parser.add_argument('--model', default='lstm', type=str, help='model type') + parser.add_argument('--optimizer', default='adamP', type=str, help='optimizer type') + parser.add_argument('--scheduler', default='plateau', type=str, help='scheduler type') args = parser.parse_args() diff --git a/conf.yaml b/conf.yaml new file mode 100644 index 0000000..c1d62f3 --- /dev/null +++ b/conf.yaml @@ -0,0 +1,90 @@ +model : bert # {lstm, lstmattn, bert, lgbm, lstmroberta, lastquery, saint, lstmalbertattn} + +# (비일반화) : base feature만 사용 +# - lstm,lstmattn,bert +# (일반화) : 추가한 컬럼까지 범주형으로 사용 + +wandb : + using: True + project: DKT + + ## 자신의 wandb 아이디를 적어주세요 + entity: vail131 + tags: + - baseline + +##main params +task_name: bert_seo1try_user_split +seed: 42 +device: cuda + +data_dir: /opt/ml/input/data/train_dataset + +# For training files +file_name: train_time_finalfix.csv +test_train_file_name : test_time_finalfix2.csv + +# For predicting files +test_file_name: test_time_finalfix.csv + +asset_dir: asset/ +model_dir: models/ +output_dir: output/ + +max_seq_len: 128 +num_workers: 1 + +##K-fold params +use_kfold : False #n개의 fold를 이용하여 k-fold를 진행한다. +use_stratify : False +n_fold : 4 +split_by_user : False #k-fold를 수행할 dataset을 user 기준으로 split +user_split_augmentation : True +use_total_data : True +##모델 +hidden_dim : 512 +n_layers : 2 +n_heads : 2 +drop_out: 0.2 + +#train +n_epochs: 20 +batch_size: 128 +lr: 0.0001 +clip_grad : 10 +patience : 5 +log_steps : 50 +split_ratio : 0.8 + +#중요 +optimizer : adamW +scheduler: plateau + + +#use only in lgbm +lgbm: + model_params: { + 'objective': 'binary', # 이진 분류 + 'boosting_type': 'gbdt', + 'metric': 'auc', # 평가 지표 설정 + 'feature_fraction': 0.4, # 원래 0.8 피처 샘플링 비율 + 'bagging_fraction': 0.6, # 원래 0.8 데이터 샘플링 비율 + 'bagging_freq': 1, + 'n_estimators': 10000, # 트리 개수 + 'early_stopping_rounds': 100, + 'seed': 42, + 'verbose': -1, + 'n_jobs': -1, + } + + verbose_eval : 100 #ori 100 + num_boost_round : 500 + early_stopping_rounds : 100 + + +## LGBM feature enginnering 용 args +make_sharing_feature : True #extract statistics featrue from train + test(except last rows) +use_test_data : True #use test_data for train + +# distance 피처 쓰기 +use_distance : False \ No newline at end of file diff --git a/conf.yml b/conf.yml deleted file mode 100644 index ebae3e4..0000000 --- a/conf.yml +++ /dev/null @@ -1,68 +0,0 @@ -model : lstm # {lstm, lstmattn, bert, lgbm, lstmroberta, lastquery, saint} - -wandb : - using: True - project: DKT - - ## 자신의 wandb 아이디를 적어주세요 - entity: pomdeyo - tags: - - baseline - -##main params -task_name: lstm_test -seed: 42 -device: cuda - -data_dir: /opt/ml/input/data/train_dataset -file_name: train_data.csv -test_file_name: test_data.csv - -asset_dir: asset/ -model_dir: models/ -output_dir: output/ - -max_seq_len: 20 -num_workers: 1 -use_stratify : True -n_fold : 8 - -##모델 -hidden_dim : 128 -n_layers : 4 -n_heads : 2 -drop_out: 0.2 - -#train -n_epochs: 200 -batch_size: 64 -lr: 0.0001 -clip_grad : 10 -patience : 20 -log_steps : 50 - -#중요 -optimizer : adamW -scheduler: plateau - - -#use only in lgbm -lgbm: - model_params: { - 'objective': 'binary', # 이진 분류 - 'boosting_type': 'gbdt', - 'metric': 'auc', # 평가 지표 설정 - 'feature_fraction': 0.8, # 원래 0.8 피처 샘플링 비율 - 'bagging_fraction': 0.8, # 원래 0.8 데이터 샘플링 비율 - 'bagging_freq': 1, - 'n_estimators': 10000, # 트리 개수 - 'early_stopping_rounds': 100, - 'seed': 42, - 'verbose': -1, - 'n_jobs': -1, - } - - verbose_eval : 100 #ori 100 - num_boost_round : 500 - early_stopping_rounds : 100 - \ No newline at end of file diff --git a/dkt/__init__.py b/dkt/__init__.py new file mode 100644 index 0000000..6fd3c37 --- /dev/null +++ b/dkt/__init__.py @@ -0,0 +1 @@ +from .metric import * \ No newline at end of file diff --git a/dkt/criterion.py b/dkt/criterion.py index 3d46a7f..b17d731 100644 --- a/dkt/criterion.py +++ b/dkt/criterion.py @@ -4,4 +4,6 @@ def get_criterion(pred, target): loss = nn.BCELoss(reduction="none") + # loss = nn.CrossEntropyLoss(reduction="none") + return loss(pred, target) \ No newline at end of file diff --git a/dkt/dataloader.py b/dkt/dataloader.py index b7de475..183c95d 100644 --- a/dkt/dataloader.py +++ b/dkt/dataloader.py @@ -7,8 +7,15 @@ from sklearn.preprocessing import LabelEncoder import numpy as np import torch +from makefeature import * +from .features import Features as fe + +n_test_level_diff=10000 +n_unique = 0 + from lgbm_utils import * + class Preprocess: def __init__(self,args): self.args = args @@ -22,13 +29,14 @@ def get_train_data(self): def get_test_data(self): return self.test_data + def split_data(self, data, ratio=0.7, shuffle=True, seed=42): """ split data into two parts with a given ratio. """ #lgbm일 경우 if self.args.model=='lgbm': - return lgbm_split_data(data,ratio) + return lgbm_split_data(data,ratio,seed) #lgbm이 아닐 경우 if shuffle: @@ -42,18 +50,32 @@ def split_data(self, data, ratio=0.7, shuffle=True, seed=42): return data_1, data_2 def __save_labels(self, encoder, name): - le_path = os.path.join(self.args.asset_dir, name + '_classes.npy') + le_path = os.path.join(self.args.asset_dir, name + '_classes.npy') # 해당 클래스는 numpy로 저장 np.save(le_path, encoder.classes_) def __preprocessing(self, df, is_train = True): - cate_cols = ['assessmentItemID', 'testId', 'KnowledgeTag'] + #범주형 column 골라내기 + cate_cols=[] + cont_cols=[] + for column in df.columns: + if df[column].dtype==object: + cate_cols.append(column) + else : + cont_cols.append(column) + + #feat이름 배열 conf에 저장 + self.args.cate_feats=cate_cols + self.args.cont_feats=cont_cols + + + print(f"범주형의 개수는 {len(cate_cols)}개 이고, 연속형의 개수는 {len(cont_cols)}개 입니다") + print(f'범주형 {cate_cols}') + print(f'연속형 {cont_cols}') if not os.path.exists(self.args.asset_dir): os.makedirs(self.args.asset_dir) - for col in cate_cols: - - + for col in cate_cols[1:]: le = LabelEncoder() if is_train: #For UNKNOWN class @@ -65,74 +87,188 @@ def __preprocessing(self, df, is_train = True): le.classes_ = np.load(label_path) df[col] = df[col].apply(lambda x: x if x in le.classes_ else 'unknown') - #모든 컬럼이 범주형이라고 가정 df[col]= df[col].astype(str) test = le.transform(df[col]) df[col] = test - - def convert_time(s): - timestamp = time.mktime(datetime.strptime(s, '%Y-%m-%d %H:%M:%S').timetuple()) - return int(timestamp) - - df['Timestamp'] = df['Timestamp'].apply(convert_time) - + #cate feat들의 이름 / 고유값 개수를 dict로 conf에 저장 + self.args.cate_feat_dict=dict(zip(cate_cols,[len(df[col].unique()) for col in cate_cols])) return df def __feature_engineering(self, df): #TODO if self.args.model=='lgbm': - return make_lgbm_feature(df) + return make_lgbm_feature(self.args, df) else: + #lgbm 외의 다른 모델들의 fe가 필요하다 + # df = fe.feature_engineering_03(df) # 종호님 피쳐는 먼저 나와야한다. + df=make_feature(self.args,df) + # df = df.merge(fe.feature_engineering_06(pd.DataFrame(df)), left_index=True,right_index=True, how='left') + print(f'FE후 컬럼 확인 : {df.columns}') + print(df.columns) + + print('dataframe 확인') + print(df) + + # drop_cols = ['_',"index","point","answer_min_count","answer_max_count","user_count",'sec_time'] # drop할 칼럼 + # for col in drop_cols: + # if col in df.columns: + # df.drop([col],axis=1, inplace=True) + # print(f"drop 후 : {df.columns}") + + delete_feats=['Timestamp','sec_time'] + df=df.drop(columns=delete_feats) + features = df.columns + print(f"drop 후 : {len(features)}개, {features}") return df - + def load_data_from_file(self, file_name, is_train=True): csv_file_path = os.path.join(self.args.data_dir, file_name) - df = pd.read_csv(csv_file_path)#, nrows=100000) - df = self.__feature_engineering(df) - - + print(f'csv_file_path : {csv_file_path}') + df = pd.read_csv(csv_file_path) + print("load data 전 유저 수",len(df['userID'].unique())) if self.args.model=='lgbm': - #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 + df.sort_values(by=['userID','Timestamp'], inplace=True) return df + + if is_train and self.args.user_split_augmentation: + if self.args.use_total_data: + csv_file_path = os.path.join(self.args.data_dir, 'test_time_finalfix2.csv') + print(f'csv_file_path : {csv_file_path}') + tdf = pd.read_csv(csv_file_path) + df=pd.concat([df,tdf],ignore_index=True) + #종호님의 유저 split augmentation + df['Timestamp']=pd.to_datetime(df['Timestamp'].values) + df['month'] = df['Timestamp'].dt.month + df['userID'] = (df['userID'].map(str)+df['month'].apply(lambda x: ('0'+str(x)) if x<10 else str(x))).astype('int32') + # df['userID'] = df['userID'].map(str)+'-'+df['month'].map(str) + df.drop(columns=['month'],inplace=True) + print("user_augmentation 후 유저 수",len(df['userID'].unique())) - df = self.__preprocessing(df, is_train) - - # 추후 feature를 embedding할 시에 embedding_layer의 input 크기를 결정할때 사용 - self.args.n_questions = len(np.load(os.path.join(self.args.asset_dir,'assessmentItemID_classes.npy'))) - self.args.n_test = len(np.load(os.path.join(self.args.asset_dir,'testId_classes.npy'))) - self.args.n_tag = len(np.load(os.path.join(self.args.asset_dir,'KnowledgeTag_classes.npy'))) + #둘은 int로 돼있어서 cate_col로 분류되도록 미리 형변환 + df['userID']=df['userID'].astype(str) + df['KnowledgeTag']=df['KnowledgeTag'].astype(str) + + col_cnt = len(df.columns) + df = self.__feature_engineering(df) + df = self.__preprocessing(df, is_train) + + df['userID']=df['userID'].astype(int) + df['KnowledgeTag']=df['KnowledgeTag'].astype(int) - df = df.sort_values(by=['userID','Timestamp'], axis=0) - columns = ['userID', 'assessmentItemID', 'testId', 'answerCode', 'KnowledgeTag'] + #column은 cate feats 다음에 cont_feats가 오며 cate feats의 처음은 userid, cont_feats의 처음 피처는 answerCode임 + columns=self.args['cate_feats']+self.args['cont_feats'] + print(columns) + #기존 피처 유저제외시킴 + ret = columns[1:] + #연속형 첫번째 순서인 answerCode를 빼서 + ret.pop(len(self.args.cate_feats)-1) + #맨뒤로 붙여줌 + ret.append('answerCode') + print("ret",ret) + print("answercode의 순서 뒤로 변경",ret) group = df[columns].groupby('userID').apply( - lambda r: ( - r['testId'].values, - r['assessmentItemID'].values, - r['KnowledgeTag'].values, - r['answerCode'].values - ) + lambda r: tuple([r[i].values for i in ret]) ) + print(group) + print(f"유저수 {len(group)} 피처수 {len(group.iloc[0])} 푼 문제 수 {len(group.iloc[0][0])}") + len(f'group.values->{len(group.values)}') + print("load data 후",len(df['userID'].unique())) + return group.values, pd.DataFrame(df['userID'].unique(), columns=['userID']) - return group.values def load_train_data(self, file_name): self.train_data = self.load_data_from_file(file_name) def load_test_data(self, file_name): - self.test_data = self.load_data_from_file(file_name, is_train= False) + self.test_data = self.load_data_from_file(file_name,is_train=False) + +class MyDKTDataset(torch.utils.data.Dataset): + def __init__(self,data, args): + self.data = data + self.args = args + + def __getitem__(self,index): + row = self.data[index] + + # 각 data의 sequence length + seq_len = len(row[0]) + # print(f'row 값 : {len(row)}') + + # test, question, tag, correct, solve_time...etc + columns = [row[i] for i in range(len(row))] + + # max seq len을 고려하여서 이보다 길면 자르고 아닐 경우 그대로 냅둔다 + if seq_len > self.args.max_seq_len: + for i, col in enumerate(columns): + columns[i] = col[-self.args.max_seq_len:] + mask = np.ones(self.args.max_seq_len, dtype=np.int16) + else: + mask = np.zeros(self.args.max_seq_len, dtype=np.int16) + mask[-seq_len:] = 1 + + # mask도 columns 목록에 포함시킴 + columns.append(mask) + + # np.array -> torch.tensor 형변환 + for i, col in enumerate(columns): + columns[i] = torch.tensor(col) + + return columns + + def __len__(self): + return len(self.data) + +class TestDKTDataset(torch.utils.data.Dataset): + def __init__(self,data, args): + self.data = data + self.args = args + + def __getitem__(self,index): + row = self.data[index] + + # 각 data의 sequence length + seq_len = len(row[0]) + print(f'row 값 : {len(row)}') + + # test, question, tag, correct, solve_time + cate_cols = [row[i] for i in range(len(row))] + + + # max seq len을 고려하여서 이보다 길면 자르고 아닐 경우 그대로 냅둔다 + if seq_len > self.args.max_seq_len: + for i, col in enumerate(cate_cols): + cate_cols[i] = col[-self.args.max_seq_len:] + mask = np.ones(self.args.max_seq_len, dtype=np.int16) + else: + mask = np.zeros(self.args.max_seq_len, dtype=np.int16) + mask[-seq_len:] = 1 + + # mask도 columns 목록에 포함시킴 + cate_cols.append(mask) + + # np.array -> torch.tensor 형변환 + for i, col in enumerate(cate_cols): + cate_cols[i] = torch.tensor(col.astype(int)) + + return cate_cols + + def __len__(self): + return len(self.data) + class DKTDataset(torch.utils.data.Dataset): def __init__(self, data, args): self.data = data - self.args=args + self.args = args + def __getitem__(self, index): row = self.data[index] @@ -147,9 +283,11 @@ def __getitem__(self, index): # max seq len을 고려하여서 이보다 길면 자르고 아닐 경우 그대로 냅둔다 if seq_len > self.args.max_seq_len: for i, col in enumerate(cate_cols): + #가장 최근 것부터 적용, 예전 기록들은 지운다 cate_cols[i] = col[-self.args.max_seq_len:] mask = np.ones(self.args.max_seq_len, dtype=np.int16) else: + #0으로 패딩 mask = np.zeros(self.args.max_seq_len, dtype=np.int16) mask[-seq_len:] = 1 @@ -171,18 +309,22 @@ def __len__(self): def collate(batch): col_n = len(batch[0]) col_list = [[] for _ in range(col_n)] + #마스크의 길이로 max_seq_len max_seq_len = len(batch[0][-1]) # batch의 값들을 각 column끼리 그룹화 for row in batch: for i, col in enumerate(row): + #앞부분에 마스킹을 넣어주어 sequential하게 interaction들을 학습하게 한다 pre_padded = torch.zeros(max_seq_len) pre_padded[-len(col):] = col col_list[i].append(pre_padded) for i, _ in enumerate(col_list): + #stack을 통해 피처 텐서를 이어붙인다(차원축으로) <-> torch.cat + #각 배치에서 shape(len(feature),len(max_seq_len)) -> shape(len(feature),1,len(max_seq_len)) col_list[i] =torch.stack(col_list[i]) return tuple(col_list) @@ -190,16 +332,79 @@ def collate(batch): def get_loaders(args, train, valid): - pin_memory = False + pin_memory = True train_loader, valid_loader = None, None if train is not None: - trainset = DKTDataset(train, args) + # trainset = DKTDataset(train, args) + # trainset = DevDKTDataset(train,args) + # trainset = TestDKTDataset(train,args) + trainset = MyDKTDataset(train,args) train_loader = torch.utils.data.DataLoader(trainset, num_workers=args.num_workers, shuffle=True, batch_size=args.batch_size, pin_memory=pin_memory, collate_fn=collate) if valid is not None: - valset = DKTDataset(valid, args) + valset = MyDKTDataset(valid,args) valid_loader = torch.utils.data.DataLoader(valset, num_workers=args.num_workers, shuffle=False, batch_size=args.batch_size, pin_memory=pin_memory, collate_fn=collate) - return train_loader, valid_loader \ No newline at end of file + return train_loader, valid_loader + + +def slidding_window(data, args): + window_size = args.max_seq_len + stride = args.stride + + augmented_datas = [] + for row in data: + seq_len = len(row[0]) + + # 만약 window 크기보다 seq len이 같거나 작으면 augmentation을 하지 않는다 + if seq_len <= window_size: + augmented_datas.append(row) + else: + total_window = ((seq_len - window_size) // stride) + 1 + + # 앞에서부터 slidding window 적용 + for window_i in range(total_window): + # window로 잘린 데이터를 모으는 리스트 + window_data = [] + for col in row: + window_data.append(col[window_i*stride:window_i*stride + window_size]) + + # Shuffle + # 마지막 데이터의 경우 shuffle을 하지 않는다 + if args.shuffle and window_i + 1 != total_window: + shuffle_datas = shuffle(window_data, window_size, args) + augmented_datas += shuffle_datas + else: + augmented_datas.append(tuple(window_data)) + + # slidding window에서 뒷부분이 누락될 경우 추가 + total_len = window_size + (stride * (total_window - 1)) + if seq_len != total_len: + window_data = [] + for col in row: + window_data.append(col[-window_size:]) + augmented_datas.append(tuple(window_data)) + + + return augmented_datas + +def shuffle(data, data_size, args): + shuffle_datas = [] + for i in range(args.shuffle_n): + # shuffle 횟수만큼 window를 랜덤하게 계속 섞어서 데이터로 추가 + shuffle_data = [] + random_index = np.random.permutation(data_size) + for col in data: + shuffle_data.append(col[random_index]) + shuffle_datas.append(tuple(shuffle_data)) + return shuffle_datas + + +def data_augmentation(data, args): + if args.window == True: + data = slidding_window(data, args) + + return data + diff --git a/dkt/features.py b/dkt/features.py new file mode 100644 index 0000000..42ceb59 --- /dev/null +++ b/dkt/features.py @@ -0,0 +1,336 @@ +import os +from datetime import datetime +import time +import tqdm +import pandas as pd +import random +from sklearn.preprocessing import LabelEncoder +import numpy as np +import torch + + +class Features: + # base_line + def feature_engineering_01(df): + #TODO + #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 + df.sort_values(by=['userID','Timestamp'], inplace=True) + + #유저들의 문제 풀이수, 정답 수, 정답률을 시간순으로 누적해서 계산 + df['user_correct_answer'] = df.groupby('userID')['answerCode'].transform(lambda x: x.cumsum().shift(1)) + df['user_total_answer'] = df.groupby('userID')['answerCode'].cumcount() + df['user_acc'] = df['user_correct_answer']/df['user_total_answer'] + + # testId와 KnowledgeTag의 전체 정답률은 한번에 계산 + # 아래 데이터는 제출용 데이터셋에 대해서도 재사용 + correct_t = df.groupby(['testId'])['answerCode'].agg(['mean', 'sum']) + correct_t.columns = ["test_mean", 'test_sum'] + correct_k = df.groupby(['KnowledgeTag'])['answerCode'].agg(['mean', 'sum']) + correct_k.columns = ["tag_mean", 'tag_sum'] + + df = pd.merge(df, correct_t, on=['testId'], how="left") + df = pd.merge(df, correct_k, on=['KnowledgeTag'], how="left") + return df + + def feature_engineering_02(df): + + return df + + # 종호님 데이터(안쓸 예정) + def feature_engineering_03(df): + df = pd.read_csv('/opt/ml/input/data/train_dataset/test_jongho.csv') + return df + + # 서일님 feature + def feature_engineering_04(df): + + # testId_mean_sum = df_train_ori.groupby(['testId'])['answerCode'].agg(['mean','sum']).to_dict() + # assessmentItemID_mean_sum = df_train_ori.groupby(['assessmentItemID'])['answerCode'].agg(['mean', 'sum']).to_dict() + # KnowledgeTag_mean_sum = df_train_ori.groupby(['KnowledgeTag'])['answerCode'].agg(['mean', 'sum']).to_dict() + + # 문항이 중간에 비어있는 경우를 파악 (1,2,3,,5) + # def assessmentItemID2item(x): + # return int(x[-3:]) - 1 # 0 부터 시작하도록 + # df['item'] = df.assessmentItemID.map(assessmentItemID2item) + + # item_size = df[['assessmentItemID', 'testId']].drop_duplicates().groupby('testId').size() + # testId2maxlen = item_size.to_dict() # 중복해서 풀이할 놈들을 제거하기 위해 + + # item_max = df.groupby('testId').item.max() + # print(len(item_max[item_max + 1 != item_size]), '개의 시험지가 중간 문항이 빈다. item_order가 올바른 순서') # item_max는 0부터 시작하니까 + 1 + # shit_index = item_max[item_max +1 != item_size].index + # shit_df = df.loc[df.testId.isin(shit_index),['assessmentItemID', 'testId']].drop_duplicates().sort_values('assessmentItemID') + # shit_df_group = shit_df.groupby('testId') + + # shitItemID2item = {} + # for key in shit_df_group.groups: + # for i, (k,_) in enumerate(shit_df_group.get_group(key).values): + # shitItemID2item[k] = i + + # def assessmentItemID2item_order(x): + # if x in shitItemID2item: + # return int(shitItemID2item[x]) + # return int(x[-3:]) - 1 # 0 부터 시작하도록 + # df['item_order'] = df.assessmentItemID.map(assessmentItemID2item_order) + + + + # #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 + # df.sort_values(by=['userID','Timestamp'], inplace=True) + + # # 유저가 푼 시험지에 대해, 유저의 전체 정답/풀이횟수/정답률 계산 (3번 풀었으면 3배) + # df_group = df.groupby(['userID','testId'])['answerCode'] + # df['user_total_correct_cnt'] = df_group.transform(lambda x: x.cumsum().shift(1)) + # df['user_total_ans_cnt'] = df_group.cumcount() + # df['user_total_acc'] = df['user_total_correct_cnt'] / df['user_total_ans_cnt'] + + # # 유저가 푼 시험지에 대해, 유저의 풀이 순서 계산 (시험지를 반복해서 풀었어도, 누적되지 않음) + # # 특정 시험지를 얼마나 반복하여 풀었는지 계산 ( 2번 풀었다면, retest == 1) + # df['test_size'] = df.testId.map(testId2maxlen) + # df['retest'] = df['user_total_ans_cnt'] // df['test_size'] + # df['user_test_ans_cnt'] = df['user_total_ans_cnt'] % df['test_size'] + + # # 각 시험지 당 유저의 정확도를 계산 + # df['user_test_correct_cnt'] = df.groupby(['userID','testId','retest'])['answerCode'].transform(lambda x: x.cumsum().shift(1)) + # df['user_acc'] = df['user_test_correct_cnt']/df['user_test_ans_cnt'] + + # # 본 피처는 train에서 얻어진 값을 그대로 유지합니다. + # df["test_mean"] = df.testId.map(testId_mean_sum['mean']) + # df['test_sum'] = df.testId.map(testId_mean_sum['sum']) + # df["ItemID_mean"] = df.assessmentItemID.map(assessmentItemID_mean_sum['mean']) + # df['ItemID_sum'] = df.assessmentItemID.map(assessmentItemID_mean_sum['sum']) + # df["tag_mean"] = df.KnowledgeTag.map(KnowledgeTag_mean_sum['mean']) + # df['tag_sum'] = df.KnowledgeTag.map(KnowledgeTag_mean_sum['sum']) + + return df + def feature_engineering_05(df): + + #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 + df.sort_values(by=['userID','Timestamp'], inplace=True) + + #유저들의 문제 풀이수, 정답 수, 정답률을 시간순으로 누적해서 계산 + df['user_correct_answer'] = df.groupby('userID')['answerCode'].transform(lambda x: x.cumsum().shift(1)) + df['user_total_answer'] = df.groupby('userID')['answerCode'].cumcount() + df['user_acc'] = df['user_correct_answer']/df['user_total_answer'] + + # testId와 KnowledgeTag의 전체 정답률은 한번에 계산 + # 아래 데이터는 제출용 데이터셋에 대해서도 재사용 + correct_t = df.groupby(['testId'])['answerCode'].agg(['mean', 'sum']) + correct_t.columns = ["test_mean", 'test_sum'] + correct_k = df.groupby(['KnowledgeTag'])['answerCode'].agg(['mean', 'sum']) + correct_k.columns = ["tag_mean", 'tag_sum'] + + df = pd.merge(df, correct_t, on=['testId'], how="left") + df = pd.merge(df, correct_k, on=['KnowledgeTag'], how="left") + + + # last_Tag ans rate + last_tag_ans_rate = self.last_tag_rate(df) + df = pd.merge(df, last_tag_ans_rate, on=['userID'], how='left') + # df=df.drop(columns=['_']) + print(f'df 살펴보기') + print('정답률 : '+str(df['ans_rate'].nunique())) + print('태그개수 고윳값 : '+str(df['tag_sum'].nunique())) + print('태그평균 고윳값 : '+str(df['tag_mean'].nunique())) + print(f'태그합 최댓값 : '+str(df['tag_sum'].max())) + + _max_sum = df['tag_sum'].max() + _max_mean = df['tag_mean'].max() + + df['tag_sum'] = df['tag_sum'].apply(lambda x : x/(10*len(str(_max_sum)))) + df['tag_mean'] = df['tag_mean'].apply(lambda x : x/(10*len(str(_max_mean)))) + + return df[['ans_rate','tag_sum','tag_mean']] + + # last Tag에 대한 정답률 함수 생성 + def last_tag_rate(df): + df = df.copy() + + last_tag = df.groupby(['userID']).tail(1) + last_tag_1 = last_tag.loc[:, ['userID','KnowledgeTag']] + + user_tag_ans = last_tag_1.merge(df, on=['userID','KnowledgeTag'], how='left') + + user_tag_ans['count'] = user_tag_ans.groupby(['userID','KnowledgeTag'])['KnowledgeTag'].transform('count') + user_tag_ans['tag_cnt'] = user_tag_ans['count']-1 + user_tag_ans['ans_cnt'] = user_tag_ans.groupby('userID')['answerCode'].transform(lambda x: x.cumsum().shift(1)) + + user_tag_ans_rate = user_tag_ans.groupby('userID').tail(1) + user_tag_ans_rate['ans_cnt'] = user_tag_ans_rate['ans_cnt'].fillna(0) + user_tag_ans_rate['ans_rate'] = round(user_tag_ans_rate['ans_cnt']/ user_tag_ans_rate['tag_cnt'],2) + user_tag_ans_rate['ans_rate'] = user_tag_ans_rate['ans_rate'].fillna(0.00) + + return user_tag_ans_rate.loc[:,['userID','ans_rate']] + + # 新 채원님 Feature + def feature_engineering_06(df): + + def test_rate(df): + test_df = df.groupby(['testId'])['testId'].count().reset_index(name='test_cnt') + test_ans_df = df.groupby(['testId'])['answerCode'].sum().reset_index(name='test_ans_cnt') + test_df = test_df.merge(test_ans_df, on ='testId', how='left') + test_df['test_rate'] = round(test_df['test_ans_cnt'] / test_df['test_cnt'], 2) + test_df['test_rate'] = test_df['test_rate'].fillna(0.00) + return test_df.loc[:,['testId','test_rate']] + + def que_rate(df): + que_df = df.groupby(['assessmentItemID'])['assessmentItemID'].count().reset_index(name='que_cnt') + que_ans_df = df.groupby(['assessmentItemID'])['answerCode'].sum().reset_index(name='que_ans_cnt') + que_df = que_df.merge(que_ans_df, on ='assessmentItemID', how='left') + que_df['que_rate'] = round(que_df['que_ans_cnt'] / que_df['que_cnt'], 2) + que_df['que_rate'] = que_df['que_rate'].fillna(0.00) + return que_df.loc[:,['assessmentItemID','que_rate']] + + def user_test_rate(df): + df['index'] = df.index + u_test_cnt = df.groupby(['userID','testId'])['testId'].cumcount().reset_index(name='u_test_cnt') + user_test = df.merge(u_test_cnt, on='index', how='left') + u_test_cnt= user_test.groupby(['userID','testId'])['answerCode'].transform(lambda x: x.cumsum().shift(1)).reset_index(name='test_ans_cnt') + user_test_ans_sum = user_test.merge(u_test_cnt, on='index', how='left') + user_test_ans_sum['test_ans_cnt'] = user_test_ans_sum['test_ans_cnt'].fillna(0.0) + + def rating_1(user_test_ans_sum): + if user_test_ans_sum['u_test_cnt'] == 0: + return 0.50 + else : + return round(user_test_ans_sum['test_ans_cnt']/user_test_ans_sum['u_test_cnt'],2) + + user_test_ans_sum['test_ans_rate'] = user_test_ans_sum.apply(rating_1, axis=1) + + return user_test_ans_sum.loc[:,['index','u_test_cnt','test_ans_cnt','test_ans_rate']] + + # user별 Tag에 대한 정답률 sequential하게 적용 + def ut_ans_rate(df) : + df['index'] = df.index + + u_t_cnt = df.groupby(['userID','KnowledgeTag'])['KnowledgeTag'].cumcount().reset_index(name='u_tag_cnt') + user_tag = df.merge(u_t_cnt, on='index', how='left') + #user_tag['tag_cnt_1'] = user_tag['u_tag_cnt']+1 + + u_t_cnt= user_tag.groupby(['userID','KnowledgeTag'])['answerCode'].transform(lambda x: x.cumsum().shift(1)).reset_index(name='ans_cnt') + user_tag_ans_sum = user_tag.merge(u_t_cnt, on='index', how='left') + user_tag_ans_sum['ans_cnt'] = user_tag_ans_sum['ans_cnt'].fillna(0.0) + + def rating(user_tag_ans_sum): + if user_tag_ans_sum['u_tag_cnt'] == 0: + return 0.50 + else : + return round(user_tag_ans_sum['ans_cnt']/user_tag_ans_sum['u_tag_cnt'],2) + user_tag_ans_sum['tag_ans_rate'] = user_tag_ans_sum.apply(rating, axis=1) + + return user_tag_ans_sum.loc[:,['index','u_tag_cnt','ans_cnt','tag_ans_rate']] + + df['index'] = df.index + + #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 + df.sort_values(by=['userID','Timestamp'], inplace=True) + + #유저들의 문제 풀이수, 정답 수, 정답률을 시간순으로 누적해서 계산 + df['user_correct_answer'] = df.groupby('userID')['answerCode'].transform(lambda x: x.cumsum().shift(1)) + df['user_total_answer'] = df.groupby('userID')['answerCode'].cumcount() + df['user_acc'] = df['user_correct_answer']/df['user_total_answer'] + + # testId와 KnowledgeTag의 전체 정답률은 한번에 계산 + # 아래 데이터는 제출용 데이터셋에 대해서도 재사용 + correct_t = df.groupby(['testId'])['answerCode'].agg(['mean', 'sum']) + correct_t.columns = ["test_mean", 'test_sum'] + correct_k = df.groupby(['KnowledgeTag'])['answerCode'].agg(['mean', 'sum']) + correct_k.columns = ["tag_mean", 'tag_sum'] + + df = pd.merge(df, correct_t, on=['testId'], how="left") + df = pd.merge(df, correct_k, on=['KnowledgeTag'], how="left") + + + # # last_Tag ans rate + # last_tag_ans_rate = last_tag_rate(df) + # df = pd.merge(df, last_tag_ans_rate, on=['userID'], how='left') + + # test별 정답률 + all_test_rate = test_rate(df) + df = pd.merge(df, all_test_rate, on='testId', how='left') + + + ########## 이부분 추가 ########## + # 문제별 정답률 + all_que_rate = que_rate(df) + df = pd.merge(df, all_que_rate, on='assessmentItemID', how='left') + ################################ + + + # user_test_ans_rate + user_test_ans_rate = user_test_rate(df) + df = pd.merge(df, user_test_ans_rate, on='index', how='left') + + + # user_tag_seq_ans_rate + user_tag_ans_rate = ut_ans_rate(df) + df = pd.merge(df, user_tag_ans_rate, on='index', how='left') + + for i in df.columns: + print(f'컬럼 {i}의 고윳값 개수 : {str(df[i].nunique())}') + + tmp_list = ['user_correct_answer','user_total_answer'] # 소숫점화 시킬 대상 + for i in tmp_list: + m = df[i].max() + print(f'최댓값 : {m}') + df[i]=df[i].fillna(0).apply(lambda x : x/(10**len(str(int(m))))) + df['user_acc'] = df['user_acc'].fillna(0).apply(lambda x : x) + return df[['user_acc','user_correct_answer', 'user_total_answer']] + + # 그룹병 정답 누적합 + def feature_engineering_07(df): + df['test_user_correct_answer'] = df.groupby(['userID','testId'])['answerCode'].transform(lambda x: x.cumsum().shift(1)) + m = df['test_user_correct_answer'].max() + df['test_user_correct_answer'] = df['test_user_correct_answer'].fillna(0).apply(lambda x : x/(10**len(str(int(m))))) + return df + + # 그룹별 푼 문제 개수 누적 합 + def feature_engineering_08(df): + df['test_user_total_answer'] = df.groupby(['userID','testId'])['answerCode'].cumcount() + m = df['test_user_total_answer'].max() + + df['test_user_total_answer'] = df['test_user_total_answer'].fillna(0).apply(lambda x : x/(10**len(str(int(m))))) + return df + + # 누적정답률 + def feature_engineering_09(df): + df['test_user_correct_answer'] = df.groupby(['userID','testId'])['answerCode'].transform(lambda x: x.cumsum().shift(1)) + df['test_user_total_answer'] = df.groupby(['userID','testId'])['answerCode'].cumcount() + df['test_user_acc'] = df['test_user_correct_answer']/df['test_user_total_answer'] + + return df.fillna(0.5) + + # 정답 누적합 + def feature_engineering_10(df): + df['tag_user_correct_answer_tag'] = df.groupby(['userID','KnowledgeTag'])['answerCode'].transform(lambda x: x.cumsum().shift(1)) + m = df['tag_user_correct_answer_tag'].max() + df['tag_user_correct_answer_tag'] = df['tag_user_correct_answer_tag'].fillna(0).apply(lambda x : x/(10**len(str(int(m))))) + return df + + # 푼 문제 개수 누적 합 + def feature_engineering_11(df): + df['tag_user_total_answer_tag'] = df.groupby(['userID','KnowledgeTag'])['answerCode'].cumcount() + m = df['tag_user_correct_answer_tag'].max() + df['tag_user_total_answer_tag'] = df['tag_user_total_answer_tag'].fillna(0).apply(lambda x : x/(10**len(str(int(m))))) + return df + + # 누적 정답률 + def feature_engineering_12(df): + df['tag_user_correct_answer_tag'] = df.groupby(['userID','KnowledgeTag'])['answerCode'].transform(lambda x: x.cumsum().shift(1)) + df['tag_user_total_answer_tag'] = df.groupby(['userID','KnowledgeTag'])['answerCode'].cumcount() + df['tag_user_acc_tag'] = df['tag_user_correct_answer_tag']/df['tag_user_total_answer_tag'] + + return df.fillna(0.5) + + # 문제를 푼 시간(by 도훈님), 유저별 (현재 time_stamp - 과거 time_stamp) + def feature_engineering_13(df): + df = pd.read_csv('/opt/ml/input/data/train_dataset/time_train.csv') + df=df.sort_values(by=["userID","sec_time"],ascending=True) + df['time_diff'] = df.groupby('userID')['sec_time'].apply(lambda x: (x-x.shift(1))).fillna(0).apply(lambda x : min(int(x),3600*24*3)) + _max = int(df['time_diff'].max()) + df['time_diff'] = df['time_diff'].apply(lambda x:x/(10**(len(f'{max}')))) + _max = int(df['solve_time'].max()) + df['solve_time'] = df['solve_time'].apply(lambda x:x/(10**(len(f'{max}')))) + + return df + diff --git a/dkt/metric.py b/dkt/metric.py index 9ffe2f2..644ab72 100644 --- a/dkt/metric.py +++ b/dkt/metric.py @@ -1,8 +1,11 @@ -from sklearn.metrics import roc_auc_score, accuracy_score +from sklearn.metrics import accuracy_score,precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, roc_curve import numpy as np def get_metric(targets, preds): auc = roc_auc_score(targets, preds) acc = accuracy_score(targets, np.where(preds >= 0.5, 1, 0)) - - return auc, acc \ No newline at end of file + precision=precision_score(targets, np.where(preds >= 0.5, 1, 0)) + recall=recall_score(targets, np.where(preds >= 0.5, 1, 0)) + f1=f1_score(targets, np.where(preds >= 0.5, 1, 0)) + + return auc, acc ,precision,recall,f1 \ No newline at end of file diff --git a/dkt/model.py b/dkt/model.py index e475abc..e7b7383 100644 --- a/dkt/model.py +++ b/dkt/model.py @@ -1,3 +1,5 @@ +from operator import index +from numpy.lib.function_base import select import torch import torch.nn as nn import torch.nn.functional as F @@ -9,6 +11,7 @@ from torch.nn.modules import dropout from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel try: from transformers.modeling_bert import BertConfig, BertEncoder, BertModel @@ -17,96 +20,17 @@ from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig from transformers import BertPreTrainedModel +import re -class LSTM(nn.Module): - - def __init__(self, args): - super(LSTM, self).__init__() - self.args = args - self.device = args.device - - self.hidden_dim = self.args.hidden_dim - self.n_layers = self.args.n_layers - - # Embedding - # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) - self.args.n_questions = len(np.load(os.path.join(args.asset_dir,'assessmentItemID_classes.npy'))) - self.args.n_test = len(np.load(os.path.join(args.asset_dir,'testId_classes.npy'))) - self.args.n_tag = len(np.load(os.path.join(args.asset_dir,'KnowledgeTag_classes.npy'))) - - self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) - self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) - self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) - self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) - - # embedding combination projection - self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) - - self.lstm = nn.LSTM(self.hidden_dim, - self.hidden_dim, - self.n_layers, - batch_first=True) - - # Fully connected layer - self.fc = nn.Linear(self.hidden_dim, 1) - - self.activation = nn.Sigmoid() - - def init_hidden(self, batch_size): - h = torch.zeros( - self.n_layers, - batch_size, - self.hidden_dim) - h = h.to(self.device) - - c = torch.zeros( - self.n_layers, - batch_size, - self.hidden_dim) - c = c.to(self.device) - - return (h, c) - - def forward(self, input): - - test, question, tag, _, mask, interaction, _ = input - - batch_size = interaction.size(0) - - # Embedding - - embed_interaction = self.embedding_interaction(interaction) - embed_test = self.embedding_test(test) - embed_question = self.embedding_question(question) - embed_tag = self.embedding_tag(tag) - - - embed = torch.cat([embed_interaction, - embed_test, - embed_question, - embed_tag,], 2) - - X = self.comb_proj(embed) - - hidden = self.init_hidden(batch_size) - out, hidden = self.lstm(X, hidden) - out = out.contiguous().view(batch_size, -1, self.hidden_dim) - - out = self.fc(out) - preds = self.activation(out).view(batch_size, -1) - - return preds - - - -class LSTMATTN(nn.Module): - +class TestLSTMConvATTN(nn.Module): def __init__(self, args): - super(LSTMATTN, self).__init__() + super(TestLSTMConvATTN, self).__init__() self.args = args self.device = args.device @@ -115,22 +39,38 @@ def __init__(self, args): self.n_heads = self.args.n_heads self.drop_out = self.args.drop_out + #dev + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + # Embedding # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + # self.comb_proj = nn.Linear((self.hidden_dim//3)*(4+self.f_cnt), self.hidden_dim) - # embedding combination projection - self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + self.comb_proj = nn.Sequential( + nn.Linear((self.hidden_dim//3)*(4), self.hidden_dim), + nn.LayerNorm(self.hidden_dim) + ) self.lstm = nn.LSTM(self.hidden_dim, self.hidden_dim, self.n_layers, batch_first=True) - - self.config = BertConfig( + # self.embedding_test = nn.Embedding(100,self.hidden_dim//3) + + # 연속형 임베딩 + self.embedding_other_cont = nn.Sequential( + nn.Linear(self.f_cnt*self.args.max_seq_len,self.hidden_dim//3), + nn.LayerNorm(self.hidden_dim//3) + ) + + self.config = ConvBertConfig( 3, # not used hidden_size=self.hidden_dim, num_hidden_layers=1, @@ -139,7 +79,7 @@ def __init__(self, args): hidden_dropout_prob=self.drop_out, attention_probs_dropout_prob=self.drop_out, ) - self.attn = BertEncoder(self.config) + self.attn = ConvBertEncoder(self.config) # Fully connected layer self.fc = nn.Linear(self.hidden_dim, 1) @@ -162,30 +102,55 @@ def init_hidden(self, batch_size): return (h, c) def forward(self, input): + # print(f'input 길이 : {len(input)}') + + # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) - test, question, tag, _, mask, interaction, _ = input + # for i,e in enumerate(input): + # print(f'i 번째 : {e[i].shape}') + test = input[0] # [64,24] + question = input[1] + tag = input[2] - batch_size = interaction.size(0) + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + batch_size = interaction.size(0) + # Embedding - + print(f'interaction_embedding shape : {self.embedding_interaction(interaction).shape}') embed_interaction = self.embedding_interaction(interaction) embed_test = self.embedding_test(test) embed_question = self.embedding_question(question) embed_tag = self.embedding_tag(tag) - - - embed = torch.cat([embed_interaction, + print(f'interaction_embed_after : {embed_interaction.shape}') + # dev + other_features = [input[i] for i in range(6,len(input)-1)] + embed_others = self.embedding_other_cont(other_features[0]) + cat_list = [embed_interaction, embed_test, embed_question, - embed_tag,], 2) - + embed_tag, + ] + + embed = torch.cat(cat_list, 1) + print(f'embed : {embed.shape}') + embed = embed.view(batch_size, self.args.max_seq_len*4, -1) + print(f'embed : {embed.shape}') X = self.comb_proj(embed) + print(f'X_shape : {X.shape}') # [64,96,42] [64,24] + print(f'embed_others.shape : {embed_others.shape}') + X = torch.cat([X,embed_others],2) hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') out, hidden = self.lstm(X, hidden) + # print(out.shape) out = out.contiguous().view(batch_size, -1, self.hidden_dim) - + # print(out.shape) + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 @@ -200,206 +165,242 @@ def forward(self, input): return preds +class PositionalEncoding(nn.Module): + def __init__(self, d_model, dropout=0.1, max_len=1000): + super(PositionalEncoding, self).__init__() + self.dropout = nn.Dropout(p=dropout) + self.scale = nn.Parameter(torch.ones(1)) + + pe = torch.zeros(max_len, d_model) + position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) + div_term = torch.exp(torch.arange( + 0, d_model, 2).float() * (-math.log(10000.0) / d_model)) + pe[:, 0::2] = torch.sin(position * div_term) + pe[:, 1::2] = torch.cos(position * div_term) + pe = pe.unsqueeze(0).transpose(0, 1) + self.register_buffer('pe', pe) + + def forward(self, x): + x = x + self.scale * self.pe[:x.size(0), :] + return self.dropout(x) -class LSTMRobertaATTN(nn.Module): - def __init__(self, args): - super(LSTMRobertaATTN, self).__init__() +class TfixupSaint(nn.Module): + + def __init__(self, args,Tfixup=True): + super(TfixupSaint, self).__init__() self.args = args self.device = args.device self.hidden_dim = self.args.hidden_dim - self.n_layers = self.args.n_layers - self.n_heads = self.args.n_heads - self.drop_out = self.args.drop_out + # self.dropout = self.args.dropout + self.dropout = 0. + + ### Embedding + # ENCODER embedding - # Embedding - # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) - self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) - # embedding combination projection - self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + + # encoder combination projection + self.enc_comb_proj = nn.Linear((self.hidden_dim//3)*(3+len(self.n_other_features)), self.hidden_dim) - self.lstm = nn.LSTM(self.hidden_dim, - self.hidden_dim, - self.n_layers, - batch_first=True) + # DECODER embedding + # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) - self.config = RobertaConfig( - 3, # not used - hidden_size=self.hidden_dim, - num_hidden_layers=1, - num_attention_heads=self.n_heads, - intermediate_size=self.hidden_dim, - hidden_dropout_prob=self.drop_out, - attention_probs_dropout_prob=self.drop_out, - ) - self.attn = RobertaEncoder(self.config) - - # Fully connected layer - self.fc = nn.Linear(self.hidden_dim, 1) + # decoder combination projection + self.dec_comb_proj = nn.Linear((self.hidden_dim//3)*(4+len(self.n_other_features)), self.hidden_dim) + + # Positional encoding + self.pos_encoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + self.pos_decoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + + # # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + + self.transformer = nn.Transformer( + d_model=self.hidden_dim, + nhead=self.args.n_heads, + num_encoder_layers=self.args.n_layers, + num_decoder_layers=self.args.n_layers, + dim_feedforward=self.hidden_dim, + dropout=self.dropout, + activation='relu') + self.fc = nn.Linear(self.hidden_dim, 1) self.activation = nn.Sigmoid() - def init_hidden(self, batch_size): - h = torch.zeros( - self.n_layers, - batch_size, - self.hidden_dim) - h = h.to(self.device) + self.enc_mask = None + self.dec_mask = None + self.enc_dec_mask = None - c = torch.zeros( - self.n_layers, - batch_size, - self.hidden_dim) - c = c.to(self.device) + # T-Fixup + if self.args.Tfixup: + + # 초기화 (Initialization) + self.tfixup_initialization() + print("T-Fixupbb Initialization Done") + + # 스케일링 (Scaling) + self.tfixup_scaling() + print(f"T-Fixup Scaling Done") + + def tfixup_initialization(self): + # 우리는 padding idx의 경우 모두 0으로 통일한다 + padding_idx = 0 + print(self.named_parameters) + for name, param in self.named_parameters(): + print(f'name : {name}') + if re.match(r'^embedding*', name): + nn.init.normal_(param, mean=0, std=param.shape[1] ** -0.5) + nn.init.constant_(param[padding_idx], 0) + elif re.match(r'.*Norm.*', name) or re.match(r'.*norm*.*',name): + continue + elif re.match(r'.*weight*', name): + # nn.init.xavier_uniform_(param) + nn.init.xavier_normal_(param) + + + def tfixup_scaling(self): + temp_state_dict = {} + + # 특정 layer들의 값을 스케일링한다 + for name, param in self.named_parameters(): + + # TODO: 모델 내부의 module 이름이 달라지면 직접 수정해서 + # module이 scaling 될 수 있도록 변경해주자 + # print(name) + + if re.match(r'^embedding*', name): + temp_state_dict[name] = (9 * self.args.n_layers) ** (-1 / 4) * param + elif re.match(r'.*Norm.*', name) or re.match(r'.*norm*.*',name): + continue + elif re.match(r'encoder.*dense.*weight$|encoder.*attention.output.*weight$', name): + temp_state_dict[name] = (0.67 * (self.args.n_layers) ** (-1 / 4)) * param + elif re.match(r"encoder.*value.weight$", name): + temp_state_dict[name] = (0.67 * (self.args.n_layers) ** (-1 / 4)) * (param * (2**0.5)) + + # 나머지 layer는 원래 값 그대로 넣는다 + for name in self.state_dict(): + if name not in temp_state_dict: + temp_state_dict[name] = self.state_dict()[name] + + self.load_state_dict(temp_state_dict) + + def get_mask(self, seq_len): + mask = torch.from_numpy(np.triu(np.ones((seq_len, seq_len)), k=1)) - return (h, c) + return mask.masked_fill(mask==1, float('-inf')) def forward(self, input): + # test, question, tag, _, mask, interaction, _ = input - test, question, tag, _, mask, interaction, _ = input + # # print(f'input 길이 : {len(input)}') + + # # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) + # # for i,e in enumerate(input): + # # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + batch_size = interaction.size(0) + seq_len = interaction.size(1) - # Embedding + - embed_interaction = self.embedding_interaction(interaction) + # 신나는 embedding + # ENCODER embed_test = self.embedding_test(test) embed_question = self.embedding_question(question) embed_tag = self.embedding_tag(tag) - - embed = torch.cat([embed_interaction, + # # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [ + # embed_interaction, embed_test, embed_question, - embed_tag,], 2) - - X = self.comb_proj(embed) - - hidden = self.init_hidden(batch_size) - # print(f'{hidden[0].shape}, {hidden[1].shape}') - out, hidden = self.lstm(X, hidden) - # print(out.shape) - out = out.contiguous().view(batch_size, -1, self.hidden_dim) - # print(out.shape) + embed_tag, + ] + cat_list.extend(embed_other_features) + embed_enc = torch.cat(cat_list, 2) - extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) - extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) - extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 - head_mask = [None] * self.n_layers - - encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) - sequence_output = encoded_layers[-1] + embed_enc = self.enc_comb_proj(embed_enc) - out = self.fc(sequence_output) - - preds = self.activation(out).view(batch_size, -1) - - return preds - + # DECODER + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) -class Bert(nn.Module): + embed_interaction = self.embedding_interaction(interaction) - def __init__(self, args): - super(Bert, self).__init__() - self.args = args - self.device = args.device + cat_list = [ + + embed_test, + embed_question, + embed_tag, + embed_interaction, + ] + cat_list.extend(embed_other_features) + embed_dec = torch.cat(cat_list, 2) - # Defining some parameters - self.hidden_dim = self.args.hidden_dim - self.n_layers = self.args.n_layers + embed_dec = self.dec_comb_proj(embed_dec) - # Embedding - # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) - self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) - self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) - self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) - self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) - - # embedding combination projection - self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) - - # # Embedding - # # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) - # self.embedding_interaction = nn.Embedding(3, 1) - # self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim) - # self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//2) - # self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//2) - - # # embedding combination projection - # self.comb_proj = nn.Linear((self.hidden_dim * 2) + 1, self.hidden_dim) - - # Bert config - self.config = BertConfig( - 3, # not used - hidden_size=self.hidden_dim, - num_hidden_layers=self.args.n_layers, - num_attention_heads=self.args.n_heads, - max_position_embeddings=self.args.max_seq_len - ) - - # Defining the layers - # Bert Layer - self.encoder = BertModel(self.config) - - # Fully connected layer - self.fc = nn.Linear(self.args.hidden_dim, 1) - - self.activation = nn.Sigmoid() - - - def forward(self, input): - test, question, tag, _, mask, interaction, _ = input - - batch_size = interaction.size(0) - - # 신나는 embedding + # ATTENTION MASK 생성 + # encoder하고 decoder의 mask는 가로 세로 길이가 모두 동일하여 + # 사실 이렇게 3개로 나눌 필요가 없다 + if self.enc_mask is None or self.enc_mask.size(0) != seq_len: + self.enc_mask = self.get_mask(seq_len).to(self.device) + + if self.dec_mask is None or self.dec_mask.size(0) != seq_len: + self.dec_mask = self.get_mask(seq_len).to(self.device) + + if self.enc_dec_mask is None or self.enc_dec_mask.size(0) != seq_len: + self.enc_dec_mask = self.get_mask(seq_len).to(self.device) + + + embed_enc = embed_enc.permute(1, 0, 2) + embed_dec = embed_dec.permute(1, 0, 2) - embed_interaction = self.embedding_interaction(interaction) - embed_test = self.embedding_test(test) - embed_question = self.embedding_question(question) - embed_tag = self.embedding_tag(tag) + # Positional encoding + embed_enc = self.pos_encoder(embed_enc) + embed_dec = self.pos_decoder(embed_dec) + out = self.transformer(embed_enc, embed_dec, + src_mask=self.enc_mask, + tgt_mask=self.dec_mask, + memory_mask=self.enc_dec_mask) - embed = torch.cat([embed_interaction, - embed_test, - embed_question, - embed_tag,], 2) - - X = self.comb_proj(embed) - - # Bert - encoded_layers = self.encoder(inputs_embeds=X, attention_mask=mask) - out = encoded_layers[0] + out = out.permute(1, 0, 2) out = out.contiguous().view(batch_size, -1, self.hidden_dim) out = self.fc(out) - preds = self.activation(out).view(batch_size, -1) - - return preds - -### Saint model -class PositionalEncoding(nn.Module): - def __init__(self, d_model, dropout=0.1, max_len=1000): - super(PositionalEncoding, self).__init__() - self.dropout = nn.Dropout(p=dropout) - self.scale = nn.Parameter(torch.ones(1)) + preds = self.activation(out).view(batch_size, -1) - pe = torch.zeros(max_len, d_model) - position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) - div_term = torch.exp(torch.arange( - 0, d_model, 2).float() * (-math.log(10000.0) / d_model)) - pe[:, 0::2] = torch.sin(position * div_term) - pe[:, 1::2] = torch.cos(position * div_term) - pe = pe.unsqueeze(0).transpose(0, 1) - self.register_buffer('pe', pe) - def forward(self, x): - x = x + self.scale * self.pe[:x.size(0), :] - return self.dropout(x) + return preds class Saint(nn.Module): @@ -409,28 +410,35 @@ def __init__(self, args): self.device = args.device self.hidden_dim = self.args.hidden_dim - # self.dropout = self.args.dropout - self.dropout = 0. + self.dropout = self.args.drop_out ### Embedding # ENCODER embedding + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + self.n_other_features = self.args.n_other_features + print(self.n_other_features) # encoder combination projection - self.enc_comb_proj = nn.Linear((self.hidden_dim//3)*3, self.hidden_dim) + self.enc_comb_proj = nn.Linear((self.hidden_dim//3)*(2+len(self.n_other_features)), self.hidden_dim) # DECODER embedding # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) # decoder combination projection - self.dec_comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + self.dec_comb_proj = nn.Linear((self.hidden_dim//3)*(3+len(self.n_other_features)), self.hidden_dim) # Positional encoding self.pos_encoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) self.pos_decoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + + # # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] self.transformer = nn.Transformer( @@ -455,34 +463,72 @@ def get_mask(self, seq_len): return mask.masked_fill(mask==1, float('-inf')) def forward(self, input): - test, question, tag, _, mask, interaction, _ = input + # test, question, tag, _, mask, interaction, _ = input + + # # print(f'input 길이 : {len(input)}') + + # # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) + # # for i,e in enumerate(input): + # # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + batch_size = interaction.size(0) seq_len = interaction.size(1) + + # 신나는 embedding # ENCODER embed_test = self.embedding_test(test) embed_question = self.embedding_question(question) - embed_tag = self.embedding_tag(tag) + # embed_tag = self.embedding_tag(tag) - embed_enc = torch.cat([embed_test, - embed_question, - embed_tag,], 2) + # # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [ + # embed_interaction, + embed_test, + embed_question, + # embed_tag, + ] + cat_list.extend(embed_other_features) + embed_enc = torch.cat(cat_list, 2) embed_enc = self.enc_comb_proj(embed_enc) # DECODER embed_test = self.embedding_test(test) embed_question = self.embedding_question(question) - embed_tag = self.embedding_tag(tag) + # embed_tag = self.embedding_tag(tag) embed_interaction = self.embedding_interaction(interaction) - embed_dec = torch.cat([embed_test, - embed_question, - embed_tag, - embed_interaction], 2) + cat_list = [ + + embed_test, + embed_question, + # embed_tag, + embed_interaction, + ] + cat_list.extend(embed_other_features) + embed_dec = torch.cat(cat_list, 2) embed_dec = self.dec_comb_proj(embed_dec) @@ -517,16 +563,12 @@ def forward(self, input): preds = self.activation(out).view(batch_size, -1) - return preds - - -""" -Encoder --> LSTM --> dense -""" + return preds +######## Post Padding -class Feed_Forward_block(nn.Module): +class Feed_Forward_block_Post(nn.Module): """ out = Relu( M_out*w1 + b1) *w2 + b2 """ @@ -538,28 +580,37 @@ def __init__(self, dim_ff): def forward(self,ffn_in): return self.layer2(F.relu(self.layer1(ffn_in))) -class LastQuery(nn.Module): +class LastQuery_Post_TEST(nn.Module): def __init__(self, args): - super(LastQuery, self).__init__() + super(LastQuery_Post_TEST, self).__init__() + self.args = args self.device = args.device self.hidden_dim = self.args.hidden_dim - # Embedding # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) - # embedding combination projection - self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + + # encoder combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*(3+len(self.n_other_features)), self.hidden_dim) # 태그 제외 + + # # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + # 기존 keetar님 솔루션에서는 Positional Embedding은 사용되지 않습니다 # 하지만 사용 여부는 자유롭게 결정해주세요 :) - # self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) # Encoder self.query = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) @@ -568,7 +619,7 @@ def __init__(self, args): self.attn = nn.MultiheadAttention(embed_dim=self.hidden_dim, num_heads=self.args.n_heads) self.mask = None # last query에서는 필요가 없지만 수정을 고려하여서 넣어둠 - self.ffn = Feed_Forward_block(self.hidden_dim) + self.ffn = Feed_Forward_block_Post(self.hidden_dim) self.ln1 = nn.LayerNorm(self.hidden_dim) self.ln2 = nn.LayerNorm(self.hidden_dim) @@ -585,6 +636,29 @@ def __init__(self, args): self.activation = nn.Sigmoid() + + def get_mask(self, seq_len, index, batch_size): + """ + batchsize * n_head 수만큼 각 mask를 반복하여 증가시킨다 + + 참고로 (batch_size*self.args.n_heads, seq_len, seq_len) 가 아니라 + (batch_size*self.args.n_heads, 1, seq_len) 로 하는 이유는 + + last query라 output의 seq부분의 사이즈가 1이기 때문이다 + """ + # [[1], -> [1, 2, 3] + # [2], + # [3]] + index = index.view(-1) + + # last query의 index에 해당하는 upper triangular mask의 row를 사용한다 + mask = torch.from_numpy(np.triu(np.ones((seq_len, seq_len)), k=1)) + mask = mask[index] + + # batchsize * n_head 수만큼 각 mask를 반복하여 증가시킨다 + mask = mask.repeat(1, self.args.n_heads).view(batch_size*self.args.n_heads, -1, seq_len) + return mask.masked_fill(mask==1, float('-inf')) + def get_pos(self, seq_len): # use sine positional embeddinds return torch.arange(seq_len).unsqueeze(0) @@ -600,43 +674,89 @@ def init_hidden(self, batch_size): self.args.n_layers, batch_size, self.args.hidden_dim) + c = c.to(self.device) return (h, c) def forward(self, input): - test, question, tag, _, mask, interaction, index = input + + # test, question, tag, _, mask, interaction, index = input + + # for i,e in enumerate(input): + # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + + index = input[len(input)-1] + batch_size = interaction.size(0) seq_len = interaction.size(1) # 신나는 embedding + + def forward(self, input): + + test, question, tag, _, mask, interaction, _ = input + + batch_size = interaction.size(0) + + # Embedding embed_interaction = self.embedding_interaction(interaction) embed_test = self.embedding_test(test) embed_question = self.embedding_question(question) embed_tag = self.embedding_tag(tag) - embed = torch.cat([embed_interaction, + # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [embed_interaction, embed_test, embed_question, - embed_tag,], 2) + # embed_tag, + ] + cat_list.extend(embed_other_features) + + + embed = torch.cat(cat_list, 2) embed = self.comb_proj(embed) # Positional Embedding # last query에서는 positional embedding을 하지 않음 - # position = self.get_pos(seq_len).to('cuda') - # embed_pos = self.embedding_position(position) - # embed = embed + embed_pos + position = self.get_pos(seq_len).to(self.args.device) + embed_pos = self.embedding_position(position) + embed = embed + embed_pos ####################### ENCODER ##################### - q = self.query(embed)[:, -1:, :].permute(1, 0, 2) + q = self.query(embed) + + # 이 3D gathering은 머리가 아픕니다. 잠시 머리를 식히고 옵니다. + q = torch.gather(q, 1, index.repeat(1, self.hidden_dim).unsqueeze(1)) + q = q.permute(1, 0, 2) + k = self.key(embed).permute(1, 0, 2) v = self.value(embed).permute(1, 0, 2) ## attention # last query only - out, _ = self.attn(q, k, v) + self.mask = self.get_mask(seq_len, index, batch_size).to(self.device) + out, _ = self.attn(q, k, v, attn_mask=self.mask) ## residual + layer norm out = out.permute(1, 0, 2) @@ -662,4 +782,1051 @@ def forward(self, input): return preds +class LastQuery_Post(nn.Module): + def __init__(self, args): + super(LastQuery_Post, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + + # Embedding + # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + + # encoder combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*(4+len(self.n_other_features)), self.hidden_dim) + + # # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + + # 기존 keetar님 솔루션에서는 Positional Embedding은 사용되지 않습니다 + # 하지만 사용 여부는 자유롭게 결정해주세요 :) + self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + + # Encoder + self.query = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.key = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.value = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + + self.attn = nn.MultiheadAttention(embed_dim=self.hidden_dim, num_heads=self.args.n_heads) + self.mask = None # last query에서는 필요가 없지만 수정을 고려하여서 넣어둠 + self.ffn = Feed_Forward_block_Post(self.hidden_dim) + + self.ln1 = nn.LayerNorm(self.hidden_dim) + self.ln2 = nn.LayerNorm(self.hidden_dim) + + # LSTM + self.lstm = nn.LSTM( + self.hidden_dim, + self.hidden_dim, + self.args.n_layers, + batch_first=True) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def get_mask(self, seq_len, index, batch_size): + """ + batchsize * n_head 수만큼 각 mask를 반복하여 증가시킨다 + + 참고로 (batch_size*self.args.n_heads, seq_len, seq_len) 가 아니라 + (batch_size*self.args.n_heads, 1, seq_len) 로 하는 이유는 + + last query라 output의 seq부분의 사이즈가 1이기 때문이다 + """ + # [[1], -> [1, 2, 3] + # [2], + # [3]] + index = index.view(-1) + + # last query의 index에 해당하는 upper triangular mask의 row를 사용한다 + mask = torch.from_numpy(np.triu(np.ones((seq_len, seq_len)), k=1)) + mask = mask[index] + + # batchsize * n_head 수만큼 각 mask를 반복하여 증가시킨다 + mask = mask.repeat(1, self.args.n_heads).view(batch_size*self.args.n_heads, -1, seq_len) + return mask.masked_fill(mask==1, float('-inf')) + + def get_pos(self, seq_len): + # use sine positional embeddinds + return torch.arange(seq_len).unsqueeze(0) + + def init_hidden(self, batch_size): + h = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + c = c.to(self.device) + + return (h, c) + + + def forward(self, input): + + # test, question, tag, _, mask, interaction, index = input + + # for i,e in enumerate(input): + # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + + index = input[len(input)-1] + + + batch_size = interaction.size(0) + seq_len = interaction.size(1) + + # 신나는 embedding + embed_interaction = self.embedding_interaction(interaction) + + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [embed_interaction, + embed_test, + embed_question, + embed_tag, + ] + cat_list.extend(embed_other_features) + + + embed = torch.cat(cat_list, 2) + + embed = self.comb_proj(embed) + + # Positional Embedding + # last query에서는 positional embedding을 하지 않음 + position = self.get_pos(seq_len).to(self.args.device) + embed_pos = self.embedding_position(position) + embed = embed + embed_pos + + ####################### ENCODER ##################### + q = self.query(embed) + + # 이 3D gathering은 머리가 아픕니다. 잠시 머리를 식히고 옵니다. + q = torch.gather(q, 1, index.repeat(1, self.hidden_dim).unsqueeze(1)) + q = q.permute(1, 0, 2) + + k = self.key(embed).permute(1, 0, 2) + v = self.value(embed).permute(1, 0, 2) + + ## attention + # last query only + self.mask = self.get_mask(seq_len, index, batch_size).to(self.device) + out, _ = self.attn(q, k, v, attn_mask=self.mask) + + ## residual + layer norm + out = out.permute(1, 0, 2) + out = embed + out + out = self.ln1(out) + + ## feed forward network + out = self.ffn(out) + + ## residual + layer norm + out = embed + out + out = self.ln2(out) + + ###################### LSTM ##################### + hidden = self.init_hidden(batch_size) + out, hidden = self.lstm(out, hidden) + + ###################### DNN ##################### + + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + + preds = self.activation(out).view(batch_size, -1) + + return preds + +##### PrePadding +class Feed_Forward_block_Pre(nn.Module): + + """ + out = Relu( M_out*w1 + b1) *w2 + b2 + """ + def __init__(self, dim_ff): + super().__init__() + self.layer1 = nn.Linear(in_features=dim_ff, out_features=dim_ff) + self.layer2 = nn.Linear(in_features=dim_ff, out_features=dim_ff) + + def forward(self,ffn_in): + return self.layer2(F.relu(self.layer1(ffn_in))) + +class LastQuery_Pre(nn.Module): + def __init__(self, args): + super(LastQuery_Pre, self).__init__() + + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + + # Embedding + # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + + + # 기존 keetar님 솔루션에서는 Positional Embedding은 사용되지 않습니다 + # 하지만 사용 여부는 자유롭게 결정해주세요 :) + # self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + + # encoder combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*(4+len(self.n_other_features)), self.hidden_dim) + + # # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + # Encoder + self.query = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.key = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.value = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + + self.attn = nn.MultiheadAttention(embed_dim=self.hidden_dim, num_heads=self.args.n_heads) + self.mask = None # last query에서는 필요가 없지만 수정을 고려하여서 넣어둠 + self.ffn = Feed_Forward_block_Pre(self.hidden_dim) + + + self.ln1 = nn.LayerNorm(self.hidden_dim) + self.ln2 = nn.LayerNorm(self.hidden_dim) + + # LSTM + self.lstm = nn.LSTM( + self.hidden_dim, + self.hidden_dim, + self.args.n_layers, + batch_first=True) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def get_pos(self, seq_len): + # use sine positional embeddinds + return torch.arange(seq_len).unsqueeze(0) + + def init_hidden(self, batch_size): + h = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + c = c.to(self.device) + + return (h, c) + + + def forward(self, input): + # test, question, tag, _, mask, interaction, index = input + + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + index = input[len(input)-1] + + batch_size = interaction.size(0) + seq_len = interaction.size(1) + + # 신나는 embedding + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [embed_interaction, + embed_test, + embed_question, + embed_tag, + ] + cat_list.extend(embed_other_features) + + embed = torch.cat(cat_list, 2) + + embed = self.comb_proj(embed) + + # Positional Embedding + # last query에서는 positional embedding을 하지 않음 + # position = self.get_pos(seq_len).to('cuda') + # embed_pos = self.embedding_position(position) + # embed = embed + embed_pos + + ####################### ENCODER ##################### + q = self.query(embed)[:, -1:, :].permute(1, 0, 2) + k = self.key(embed).permute(1, 0, 2) + v = self.value(embed).permute(1, 0, 2) + + ## attention + # last query only + out, _ = self.attn(q, k, v) + + ## residual + layer norm + out = out.permute(1, 0, 2) + out = embed + out + out = self.ln1(out) + + ## feed forward network + out = self.ffn(out) + + ## residual + layer norm + out = embed + out + out = self.ln2(out) + + ###################### LSTM ##################### + hidden = self.init_hidden(batch_size) + out, hidden = self.lstm(out, hidden) + + ###################### DNN ##################### + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + + preds = self.activation(out).view(batch_size, -1) + # print(preds) + + return preds + +class MyLSTMConvATTN(nn.Module): + def __init__(self, args): + super(MyLSTMConvATTN, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + + #dev + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + self.comb_proj = nn.Linear((self.hidden_dim//3)*(4+self.f_cnt), self.hidden_dim) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + # self.embedding_test = nn.Embedding(100,self.hidden_dim//3) + self.config = ConvBertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = ConvBertEncoder(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + # print(f'input 길이 : {len(input)}') + + # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) + + # for i,e in enumerate(input): + # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + + batch_size = interaction.size(0) + + # Embedding + # print(interaction.shape) + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [embed_interaction, + embed_test, + embed_question, + embed_tag, + ] + cat_list.extend(embed_other_features) + embed = torch.cat(cat_list, 2) + + + X = self.comb_proj(embed) + + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds + + +class LSTM(nn.Module): + + def __init__(self, args): + super(LSTM, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.cont_cols=1 + + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + #continuous + self.cont_proj=nn.Linear(self.cont_cols,self.hidden_dim//2) + + # embedding combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim//2) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + test, question,tag, correct, mask, interaction, solve_time, gather_index=input + + # test, question, tag, _, mask,interaction,solve_time, _ = input + + batch_size = interaction.size(0) + + # Embedding + solve_time=solve_time.unsqueeze(-1) + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + embed_cont=self.cont_proj(solve_time) + # print(embed_cont.shape) + # print("-"*80) + embed = torch.cat([embed_interaction, + embed_test, + embed_question, + embed_tag,], 2) + + X = self.comb_proj(embed) + # print("범주형과 연속형의 shape: ",X.shape,embed_cont.shape) + X=torch.cat([X, embed_cont], 2) + # print("둘은 concat한 shape: ",X.shape) + hidden = self.init_hidden(batch_size) + out, hidden = self.lstm(X, hidden) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + + out = self.fc(out) + preds = self.activation(out).view(batch_size, -1) + + return preds + +class BiLSTMATTN(nn.Module): + + def __init__(self, args): + super(BiLSTMATTN, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + # embedding combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim//2, + self.n_layers, + batch_first=True, + bidirectional=True) + + self.config = RobertaConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + + + + self.attn = RobertaEncoder(self.config) + # self.attn = ConvBertEncoder(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + # hidden_idm을 각각 2씩 나눈 이유는 하나는 forward, 다른 하나는 backward를 하기 위해서다. + # n_layer를 2배로 늘린 이유는 bidirectional = True이면 위 아래로 히든 벡터가 쌓이기 때문 + h = torch.zeros( + self.n_layers*2, + batch_size, + self.hidden_dim//2) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers*2, + batch_size, + self.hidden_dim//2) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + + test, question, tag, _, mask, interaction, _ = input + + batch_size = interaction.size(0) + + # Embedding + + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + # print(f'embed_test : {embed_test.shape}') + embed = torch.cat([embed_interaction, + embed_test, + embed_question, + embed_tag,], 2) + # print(f'embed : {embed.shape}') + X = self.comb_proj(embed) + # print(f'X.shape : {X.shape}') + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds + +class AutoEncoderLSTMATTN(nn.Module): + def __init__(self, args): + super(AutoEncoderLSTMATTN,self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + + #dev + self.n_other_features = self.args.n_other_features + print(f'other features cont : {self.n_other_features}') + + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + self.comb_proj = nn.Linear((self.hidden_dim//3)*(4+self.f_cnt), self.hidden_dim) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + # self.embedding_test = nn.Embedding(100,self.hidden_dim//3) + if args.model.lower() == 'lstmconvattn' : + self.config = ConvBertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = ConvBertEncoder(self.config) + elif args.model.lower() == 'lstmrobertaattn': + self.config = RobertaConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = RobertaEncoder(self.config) + elif args.model.lower() == 'lstmalbertattn': + self.config = AlbertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + # self.attn = AlbertAttention(self.config) + # self.attn - AlbertModel(self.config) + self.attn = AlbertTransformer(self.config) + else: + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = BertEncoder(self.config) + + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + # print(f'input 길이 : {len(input)}') + + # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) + + # for i,e in enumerate(input): + # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + + batch_size = interaction.size(0) + + # Embedding + # print(interaction.shape) + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [embed_interaction, + embed_test, + embed_question, + embed_tag, + ] + cat_list.extend(embed_other_features) + embed = torch.cat(cat_list, 2) + + + X = self.comb_proj(embed) + + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds + +# LSTMATTN +class LSTMATTN(nn.Module): + + def __init__(self, args): + super(LSTMATTN, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + self.cont_cols=1 + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + # #continuous + # self.cont_proj=nn.Sequential( + # nn.LayerNorm(self.hidden_dim//2), + # nn.Linear(self.cont_cols, self.hidden_dim//2), + # # nn.LayerNorm(self.hidden_dim//4) #layerNorm 순서 변경 + # ) + + # # embedding combination projection + # self.comb_proj = nn.Sequential( + # nn.Linear((self.hidden_dim//3)*4, self.hidden_dim//2), + # # nn.LayerNorm((self.hidden_dim//4)*3), + # ) + + self.cont_proj=nn.Linear(self.cont_cols,self.hidden_dim//2) + + # embedding combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim//2) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = BertEncoder(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + + test, question,tag, correct, mask, interaction, solve_time, gather_index=input + + batch_size = interaction.size(0) + + # Embedding shape(batch, max_seq_len,64) + solve_time=solve_time.unsqueeze(-1) + embed_interaction = self.embedding_interaction(interaction) #(batch, max_seq_len, 64) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + embed_cont=self.cont_proj(solve_time) + + embed = torch.cat([embed_interaction, + embed_test, + embed_question, + embed_tag,], 2) + + X = self.comb_proj(embed) + X=torch.cat([X, embed_cont], 2) #(batch,msl, 128) + + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds + +# BERT +class Bert(nn.Module): + + def __init__(self, args): + super(Bert, self).__init__() + self.args = args + self.device = args.device + + # Defining some parameters + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + + # Embedding + # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + # embedding combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + + # Bert config + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=self.args.n_layers, + num_attention_heads=self.args.n_heads, + max_position_embeddings=self.args.max_seq_len + ) + + # Defining the layers + # Bert Layer + self.encoder = BertModel(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.args.hidden_dim, 1) + + self.activation = nn.Sigmoid() + # self.activation=nn.Tanh() + + + def forward(self, input): + test, question, tag, _, mask, interaction, _ = input + batch_size = interaction.size(0) + + # 신나는 embedding + + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + embed = torch.cat([embed_interaction, + + embed_test, + embed_question, + + embed_tag,], 2) + + X = self.comb_proj(embed) + + # Bert + encoded_layers = self.encoder(inputs_embeds=X, attention_mask=mask) + out = encoded_layers[0] + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + preds = self.activation(out).view(batch_size, -1) + + return preds diff --git a/dkt/models_architecture/__init__.py b/dkt/models_architecture/__init__.py new file mode 100644 index 0000000..b8a5cef --- /dev/null +++ b/dkt/models_architecture/__init__.py @@ -0,0 +1,8 @@ +from .lstm import * +from .lstmattn import * +from .bert import * +from .saint import * +from .tfixupsaint import * +from .auto_encoder_lstmattn import * +from .lastquery_post import * +from .lastquery_pre import * \ No newline at end of file diff --git a/dkt/models_architecture/auto_encoder_lstmattn.py b/dkt/models_architecture/auto_encoder_lstmattn.py new file mode 100644 index 0000000..06db1a9 --- /dev/null +++ b/dkt/models_architecture/auto_encoder_lstmattn.py @@ -0,0 +1,197 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + + +import re + + +class AutoEncoderLSTMATTN(nn.Module): + def __init__(self, args): + super(AutoEncoderLSTMATTN,self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + + #dev + self.n_other_features = self.args.n_other_features + print(f'other features cont : {self.n_other_features}') + + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + self.comb_proj = nn.Linear((self.hidden_dim//3)*(4+self.f_cnt), self.hidden_dim) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + # self.embedding_test = nn.Embedding(100,self.hidden_dim//3) + if args.model.lower() == 'lstmconvattn' : + self.config = ConvBertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = ConvBertEncoder(self.config) + elif args.model.lower() == 'lstmrobertaattn': + self.config = RobertaConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = RobertaEncoder(self.config) + elif args.model.lower() == 'lstmalbertattn': + self.config = AlbertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + # self.attn = AlbertAttention(self.config) + # self.attn - AlbertModel(self.config) + self.attn = AlbertTransformer(self.config) + else: + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = BertEncoder(self.config) + + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + # print(f'input 길이 : {len(input)}') + + # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) + + # for i,e in enumerate(input): + # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + + batch_size = interaction.size(0) + + # Embedding + # print(interaction.shape) + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [embed_interaction, + embed_test, + embed_question, + embed_tag, + ] + cat_list.extend(embed_other_features) + embed = torch.cat(cat_list, 2) + + + X = self.comb_proj(embed) + + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/bert.py b/dkt/models_architecture/bert.py new file mode 100644 index 0000000..cf49e54 --- /dev/null +++ b/dkt/models_architecture/bert.py @@ -0,0 +1,130 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + + +import re + +class Bert(nn.Module): + + def __init__(self, args): + super(Bert, self).__init__() + self.args = args + self.device = args.device + + # Defining some parameters + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + + # Embedding + #userID때문에 하나 뺌 + cate_len=len(args.cate_feats)-1 + #answerCode 때문에 하나 뺌 + cont_len=len(args.cont_feats)-1 + # cate Embedding + self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]]) + # cont Embedding + self.cont_embedding = nn.Sequential( + nn.Linear(1, (self.hidden_dim//2)//cont_len), + nn.LayerNorm((self.hidden_dim//2)//cont_len) + ) + + # comb linear + self.cate_comb_proj = nn.Linear(((self.hidden_dim//2)//cate_len)*(cate_len+1), self.hidden_dim//2) #interaction을 나중에 더하므로 +1 + self.cont_comb_proj = nn.Linear(((self.hidden_dim//2)//cont_len)*cont_len, self.hidden_dim//2) + + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, (self.hidden_dim//2)//cate_len) + + # Bert config + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=self.args.n_layers, + num_attention_heads=self.args.n_heads, + max_position_embeddings=self.args.max_seq_len + ) + + # Defining the layers + # Bert Layer + self.encoder = BertModel(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.args.hidden_dim, 1) + + self.activation = nn.Sigmoid() + # self.activation=nn.Tanh() + + + def forward(self, input): + #userID가 빠졌으므로 -1 + cate_feats=input[:len(self.args.cate_feats)-1] + # print("cate_feats개수",len(cate_feats)) + + #answercode가 없으므로 -1 + cont_feats=input[len(self.args.cate_feats)-1:-4] + # print("cont_feats개수",len(cont_feats)) + interaction=input[-4] + mask=input[-3] + gather_index=input[-2] + + batch_size = interaction.size(0) + # cate Embedding + cate_feats_embed=[] + embed_interaction = self.embedding_interaction(interaction) + cate_feats_embed.append(embed_interaction) + + # print(self.cate_embedding_list) + # print("cate shapes") + for i, cate_feat in enumerate(cate_feats): + cate_feats_embed.append(self.cate_embedding_list[i](cate_feat)) + + + # unsqueeze cont feats shape & embedding + cont_feats_embed=[] + for cont_feat in cont_feats: + cont_feat=cont_feat.unsqueeze(-1) + cont_feats_embed.append(self.cont_embedding(cont_feat)) + + + #concat cate, cont feats + embed_cate = torch.cat(cate_feats_embed, 2) + embed_cate=self.cate_comb_proj(embed_cate) + + embed_cont = torch.cat(cont_feats_embed, 2) + embed_cont=self.cont_comb_proj(embed_cont) + + + X = torch.cat([embed_cate,embed_cont], 2) + # print("cate와 cont를 concat한 shape : ", X.shape) + + # Bert + encoded_layers = self.encoder(inputs_embeds=X, attention_mask=mask) + out = encoded_layers[0] + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + preds = self.activation(out).view(batch_size, -1) + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/lastquery_post.py b/dkt/models_architecture/lastquery_post.py new file mode 100644 index 0000000..85319bc --- /dev/null +++ b/dkt/models_architecture/lastquery_post.py @@ -0,0 +1,226 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + +import re + +######## Post Padding + +class Feed_Forward_block_Post(nn.Module): + """ + out = Relu( M_out*w1 + b1) *w2 + b2 + """ + def __init__(self, dim_ff): + super().__init__() + self.layer1 = nn.Linear(in_features=dim_ff, out_features=dim_ff) + self.layer2 = nn.Linear(in_features=dim_ff, out_features=dim_ff) + + def forward(self,ffn_in): + return self.layer2(F.relu(self.layer1(ffn_in))) + +class LastQuery_Post(nn.Module): + def __init__(self, args): + super(LastQuery_Post, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + + # Embedding + #userID때문에 하나 뺌 + cate_len=len(args.cate_feats)-1 + #answerCode 때문에 하나 뺌 + cont_len=len(args.cont_feats)-1 + + # Embedding + # cate Embedding + self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]]) + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, (self.hidden_dim//2)//cate_len) + + # cont Embedding + self.cont_embedding = nn.Linear(1, (self.hidden_dim//2)//cont_len) + + + # 기존 keetar님 솔루션에서는 Positional Embedding은 사용되지 않습니다 + # 하지만 사용 여부는 자유롭게 결정해주세요 :) + self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + + # comb linear + self.cate_comb_proj = nn.Linear(((self.hidden_dim//2)//cate_len)*(cate_len+1), self.hidden_dim//2) #interaction을 나중에 더하므로 +1 + self.cont_comb_proj = nn.Linear(((self.hidden_dim//2)//cont_len)*cont_len, self.hidden_dim//2) + + # Encoder + self.query = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.key = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.value = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + + self.attn = nn.MultiheadAttention(embed_dim=self.hidden_dim, num_heads=self.args.n_heads) + self.mask = None # last query에서는 필요가 없지만 수정을 고려하여서 넣어둠 + self.ffn = Feed_Forward_block_Post(self.hidden_dim) + + self.ln1 = nn.LayerNorm(self.hidden_dim) + self.ln2 = nn.LayerNorm(self.hidden_dim) + + # LSTM + self.lstm = nn.LSTM( + self.hidden_dim, + self.hidden_dim, + self.args.n_layers, + batch_first=True) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def get_mask(self, seq_len, index, batch_size): + """ + batchsize * n_head 수만큼 각 mask를 반복하여 증가시킨다 + + 참고로 (batch_size*self.args.n_heads, seq_len, seq_len) 가 아니라 + (batch_size*self.args.n_heads, 1, seq_len) 로 하는 이유는 + + last query라 output의 seq부분의 사이즈가 1이기 때문이다 + """ + # [[1], -> [1, 2, 3] + # [2], + # [3]] + index = index.view(-1) + + # last query의 index에 해당하는 upper triangular mask의 row를 사용한다 + mask = torch.from_numpy(np.triu(np.ones((seq_len, seq_len)), k=1)) + mask = mask[index] + + # batchsize * n_head 수만큼 각 mask를 반복하여 증가시킨다 + mask = mask.repeat(1, self.args.n_heads).view(batch_size*self.args.n_heads, -1, seq_len) + return mask.masked_fill(mask==1, float('-inf')) + + def get_pos(self, seq_len): + # use sine positional embeddinds + return torch.arange(seq_len).unsqueeze(0) + + def init_hidden(self, batch_size): + h = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + c = c.to(self.device) + + return (h, c) + + + def forward(self, input): + + #userID가 빠졌으므로 -1 + cate_feats=input[:len(self.args.cate_feats)-1] + # print("cate_feats개수",len(cate_feats)) + + #answercode가 없으므로 -1 + cont_feats=input[len(self.args.cate_feats)-1:-4] + # print("cont_feats개수",len(cont_feats)) + interaction=input[-4] + mask=input[-3] + gather_index=input[-2] + + batch_size = interaction.size(0) + seq_len = interaction.size(1) + + # cate Embedding + cate_feats_embed=[] + embed_interaction = self.embedding_interaction(interaction) + cate_feats_embed.append(embed_interaction) + + for i, cate_feat in enumerate(cate_feats): + cate_feats_embed.append(self.cate_embedding_list[i](cate_feat)) + + # unsqueeze cont feats shape & embedding + cont_feats_embed=[] + for cont_feat in cont_feats: + cont_feat=cont_feat.unsqueeze(-1) + cont_feats_embed.append(self.cont_embedding(cont_feat)) + + + #concat cate, cont feats + embed_cate = torch.cat(cate_feats_embed, 2) + embed_cate=self.cate_comb_proj(embed_cate) + + embed_cont = torch.cat(cont_feats_embed, 2) + embed_cont=self.cont_comb_proj(embed_cont) + + + embed = torch.cat([embed_cate,embed_cont], 2) + + # Positional Embedding + # last query에서는 positional embedding을 하지 않음 + position = self.get_pos(seq_len).to(self.args.device) + embed_pos = self.embedding_position(position) + embed = embed + embed_pos + + ####################### ENCODER ##################### + q = self.query(embed) + + # 이 3D gathering은 머리가 아픕니다. 잠시 머리를 식히고 옵니다. + q = torch.gather(q, 1, gather_index.repeat(1, self.hidden_dim).unsqueeze(1)) + q = q.permute(1, 0, 2) + + k = self.key(embed).permute(1, 0, 2) + v = self.value(embed).permute(1, 0, 2) + + ## attention + # last query only + self.mask = self.get_mask(seq_len, gather_index, batch_size).to(self.device) + out, _ = self.attn(q, k, v, attn_mask=self.mask) + + ## residual + layer norm + out = out.permute(1, 0, 2) + out = embed + out + out = self.ln1(out) + + ## feed forward network + out = self.ffn(out) + + ## residual + layer norm + out = embed + out + out = self.ln2(out) + + ###################### LSTM ##################### + hidden = self.init_hidden(batch_size) + out, hidden = self.lstm(out, hidden) + + ###################### DNN ##################### + + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + + preds = self.activation(out).view(batch_size, -1) + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/lastquery_pre.py b/dkt/models_architecture/lastquery_pre.py new file mode 100644 index 0000000..44837f6 --- /dev/null +++ b/dkt/models_architecture/lastquery_pre.py @@ -0,0 +1,200 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + + +import re + +##### PrePadding + +class Feed_Forward_block_Pre(nn.Module): + + """ + out = Relu( M_out*w1 + b1) *w2 + b2 + """ + def __init__(self, dim_ff): + super().__init__() + self.layer1 = nn.Linear(in_features=dim_ff, out_features=dim_ff) + self.layer2 = nn.Linear(in_features=dim_ff, out_features=dim_ff) + + def forward(self,ffn_in): + return self.layer2(F.relu(self.layer1(ffn_in))) + +class LastQuery_Pre(nn.Module): + def __init__(self, args): + super(LastQuery_Pre, self).__init__() + + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + #userID때문에 하나 뺌 + cate_len=len(args.cate_feats)-1 + #answerCode 때문에 하나 뺌 + cont_len=len(args.cont_feats)-1 + + # Embedding + # cate Embedding + self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]]) + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, (self.hidden_dim//2)//cate_len) + + # cont Embedding + self.cont_embedding = nn.Linear(1, (self.hidden_dim//2)//cont_len) + + + # 기존 keetar님 솔루션에서는 Positional Embedding은 사용되지 않습니다 + # 하지만 사용 여부는 자유롭게 결정해주세요 :) + # self.embedding_position = nn.Embedding(self.args.max_seq_len, self.hidden_dim) + + # comb linear + self.cate_comb_proj = nn.Linear(((self.hidden_dim//2)//cate_len)*(cate_len+1), self.hidden_dim//2) #interaction을 나중에 더하므로 +1 + self.cont_comb_proj = nn.Linear(((self.hidden_dim//2)//cont_len)*cont_len, self.hidden_dim//2) + + + # Encoder + self.query = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.key = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + self.value = nn.Linear(in_features=self.hidden_dim, out_features=self.hidden_dim) + + self.attn = nn.MultiheadAttention(embed_dim=self.hidden_dim, num_heads=self.args.n_heads) + self.mask = None # last query에서는 필요가 없지만 수정을 고려하여서 넣어둠 + self.ffn = Feed_Forward_block_Pre(self.hidden_dim) + + + self.ln1 = nn.LayerNorm(self.hidden_dim) + self.ln2 = nn.LayerNorm(self.hidden_dim) + + # LSTM + self.lstm = nn.LSTM( + self.hidden_dim, + self.hidden_dim, + self.args.n_layers, + batch_first=True) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def get_pos(self, seq_len): + # use sine positional embeddinds + return torch.arange(seq_len).unsqueeze(0) + + def init_hidden(self, batch_size): + h = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.args.n_layers, + batch_size, + self.args.hidden_dim) + c = c.to(self.device) + + return (h, c) + + + def forward(self, input): + #userID가 빠졌으므로 -1 + cate_feats=input[:len(self.args.cate_feats)-1] + # print("cate_feats개수",len(cate_feats)) + + #answercode가 없으므로 -1 + cont_feats=input[len(self.args.cate_feats)-1:-4] + # print("cont_feats개수",len(cont_feats)) + interaction=input[-4] + mask=input[-3] + gather_index=input[-2] + + batch_size = interaction.size(0) + seq_len = interaction.size(1) + + # cate Embedding + cate_feats_embed=[] + embed_interaction = self.embedding_interaction(interaction) + cate_feats_embed.append(embed_interaction) + + for i, cate_feat in enumerate(cate_feats): + cate_feats_embed.append(self.cate_embedding_list[i](cate_feat)) + + # unsqueeze cont feats shape & embedding + cont_feats_embed=[] + for cont_feat in cont_feats: + cont_feat=cont_feat.unsqueeze(-1) + cont_feats_embed.append(self.cont_embedding(cont_feat)) + + + #concat cate, cont feats + embed_cate = torch.cat(cate_feats_embed, 2) + embed_cate=self.cate_comb_proj(embed_cate) + + embed_cont = torch.cat(cont_feats_embed, 2) + embed_cont=self.cont_comb_proj(embed_cont) + + + embed = torch.cat([embed_cate,embed_cont], 2) + + # Positional Embedding + # last query에서는 positional embedding을 하지 않음 + # position = self.get_pos(seq_len).to('cuda') + # embed_pos = self.embedding_position(position) + # embed = embed + embed_pos + + ####################### ENCODER ##################### + q = self.query(embed)[:, -1:, :].permute(1, 0, 2) + k = self.key(embed).permute(1, 0, 2) + v = self.value(embed).permute(1, 0, 2) + + ## attention + # last query only + out, _ = self.attn(q, k, v) + + ## residual + layer norm + out = out.permute(1, 0, 2) + out = embed + out + out = self.ln1(out) + + ## feed forward network + out = self.ffn(out) + + ## residual + layer norm + out = embed + out + out = self.ln2(out) + + ###################### LSTM ##################### + hidden = self.init_hidden(batch_size) + out, hidden = self.lstm(out, hidden) + + ###################### DNN ##################### + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + + preds = self.activation(out).view(batch_size, -1) + # print(preds) + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/lstm.py b/dkt/models_architecture/lstm.py new file mode 100644 index 0000000..51505c6 --- /dev/null +++ b/dkt/models_architecture/lstm.py @@ -0,0 +1,122 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary + + +class LSTM(nn.Module): + + def __init__(self, args): + super(LSTM, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + + #userID때문에 하나 뺌 + cate_len=len(args.cate_feats)-1 + #answerCode 때문에 하나 뺌 + cont_len=len(args.cont_feats)-1 + # cate Embedding + self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]]) + # cont Embedding + self.cont_embedding = nn.Sequential( + nn.Linear(1, (self.hidden_dim//2)//cont_len), + nn.LayerNorm((self.hidden_dim//2)//cont_len) + ) + # comb linear + self.cate_comb_proj = nn.Linear(((self.hidden_dim//2)//cate_len)*(cate_len+1), self.hidden_dim//2) #interaction을 나중에 더하므로 +1 + self.cont_comb_proj = nn.Linear(((self.hidden_dim//2)//cont_len)*cont_len, self.hidden_dim//2) + + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, (self.hidden_dim//2)//cate_len) + + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + # cate + cont + interaction + mask + gather_index + correct= input + # print('-'*80) + # print("forward를 시작합니다") + #userID가 빠졌으므로 -1 + cate_feats=input[:len(self.args.cate_feats)-1] + # print("cate_feats개수",len(cate_feats)) + + #answercode가 없으므로 -1 + cont_feats=input[len(self.args.cate_feats)-1:-4] + # print("cont_feats개수",len(cont_feats)) + interaction=input[-4] + mask=input[-3] + gather_index=input[-2] + + batch_size = interaction.size(0) + # cate Embedding + cate_feats_embed=[] + embed_interaction = self.embedding_interaction(interaction) + cate_feats_embed.append(embed_interaction) + + # print(self.cate_embedding_list) + # print("cate shapes") + for i, cate_feat in enumerate(cate_feats): + cate_feats_embed.append(self.cate_embedding_list[i](cate_feat)) + + + # unsqueeze cont feats shape & embedding + cont_feats_embed=[] + for cont_feat in cont_feats: + cont_feat=cont_feat.unsqueeze(-1) + cont_feats_embed.append(self.cont_embedding(cont_feat)) + + + #concat cate, cont feats + embed_cate = torch.cat(cate_feats_embed, 2) + embed_cate=self.cate_comb_proj(embed_cate) + + embed_cont = torch.cat(cont_feats_embed, 2) + embed_cont=self.cont_comb_proj(embed_cont) + + + X = torch.cat([embed_cate,embed_cont], 2) + # print("cate와 cont를 concat한 shape : ", X.shape) + + hidden = self.init_hidden(batch_size) + out, hidden = self.lstm(X, hidden) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + + out = self.fc(out) + preds = self.activation(out).view(batch_size, -1) + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/lstmattn.py b/dkt/models_architecture/lstmattn.py new file mode 100644 index 0000000..2d7060e --- /dev/null +++ b/dkt/models_architecture/lstmattn.py @@ -0,0 +1,154 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + + +import re + +class LSTMATTN(nn.Module): + + def __init__(self, args): + super(LSTMATTN, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + + #userID때문에 하나 뺌 + cate_len=len(args.cate_feats)-1 + #answerCode 때문에 하나 뺌 + cont_len=len(args.cont_feats)-1 + # cate Embedding + self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim//2)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]]) + # cont Embedding + self.cont_embedding = nn.Sequential( + nn.Linear(1, (self.hidden_dim//2)//cont_len), + nn.LayerNorm((self.hidden_dim//2)//cont_len) + ) + # comb linear + self.cate_comb_proj = nn.Linear(((self.hidden_dim//2)//cate_len)*(cate_len+1), self.hidden_dim//2) #interaction을 나중에 더하므로 +1 + self.cont_comb_proj = nn.Linear(((self.hidden_dim//2)//cont_len)*cont_len, self.hidden_dim//2) + + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, (self.hidden_dim//2)//cate_len) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = BertEncoder(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + + #userID가 빠졌으므로 -1 + cate_feats=input[:len(self.args.cate_feats)-1] + # print("cate_feats개수",len(cate_feats)) + + #answercode가 없으므로 -1 + cont_feats=input[len(self.args.cate_feats)-1:-4] + # print("cont_feats개수",len(cont_feats)) + interaction=input[-4] + mask=input[-3] + gather_index=input[-2] + + batch_size = interaction.size(0) + + # cate Embedding + cate_feats_embed=[] + embed_interaction = self.embedding_interaction(interaction) + cate_feats_embed.append(embed_interaction) + + for i, cate_feat in enumerate(cate_feats): + cate_feats_embed.append(self.cate_embedding_list[i](cate_feat)) + + # unsqueeze cont feats shape & embedding + cont_feats_embed=[] + for cont_feat in cont_feats: + cont_feat=cont_feat.unsqueeze(-1) + cont_feats_embed.append(self.cont_embedding(cont_feat)) + + #concat cate, cont feats + embed_cate = torch.cat(cate_feats_embed, 2) + embed_cate=self.cate_comb_proj(embed_cate) + + embed_cont = torch.cat(cont_feats_embed, 2) + embed_cont=self.cont_comb_proj(embed_cont) + + X = torch.cat([embed_cate,embed_cont], 2) + # print("cate와 cont를 concat한 shape : ", X.shape) + + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/saint.py b/dkt/models_architecture/saint.py new file mode 100644 index 0000000..58aee8e --- /dev/null +++ b/dkt/models_architecture/saint.py @@ -0,0 +1,187 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + + +class PositionalEncoding(nn.Module): + def __init__(self, d_model, dropout=0.1, max_len=1000): + super(PositionalEncoding, self).__init__() + self.dropout = nn.Dropout(p=dropout) + self.scale = nn.Parameter(torch.ones(1)) + + pe = torch.zeros(max_len, d_model) + position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) + div_term = torch.exp(torch.arange( + 0, d_model, 2).float() * (-math.log(10000.0) / d_model)) + pe[:, 0::2] = torch.sin(position * div_term) + pe[:, 1::2] = torch.cos(position * div_term) + pe = pe.unsqueeze(0).transpose(0, 1) + self.register_buffer('pe', pe) + + def forward(self, x): + x = x + self.scale * self.pe[:x.size(0), :] + return self.dropout(x) + + +class Saint(nn.Module): + + def __init__(self, args): + super(Saint, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + # self.dropout = self.args.drop_out + self.dropout =args.drop_out + + #userID때문에 하나 뺌 + cate_len=len(args.cate_feats)-1 + #answerCode 때문에 하나 뺌 + cont_len=len(args.cont_feats)-1 + + ### Embedding + # ENCODER embedding - for cate + # cate Embedding + self.cate_embedding_list = nn.ModuleList([nn.Embedding(max_val+1, (self.hidden_dim)//cate_len) for max_val in list(args.cate_feat_dict.values())[1:]]) + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, (self.hidden_dim)//cate_len) + + + # DECODER embedding - for cont + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + # cont Embedding + self.cont_embedding = nn.Linear(1, (self.hidden_dim)//cont_len) + + # comb linear + self.cate_comb_proj = nn.Linear(((self.hidden_dim)//cate_len)*(cate_len+1), self.hidden_dim) #interaction을 나중에 더하므로 +1 + self.cont_comb_proj = nn.Sequential( + nn.Linear(((self.hidden_dim)//cont_len)*cont_len, self.hidden_dim), + nn.LayerNorm(self.hidden_dim) + ) + + # Positional encoding + self.pos_encoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + self.pos_decoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + + # # other feature + # self.f_cnt = len(self.n_other_features) # feature의 개수 + # self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + + self.transformer = nn.Transformer( + d_model=self.hidden_dim, + nhead=self.args.n_heads, + num_encoder_layers=self.args.n_layers, + num_decoder_layers=self.args.n_layers, + dim_feedforward=self.hidden_dim, + dropout=self.dropout, + activation='relu') + + self.fc = nn.Linear(self.hidden_dim, 1) + self.activation = nn.Sigmoid() + + self.enc_mask = None + self.dec_mask = None + self.enc_dec_mask = None + + def get_mask(self, seq_len): + mask = torch.from_numpy(np.triu(np.ones((seq_len, seq_len)), k=1)) + + return mask.masked_fill(mask==1, float('-inf')) + + def forward(self, input): + #userID가 빠졌으므로 -1 + cate_feats=input[:len(self.args.cate_feats)-1] + # print("cate_feats개수",len(cate_feats)) + + #answercode가 없으므로 -1 + cont_feats=input[len(self.args.cate_feats)-1:-4] + # print("cont_feats개수",len(cont_feats)) + interaction=input[-4] + mask=input[-3] + gather_index=input[-2] + + batch_size = interaction.size(0) + seq_len = interaction.size(1) + + + # 신나는 embedding + # ENCODER + # cate Embedding + cate_feats_embed=[] + embed_interaction = self.embedding_interaction(interaction) + cate_feats_embed.append(embed_interaction) + + for i, cate_feat in enumerate(cate_feats): + cate_feats_embed.append(self.cate_embedding_list[i](cate_feat)) + + #concat cate for Encoder + embed_cate = torch.cat(cate_feats_embed, 2) + embed_enc=self.cate_comb_proj(embed_cate) + + + # DECODER + # # unsqueeze cont feats shape & embedding + cont_feats_embed=[] + for cont_feat in cont_feats: + cont_feat=cont_feat.unsqueeze(-1) + cont_feats_embed.append(self.cont_embedding(cont_feat)) + + embed_cont = torch.cat(cont_feats_embed, 2) + embed_dec=self.cont_comb_proj(embed_cont) + + # ATTENTION MASK 생성 + # encoder하고 decoder의 mask는 가로 세로 길이가 모두 동일하여 + # 사실 이렇게 3개로 나눌 필요가 없다 + if self.enc_mask is None or self.enc_mask.size(0) != seq_len: + self.enc_mask = self.get_mask(seq_len).to(self.device) + + if self.dec_mask is None or self.dec_mask.size(0) != seq_len: + self.dec_mask = self.get_mask(seq_len).to(self.device) + + if self.enc_dec_mask is None or self.enc_dec_mask.size(0) != seq_len: + self.enc_dec_mask = self.get_mask(seq_len).to(self.device) + + + embed_enc = embed_enc.permute(1, 0, 2) + embed_dec = embed_dec.permute(1, 0, 2)#shape(batch,msl,hidden_dim) -> shape(msl,batch,hidden_dim) + + # Positional encoding + embed_enc = self.pos_encoder(embed_enc) + embed_dec = self.pos_decoder(embed_dec) + + out = self.transformer(embed_enc, embed_dec, + src_mask=self.enc_mask, + tgt_mask=self.dec_mask, + memory_mask=self.enc_dec_mask) + + out = out.permute(1, 0, 2) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + + preds = self.activation(out).view(batch_size, -1) + + + return preds \ No newline at end of file diff --git a/dkt/models_architecture/tfixupsaint.py b/dkt/models_architecture/tfixupsaint.py new file mode 100644 index 0000000..b12dea7 --- /dev/null +++ b/dkt/models_architecture/tfixupsaint.py @@ -0,0 +1,243 @@ +from operator import index +from numpy.lib.function_base import select +import torch +import torch.nn as nn +import torch.nn.functional as F +import numpy as np +import copy +import math +import os + +from torch.nn.modules import dropout + +from torchsummary import summary +from transformers.utils.dummy_pt_objects import AlbertModel + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + +from transformers.models.convbert.modeling_convbert import ConvBertConfig, ConvBertEncoder,ConvBertModel +from transformers.models.roberta.modeling_roberta import RobertaConfig,RobertaEncoder,RobertaModel +from transformers.models.albert.modeling_albert import AlbertAttention, AlbertTransformer, AlbertModel +from transformers.models.albert.configuration_albert import AlbertConfig +from transformers import BertPreTrainedModel + +class TfixupSaint(nn.Module): + + def __init__(self, args,Tfixup=True): + super(TfixupSaint, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + # self.dropout = self.args.dropout + self.dropout = 0. + + ### Embedding + # ENCODER embedding + + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + self.n_other_features = self.args.n_other_features + print(self.n_other_features) + + # encoder combination projection + self.enc_comb_proj = nn.Linear((self.hidden_dim//3)*(3+len(self.n_other_features)), self.hidden_dim) + + # DECODER embedding + # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + + # decoder combination projection + self.dec_comb_proj = nn.Linear((self.hidden_dim//3)*(4+len(self.n_other_features)), self.hidden_dim) + + # Positional encoding + self.pos_encoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + self.pos_decoder = PositionalEncoding(self.hidden_dim, self.dropout, self.args.max_seq_len) + + # # other feature + self.f_cnt = len(self.n_other_features) # feature의 개수 + self.embedding_other_features = [nn.Embedding(self.n_other_features[i]+1, self.hidden_dim//3) for i in range(self.f_cnt)] + + + self.transformer = nn.Transformer( + d_model=self.hidden_dim, + nhead=self.args.n_heads, + num_encoder_layers=self.args.n_layers, + num_decoder_layers=self.args.n_layers, + dim_feedforward=self.hidden_dim, + dropout=self.dropout, + activation='relu') + + self.fc = nn.Linear(self.hidden_dim, 1) + self.activation = nn.Sigmoid() + + self.enc_mask = None + self.dec_mask = None + self.enc_dec_mask = None + + # T-Fixup + if self.args.Tfixup: + + # 초기화 (Initialization) + self.tfixup_initialization() + print("T-Fixupbb Initialization Done") + + # 스케일링 (Scaling) + self.tfixup_scaling() + print(f"T-Fixup Scaling Done") + + def tfixup_initialization(self): + # 우리는 padding idx의 경우 모두 0으로 통일한다 + padding_idx = 0 + print(self.named_parameters) + for name, param in self.named_parameters(): + print(f'name : {name}') + if re.match(r'^embedding*', name): + nn.init.normal_(param, mean=0, std=param.shape[1] ** -0.5) + nn.init.constant_(param[padding_idx], 0) + elif re.match(r'.*Norm.*', name) or re.match(r'.*norm*.*',name): + continue + elif re.match(r'.*weight*', name): + # nn.init.xavier_uniform_(param) + nn.init.xavier_normal_(param) + + + def tfixup_scaling(self): + temp_state_dict = {} + + # 특정 layer들의 값을 스케일링한다 + for name, param in self.named_parameters(): + + # TODO: 모델 내부의 module 이름이 달라지면 직접 수정해서 + # module이 scaling 될 수 있도록 변경해주자 + # print(name) + + if re.match(r'^embedding*', name): + temp_state_dict[name] = (9 * self.args.n_layers) ** (-1 / 4) * param + elif re.match(r'.*Norm.*', name) or re.match(r'.*norm*.*',name): + continue + elif re.match(r'encoder.*dense.*weight$|encoder.*attention.output.*weight$', name): + temp_state_dict[name] = (0.67 * (self.args.n_layers) ** (-1 / 4)) * param + elif re.match(r"encoder.*value.weight$", name): + temp_state_dict[name] = (0.67 * (self.args.n_layers) ** (-1 / 4)) * (param * (2**0.5)) + + # 나머지 layer는 원래 값 그대로 넣는다 + for name in self.state_dict(): + if name not in temp_state_dict: + temp_state_dict[name] = self.state_dict()[name] + + self.load_state_dict(temp_state_dict) + + def get_mask(self, seq_len): + mask = torch.from_numpy(np.triu(np.ones((seq_len, seq_len)), k=1)) + + return mask.masked_fill(mask==1, float('-inf')) + + def forward(self, input): + # test, question, tag, _, mask, interaction, _ = input + + # # print(f'input 길이 : {len(input)}') + + # # input의 순서는 test, question, tag, _, mask, interaction, (...other features), gather_index(안 씀) + + # # for i,e in enumerate(input): + # # print(f'i 번째 : {e[i].shape}') + test = input[0] + question = input[1] + tag = input[2] + + mask = input[4] + interaction = input[5] + + other_features = [input[i] for i in range(6,len(input)-1)] + + batch_size = interaction.size(0) + seq_len = interaction.size(1) + + + + # 신나는 embedding + # ENCODER + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + # # dev + embed_other_features =[] + + for i,e in enumerate(self.embedding_other_features): + # print(f'{i}번째 : {e}') + # print(f'최댓값(전) : {torch.max(other_features[i])}') + # print(f'최솟값(전) : {torch.min(other_features[i])}') + embed_other_features.append(e(other_features[i])) + # print(f'최댓값(후) : {torch.max(other_features[i])}') + # print(f'최솟값(후) : {torch.min(other_features[i])}') + + cat_list = [ + # embed_interaction, + embed_test, + embed_question, + embed_tag, + ] + cat_list.extend(embed_other_features) + embed_enc = torch.cat(cat_list, 2) + + embed_enc = self.enc_comb_proj(embed_enc) + + # DECODER + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + embed_interaction = self.embedding_interaction(interaction) + + cat_list = [ + + embed_test, + embed_question, + embed_tag, + embed_interaction, + ] + cat_list.extend(embed_other_features) + embed_dec = torch.cat(cat_list, 2) + + embed_dec = self.dec_comb_proj(embed_dec) + + # ATTENTION MASK 생성 + # encoder하고 decoder의 mask는 가로 세로 길이가 모두 동일하여 + # 사실 이렇게 3개로 나눌 필요가 없다 + if self.enc_mask is None or self.enc_mask.size(0) != seq_len: + self.enc_mask = self.get_mask(seq_len).to(self.device) + + if self.dec_mask is None or self.dec_mask.size(0) != seq_len: + self.dec_mask = self.get_mask(seq_len).to(self.device) + + if self.enc_dec_mask is None or self.enc_dec_mask.size(0) != seq_len: + self.enc_dec_mask = self.get_mask(seq_len).to(self.device) + + + embed_enc = embed_enc.permute(1, 0, 2) + embed_dec = embed_dec.permute(1, 0, 2) + + # Positional encoding + embed_enc = self.pos_encoder(embed_enc) + embed_dec = self.pos_decoder(embed_dec) + + out = self.transformer(embed_enc, embed_dec, + src_mask=self.enc_mask, + tgt_mask=self.dec_mask, + memory_mask=self.enc_dec_mask) + + out = out.permute(1, 0, 2) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + + preds = self.activation(out).view(batch_size, -1) + + + return preds \ No newline at end of file diff --git a/dkt/new_model.py b/dkt/new_model.py new file mode 100644 index 0000000..6af15bf --- /dev/null +++ b/dkt/new_model.py @@ -0,0 +1,181 @@ +import torch +import torch.nn as nn + +try: + from transformers.modeling_bert import BertConfig, BertEncoder, BertModel +except: + from transformers.models.bert.modeling_bert import BertConfig, BertEncoder, BertModel + + +class LSTMATTN(nn.Module): + + def __init__(self, args): + super(LSTMATTN, self).__init__() + self.args = args + self.device = args.device + + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + self.n_heads = self.args.n_heads + self.drop_out = self.args.drop_out + + # Embedding + # interaction은 현재 correct로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + # embedding combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + + self.lstm = nn.LSTM(self.hidden_dim, + self.hidden_dim, + self.n_layers, + batch_first=True) + + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=1, + num_attention_heads=self.n_heads, + intermediate_size=self.hidden_dim, + hidden_dropout_prob=self.drop_out, + attention_probs_dropout_prob=self.drop_out, + ) + self.attn = BertEncoder(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.hidden_dim, 1) + + self.activation = nn.Sigmoid() + + def init_hidden(self, batch_size): + h = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + h = h.to(self.device) + + c = torch.zeros( + self.n_layers, + batch_size, + self.hidden_dim) + c = c.to(self.device) + + return (h, c) + + def forward(self, input): + + test, question, tag, _, mask, interaction, _ = input + + batch_size = interaction.size(0) + + # Embedding + + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + + embed = torch.cat([embed_interaction, + embed_test, + embed_question, + embed_tag,], 2) + + X = self.comb_proj(embed) + + hidden = self.init_hidden(batch_size) + # print(f'{hidden[0].shape}, {hidden[1].shape}') + out, hidden = self.lstm(X, hidden) + # print(out.shape) + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + # print(out.shape) + + extended_attention_mask = mask.unsqueeze(1).unsqueeze(2) + extended_attention_mask = extended_attention_mask.to(dtype=torch.float32) + extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0 + head_mask = [None] * self.n_layers + + encoded_layers = self.attn(out, extended_attention_mask, head_mask=head_mask) + sequence_output = encoded_layers[-1] + + out = self.fc(sequence_output) + + preds = self.activation(out).view(batch_size, -1) + + return preds + + + + +class Bert(nn.Module): + + def __init__(self, args): + super(Bert, self).__init__() + self.args = args + self.device = args.device + + # Defining some parameters + self.hidden_dim = self.args.hidden_dim + self.n_layers = self.args.n_layers + + # Embedding + # interaction은 현재 correct으로 구성되어있다. correct(1, 2) + padding(0) + self.embedding_interaction = nn.Embedding(3, self.hidden_dim//3) + self.embedding_test = nn.Embedding(self.args.n_test + 1, self.hidden_dim//3) + self.embedding_question = nn.Embedding(self.args.n_questions + 1, self.hidden_dim//3) + self.embedding_tag = nn.Embedding(self.args.n_tag + 1, self.hidden_dim//3) + + # embedding combination projection + self.comb_proj = nn.Linear((self.hidden_dim//3)*4, self.hidden_dim) + + # Bert config + self.config = BertConfig( + 3, # not used + hidden_size=self.hidden_dim, + num_hidden_layers=self.args.n_layers, + num_attention_heads=self.args.n_heads, + max_position_embeddings=self.args.max_seq_len + ) + + # Defining the layers + # Bert Layer + self.encoder = BertModel(self.config) + + # Fully connected layer + self.fc = nn.Linear(self.args.hidden_dim, 1) + + self.activation = nn.Sigmoid() + # self.activation=nn.Tanh() + + + def forward(self, input): + test, question, tag, _, mask, interaction, _ = input + batch_size = interaction.size(0) + + # 신나는 embedding + + embed_interaction = self.embedding_interaction(interaction) + embed_test = self.embedding_test(test) + embed_question = self.embedding_question(question) + embed_tag = self.embedding_tag(tag) + + embed = torch.cat([embed_interaction, + + embed_test, + embed_question, + + embed_tag,], 2) + + X = self.comb_proj(embed) + + # Bert + encoded_layers = self.encoder(inputs_embeds=X, attention_mask=mask) + out = encoded_layers[0] + out = out.contiguous().view(batch_size, -1, self.hidden_dim) + out = self.fc(out) + preds = self.activation(out).view(batch_size, -1) + + return preds \ No newline at end of file diff --git a/dkt/optimizer.py b/dkt/optimizer.py index 1548373..e702f46 100644 --- a/dkt/optimizer.py +++ b/dkt/optimizer.py @@ -1,10 +1,19 @@ from torch.optim import Adam, AdamW +from adamp import AdamP, SGDP def get_optimizer(model, args): - if args.optimizer == 'adam': + if args.optimizer.lower() == 'adam': optimizer = Adam(model.parameters(), lr=args.lr, weight_decay=0.01) - if args.optimizer == 'adamW': + # if args.optimizer == 'adamW': + if args.optimizer.lower() == 'adamw': optimizer = AdamW(model.parameters(), lr=args.lr, weight_decay=0.01) + # if args.optimizer == 'adamP': + if args.optimizer == 'adamp': + optimizer = AdamP(model.parameters(), lr=args.lr, weight_decay=0.01) + # if args.optimizer == 'SGDP': + if args.optimizer == 'sgdp': + optimizer = SGDP(model.parameters(), lr=args.lr, weight_decay=0.01) + # 모든 parameter들의 grad값을 0으로 초기화 optimizer.zero_grad() diff --git a/dkt/trainer.py b/dkt/trainer.py index 61fd329..4b1de07 100644 --- a/dkt/trainer.py +++ b/dkt/trainer.py @@ -1,16 +1,21 @@ import os -import json +from numpy.lib.arraysetops import isin import torch import numpy as np - +import json +import gc from tqdm.auto import tqdm + from .dataloader import get_loaders from .optimizer import get_optimizer from .scheduler import get_scheduler from .criterion import get_criterion from .metric import get_metric -from .model import * +import wandb + +from .models_architecture import * from lgbm_utils import * +# from .new_model import Bert,LSTMATTN import lightgbm as lgb from sklearn.metrics import roc_auc_score @@ -22,20 +27,22 @@ def run(args, train_data, valid_data): lgbm_params=args.lgbm.model_params - + print(f'{args.model}모델을 사용합니다') + print('-'*80) train_loader, valid_loader = get_loaders(args, train_data, valid_data) - ### LGBM runner - + #lgbm은 학습과 추론을 함께함 if args.model=='lgbm': - #학습 - model,auc,acc=lgbm_train(args,train_data,valid_data) - wandb.log({"valid_auc":auc, "valid_acc":acc}) + print("k-fold를 사용하지 않습니다","-"*80) + model,auc,acc, precision, recall, f1=lgbm_train(args,train_data,valid_data) + if args.wandb.using: + wandb.log({"valid_auc": train_auc, "valid_acc":train_acc,"valid_precision":precision, "valid_recall":recall, "valid_f1": f1}) #추론준비 csv_file_path = os.path.join(args.data_dir, args.test_file_name) test_df = pd.read_csv(csv_file_path)#, nrows=100000) - test_df = make_lgbm_feature(test_df) + + test_df = make_lgbm_feature(args,test_df) #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 test_df.sort_values(by=['userID','Timestamp'], inplace=True) test_df=lgbm_make_test_data(test_df) @@ -43,38 +50,50 @@ def run(args, train_data, valid_data): lgbm_inference(args,model,test_df) return - # only when using warmup scheduler args.total_steps = int(len(train_loader.dataset) / args.batch_size) * (args.n_epochs) args.warmup_steps = args.total_steps // 10 - model = get_model(args,args.model) + model = get_model(args) +# model = get_model(args,args.model) optimizer = get_optimizer(model, args) scheduler = get_scheduler(optimizer, args) best_auc = -1 + best_acc = -1 + best_precision=-1 + best_recall=-1 + best_f1=-1 early_stopping_counter = 0 for epoch in range(args.n_epochs): print(f"Start Training: Epoch {epoch + 1}") ### TRAIN - train_auc, train_acc, train_loss = train(train_loader, model, optimizer, args) - + train_auc, train_acc, train_precision,train_recall,train_f1, train_loss = train(train_loader, model, optimizer, args) + ### VALID - auc, acc,_ , _ = validate(valid_loader, model, args) + auc, acc,precision,recall,f1, preds , _ = validate(valid_loader, model, args) ### TODO: model save or early stopping - wandb.log({"epoch": epoch, "train_loss": train_loss, "train_auc": train_auc, "train_acc":train_acc, - "valid_auc":auc, "valid_acc":acc}) + if args.wandb.using: + wandb.log({"epoch": epoch, "train_loss": train_loss, "train_auc": train_auc, "train_acc":train_acc, + "train_precision": train_precision,"train_recall":train_recall,"train_f1":train_f1, + "valid_auc":auc, "valid_acc":acc,"valid_precision":precision,"valid_recall":recall,"valid_f1":f1}) if auc > best_auc: best_auc = auc + best_acc = acc + best_precision=precision + best_recall=recall + best_f1=f1 + best_preds=preds # torch.nn.DataParallel로 감싸진 경우 원래의 model을 가져옵니다. model_to_save = model.module if hasattr(model, 'module') else model save_checkpoint({ 'epoch': epoch + 1, 'state_dict': model_to_save.state_dict(), }, + args.model_dir, f'{args.task_name}.pt', ) early_stopping_counter = 0 @@ -89,63 +108,118 @@ def run(args, train_data, valid_data): scheduler.step(best_auc) else: scheduler.step() + + print(f"best AUC : {best_auc:.5f}, accuracy : {best_acc:.5f}, precision : {best_precision:.5f}, recall : {best_recall:.5f}, f1 : {best_f1:.5f}") -def run_kfold(args, train_data): +def run_kfold(args, train_data, test_train_data ,train_uid_df): n_splits = args.n_fold - - if args.use_stratify == True: - kfold = StratifiedKFold(n_splits=n_splits, shuffle=True) - else: - kfold = KFold(n_splits=n_splits, shuffle=True) - + print("k-fold를 사용합니다","-"*80) ### LGBM runner if args.model=='lgbm': - lgbm_params=args.lgbm.model_params - - for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_data)): - train_data, valid_data = train_data[train_idx], train_data[valid_idx] - - #학습 - model,auc,acc=lgbm_train(args,train_data,valid_data) - wandb.log({"valid_auc":auc, "valid_acc":acc}) - #추론준비 - csv_file_path = os.path.join(args.data_dir, args.test_file_name) - test_df = pd.read_csv(csv_file_path)#, nrows=100000) - test_df = make_lgbm_feature(test_df) - #유저별 시퀀스를 고려하기 위해 아래와 같이 정렬 - test_df.sort_values(by=['userID','Timestamp'], inplace=True) - test_df=lgbm_make_test_data(test_df) - #추론 - lgbm_inference(args,model,test_df) + + csv_file_path = os.path.join(args.data_dir, args.file_name) + train_df = pd.read_csv(csv_file_path)#, nrows=100000) + + csv_file_path = os.path.join(args.data_dir, args.test_file_name) + test_df = pd.read_csv(csv_file_path)#, nrows=100000) + + if args.use_test_data:#test의 데이터까지 사용할 경우 + train_df=make_sharing_feature(args) + + if args.user_split_augmentation: + #종호님의 유저 split augmentation + train_df['Timestamp']=pd.to_datetime(train_df['Timestamp'].values) + train_df['month'] = train_df['Timestamp'].dt.month + # df['userID'] = (df['userID'].map(str)+'0'+df['month'].map(str)).astype('int32') + train_df['userID'] = (train_df['userID'].map(str)+'0'+train_df['month'].map(str)).astype('int32') + train_df.drop(columns=['month'],inplace=True) + print("user_augmentation 후 유저 수",len(train_df['userID'].unique())) + train_df=make_lgbm_feature(args,train_df) + + if args.use_distance: + test_df['distance']=np.load('/opt/ml/np_test_tag_distance_arr.npy') + + test_df=make_lgbm_feature(args,test_df) + + delete_feats=['userID','assessmentItemID','testId','answerCode','Timestamp','sec_time'] + features=list(set(test_df.columns)-set(delete_feats)) + + print(f'사용한 피처는 다음과 같습니다') + print(features) + + if args.split_by_user: #유저별로 train/valid set을 나눌 때 + y_oof,pred,fi,score,acc, precision, recall, f1=make_lgb_user_oof_prediction(args,train_df, test_df, features, categorical_features='auto', model_params=args.lgbm.model_params, folds=args.n_fold) + + else : #skl split라이브러리를 이용하여 유저 구분없이 나눌 때 + y_oof,pred,fi, score,acc, precision, recall, f1=make_lgb_oof_prediction(args,train_df, test_df, features, categorical_features='auto', model_params=args.lgbm.model_params, folds=args.n_fold) + + if args.wandb.using: + wandb.log({"valid_auc":score, "valid_acc":acc, "valid_precision": precision, "valid_recall": recall,"valid_f1": f1}) + new_output_path=f'{args.output_dir}{args.task_name}' + write_path = os.path.join(new_output_path, "output.csv") + if not os.path.exists(new_output_path): + os.makedirs(new_output_path) + with open(write_path, 'w', encoding='utf8') as w: + print("writing prediction : {}".format(write_path)) + w.write("id,prediction\n") + for id, p in enumerate(pred): + w.write('{},{}\n'.format(id,p)) + + print(f"lgbm의 예측파일이 {new_output_path}/{args.task_name}.csv 로 저장됐습니다.") + + save_path=f"{args.output_dir}{args.task_name}/feature{len(features)}_config.json" + json.dump( + features, + open(save_path, "w"), + indent=2, + ensure_ascii=False, + ) return + + + if args.use_stratify == True: + kfold = StratifiedKFold(n_splits=n_splits, shuffle=True) + else: + kfold = KFold(n_splits=n_splits, shuffle=True) target = get_target(train_data) - val_auc = 0 val_acc = 0 - + val_precision=0 + val_recall=0 + val_f1=0 oof = np.zeros(train_data.shape[0]) + oof_target = np.zeros(train_data.shape[0]) for fold, (train_idx, valid_idx) in enumerate(kfold.split(train_data, target)): + print(f'{fold}fold를 수행합니다') trn_data = train_data[train_idx] val_data = train_data[valid_idx] + if args.use_total_data == False: + print("validation에 test 파일유저를 넣지 않습니다.") + # test data 중 일부 추가 + trn_data = np.concatenate([trn_data, test_train_data]) + train_loader, valid_loader = get_loaders(args, trn_data, val_data) # only when using warmup scheduler args.total_steps = int(len(train_loader.dataset) / args.batch_size) * (args.n_epochs) args.warmup_steps = args.total_steps // 10 - model = get_model(args,args.model) + model = get_model(args) optimizer = get_optimizer(model, args) scheduler = get_scheduler(optimizer, args) best_auc = -1 best_acc = 0 + best_precision=-1 + best_recall=-1 + best_f1=-1 best_preds = None early_stopping_counter = 0 @@ -154,17 +228,23 @@ def run_kfold(args, train_data): print(f"Start Training: Epoch {epoch + 1}") ### TRAIN - train_auc, train_acc, train_loss = train(train_loader, model, optimizer, args) - + train_auc, train_acc, train_precision,train_recall,train_f1, train_loss = train(train_loader, model, optimizer, args) + ### VALID - auc, acc, preds , _ = validate(valid_loader, model, args) + auc, acc,precision,recall,f1, preds , targets = validate(valid_loader, model, args) ### TODO: model save or early stopping - wandb.log({"epoch": epoch, "train_loss": train_loss, "train_auc": train_auc, "train_acc":train_acc, - "valid_auc":auc, "valid_acc":acc}) + if args.wandb.using: + wandb.log({"epoch": epoch, "train_loss": train_loss, "train_auc": train_auc, "train_acc":train_acc, + "train_precision": train_precision,"train_recall":train_recall,"train_f1":train_f1, + "valid_auc":auc, "valid_acc":acc,"valid_precision":precision,"valid_recall":recall,"valid_f1":f1}) + if auc > best_auc: best_auc = auc best_acc = acc + best_precision=precision + best_recall=recall + best_f1=f1 best_preds = preds # torch.nn.DataParallel로 감싸진 경우 원래의 model을 가져옵니다. model_to_save = model.module if hasattr(model, 'module') else model @@ -186,25 +266,49 @@ def run_kfold(args, train_data): scheduler.step(best_auc) else: scheduler.step() - + + gc.collect() + val_auc += best_auc/n_splits val_acc += best_acc/n_splits + val_precision += best_precision/n_splits + val_recall += best_recall/n_splits + val_f1 += best_f1/n_splits + oof[valid_idx] = best_preds + oof_target[valid_idx] = targets - print(f'Valid AUC : {val_auc}, Valid ACC : {val_acc} \n') + oof_df = train_uid_df + oof_df['preds'] = oof + oof_df['target'] = oof_target + + new_output_path=f'{args.output_dir}/{args.task_name}' + write_path = os.path.join(new_output_path, "oof_preds.csv") + if not os.path.exists(new_output_path): + os.makedirs(new_output_path) + oof_df.to_csv(write_path, index=False) + + print(f"Valid AUC : {val_auc:.5f}, accuracy : {val_acc:.5f}, precision : {val_precision:.5f}, recall : {val_recall:.5f}, f1 : {val_f1:.5f}") def train(train_loader, model, optimizer, args): model.train() - print("start training--------------------------") + total_preds = [] total_targets = [] losses = [] - for step, batch in tqdm(enumerate(train_loader)): - input = process_batch(batch, args) + + for step, batch in enumerate(train_loader): + gc.collect() + if isinstance(model,Saint) or isinstance(model, LastQuery_Post) or isinstance(model,LastQuery_Pre)\ + or isinstance(model, TfixupSaint) or isinstance(model,LSTM) or isinstance(model, LSTMATTN): + input = process_batch_v2(batch, args) + else: + input = process_batch_v2(batch,args) + # print(f"input 텐서 사이즈 : {type(input)}, {len(input)}") preds = model(input) - targets = input[3] # correct + targets = input[-1] # correct loss = compute_loss(preds, targets) @@ -233,10 +337,10 @@ def train(train_loader, model, optimizer, args): total_targets = np.concatenate(total_targets) # Train AUC / ACC - auc, acc = get_metric(total_targets, total_preds) + auc, acc ,precision,recall,f1 = get_metric(total_targets, total_preds) loss_avg = sum(losses)/len(losses) - print(f'TRAIN AUC : {auc} ACC : {acc}') - return auc, acc, loss_avg + print(f'TRAIN AUC : {auc:.5f} ACC : {acc:.5f} precision : {precision:.5f} recall : {recall:.5f} f1 : {f1:.5f}') + return auc, acc ,precision,recall,f1, loss_avg def validate(valid_loader, model, args): @@ -245,10 +349,16 @@ def validate(valid_loader, model, args): total_preds = [] total_targets = [] for step, batch in enumerate(valid_loader): - input = process_batch(batch, args) + # input = process_batch(batch, args) + if isinstance(model,Saint) or isinstance(model, LastQuery_Post) or isinstance(model,LastQuery_Pre)\ + or isinstance(model, TfixupSaint) or isinstance(model, AutoEncoderLSTMATTN): + input = process_batch_v2(batch, args) + else: + input = process_batch_v2(batch,args) + preds = model(input) - targets = input[3] # correct + targets = input[-1] # correct # predictions @@ -269,11 +379,11 @@ def validate(valid_loader, model, args): total_targets = np.concatenate(total_targets) # Train AUC / ACC - auc, acc = get_metric(total_targets, total_preds) + auc, acc ,precision,recall,f1 = get_metric(total_targets, total_preds) - print(f'VALID AUC : {auc} ACC : {acc}\n') + print(f'VALID AUC : {auc:.5f} ACC : {acc:.5f} precision : {precision:.5f} recall : {recall:.5f} f1 : {f1:.5f}') - return auc, acc, total_preds, total_targets + return auc, acc ,precision,recall,f1, total_preds, total_targets @@ -285,12 +395,14 @@ def inference(args, test_data): model = load_model(args) model.eval() _, test_loader = get_loaders(args, None, test_data) - + print("test_loader에 대해") + print(len(test_loader)) + print(test_loader) total_preds = [] - - for step, batch in tqdm(enumerate(test_loader)): - input = process_batch(batch, args) + + for step, batch in enumerate(test_loader): + input = process_batch_v2(batch,args) preds = model(input) @@ -305,11 +417,12 @@ def inference(args, test_data): preds = preds.detach().numpy() total_preds+=list(preds) - + new_output_path=f'{args.output_dir}/{args.task_name}' write_path = os.path.join(new_output_path, "output.csv") if not os.path.exists(new_output_path): os.makedirs(new_output_path) + print("정답의 개수 :",len(total_preds)) with open(write_path, 'w', encoding='utf8') as w: print("writing prediction : {}".format(write_path)) w.write("id,prediction\n") @@ -317,8 +430,7 @@ def inference(args, test_data): w.write('{},{}\n'.format(id,p)) - -def inference_kfold(args, test_data): +def inference_kfold(args, test_data, test_uid_df): if args.model=='lgbm': return @@ -335,7 +447,7 @@ def inference_kfold(args, test_data): fold_preds = [] for step, batch in tqdm(enumerate(test_loader)): - input = process_batch(batch, args) + input = process_batch_v2(batch, args) preds = model(input) # predictions @@ -355,44 +467,105 @@ def inference_kfold(args, test_data): else: oof_pred += fold_pred / args.n_fold - + oof_df = test_uid_df + oof_df['preds'] = oof_pred new_output_path=f'{args.output_dir}/{args.task_name}' write_path = os.path.join(new_output_path, "output.csv") if not os.path.exists(new_output_path): os.makedirs(new_output_path) - with open(write_path, 'w', encoding='utf8') as w: - print("writing prediction : {}".format(write_path)) - w.write("id,prediction\n") - for id, p in enumerate(oof_pred): - w.write('{},{}\n'.format(id,p)) - + + oof_df.to_csv(write_path, index=False) -def get_model(args,model_name:str): +def get_model(args): """ Load model and move tensors to a given devices. """ - if model_name == 'lstm': model = LSTM(args) - if model_name == 'lstmattn': model = LSTMATTN(args) - if model_name == 'bert': model = Bert(args) - if model_name == 'lstmroberta' : model = LSTMRobertaATTN(args) - if model_name == 'lastquery': model = LastQuery(args) - if model_name == 'saint': model = Sain(args) - + if args.model.lower() == 'lstm': model = LSTM(args) + # if args.model.lower() == 'lstmattn': model = LSTMATTN(args) + if args.model.lower() == 'bert': model = Bert(args) + if args.model.lower() == 'bilstmattn': model = BiLSTMATTN(args) + if args.model.lower() == 'lstmattn': model = LSTMATTN(args) + # if args.model.lower() == 'lstmconvattn' or args.model.lower() == 'lstmrobertaattn' or args.model.lower() == 'lstmattn'\ + # or args.model.lower() == 'lstmalbertattn': + # model = AutoEncoderLSTMATTN(args) + if args.model.lower() == 'mylstmconvattn' : model = MyLSTMConvATTN(args) + if args.model.lower() == 'saint' : model = Saint(args) + if args.model.lower() == 'lastquery_post': model = LastQuery_Post(args) + if args.model.lower() == 'lastquery_pre' : model = LastQuery_Pre(args) + if args.model.lower() == 'lastquery_post_test' : model = LastQuery_Post_TEST(args) # 개발중(deprecated) + if args.model.lower() == 'tfixsaint' : model = TfixupSaint(args) # tfix-up을 적용한 Saint model.to(args.device) return model +# 배치 전처리 일반화 +def process_batch_v2(batch, args): -# 배치 전처리 -def process_batch(batch, args): - test, question, tag, correct, mask = batch + cate_cols=batch[:len(args.cate_feats)-1] #userID 빼고 + # print("cate_cols",cate_cols) + cont_cols=batch[len(args.cate_feats)-1:-2] #answercode, mask 빼고 + # print("cont_cols",cont_cols) + correct = batch[-2] + mask = batch[-1] + # change to float + mask = mask.type(torch.FloatTensor) + correct = correct.type(torch.FloatTensor) + # interaction을 임시적으로 correct를 한칸 우측으로 이동한 것으로 사용 + # saint의 경우 decoder에 들어가는 input이다 + interaction = correct + 1 # 패딩을 위해 correct값에 1을 더해준다. + interaction = interaction.roll(shifts=1, dims=1) + # interaction[:, 0] = 0 # set padding index to the first sequence + interaction_mask = mask.roll(shifts=1, dims=1) + interaction_mask[:, 0] = 0 + interaction = (interaction * interaction_mask).to(torch.int64) + + # cate features masking + for i in range(len(args.cate_feats)-1): + batch[i] = ((batch[i]+1)*mask).to(torch.int64) + # cont features masking + for j in range(len(args.cate_feats)-1,len(batch)-2): + batch[j] = batch[j].type(torch.FloatTensor) + batch[j] = ((batch[j]+1)*mask).to(torch.float32) + + # gather index + # 마지막 sequence만 사용하기 위한 index + gather_index = torch.tensor(np.count_nonzero(mask, axis=1)) + gather_index = gather_index.view(-1, 1) - 1 + # feature들 device에 load + for i in range(len(batch)): + batch[i] = batch[i].to(args.device) + + correct = correct.to(args.device) + mask=mask.to(args.device) + interaction=interaction.to(args.device) + gather_index = gather_index.to(args.device) + #userID, answerCode 제거, answer는 이미 get_target에서 갖고갔음 + ret = batch[:len(args.cate_feats)-1]+batch[len(args.cate_feats)-1:len(batch)-2] + ret.append(interaction) + ret.append(mask) + ret.append(gather_index) + ret.append(correct) + # print("모델로 넘기는 ret 출력",ret) + # print(f"최댓값 : {m}") + return tuple(ret) #tuple(cate + cont + interaction + mask + gather_index + correct) + +# 배치 전처리(기본 feature만 쓸 때, baseline) +def process_batch(batch, args): + # print("배치",batch) + # print('process batch의 사이즈',len(batch)) + test, question, tag, solve_time,correct, mask = batch + + # print("시간",solve_time) + # test, question, tag, correct, mask = batch # base + # print(type(batch)) # change to float + solve_time = solve_time.type(torch.FloatTensor) mask = mask.type(torch.FloatTensor) correct = correct.type(torch.FloatTensor) @@ -400,7 +573,6 @@ def process_batch(batch, args): interaction에서 rolling의 이유 - 이전 time_step에서 푼 문제를 맞췄는지 틀렸는지를 현재 time step의 input으로 넣기 위해서 rolling을 사용한다. """ - # interaction을 임시적으로 correct를 한칸 우측으로 이동한 것으로 사용 # saint의 경우 decoder에 들어가는 input이다 interaction = correct + 1 # 패딩을 위해 correct값에 1을 더해준다. @@ -414,7 +586,7 @@ def process_batch(batch, args): test = ((test + 1) * mask).to(torch.int64) question = ((question + 1) * mask).to(torch.int64) tag = ((tag + 1) * mask).to(torch.int64) - + solve_time=((solve_time + 1) * mask).to(torch.float32) # gather index # 마지막 sequence만 사용하기 위한 index gather_index = torch.tensor(np.count_nonzero(mask, axis=1)) @@ -433,11 +605,13 @@ def process_batch(batch, args): interaction = interaction.to(args.device) gather_index = gather_index.to(args.device) - + solve_time=solve_time.to(args.device) + # dev + # return (test, question, + # tag, correct, mask, + # interaction, gather_index) # base return (test, question, - tag, correct, mask, - interaction, gather_index) - + tag, correct, mask, interaction, solve_time, gather_index) # loss계산하고 parameter update! def compute_loss(preds, targets): @@ -468,12 +642,11 @@ def save_checkpoint(state, model_dir, model_filename): torch.save(state, os.path.join(model_dir, model_filename)) - def load_model(args): model_path = os.path.join(args.model_dir, f'{args.task_name}.pt') print("Loading Model from:", model_path) load_state = torch.load(model_path) - model = get_model(args, args.model) + model = get_model(args) # 1. load model state model.load_state_dict(load_state['state_dict'], strict=True) @@ -482,12 +655,11 @@ def load_model(args): return model - def load_model_kfold(args, fold): model_path = os.path.join((args.model_dir + args.task_name), f'{args.task_name}_{fold+1}fold.pt') print("Loading Model from:", model_path) load_state = torch.load(model_path) - model = get_model(args, args.model) + model = get_model(args) # 1. load model state model.load_state_dict(load_state['state_dict'], strict=True) @@ -497,7 +669,11 @@ def load_model_kfold(args, fold): -def get_target(datas): + + + + +def get_target(datas): #처리하기 전에 get_data_from_file 에서 맨마지막 answer이므로 바뀔일 없음 targets = [] for data in datas: targets.append(data[-1][-1]) diff --git a/ensemble/.gitignore b/ensemble/.gitignore new file mode 100644 index 0000000..24ede34 --- /dev/null +++ b/ensemble/.gitignore @@ -0,0 +1,2 @@ +*.csv +.ipynb_checkpoints \ No newline at end of file diff --git a/inference.py b/inference.py index 8afcfe3..d4df2f8 100644 --- a/inference.py +++ b/inference.py @@ -1,35 +1,30 @@ import os -import yaml -import argparse -from attrdict import AttrDict - - +from args import parse_args from dkt.dataloader import Preprocess from dkt import trainer import torch - +import yaml +import argparse +from attrdict import AttrDict def main(args): device = "cuda" if torch.cuda.is_available() else "cpu" args.device = device - preprocess = Preprocess(args) +# args.infer = True preprocess.load_test_data(args.test_file_name) - test_data = preprocess.get_test_data() - + test_data, test_uid_df = preprocess.get_test_data() - trainer.inference_kfold(args, test_data) - + if args.use_kfold: + trainer.inference_kfold(args, test_data, test_uid_df) + else : + trainer.inference(args, test_data) + if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument('-c', '--conf', default='/opt/ml/git/p4-dkt-ollehdkt/conf.yml', help='wrtie configuration file root.') - parser.add_argument('-t', '--task', default='', help='wrtie task_dir root.') - - term_args = parser.parse_args() - with open(term_args.conf) as f: + with open('/opt/ml/code/conf.yml') as f: cf = yaml.load(f, Loader=yaml.FullLoader) args = AttrDict(cf) # args = parse_args(mode='train') diff --git a/lgbm_utils.py b/lgbm_utils.py index 1d07e7a..e41572f 100644 --- a/lgbm_utils.py +++ b/lgbm_utils.py @@ -1,37 +1,185 @@ +import warnings +warnings.filterwarnings('ignore') + +from matplotlib import rc +import matplotlib.pyplot as plt import json import pandas as pd -import os +import os,gc import random from attrdict import AttrDict # !pip install ligthgbm import lightgbm as lgb -from sklearn.metrics import roc_auc_score -from sklearn.metrics import accuracy_score +from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, roc_curve, accuracy_score +from sklearn.preprocessing import MinMaxScaler, LabelEncoder +from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold, GroupKFold +from sklearn.impute import SimpleImputer import numpy as np +from collections import defaultdict -def make_lgbm_feature(df): - #유저들의 문제 풀이수, 정답 수, 정답률을 시간순으로 누적해서 계산 - df['user_correct_answer'] = df.groupby('userID')['answerCode'].transform(lambda x: x.cumsum().shift(1)) - df['user_total_answer'] = df.groupby('userID')['answerCode'].cumcount() - df['user_acc'] = df['user_correct_answer']/df['user_total_answer'] +def make_sharing_feature(args): + """[use train+test(except last row) get pre_processed feature] + + Args: + args ([conf.yaml]): [args를 전달받는다] + """ + + csv_file_path = os.path.join(args.data_dir, args.file_name) + df = pd.read_csv(csv_file_path)#, nrows=100000) + csv_file_path = os.path.join(args.data_dir, args.test_file_name) + tdf = pd.read_csv(csv_file_path)#, nrows=100000) + + if args.use_distance: + df['distance']=np.load('/opt/ml/np_train_tag_distance_arr.npy') + tdf['distance']=np.load('/opt/ml/np_test_tag_distance_arr.npy') + + tdf=tdf[tdf['userID']==tdf['userID'].shift(-1)] + df=pd.concat([df,tdf],ignore_index=True) + return df + + +def get_sharing_feature(args): + if args.make_sharing_feature: + df=make_sharing_feature(args) + else : + csv_file_path = os.path.join(args.data_dir, args.file_name) + df = pd.read_csv(csv_file_path)#, nrows=100000) + # trian에서 각 문제 평균 뽑기 + testId_mean_sum = df.groupby(['testId'])['answerCode'].agg(['mean','sum']).to_dict() + assessmentItemID_mean_sum = df.groupby(['assessmentItemID'])['answerCode'].agg(['mean', 'sum']).to_dict() + KnowledgeTag_mean_sum = df.groupby(['KnowledgeTag'])['answerCode'].agg(['mean', 'sum']).to_dict() + + # 시간 피처 + testId_time_agg = df.groupby(['testId'])['solve_time'].agg(['mean','std','skew']).to_dict() + assessment_time_agg=df.groupby(['assessmentItemID'])['solve_time'].agg(['mean','std','skew']).to_dict() + KnowledgeTag_time_agg = df.groupby(['KnowledgeTag'])['solve_time'].agg(['mean','std','skew']).to_dict() + #해당 문제를 맞은사람의 평균시간과 틀린사람의 평균시간 + a_t_rate_df=df.groupby(['assessmentItemID','answerCode']).agg({'solve_time':'mean'}).reset_index(drop=False) + assess_time_corNwrong_agg=a_t_rate_df.groupby('assessmentItemID')['solve_time'].agg(['first','last']).to_dict() + print("lgbm feature처럼 사용") + return testId_mean_sum, assessmentItemID_mean_sum, KnowledgeTag_mean_sum,testId_time_agg,assessment_time_agg,KnowledgeTag_time_agg,assess_time_corNwrong_agg + +def make_lgbm_feature(args, df,is_train=True): + testId_mean_sum, assessmentItemID_mean_sum, KnowledgeTag_mean_sum,testId_time_agg,assessment_time_agg,KnowledgeTag_time_agg,assess_time_corNwrong_agg=get_sharing_feature(args) + + #문제별 맞은사람과 틀린사람의 평균풀이시간 + df['wrongP_time']=df.assessmentItemID.map(assess_time_corNwrong_agg['first']) + df['correctP_time']=df.assessmentItemID.map(assess_time_corNwrong_agg['last']) + + item_size = df[['assessmentItemID', 'testId']].drop_duplicates().groupby('testId').size() + testId2maxlen = item_size.to_dict() # 중복해서 풀이할 놈들을 제거하기 위해 + + df["test_mean"] = df.testId.map(testId_mean_sum['mean']) + df['test_sum'] = df.testId.map(testId_mean_sum['sum']) + df["ItemID_mean"] = df.assessmentItemID.map(assessmentItemID_mean_sum['mean']) + df['ItemID_sum'] = df.assessmentItemID.map(assessmentItemID_mean_sum['sum']) + df["tag_mean"] = df.KnowledgeTag.map(KnowledgeTag_mean_sum['mean']) + df['tag_sum'] = df.KnowledgeTag.map(KnowledgeTag_mean_sum['sum']) + + df['test_t_mean']= df.testId.map(testId_time_agg['mean']) + df['test_t_std']= df.testId.map(testId_time_agg['std']) + df['test_t_skew']= df.testId.map(testId_time_agg['skew']) + df['assess_t_mean']= df.assessmentItemID.map(assessment_time_agg['mean']) + df['assess_t_std']= df.assessmentItemID.map(assessment_time_agg['std']) + df['assess_t_skew']= df.assessmentItemID.map(assessment_time_agg['skew']) + df['tag_t_mean']= df.KnowledgeTag.map(KnowledgeTag_time_agg['mean']) + df['tag_t_std']= df.KnowledgeTag.map(KnowledgeTag_time_agg['std']) + df['tag_t_skew']= df.KnowledgeTag.map(KnowledgeTag_time_agg['skew']) + ###서일님 피처 + # 유저가 푼 시험지에 대해, 유저의 전체 정답/풀이횟수/정답률 계산 (3번 풀었으면 3배) + df_group = df.groupby(['userID','testId'])['answerCode'] + df['user_total_correct_cnt'] = df_group.transform(lambda x: x.cumsum().shift(1)) + df['user_total_ans_cnt'] = df_group.cumcount() + df['user_total_acc'] = df['user_total_correct_cnt'] / df['user_total_ans_cnt'] + # 유저가 푼 시험지에 대해, 유저의 풀이 순서 계산 (시험지를 반복해서 풀었어도, 누적되지 않음) + # 특정 시험지를 얼마나 반복하여 풀었는지 계산 ( 2번 풀었다면, retest == 1) + df['test_size'] = df.testId.map(testId2maxlen) + df['retest'] = df['user_total_ans_cnt'] // df['test_size'] + df['user_test_ans_cnt'] = df['user_total_ans_cnt'] % df['test_size'] + + # 각 시험지 당 유저의 정확도를 계산 + df['user_test_correct_cnt'] = df.groupby(['userID','testId','retest'])['answerCode'].transform(lambda x: x.cumsum().shift(1)) + df['user_acc'] = df['user_test_correct_cnt']/df['user_test_ans_cnt'] + + ###서일님 피처 + + #sequential feature in here + #유저들의 문제 풀이수, 정답 수, 정답률을 시간순으로 누적해서 계산 + # df['user_correct_answer'] = df.groupby('userID')['answerCode'].transform(lambda x: x.cumsum().shift(1)) + # df['user_total_answer'] = df.groupby('userID')['answerCode'].cumcount() + # df['user_acc'] = df['user_correct_answer']/df['user_total_answer'] + #학생의 학년을 정하고 푼 문제지의 학년합을 구해본다 + df['test_level']=df['assessmentItemID'].apply(lambda x:int(x[2])) + #문제번호 + df['problem_number']=df['assessmentItemID'].apply(lambda x:int(x[-3:])) + #시간 + # df['year_month']=pd.to_datetime(df['Timestamp'], format="").dt.strftime('%Y-%m') + # print(time.dt.strftime('%Y-%m-%d')) + df['test_tag_cumsum']=df.groupby(['userID','testId']).agg({'solve_time':'cumsum'}) + # non-sequential feature in here # testId와 KnowledgeTag의 전체 정답률은 한번에 계산 # 아래 데이터는 제출용 데이터셋에 대해서도 재사용 - correct_t = df.groupby(['testId'])['answerCode'].agg(['mean', 'sum']) - correct_t.columns = ["test_mean", 'test_sum'] - correct_k = df.groupby(['KnowledgeTag'])['answerCode'].agg(['mean', 'sum']) - correct_k.columns = ["tag_mean", 'tag_sum'] + group_list=['userID'] + + #문제 태그로 groupby했을 때 적용할 함수들 + # tag_agg_dict={ + # # 'answerCode': ['count','mean', 'sum'], #태그개수,태그별 정답률, 해당 태그를 맞춘 개수 + # # 'userID' :['nunique'], #해당 태그를 풀이한 유저의 수(인기도), + # # 'assessmentItemID':['nunique'], #해당 태그가 얼마나 여러 문제번호에 분포돼있는지, 왜도(문제지의 어느부분인지) + # 'solve_time' :['mean','std','skew'], + # } + + #시험지번호로 groupby했을 때 적용할 함수들 + # test_agg_dict={ + # 'solve_time' :['mean','std','skew'], + # # 'answerCode': ['count','mean', 'sum'], #시험지별 제출개수 ,시험지별 정답률, 해당 시험지를 풀이하여 맞은 개수 + # # 'userID' :['nunique'], #해당 시험지를 풀이한 유저의 수(인기도), + # # 'assessmentItemID':['nunique'], #해당 시험지에 문제가 얼마나 분포돼있는지, 왜도(문제지의 어느부분인지) + + # } + + #사용자별로 groupby했을 때 적용할 함수들, answercode 관련 column은 위에서 이미 정의함 + uid_agg_dict={ + # 'assessmentItemID':['nunique'], #얼마나 많은 종류의 문제를 풀었는지, 왜도(문제지의 어느부분인지) + # 'problem_number':['skew'], + # 'year_month':[lambda x:x.value_counts().index[0]], + # 'Timestamp':['first'], + # 'test_level':['mean', 'sum','std'], + 'solve_time' :['mean','std','skew'], + } + + agg_dict_list=[uid_agg_dict] + + + for group, now_agg in zip(group_list,agg_dict_list): + grouped_df=df.groupby(group).agg(now_agg) + new_cols = [] + for col in now_agg.keys(): + for stat in now_agg[col]: + if type(stat) is str: + new_cols.append(f'{group}-{col}-{stat}') + else: + new_cols.append(f'{group}-{col}-mode') + grouped_df.columns = new_cols - df = pd.merge(df, correct_t, on=['testId'], how="left") - df = pd.merge(df, correct_k, on=['KnowledgeTag'], how="left") + grouped_df.reset_index(inplace = True) + df = df.merge(grouped_df, on=group, how='left') + #agg취한 값들이 feature가 된다 + delete_feats=['userID','assessmentItemID','testId','answerCode','Timestamp','sec_time'] + features = df.drop(columns=delete_feats).columns + df.isnull().sum() df = df.fillna(0) - return df + + + return lgbm_feature_preprocessing(df,features, do_imputing=True) + +def lgbm_split_data(data,ratio,seed=42): + random.seed(seed) -def lgbm_split_data(data,ratio): - random.seed(42) users = list(zip(data['userID'].value_counts().index, data['userID'].value_counts())) random.shuffle(users) @@ -53,52 +201,84 @@ def lgbm_split_data(data,ratio): test = test[test['userID'] != test['userID'].shift(-1)] return train, test +def lgbm_oof_split_data_withidx(args,data): + random.seed(42) + n_fold=args.n_fold + + users = list(zip(data['userID'].value_counts().index, data['userID'].value_counts())) + random.shuffle(users) + + user_id_dict=defaultdict(list) + user_count_dict=defaultdict(int) + + for idx, (user_id, count) in enumerate(users): + f_num=idx%n_fold + user_count_dict[f_num] += count + user_id_dict[f_num].append(user_id) + + return user_id_dict,user_count_dict + +def get_fold_data(idx,data,user_id_dict,user_count_dict): + + train = data[data['userID'].isin(user_id_dict[idx]) == False] + test = data[data['userID'].isin(user_id_dict[idx])] + + #test데이터셋은 각 유저의 마지막 interaction만 추출 + test = test[test['userID'] != test['userID'].shift(-1)] + return train, test + def lgbm_make_test_data(data): data = data.drop(['answerCode'], axis=1) return data[data['userID'] != data['userID'].shift(-1)] def lgbm_train(args,train_data,valid_data): # 사용할 Feature 설정 - delete_feats=['userID','assessmentItemID','testId','answerCode','Timestamp'] + delete_feats=['userID','assessmentItemID','testId','answerCode','Timestamp','sec_time'] FEATS = list(set(train_data.columns)-set(delete_feats)) print(f'{len(FEATS)}개의 피처를 사용합니다') print(FEATS) # FEATS = ['KnowledgeTag', 'user_correct_answer', 'user_total_answer', # 'user_acc', 'test_mean', 'test_sum', 'tag_mean','tag_sum'] - + # X, y 값 분리 y_train = train_data['answerCode'] train_data = train_data.drop(['answerCode'], axis=1) y_test = valid_data['answerCode'] valid_data = valid_data.drop(['answerCode'], axis=1) - + lgb_train = lgb.Dataset(train_data[FEATS], y_train) lgb_test = lgb.Dataset(valid_data[FEATS], y_test) model = lgb.train( -# {'objective': 'binary'}, args.lgbm.model_params, lgb_train, valid_sets=[lgb_train, lgb_test], - verbose_eval=args.lgbm.verbose_eval, #ori 100 - num_boost_round= args.lgbm.num_boost_round, - early_stopping_rounds=args.lgbm.early_stopping_rounds, + verbose_eval=args.model_params.verbose_eval, + num_boost_round=args.model_params.num_boost_round, + early_stopping_rounds=args.model_params.early_stopping_rounds, ) preds = model.predict(valid_data[FEATS]) - acc = accuracy_score(y_test, np.where(preds >= 0.5, 1, 0)) - auc = roc_auc_score(y_test, preds) + auc, acc ,precision,recall,f1 = get_metric(y_test, preds) + # acc = accuracy_score(y_test, np.where(preds >= 0.5, 1, 0)) + # auc = roc_auc_score(y_test, preds) - print(f'VALID AUC : {auc} ACC : {acc}\n') + print(f'VALID AUC : {auc} ACC : {acc} precision : {precision} recall : {recall} f1 : {f1}') - _ = lgb.plot_importance(model) + ax = lgb.plot_importance(model,figsize=(12,7)) + #Feature Importance 저장 + new_output_path=f'{args.output_dir}{args.task_name}' + write_path = os.path.join(new_output_path, "feature_importance.png") + if not os.path.exists(new_output_path): + os.makedirs(new_output_path) + ax.figure.savefig(write_path,bbox_inches='tight', pad_inches=0.5) - return model,auc,acc + return model,auc, acc ,precision,recall,f1 def lgbm_inference(args,model, test_data): - delete_feats=['userID','assessmentItemID','testId','answerCode','Timestamp'] + delete_feats=['userID','assessmentItemID','testId','answerCode','Timestamp','sec_time'] FEATS = list(set(test_data.columns)-set(delete_feats)) answer = model.predict(test_data[FEATS]) @@ -114,4 +294,283 @@ def lgbm_inference(args,model, test_data): for id, p in enumerate(answer): w.write('{},{}\n'.format(id,p)) - print(f"lgbm의 예측파일이 {new_output_path}/{args.task_name}.csv 로 저장됐습니다.") \ No newline at end of file + print(f"lgbm의 예측파일이 {new_output_path}/{args.task_name}.csv 로 저장됐습니다.") + + save_path=f"{args.output_dir}{args.task_name}/feature{len(FEATS)}_config.json" + json.dump( + FEATS, + open(save_path, "w"), + indent=2, + ensure_ascii=False, + ) + +def lgbm_feature_preprocessing(train,features, do_imputing=True): + x_tr = train.copy() + + # 범주형 피처 이름을 저장할 변수 + cate_cols = [] + + # 레이블 인코딩 + for f in features: + if x_tr[f].dtype.name == 'object': # 데이터 타입이 object(str)이면 레이블 인코딩 + cate_cols.append(f) + le = LabelEncoder() + # train 데이터를 레이블 인코딩 함수에 fit + le.fit(list(x_tr[f].values)) + + # train 데이터 레이블 인코딩 변환 수행 + x_tr[f] = le.transform(list(x_tr[f].values)) + + + print('categorical feature:', cate_cols) + + if do_imputing: + # 중위값으로 결측치 채우기 + imputer = SimpleImputer(strategy='median') + x_tr[features] = imputer.fit_transform(x_tr[features]) + + return x_tr + +def make_lgb_user_oof_prediction(args, train, test, features, categorical_features='auto', model_params=None, folds=None): + user_id_dict,user_count_dict=lgbm_oof_split_data_withidx(args,train) + + test = test[test['userID'] != test['userID'].shift(-1)] + x_test = test[features] + + # 테스트 데이터 예측값을 저장할 변수 + test_preds = np.zeros(x_test.shape[0]) + + # Out Of Fold Validation 예측 데이터를 저장할 변수 + y_oof = np.zeros(train.shape[0]) + + # 폴드별 평균 Validation 스코어를 저장할 변수 + score = 0 + acc=0 + precision=0 + recall=0 + f1=0 + # 피처 중요도를 저장할 데이터 프레임 선언 + fi = pd.DataFrame() + fi['feature'] = features + + for fold in range(folds): + # train index, validation index로 train 데이터를 나눔 + train_set,valid_set=get_fold_data(fold,train,user_id_dict,user_count_dict) + x_tr, x_val = train_set[features], valid_set[features] + y_tr, y_val = train_set['answerCode'], valid_set['answerCode'] + + print(f'fold: {fold+1}, x_tr.shape: {x_tr.shape}, x_val.shape: {x_val.shape}') + + # LightGBM 데이터셋 선언 + dtrain = lgb.Dataset(x_tr, label=y_tr) + dvalid = lgb.Dataset(x_val, label=y_val) + + # LightGBM 모델 훈련 + clf = lgb.train( + model_params, + dtrain, + valid_sets=[dtrain, dvalid], # Validation 성능을 측정할 수 있도록 설정 + categorical_feature=categorical_features, + verbose_eval=args.lgbm.verbose_eval, + num_boost_round=args.lgbm.num_boost_round, + early_stopping_rounds=args.lgbm.early_stopping_rounds, + ) + + # Validation 데이터 예측 + val_preds = clf.predict(x_val) + + # Validation index에 예측값 저장 + # y_oof[fold] = val_preds + + # 폴드별 Validation 스코어 측정 + fold_auc, fold_acc ,fold_precision,fold_recall,fold_f1 = get_metric(y_val, val_preds) + + print(f"Fold {fold + 1} | AUC: {fold_auc} | ACC: {fold_acc} | Precision: {fold_precision} | Recall: {fold_recall} | f1: {fold_f1}") + print('-'*80) + + # score 변수에 폴드별 평균 Validation 스코어 저장 + score += fold_auc / folds + acc+=fold_acc / folds + precision=fold_precision / folds + recall = fold_recall / folds + f1 = fold_f1 / folds + # 테스트 데이터 예측하고 평균해서 저장 + test_preds += clf.predict(x_test) / folds + + # 폴드별 피처 중요도 저장 + fi[f'fold_{fold+1}'] = clf.feature_importance() + + del x_tr, x_val, y_tr, y_val + gc.collect() + + print(f"\nMean AUC = {score}") # 폴드별 Validation 스코어 출력 + print(f"Mean ACC = {acc}") # 폴드별 Validation 스코어 출력 + print(f"\nMean Precision = {precision}") # 폴드별 Validation 스코어 출력 + print(f"Mean Recall = {recall}") # 폴드별 Validation 스코어 출력 + print(f"Mean f1 = {f1}") # 폴드별 Validation 스코어 출력 + # print(f"OOF AUC = {roc_auc_score(y, y_oof)}") # Out Of Fold Validation 스코어 출력 + + # 폴드별 피처 중요도 평균값 계산해서 저장 + fi_cols = [col for col in fi.columns if 'fold_' in col] + fi['importance'] = fi[fi_cols].mean(axis=1) + + #add fig to mlflow + n=40 + color='blue' + figsize=(12,8) + + fi = fi.sort_values('importance', ascending = False).reset_index(drop = True) + + # 피처 중요도 정규화 및 누적 중요도 계산 + fi['importance_normalized'] = fi['importance'] / fi['importance'].sum() + fi['cumulative_importance'] = np.cumsum(fi['importance_normalized']) + + plt.rcParams['font.size'] = 12 + plt.style.use('fivethirtyeight') + # 피처 중요도 순으로 n개까지 바플롯으로 그리기 + fi.loc[:n, :].plot.barh(y='importance_normalized', + x='feature', color=color, + edgecolor='k', figsize=figsize, + legend=False) + + plt.xlabel('Normalized Importance', size=18); plt.ylabel(''); + plt.title(f'Top {n} Most Important Features', size=18) + plt.gca().invert_yaxis() + + new_output_path=f'{args.output_dir}{args.task_name}' + write_path = os.path.join(new_output_path, "feature_importance.png") + if not os.path.exists(new_output_path): + os.makedirs(new_output_path) + plt.savefig(write_path,bbox_inches='tight', pad_inches=0.5) + plt.close() + + return y_oof, test_preds, fi , score, acc, precision, recall, f1 + + +def make_lgb_oof_prediction(args,train, test, features, categorical_features='auto', model_params=None, folds=None): + x_train = train[features] + y =train['answerCode'] + test = test[test['userID'] != test['userID'].shift(-1)] + x_test = test[features] + + # 테스트 데이터 예측값을 저장할 변수 + test_preds = np.zeros(x_test.shape[0]) + + # Out Of Fold Validation 예측 데이터를 저장할 변수 + y_oof = np.zeros(x_train.shape[0]) + + # 폴드별 평균 Validation 스코어를 저장할 변수 + score = 0 + acc=0 + precision=0 + recall=0 + f1=0 + + # 피처 중요도를 저장할 데이터 프레임 선언 + fi = pd.DataFrame() + fi['feature'] = features + + # Stratified K Fold 선언 + skf = StratifiedKFold(n_splits=folds, shuffle=True, random_state=args.seed) + + for fold, (tr_idx, val_idx) in enumerate(skf.split(x_train, y)): + # train index, validation index로 train 데이터를 나눔 + x_tr, x_val = x_train.loc[tr_idx, features], x_train.loc[val_idx, features] + y_tr, y_val = y[tr_idx], y[val_idx] + + print(f'fold: {fold+1}, x_tr.shape: {x_tr.shape}, x_val.shape: {x_val.shape}') + + # LightGBM 데이터셋 선언 + dtrain = lgb.Dataset(x_tr, label=y_tr) + dvalid = lgb.Dataset(x_val, label=y_val) + + # LightGBM 모델 훈련 + clf = lgb.train( + model_params, + dtrain, + valid_sets=[dtrain, dvalid], # Validation 성능을 측정할 수 있도록 설정 + categorical_feature=categorical_features, + verbose_eval=args.lgbm.verbose_eval, + num_boost_round=args.lgbm.num_boost_round, + early_stopping_rounds=args.lgbm.early_stopping_rounds, + ) + + # Validation 데이터 예측 + val_preds = clf.predict(x_val) + + # Validation index에 예측값 저장 + y_oof[val_idx] = val_preds + + fold_auc, fold_acc ,fold_precision,fold_recall,fold_f1 = get_metric(y_val, list(map(round,val_preds))) + + # 폴드별 Validation 스코어 측정 + print(f"Fold {fold + 1} | AUC: {roc_auc_score(y_val, val_preds)} | ACC: {fold_acc} | Precision: {fold_precision} | Recall: {fold_recall} | f1: {fold_f1}") + print('-'*80) + + # score 변수에 폴드별 평균 Validation 스코어 저장 + score += fold_auc / folds + acc+=fold_acc / folds + precision=fold_precision / folds + recall = fold_recall / folds + f1 = fold_f1 / folds + # 테스트 데이터 예측하고 평균해서 저장 + test_preds += clf.predict(x_test) / folds + + # 폴드별 피처 중요도 저장 + fi[f'fold_{fold+1}'] = clf.feature_importance() + + del x_tr, x_val, y_tr, y_val + gc.collect() + + print(f"\nMean AUC = {score}") # 폴드별 Validation 스코어 출력 + print(f"Mean ACC = {acc}") # 폴드별 Validation 스코어 출력 + print(f"\nMean Precision = {precision}") # 폴드별 Validation 스코어 출력 + print(f"Mean Recall = {recall}") # 폴드별 Validation 스코어 출력 + print(f"Mean f1 = {f1}") # 폴드별 Validation 스코어 출력 + # print(f"OOF AUC = {roc_auc_score(y, y_oof)}") # Out Of Fold Validation 스코어 출력 + + # 폴드별 피처 중요도 평균값 계산해서 저장 + fi_cols = [col for col in fi.columns if 'fold_' in col] + fi['importance'] = fi[fi_cols].mean(axis=1) + + #add fig to mlflow + n=40 + color='blue' + figsize=(12,8) + + fi = fi.sort_values('importance', ascending = False).reset_index(drop = True) + + # 피처 중요도 정규화 및 누적 중요도 계산 + fi['importance_normalized'] = fi['importance'] / fi['importance'].sum() + fi['cumulative_importance'] = np.cumsum(fi['importance_normalized']) + + plt.rcParams['font.size'] = 12 + plt.style.use('fivethirtyeight') + # 피처 중요도 순으로 n개까지 바플롯으로 그리기 + fi.loc[:n, :].plot.barh(y='importance_normalized', + x='feature', color=color, + edgecolor='k', figsize=figsize, + legend=False) + + plt.xlabel('Normalized Importance', size=18); plt.ylabel(''); + plt.title(f'Top {n} Most Important Features', size=18) + plt.gca().invert_yaxis() + + new_output_path=f'{args.output_dir}{args.task_name}' + write_path = os.path.join(new_output_path, "feature_importance.png") + if not os.path.exists(new_output_path): + os.makedirs(new_output_path) + plt.savefig(write_path,bbox_inches='tight', pad_inches=0.5) + plt.close() + + return y_oof, test_preds, fi, score, acc, precision, recall, f1 + + +def get_metric(targets, preds): + auc = roc_auc_score(targets, preds) + acc = accuracy_score(targets, list(map(round,preds))) + precision=precision_score(targets, list(map(round,preds))) + recall=recall_score(targets, list(map(round,preds))) + f1=f1_score(targets, list(map(round,preds))) + + return auc, acc ,precision,recall,f1 \ No newline at end of file diff --git a/makefeature.py b/makefeature.py new file mode 100644 index 0000000..b73489a --- /dev/null +++ b/makefeature.py @@ -0,0 +1,88 @@ +import warnings +warnings.filterwarnings('ignore') + +from matplotlib import rc +import matplotlib.pyplot as plt +import json +import pandas as pd +import os,gc +import random +from attrdict import AttrDict +# !pip install ligthgbm +import lightgbm as lgb +from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, precision_recall_curve, roc_curve, accuracy_score +from sklearn.preprocessing import MinMaxScaler, LabelEncoder +from sklearn.model_selection import train_test_split, cross_val_score, KFold, StratifiedKFold, GroupKFold +from sklearn.impute import SimpleImputer +import numpy as np +from collections import defaultdict + +from lgbm_utils import * + +def make_feature(args,df): + testId_mean_sum, assessmentItemID_mean_sum, KnowledgeTag_mean_sum,testId_time_agg,assessment_time_agg,KnowledgeTag_time_agg,assess_time_corNwrong_agg=get_sharing_feature(args) + + #문제별 맞은사람과 틀린사람의 평균풀이시간 + df['wrongP_time']=df.assessmentItemID.map(assess_time_corNwrong_agg['first']) + df['correctP_time']=df.assessmentItemID.map(assess_time_corNwrong_agg['last']) + + item_size = df[['assessmentItemID', 'testId']].drop_duplicates().groupby('testId').size() + testId2maxlen = item_size.to_dict() # 중복해서 풀이할 놈들을 제거하기 위해 + + df["test_mean"] = df.testId.map(testId_mean_sum['mean']) + df['test_sum'] = df.testId.map(testId_mean_sum['sum']) + df["ItemID_mean"] = df.assessmentItemID.map(assessmentItemID_mean_sum['mean']) + df['ItemID_sum'] = df.assessmentItemID.map(assessmentItemID_mean_sum['sum']) + df["tag_mean"] = df.KnowledgeTag.map(KnowledgeTag_mean_sum['mean']) + df['tag_sum'] = df.KnowledgeTag.map(KnowledgeTag_mean_sum['sum']) + + df['test_t_mean']= df.testId.map(testId_time_agg['mean']) + df['test_t_std']= df.testId.map(testId_time_agg['std']) + df['test_t_skew']= df.testId.map(testId_time_agg['skew']) + df['assess_t_mean']= df.assessmentItemID.map(assessment_time_agg['mean']) + df['assess_t_std']= df.assessmentItemID.map(assessment_time_agg['std']) + df['assess_t_skew']= df.assessmentItemID.map(assessment_time_agg['skew']) + df['tag_t_mean']= df.KnowledgeTag.map(KnowledgeTag_time_agg['mean']) + df['tag_t_std']= df.KnowledgeTag.map(KnowledgeTag_time_agg['std']) + df['tag_t_skew']= df.KnowledgeTag.map(KnowledgeTag_time_agg['skew']) + ###서일님 피처 + + df['test_tag_cumsum']=df.groupby(['userID','testId']).agg({'solve_time':'cumsum'}) + + # 유저가 푼 시험지에 대해, 유저의 전체 정답/풀이횟수/정답률 계산 (3번 풀었으면 3배) + df_group = df.groupby(['userID','testId'])['answerCode'] + df['user_total_correct_cnt'] = df_group.transform(lambda x: x.cumsum().shift(1)) + df['user_total_ans_cnt'] = df_group.cumcount() + df['user_total_acc'] = df['user_total_correct_cnt'] / df['user_total_ans_cnt'] + + #학생의 학년을 정하고 푼 문제지의 학년합을 구해본다 + df['test_level']=df['assessmentItemID'].apply(lambda x:str(x[2])) + #문제번호 + df['problem_number']=df['assessmentItemID'].apply(lambda x:str(x[-3:])) + + group_list=['userID'] + uid_agg_dict={ + 'solve_time' :['mean','std','skew'], + } + + agg_dict_list=[uid_agg_dict] + + + for group, now_agg in zip(group_list,agg_dict_list): + grouped_df=df.groupby(group).agg(now_agg) + new_cols = [] + for col in now_agg.keys(): + for stat in now_agg[col]: + if type(stat) is str: + new_cols.append(f'{group}-{col}-{stat}') + else: + new_cols.append(f'{group}-{col}-mode') + + grouped_df.columns = new_cols + + grouped_df.reset_index(inplace = True) + df = df.merge(grouped_df, on=group, how='left') + #delete null + df.isnull().sum() + df = df.fillna(0) + return df \ No newline at end of file diff --git a/pre_FE.py b/pre_FE.py new file mode 100644 index 0000000..40fcea5 --- /dev/null +++ b/pre_FE.py @@ -0,0 +1,72 @@ +def user_tag_ansrate_feature(df): + tag_ansrate=[] + tag_len=[] + for uid in df['userID'].unique(): + interactions=df[df['userID']==uid] + user_tag_dict=defaultdict(list) + for idx in range(len(interactions)): + tag=interactions.iloc[idx]['KnowledgeTag'] + answer=interactions.iloc[idx]['answerCode'] + if idx==0 or len(user_tag_dict[tag])==0 : + tag_ansrate.append(0) + else : + tag_ansrate.append(sum(user_tag_dict[tag])/len(user_tag_dict[tag])) + + + tag_len.append(len(user_tag_dict[tag])) + user_tag_dict[tag].append(0 if answer==-1 else answer) + + print(len(tag_ansrate)) + print(len(tag_len)) + #유저가 현재 풀고있는 문제 유형을 몇번이나 풀었고 그에 따른 정답률를 리턴한다 + return tag_ansrate,tag_len + +def total_tag_ans_rate_feature(df): + #태그별 정답률 + tag_groupby = df.groupby('KnowledgeTag').agg({ + 'answerCode': percentile + }).reset_index(drop=False) + tag_groupby + tag_ansrate=zip(tag_groupby['KnowledgeTag'],tag_groupby['answerCode']) + tag_ansrate_dict=dict(list(tag_ansrate)) + return df['KnowledgeTag'].apply(lambda x:tag_ansrate_dict[x]) + +#사용자의 문제풀이 시간을 구하는 함수 +def make_solve_time(df): + def convert_time(s): + timestamp = time.mktime(datetime.strptime(s, '%Y-%m-%d %H:%M:%S').timetuple()) + return int(timestamp) + df=df.copy() + df['sec_time']=df['Timestamp'].apply(convert_time) + df['solve_time']=df['sec_time']-df['sec_time'].shift(1) + return df + + +#agg에서 첫번째 값을 보간하기 위한 함수 +def get_interpolate(s): + s=s.values +# print(s[0]) + if s[0]>3600 or pd.isnull(s[0]) or s[0]<0: + s[0]=np.nan + s=pd.Series(s) + s=s[::-1].interpolate()[::-1] + + return s[0] + +#df에 solve_time을 보간하여 추가하는 함수 +def make_timecv(df): + answer=pd.DataFrame() + for user in df['userID'].unique(): + interactions=df[df['userID']==user] + interactions.sort_values(by=['testId','Timestamp'], inplace=True) + inter_df=interactions.groupby('testId').agg({'solve_time':lambda x: get_interpolate(x)}).reset_index(drop=False) +# print(inter_df) + #보간한 테스트 아이디별 시간 dict로 저장 + inter_time_dict=dict(zip(inter_df['testId'],inter_df['solve_time'])) + need_inter_interactions=interactions[interactions['testId']!=interactions['testId'].shift(1)] + need_inter_interactions['solve_time']=need_inter_interactions['testId'].apply(lambda x:inter_time_dict[x]) + under_interactions=interactions[interactions['testId']==interactions['testId'].shift(1)] + total_user=pd.concat([need_inter_interactions,under_interactions], ignore_index=False) + answer=pd.concat([answer,total_user], ignore_index=False) + answer.sort_values(by=['userID','Timestamp'], inplace=True) + return answer \ No newline at end of file diff --git a/requirements.txt b/requirements.txt index 80d0982..5e9593d 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,5 +3,8 @@ pandas sklearn tqdm wandb -transformers -easydict \ No newline at end of file +transformers==4.4.1 +easydict +seaborn +matplotlib +missingno \ No newline at end of file diff --git a/submit.py b/submit.py index f5cd826..18d44d1 100644 --- a/submit.py +++ b/submit.py @@ -25,8 +25,10 @@ def submit(user_key='', file_path = ''): requests.post(url=submit_url, data=body, files={'file': open(file_path, 'rb')}) if __name__ == "__main__": - test_dir='/opt/ml'#prediction folder path - + test_dir='/opt/ml/code/output/ensemble'#prediction folder path + print(test_dir, "에 있는 파일을 제출하였습니다") # 아래 글을 통해 자신의 key값 찾아 넣기 # http://boostcamp.stages.ai/competitions/3/discussion/post/110 - submit("Bearer 15bdf505e0902975b2e6f578148d22136b2f7717", os.path.join(test_dir, 'answer.csv')) + + # desc = "desc 시도" + submit("Bearer 15bdf505e0902975b2e6f578148d22136b2f7717", os.path.join(test_dir, 'prediction.csv')) diff --git a/train.py b/train.py index cd773c8..4e1b5c7 100644 --- a/train.py +++ b/train.py @@ -1,41 +1,46 @@ import os -import yaml -import json -import argparse -from attrdict import AttrDict - +from args import parse_args from dkt.dataloader import Preprocess from dkt import trainer -from dkt.utils import setSeeds - import torch +from dkt.utils import setSeeds import wandb +import yaml +import json +import argparse +from attrdict import AttrDict def main(args): - wandb.init(project=args.wandb.project, entity=args.wandb.entity) - wandb.run.name = args.task_name + if args.wandb.using: + wandb.init(project=args.wandb.project, entity=args.wandb.entity) + wandb.run.name = args.task_name + wandb.util.generate_id() setSeeds(args.seed) device = "cuda" if torch.cuda.is_available() else "cpu" args.device = device - + preprocess = Preprocess(args) preprocess.load_train_data(args.file_name) - train_data = preprocess.get_train_data() - - # train_data, valid_data = preprocess.split_data(train_data) - # trainer.run(args, train_data, valid_data) + train_data, train_uid_df = preprocess.get_train_data() - trainer.run_kfold(args, train_data) + preprocess.load_train_data(args.test_train_file_name) + test_train_data, _ = preprocess.get_train_data() + + if args.use_kfold: + trainer.run_kfold(args, train_data, test_train_data, train_uid_df) + else: + train_data, valid_data = preprocess.split_data(train_data, ratio=args.split_ratio, seed=args.seed) + trainer.run(args, train_data, valid_data) if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument('-c', '--conf', default='/opt/ml/git/p4-dkt-ollehdkt/conf.yml', help='wrtie configuration file root.') - term_args = parser.parse_args() + # parser = argparse.ArgumentParser() + # parser.add_argument('-c', '--conf', default='/opt/ml/code/conf.yml', help='wrtie configuration file root.') + # term_args = parser.parse_args() - with open(term_args.conf) as f: + with open('/opt/ml/code/conf.yml') as f: cf = yaml.load(f, Loader=yaml.FullLoader) args = AttrDict(cf) # args = parse_args(mode='train') @@ -43,10 +48,11 @@ def main(args): main(args) args.pop('wandb') - + save_path=f"{args.output_dir}{args.task_name}/exp_config.json" if args.model=='lgbm': args=args.lgbm + else : args.pop('lgbm') json.dump( @@ -54,4 +60,4 @@ def main(args): open(save_path, "w"), indent=2, ensure_ascii=False, - ) \ No newline at end of file + ) diff --git a/whole-in-one.py b/whole-in-one.py index b03ec7b..16eb769 100644 --- a/whole-in-one.py +++ b/whole-in-one.py @@ -15,14 +15,11 @@ from inference import main as i_main if __name__ == "__main__": - parser = argparse.ArgumentParser() - parser.add_argument('-c', '--conf', default='./conf.yml', help='wrtie configuration file root.') - term_args = parser.parse_args() - with open(term_args.conf) as f: + with open('/opt/ml/code/conf.yaml') as f: cf = yaml.load(f, Loader=yaml.FullLoader) args = AttrDict(cf) - + # args = parse_args(mode='train') os.makedirs(args.model_dir, exist_ok=True)