Improves pipeline (#19)

* docs: update README * feat: update paths and step preprocess * feat: update init_train * feat: update main and query
sc0v0ne · Nov 19, 2023 · efa6b31 · efa6b31
1 parent 8cbb37f
commit efa6b31
Show file tree

Hide file tree

Showing 12 changed files with 95 additions and 60 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,31 @@
 # Blueflix  Streamlit
 
-The idea of starting a project on [Streamlit](https://streamlit.io) came from wanting to study a new tool. With ease of streamlit provides for the development of projects is incredible. I had done a project in college of a jupyter notebook being for movie recommendations, after that I continued with [Kaggle](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-netflix). Wanting to learn Streamlit why not combine the two projects. So I developed this project to apply this knowledge. I hope to write an article about the development. For now I'm still in development, for new updates.
+When I was in college I had developed a project with jupyter notebook, which consumes data from the [Netflix Prime Video Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) set. The idea was to use this set of data, clean, analyze and develop a stage where I could recommend movies and TV shows.
 
-## Datasets:
+I was very happy with the result. But I wanted more, I wanted to take this notebook and transfer it to an application where I could interact with the project. So create a personal project where I can use what I studied and learned over time.
 
-- [Disney+ Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows)
+But something was missing, which was how am I going to show this result of my project. During that time I discovered this tool [Streamlit](https://streamlit.io), ohhhhhhhhhh!!!!! Incredible !!! The flexibility I gained using it was very good and in addition to being able to deploy using their platform, this way I can show what I did.
+
+I want to thank **Kaggle - @shivamb**, for making the sets below available. In addition to the Netflix set, there are 3 more.
+
+- [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)
 - [Hulu Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/hulu-movies-and-tv-shows)
 - [Disney+ Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows)
 - [Amazon Prime Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows)
+
+From these 4 sets, the idea of creating a single one came up to be able to expand the data further, to be able to create more recommendations. Follow the link below.
+
+[4 Services Streaming Movies and Tv Shows](https://www.kaggle.com/datasets/sc0v1n0/4-services-streaming-movies-and-tv)
+
+If you want to understand the process more, I have a post and 4 more notebooks where I explain the notebook I created.
+
+- [Post - K-Means Recommend Movies and Tv Shows ](https://dev.to/sc0v0ne/k-means-recommend-movies-and-tv-shows-156m)
+- [Hulu Notebook](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-hulu)
+- [Amazon Notebook](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-amazon-prime)
+- [Disney Notebook](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-disney)
+- [Notebook Netflix](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-netflix)
+
+
+## Conclusion
+
+This personal project is a dream that I am developing, I want to evolve it further with the skills I acquire along the way. I won't always be adding updates, because I have ideas of other projects that I want to evolve, but I won't stop paying attention. I hope that other developers understand my codes and that I can transfer what I learned in this time. I hope you enjoyed it. Please, if you could leave a like on my post or on my notebooks, I would really appreciate it, so I can know if you liked it. Thank you for reading this far.
diff --git a/containers/Dockerfile.preprocess → containers/Dockerfile.pipe b/containers/Dockerfile.preprocess → containers/Dockerfile.pipe
@@ -1,6 +1,6 @@
 FROM python:3.10
 
-WORKDIR /preprocess
+WORKDIR /pipe
 
 RUN pip install --upgrade pip & \
     pip install \
@@ -9,4 +9,4 @@ RUN pip install --upgrade pip & \
     scikit-learn==1.3.0 \
     joblib==1.3.2
 
-COPY pipe /preprocess
+COPY pipe /pipe
diff --git a/containers/Dockerfile.streamlit b/containers/Dockerfile.streamlit
@@ -9,8 +9,6 @@ RUN pip install --upgrade pip & \
 WORKDIR /streamlit
 
 COPY /src /streamlit/src
-#COPY /.streamlit /streamlit/.streamlit
-#COPY /main.py /streamlit
 
 EXPOSE 7999
 

diff --git a/pipe/clusters.py b/pipe/clusters.py
@@ -5,33 +5,41 @@
 from sklearn.cluster import KMeans
 
 
-def init_train(path_input, data_train, all_data):
-    print('-' * 100)
+def init_train(data_train, all_data):
+
+    print('-' * 80)
     print('Initialize Train')
-    path_train = os.path.join(path_input, 'data', 'processed',  data_train)
+
+    PATTERN_PATH = os.path.join('/pipe', 'data',)
+    path_train = os.path.join(PATTERN_PATH, 'processed',  data_train)
     X_train = np.array(pd.read_csv(path_train))
 
-    kmeans_model = KMeans(n_clusters=277, random_state=123456)
+    kmeans_model = KMeans(n_clusters=277, random_state=123456, n_init='auto')
     y_clusters = kmeans_model.fit_predict(X_train)
-
+    print('-' * 80)
+    print()
     print('Prepare new Dataframe')
-    path_dataset = os.path.join(path_input, 'data', 'processed', all_data)
+    path_dataset = os.path.join(PATTERN_PATH, 'processed', all_data)
     dataset = pd.read_csv(path_dataset)
-    dataset['clusters_genre_type'] = y_clusters
-
+    dataset['clusters_gender'] = y_clusters
+    print('-' * 80)
+    print()
     print('Save outputs')
-    OUTPUT= os.path.join(path_input, 'data', 'final')
-    
+    OUTPUT= os.path.join(PATTERN_PATH, 'final')
+
     if not os.path.exists(OUTPUT):
         os.mkdir(OUTPUT)
-    dataset.to_csv('data/final/dataset_titles_final.csv', index=False)
+
+    path_dataset = os.path.join(OUTPUT, 'dataset_titles_final')
+    dataset.to_csv(f'{path_dataset}.csv', index=False)
     print('Sucefully data final')
 
-
-    OUTPUT_MODEL = os.path.join(path_input, 'data', 'models')
+    OUTPUT_MODEL = os.path.join(PATTERN_PATH, 'models')
 
     if not os.path.exists(OUTPUT_MODEL):
         os.mkdir(OUTPUT_MODEL)
 
-    model_path = os.path.join(OUTPUT_MODEL, 'model_kmeans_20231118.pkl')
-    joblib.dump(kmeans_model, model_path)
+    model_path = os.path.join(OUTPUT_MODEL, 'model_kmeans_20231119')
+    joblib.dump(kmeans_model, f'{model_path}.pkl')
+    print('Sucefully Model')
+    print('-' * 80)
diff --git a/pipe/pipeline.py b/pipe/pipeline.py
@@ -1,15 +1,15 @@
 from preprocess import preprocess
 from clusters import init_train
 import sys
-import os
+
+
 if __name__ == '__main__':
 
     NAME_INPUT_DIR = sys.argv[1]
 
     preprocess(NAME_INPUT_DIR)
-
-    PATH_PROCESSED = os.path.join('/preprocess')
+
     DATA_TRAIN = 'train_gender.csv'
     DATA_MOVIES_SERIES = 'data_titles_processed.csv'
 
-    init_train(PATH_PROCESSED, DATA_TRAIN, DATA_MOVIES_SERIES)
+    init_train(DATA_TRAIN, DATA_MOVIES_SERIES)
diff --git a/pipe/preprocess.py b/pipe/preprocess.py
@@ -2,14 +2,21 @@
 import pandas as pd
 
 
-def preprocess(path_input):
-    PATTERN_PATH = os.path.join('/preprocess', 'data', path_input)
-    datasets_names = os.listdir(PATTERN_PATH)
-    print(datasets_names, flush=True)
+def preprocess(input_dir):
+
+    print('Initialize preprocess')
+    print('-' * 80)
+
+    PATTERN_PATH = os.path.join('/pipe', 'data')
+
+    path_data = os.path.join(PATTERN_PATH,  input_dir)
+    datasets_names = os.listdir(path_data)
+
+    print('Data name: ', datasets_names)
 
     all_data = []
     for dir_ in datasets_names:
-        path_file_csv = os.path.join(PATTERN_PATH, dir_)
+        path_file_csv = os.path.join(path_data, dir_)
         read_pd = pd.read_csv(path_file_csv)
         read_pd['channel_streaming'] = dir_.split('_')[0]
         all_data.append(read_pd)
@@ -41,28 +48,27 @@ def preprocess(path_input):
         data['gender_type'].apply(lambda x: x.upper())
 
     for data in all_data:
-        print(data.shape, flush=True)
+        print('Shape data: ', data.shape)
 
     data_titles = pd.concat(all_data, axis=0)
 
     data_titles['gender_type'] = data_titles['gender_type'].str.lower()
 
-
     df_split = data_titles['gender_type'].str.split(',', expand=True)
 
     df_split = df_split.fillna('-')
-    path_input
+
     for x in df_split.columns:
         df_split[x] = df_split[x].apply(lambda i: i.strip())
-    
+
     group_dummies = [df_split[d] for d in df_split.columns]
-    
+
     for x in group_dummies:
-        print(type(x))
-    
+        print('Type dummies', type(x))
+
     group_dummies = [pd.get_dummies(d, dtype='int') for d in group_dummies]
 
-    print(len(group_dummies))
+    print('Amount dummies:', len(group_dummies))
 
     group_dummies = pd.concat(group_dummies, axis=1)
 
@@ -71,16 +77,18 @@ def preprocess(path_input):
     group_dummies.drop(columns=['-'], axis=1, inplace=True)
 
     data_titles['title'] = data_titles['title'].apply(lambda x: x.lower())
-
-
-    OUTPUT= os.path.join('/preprocess', 'data', 'processed')
+
+    OUTPUT= os.path.join(PATTERN_PATH, 'processed')
     if not os.path.exists(OUTPUT):
         os.mkdir(OUTPUT)
 
-    data_titles.to_csv('/preprocess/data/processed/data_titles_processed.csv', index=False)
-    print('Sucefully Data Titles')    
-    group_dummies.to_csv('/preprocess/data/processed/train_gender.csv',
-                         index=False)
+    path_data_titles = os.path.join(OUTPUT, 'data_titles_processed')
+    data_titles.to_csv(f'{path_data_titles}.csv', index=False)
+    print('Sucefully Data Titles')
+
+    path_data_dummies = os.path.join(OUTPUT, 'train_gender')
+    group_dummies.to_csv(f'{path_data_dummies}.csv', index=False)
     print('Sucefully group_dummies')
-    print('-' * 100)
+
+    print('-' * 80)
 
diff --git a/scripts/build.sh b/scripts/build.sh
@@ -1,2 +1,3 @@
 #!/bin/bash
-docker build . -f ./containers/Dockerfile.streamlit -t streamlit_app:latest --rm
+docker build . -f ./containers/Dockerfile.streamlit \
+    -t streamlit_app:latest --rm
diff --git a/scripts/init_pipe.sh b/scripts/init_pipe.sh
@@ -2,8 +2,9 @@
 
 NAME_INPUT_DIR=raw
 
-docker build . -f containers/Dockerfile.preprocess -t container_preprocess
+docker build . -f containers/Dockerfile.pipe \
+    -t pipe_ai --rm
 docker run -it \
-    -v ${PWD}/src/data:/preprocess/data container_preprocess \
-    python /preprocess/pipeline.py \
+    -v ${PWD}/src/data:/pipe/data pipe_ai \
+    python /pipe/pipeline.py \
     ${NAME_INPUT_DIR}
diff --git a/src/components/query.py b/src/components/query.py
@@ -24,12 +24,12 @@ def recommends(
 
     extra_cols = [x for x, y in extra_cols.items() if y]
 
-    movie = dataset[dataset['title'] == rename][['clusters_genre_type']]
+    movie = dataset[dataset['title'] == rename][['clusters_gender']]
     reset_movie = movie.reset_index()
-    reset_movie = reset_movie.at[0, 'clusters_genre_type']
+    reset_movie = reset_movie.at[0, 'clusters_gender']
     k_id = int(reset_movie)
     cols_view = ['title', 'gender_type'] + extra_cols
-    result = dataset[dataset['clusters_genre_type'] == k_id][cols_view][:int(top_n)]
+    result = dataset[dataset['clusters_gender'] == k_id][cols_view][:int(top_n)]
     result.set_index('title')
 
     return response_recommends(result)
@@ -50,7 +50,7 @@ def recommender_by_gender(dataset, options: list, cols: list):
     pred_test = model.predict(y_input)
     pred_test[0]
 
-    result = dataset[dataset['clusters_genre_type'] == pred_test[0] ][['title', 'gender_type', 'channel_streaming']]
+    result = dataset[dataset['clusters_gender'] == pred_test[0] ][['title', 'gender_type', 'channel_streaming']]
     result.set_index('title')
 
     return response_recommends(result)
diff --git a/src/data/final/dataset_titles_final.csv b/src/data/final/dataset_titles_final.csv
@@ -1,4 +1,4 @@
-movie_or_serie,title,director,cast,country,date_added_platform,release_year,duration_seconds,gender_type,description,channel_streaming,clusters_genre_type
+movie_or_serie,title,director,cast,country,date_added_platform,release_year,duration_seconds,gender_type,description,channel_streaming,clusters_gender
 Movie,ricky velez: here's everything,uninformed director,uninformed cast,uninformed country,"October 24, 2021",2021,,"comedy, stand up",Comedian Ricky Velez bares it all with his honest lens and down to earth perspective in his first-ever HBO stand-up special.,hulu,8
 Movie,silent night,uninformed director,uninformed cast,uninformed country,"October 23, 2021",2020,94 min,"crime, drama, thriller","Mark, a low end South London hitman recently released from prison, tries to go straight for his daughter, but gets drawn back in by Alan, his former cellmate, to do one final job.",hulu,58
 Movie,the marksman,uninformed director,uninformed cast,uninformed country,"October 23, 2021",2021,108 min,"action, thriller",A hardened Arizona rancher tries to protect an 11-year-old migrant boy fleeing from a ruthless drug cartel.,hulu,18

diff --git a/src/data/models/model_kmeans_20231119.pkl b/src/data/models/model_kmeans_20231119.pkl
diff --git a/src/main.py b/src/main.py
@@ -67,12 +67,10 @@ def load_data(dir_data: str, name_dataset: str):
 
 st.markdown('------------------------')
 
-st.title(' Select By Gender 📽️🍿 ')
-
-
+st.title(' Select by Genre 📽️🍿 ')
 
 options = st.multiselect(
-    'Select by Gender',
+    'In the menu select up to 3 types of movie or TV show genres:',
     cols,
     max_selections=3,
     )