Skip to content

Commit

Permalink
Improves pipeline (#19)
Browse files Browse the repository at this point in the history
* docs: update README

* feat: update paths and step preprocess

* feat: update init_train

* feat: update main and query
  • Loading branch information
sc0v0ne authored Nov 19, 2023
1 parent 8cbb37f commit efa6b31
Show file tree
Hide file tree
Showing 12 changed files with 95 additions and 60 deletions.
27 changes: 24 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,31 @@
# Blueflix Streamlit

The idea of starting a project on [Streamlit](https://streamlit.io) came from wanting to study a new tool. With ease of streamlit provides for the development of projects is incredible. I had done a project in college of a jupyter notebook being for movie recommendations, after that I continued with [Kaggle](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-netflix). Wanting to learn Streamlit why not combine the two projects. So I developed this project to apply this knowledge. I hope to write an article about the development. For now I'm still in development, for new updates.
When I was in college I had developed a project with jupyter notebook, which consumes data from the [Netflix Prime Video Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows) set. The idea was to use this set of data, clean, analyze and develop a stage where I could recommend movies and TV shows.

## Datasets:
I was very happy with the result. But I wanted more, I wanted to take this notebook and transfer it to an application where I could interact with the project. So create a personal project where I can use what I studied and learned over time.

- [Disney+ Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows)
But something was missing, which was how am I going to show this result of my project. During that time I discovered this tool [Streamlit](https://streamlit.io), ohhhhhhhhhh!!!!! Incredible !!! The flexibility I gained using it was very good and in addition to being able to deploy using their platform, this way I can show what I did.

I want to thank **Kaggle - @shivamb**, for making the sets below available. In addition to the Netflix set, there are 3 more.

- [Netflix Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/netflix-shows)
- [Hulu Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/hulu-movies-and-tv-shows)
- [Disney+ Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/disney-movies-and-tv-shows)
- [Amazon Prime Movies and TV Shows](https://www.kaggle.com/datasets/shivamb/amazon-prime-movies-and-tv-shows)

From these 4 sets, the idea of creating a single one came up to be able to expand the data further, to be able to create more recommendations. Follow the link below.

[4 Services Streaming Movies and Tv Shows](https://www.kaggle.com/datasets/sc0v1n0/4-services-streaming-movies-and-tv)

If you want to understand the process more, I have a post and 4 more notebooks where I explain the notebook I created.

- [Post - K-Means Recommend Movies and Tv Shows ](https://dev.to/sc0v0ne/k-means-recommend-movies-and-tv-shows-156m)
- [Hulu Notebook](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-hulu)
- [Amazon Notebook](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-amazon-prime)
- [Disney Notebook](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-disney)
- [Notebook Netflix](https://www.kaggle.com/code/sc0v1n0/k-means-recommend-movies-and-tv-shows-netflix)


## Conclusion

This personal project is a dream that I am developing, I want to evolve it further with the skills I acquire along the way. I won't always be adding updates, because I have ideas of other projects that I want to evolve, but I won't stop paying attention. I hope that other developers understand my codes and that I can transfer what I learned in this time. I hope you enjoyed it. Please, if you could leave a like on my post or on my notebooks, I would really appreciate it, so I can know if you liked it. Thank you for reading this far.
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM python:3.10

WORKDIR /preprocess
WORKDIR /pipe

RUN pip install --upgrade pip & \
pip install \
Expand All @@ -9,4 +9,4 @@ RUN pip install --upgrade pip & \
scikit-learn==1.3.0 \
joblib==1.3.2

COPY pipe /preprocess
COPY pipe /pipe
2 changes: 0 additions & 2 deletions containers/Dockerfile.streamlit
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,6 @@ RUN pip install --upgrade pip & \
WORKDIR /streamlit

COPY /src /streamlit/src
#COPY /.streamlit /streamlit/.streamlit
#COPY /main.py /streamlit

EXPOSE 7999

Expand Down
38 changes: 23 additions & 15 deletions pipe/clusters.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,41 @@
from sklearn.cluster import KMeans


def init_train(path_input, data_train, all_data):
print('-' * 100)
def init_train(data_train, all_data):

print('-' * 80)
print('Initialize Train')
path_train = os.path.join(path_input, 'data', 'processed', data_train)

PATTERN_PATH = os.path.join('/pipe', 'data',)
path_train = os.path.join(PATTERN_PATH, 'processed', data_train)
X_train = np.array(pd.read_csv(path_train))

kmeans_model = KMeans(n_clusters=277, random_state=123456)
kmeans_model = KMeans(n_clusters=277, random_state=123456, n_init='auto')
y_clusters = kmeans_model.fit_predict(X_train)

print('-' * 80)
print()
print('Prepare new Dataframe')
path_dataset = os.path.join(path_input, 'data', 'processed', all_data)
path_dataset = os.path.join(PATTERN_PATH, 'processed', all_data)
dataset = pd.read_csv(path_dataset)
dataset['clusters_genre_type'] = y_clusters

dataset['clusters_gender'] = y_clusters
print('-' * 80)
print()
print('Save outputs')
OUTPUT= os.path.join(path_input, 'data', 'final')
OUTPUT= os.path.join(PATTERN_PATH, 'final')

if not os.path.exists(OUTPUT):
os.mkdir(OUTPUT)
dataset.to_csv('data/final/dataset_titles_final.csv', index=False)

path_dataset = os.path.join(OUTPUT, 'dataset_titles_final')
dataset.to_csv(f'{path_dataset}.csv', index=False)
print('Sucefully data final')


OUTPUT_MODEL = os.path.join(path_input, 'data', 'models')
OUTPUT_MODEL = os.path.join(PATTERN_PATH, 'models')

if not os.path.exists(OUTPUT_MODEL):
os.mkdir(OUTPUT_MODEL)

model_path = os.path.join(OUTPUT_MODEL, 'model_kmeans_20231118.pkl')
joblib.dump(kmeans_model, model_path)
model_path = os.path.join(OUTPUT_MODEL, 'model_kmeans_20231119')
joblib.dump(kmeans_model, f'{model_path}.pkl')
print('Sucefully Model')
print('-' * 80)
8 changes: 4 additions & 4 deletions pipe/pipeline.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
from preprocess import preprocess
from clusters import init_train
import sys
import os


if __name__ == '__main__':

NAME_INPUT_DIR = sys.argv[1]

preprocess(NAME_INPUT_DIR)

PATH_PROCESSED = os.path.join('/preprocess')

DATA_TRAIN = 'train_gender.csv'
DATA_MOVIES_SERIES = 'data_titles_processed.csv'

init_train(PATH_PROCESSED, DATA_TRAIN, DATA_MOVIES_SERIES)
init_train(DATA_TRAIN, DATA_MOVIES_SERIES)
50 changes: 29 additions & 21 deletions pipe/preprocess.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,21 @@
import pandas as pd


def preprocess(path_input):
PATTERN_PATH = os.path.join('/preprocess', 'data', path_input)
datasets_names = os.listdir(PATTERN_PATH)
print(datasets_names, flush=True)
def preprocess(input_dir):

print('Initialize preprocess')
print('-' * 80)

PATTERN_PATH = os.path.join('/pipe', 'data')

path_data = os.path.join(PATTERN_PATH, input_dir)
datasets_names = os.listdir(path_data)

print('Data name: ', datasets_names)

all_data = []
for dir_ in datasets_names:
path_file_csv = os.path.join(PATTERN_PATH, dir_)
path_file_csv = os.path.join(path_data, dir_)
read_pd = pd.read_csv(path_file_csv)
read_pd['channel_streaming'] = dir_.split('_')[0]
all_data.append(read_pd)
Expand Down Expand Up @@ -41,28 +48,27 @@ def preprocess(path_input):
data['gender_type'].apply(lambda x: x.upper())

for data in all_data:
print(data.shape, flush=True)
print('Shape data: ', data.shape)

data_titles = pd.concat(all_data, axis=0)

data_titles['gender_type'] = data_titles['gender_type'].str.lower()


df_split = data_titles['gender_type'].str.split(',', expand=True)

df_split = df_split.fillna('-')
path_input

for x in df_split.columns:
df_split[x] = df_split[x].apply(lambda i: i.strip())

group_dummies = [df_split[d] for d in df_split.columns]

for x in group_dummies:
print(type(x))
print('Type dummies', type(x))

group_dummies = [pd.get_dummies(d, dtype='int') for d in group_dummies]

print(len(group_dummies))
print('Amount dummies:', len(group_dummies))

group_dummies = pd.concat(group_dummies, axis=1)

Expand All @@ -71,16 +77,18 @@ def preprocess(path_input):
group_dummies.drop(columns=['-'], axis=1, inplace=True)

data_titles['title'] = data_titles['title'].apply(lambda x: x.lower())


OUTPUT= os.path.join('/preprocess', 'data', 'processed')

OUTPUT= os.path.join(PATTERN_PATH, 'processed')
if not os.path.exists(OUTPUT):
os.mkdir(OUTPUT)

data_titles.to_csv('/preprocess/data/processed/data_titles_processed.csv', index=False)
print('Sucefully Data Titles')
group_dummies.to_csv('/preprocess/data/processed/train_gender.csv',
index=False)
path_data_titles = os.path.join(OUTPUT, 'data_titles_processed')
data_titles.to_csv(f'{path_data_titles}.csv', index=False)
print('Sucefully Data Titles')

path_data_dummies = os.path.join(OUTPUT, 'train_gender')
group_dummies.to_csv(f'{path_data_dummies}.csv', index=False)
print('Sucefully group_dummies')
print('-' * 100)

print('-' * 80)

3 changes: 2 additions & 1 deletion scripts/build.sh
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
#!/bin/bash
docker build . -f ./containers/Dockerfile.streamlit -t streamlit_app:latest --rm
docker build . -f ./containers/Dockerfile.streamlit \
-t streamlit_app:latest --rm
7 changes: 4 additions & 3 deletions scripts/init_pipe.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,9 @@

NAME_INPUT_DIR=raw

docker build . -f containers/Dockerfile.preprocess -t container_preprocess
docker build . -f containers/Dockerfile.pipe \
-t pipe_ai --rm
docker run -it \
-v ${PWD}/src/data:/preprocess/data container_preprocess \
python /preprocess/pipeline.py \
-v ${PWD}/src/data:/pipe/data pipe_ai \
python /pipe/pipeline.py \
${NAME_INPUT_DIR}
8 changes: 4 additions & 4 deletions src/components/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,12 @@ def recommends(

extra_cols = [x for x, y in extra_cols.items() if y]

movie = dataset[dataset['title'] == rename][['clusters_genre_type']]
movie = dataset[dataset['title'] == rename][['clusters_gender']]
reset_movie = movie.reset_index()
reset_movie = reset_movie.at[0, 'clusters_genre_type']
reset_movie = reset_movie.at[0, 'clusters_gender']
k_id = int(reset_movie)
cols_view = ['title', 'gender_type'] + extra_cols
result = dataset[dataset['clusters_genre_type'] == k_id][cols_view][:int(top_n)]
result = dataset[dataset['clusters_gender'] == k_id][cols_view][:int(top_n)]
result.set_index('title')

return response_recommends(result)
Expand All @@ -50,7 +50,7 @@ def recommender_by_gender(dataset, options: list, cols: list):
pred_test = model.predict(y_input)
pred_test[0]

result = dataset[dataset['clusters_genre_type'] == pred_test[0] ][['title', 'gender_type', 'channel_streaming']]
result = dataset[dataset['clusters_gender'] == pred_test[0] ][['title', 'gender_type', 'channel_streaming']]
result.set_index('title')

return response_recommends(result)
2 changes: 1 addition & 1 deletion src/data/final/dataset_titles_final.csv
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
movie_or_serie,title,director,cast,country,date_added_platform,release_year,duration_seconds,gender_type,description,channel_streaming,clusters_genre_type
movie_or_serie,title,director,cast,country,date_added_platform,release_year,duration_seconds,gender_type,description,channel_streaming,clusters_gender
Movie,ricky velez: here's everything,uninformed director,uninformed cast,uninformed country,"October 24, 2021",2021,,"comedy, stand up",​Comedian Ricky Velez bares it all with his honest lens and down to earth perspective in his first-ever HBO stand-up special.,hulu,8
Movie,silent night,uninformed director,uninformed cast,uninformed country,"October 23, 2021",2020,94 min,"crime, drama, thriller","Mark, a low end South London hitman recently released from prison, tries to go straight for his daughter, but gets drawn back in by Alan, his former cellmate, to do one final job.",hulu,58
Movie,the marksman,uninformed director,uninformed cast,uninformed country,"October 23, 2021",2021,108 min,"action, thriller",A hardened Arizona rancher tries to protect an 11-year-old migrant boy fleeing from a ruthless drug cartel.,hulu,18
Expand Down
Binary file added src/data/models/model_kmeans_20231119.pkl
Binary file not shown.
6 changes: 2 additions & 4 deletions src/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,10 @@ def load_data(dir_data: str, name_dataset: str):

st.markdown('------------------------')

st.title(' Select By Gender 📽️🍿 ')


st.title(' Select by Genre 📽️🍿 ')

options = st.multiselect(
'Select by Gender',
'In the menu select up to 3 types of movie or TV show genres:',
cols,
max_selections=3,
)
Expand Down

0 comments on commit efa6b31

Please sign in to comment.