Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Problem with defining input module, item embedding table. #773

Closed
Fluitketel0 opened this issue Apr 11, 2024 · 4 comments
Closed

[QST] Problem with defining input module, item embedding table. #773

Fluitketel0 opened this issue Apr 11, 2024 · 4 comments

Comments

@Fluitketel0
Copy link

Fluitketel0 commented Apr 11, 2024

❓ Questions & Help

When attempting to configure my model with TabularSequenceFeatures.from_schema(), I encounter an error, which I suspect is related to the setup of the item embedding table. Could anyone point out what I might be doing wrong?

Details

I'm working in the PyTorch 23.12 Docker image and most of my code came from trying to follow the End-to-end session-based recommendation notebook or the Model Architectures page.

Here is my nvt code:

# Load dataset
df = pq.read_table('/workspace/scriptie/data/processed/processedAndTruncated.parquet').to_pandas()
df['priceCategory'] = df['priceCategory'].astype(str)
df = df.rename(columns={'accommodationId': 'item_id'})


# Categorify categorical features
categ_feats = ['engagementType', 'periodId', 'country', 'item_id', 'aquaFun', 'adultOnly', 'forKids',
               'priceCategory']
categorify_op = categ_feats >> nvt.ops.Categorify()

userId = ['userId']
userId_op = userId >> nvt.ops.Categorify() >> nvt.ops.TagAsUserID()
# Define Groupby Workflow
groupby_feats = userId_op + categ_feats + ['engagementCountLog', 'itemRecencyLog', 'dateHoursLog', 'dayOfYearSin', 'dayOfYearCos']

# Step 2: Define groupby operation to create list columns
groupby_features =  groupby_feats >> nvt.ops.Groupby(
    groupby_cols=['userId'],
    sort_cols=['dateHoursLog'],
    aggs={
        'item_id': ['list', 'count'],
        'engagementType': ['list'],
        'periodId': ['list'],
        'country': ['list'],
        'aquaFun': ['list'],
        'adultOnly': ['list'],
        'forKids': ['list'],
        'priceCategory': ['list'],
        'dateHoursLog': ['list'],
        'itemRecencyLog': ['list'],
        'engagementCountLog': ['list'],
        'dayOfYearSin': ['list'],
        'dayOfYearCos': ['list']
    },
    name_sep='-'
)

# Ading metadata ops
metadata_features = groupby_features >> nvt.ops.AddMetadata(tags=['LIST'])

tagged_item_id = groupby_features['item_id-list'] >> nvt.ops.TagAsItemID() >> nvt.ops.AddMetadata(tags=['ITEM_ID', 'ITEM' ,'CATEGORICAL'])

cont_op = groupby_features['dateHoursLog-list', 'itemRecencyLog-list', 'engagementCountLog-list', 'dayOfYearSin-list', 'dayOfYearCos-list'] >> nvt.ops.AddMetadata(tags=[Tags.CONTINUOUS])

categ_op = groupby_features['engagementType-list', 'periodId-list', 'country-list', 'item_id-list', 'aquaFun-list', 'adultOnly-list', 'forKids-list', 'priceCategory-list', 'item_id-count'] >> nvt.ops.AddMetadata(tags=['CATEGORICAL'])

# add any other workflows
renamendUserId = groupby_features['userId'] >> nvt.ops.Rename(name ='user_id')

selected_features =  metadata_features + cont_op + categ_op + tagged_item_id 

# Filter out sessions with length 1
MINIMUM_SESSION_LENGTH = 2
final_workflow_ops = selected_features >> nvt.ops.Filter(f=lambda df: df["item_id-count"] >= MINIMUM_SESSION_LENGTH)

# Create and apply the workflow
workflow = nvt.Workflow(final_workflow_ops)

# Apply the combined workflow in a single fit_transform call
dataset = nvt.Dataset(df)
workflow.fit(dataset)
transformed_dataset = workflow.transform(dataset) 

# Save the transformed dataset with metadata to parquet
transformed_dataset.to_parquet("/workspace/scriptie/data/processed/processed_with_metadata_nvt")

And here is my current model:

from transformers4rec.torch.ranking_metric import NDCGAt, RecallAt

dataset_schema = tr.Schema().from_proto_text("/workspace/scriptie/data/processed/processed_with_metadata_nvt/schema.pbtxt")

max_sequence_length, d_model = 20, 64

inputs = tr.TabularSequenceFeatures.from_schema(
        schema = dataset_schema,
        max_sequence_length= max_sequence_length,
        masking = 'causal',
        continuous_projection=64,
        aggregation="concat",
    )

# Define the config of the XLNet Transformer architecture
transformer_config = tr.XLNetConfig.build(
    d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length
)

body = tr.SequentialBlock(
    inputs,
    tr.TransformerBlock(
        transformer_config, masking = inputs.masking
    )

)
head = tr.Head(
    body,
    tr.NextItemPredictionTask(weight_tying=True,
                                # metrics=[RecallAt(top_ks=[1, 5, 10], labels_onehot=True),  
                                #         NDCGAt(top_ks=[5, 10], labels_onehot=True)]
                             ),
)
model = tr.Model(head)

Running the code of my model will result in a key error, namely: KeyError: 'item_id-list'. Here is the complete error message:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[151], line 27
     16 transformer_config = tr.XLNetConfig.build(
     17     d_model=d_model, n_head=8, n_layer=2, total_seq_length=max_sequence_length
     18 )
     20 body = tr.SequentialBlock(
     21     inputs,
     22     tr.TransformerBlock(
   (...)
     25 
     26 )
---> 27 head = tr.Head(
     28     body,
     29     tr.NextItemPredictionTask(weight_tying=True,
     30                                 # metrics=[RecallAt(top_ks=[1, 5, 10], labels_onehot=True),  
     31                                 #         NDCGAt(top_ks=[5, 10], labels_onehot=True)]
     32                              ),
     33 )
     34 model = tr.Model(head)

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py:273, in Head.__init__(self, body, prediction_tasks, task_blocks, task_weights, loss_reduction, inputs)
    270     for task, val in zip(cast(List[PredictionTask], prediction_tasks), task_weights):
    271         self._task_weights[task.task_name] = val
--> 273 self.build(inputs=inputs, task_blocks=task_blocks)

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/base.py:299, in Head.build(self, inputs, device, task_blocks)
    297     if task_blocks and isinstance(task_blocks, dict) and name in task_blocks:
    298         task_block = task_blocks[name]
--> 299     task.build(self.body, input_size, inputs=inputs, device=device, task_block=task_block)
    300 self.input_size = input_size

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/model/prediction_task.py:386, in NextItemPredictionTask.build(self, body, input_size, device, inputs, task_block, pre)
    384 self.embeddings = inputs.categorical_module
    385 if not self.target_dim:
--> 386     self.target_dim = self.embeddings.item_embedding_table.num_embeddings
    387 if self.weight_tying:
    388     self.item_embedding_table = self.embeddings.item_embedding_table

File /usr/local/lib/python3.10/dist-packages/transformers4rec/torch/features/embedding.py:94, in EmbeddingFeatures.item_embedding_table(self)
     90 @property
     91 def item_embedding_table(self):
     92     assert self.item_id is not None
---> 94     return self.embedding_tables[self.item_id]

File /usr/local/lib/python3.10/dist-packages/torch/nn/modules/container.py:461, in ModuleDict.__getitem__(self, key)
    459 @_copy_to_script_wrapper
    460 def __getitem__(self, key: str) -> Module:
--> 461     return self._modules[key]

KeyError: 'item_id-list'

After executing the command inputs.item_embedding_table, I encounter a KeyError identical to one I've experienced above.

@rnyak
Copy link
Contributor

rnyak commented Apr 12, 2024

'm working in the PyTorch 23.12 Docker image

Is that merlin image? if not can you please use merlin-pytorch:23.08 image.. it comes everything installed. you dont need to install anything.

@Fluitketel0
Copy link
Author

Is that merlin image?

Thanks for the reply. Yes that is the one I mean, I will try my code with merlin-pytorch:23.08 first thing Monday morning.

@Fluitketel0
Copy link
Author

Thank you for your suggestion @rnyak, but I still receive the same KeyError: 'item_id-list' error. Do you have any other ideas?

@Fluitketel0
Copy link
Author

I had a mistake in my code, I did not correctly apply the categorify_op:

# Define Groupby Workflow
groupby_feats = userId_op + categ_feats + ...

I added the columns instead of the op. Thanks for your help @rnyak .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants