-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing Issues #2
Comments
The issue concerning the preprocessed data extends to the fact that if we want to adopt this dataset for training a model, and we want to predict on a new data point, we should preprocess this data point. However, we need to, for example, apply min-max scaling as a normalization. Hence, we should subtract the minimum of each feature from the dataset, similarly for division by max-min. The problem is that we do not have these values. I think this issue should be solved in order for anyone to benefit from this dataset. Thanks |
MGTAB is a standardized data set. The code for standardized the data is as follows:
The feature processing function is as follows:
|
Or at least share the minimum and maximum of each feature in the dataset, for a complete reproducibility |
Sorry for being persistent. But I bet when you publicly release a dataset, then your aim is that people benefit from the dataset. Here is an example to clarify my point:
Now, having the model trained, suppose we want to predict the label of a new datapoint x_1, this datapoint must be minmax-scaled on the same scaler that was used on the training data, otherwise we will have wrong results (since we need to minmax the datapoint on the same values used in the training data to be consistent)
I hope you take this into consideration otherwise no one can benefit from the released dataset and part of your efforts would be in vain. |
MGTAB is a normalization heterogeneous graph data set with multiple relations, and effective feature extraction has been carried out. As you say, the original features are not visible, since we hope that readers can directly use the processed data. Part of the original data has been sent to your email, we hope it will be helpful to your research. |
Thanks for sharing. But there is a win-win solution for both of us: just share the minimum and the maximum for each numerical feature, please. In this way, no user information is disclosed, and at the same time, every one can benefit properly and correctly from the dataset. |
Hello, In the Appendix of your paper, section A.1., you mention that min/max values of features are made public on the repository. But I can't find it. Could you point me to it? If you haven't published them, then I'd agree with @msharara1998 that no one can benefit from you great work on new/other data! |
Hello,
1)You mentioned in the paper that you've calculated the z-score of each feature. However, upon inspecting the dataset, I found that no feature has a value greater than one. To my knowledge, the z-score is calculated as:
Have you standardized the data using the above z-score, or normalized it by dividing each column's values by the maximum value?
It would be easier to share the user_name feature or at least the user ID, for easier reproducibility.
Several authors who released public datasets have shared the user-ID. I kindly request to share with me in private the account ids or usernames via my email ([email protected]). If you really cannot share it, please provide me with preprocessing code for the entire dataset (especially graph features).
Another concern to me that is related to the above is what Twitter API endpoint I want to use so that I can construct and preprocess the data point identically to the dataset (especially the graph part). Thus, sharing the code you've used to go from raw data coming from Twitter API to such a dataset would be extremely helpful.
Thank you in advance.
The text was updated successfully, but these errors were encountered: