-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Map item catch #365
Map item catch #365
Conversation
…_item compatibility with dataset
Created BasicProcessor class methods I did not combine these methods as it seems unnecessary to check compatibility of processor and dataset every time Note: |
Tested it on all the ZeeSchuimer datasources. Apparently there is some other oddity in LinkedIn. This update works as intended and notifies both us and the User. |
`get_item_keys` would also raise an error on a non csv/ndjson. It will return the first item keys even in the instance of `map_item` not existing... but that was how it always functioned so I am leaving it the same.
…t keys which is done A LOT)
… removed/deprecated)
…is already in the log)
backend/abstract/processor.py
Outdated
""" | ||
# only run item mapper if extension of processor == extension of | ||
# data file, for the scenario where a csv file was uploaded and | ||
# converted to an ndjson-based data source, for example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dale-wahl is there any datasource that does this at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's some old voodoo (I think it is yours to be honest!)... I extracted that code and comment into a new method to use as a check to see if map_item
should be run, but I did not modify the code. I believe it has to do with custom imports of, say, a Twitter dataset. You could convert the dataset type from custom to a Twitter datasource type, but it would not be an NDJSON as expected. That would also presumably apply to any ZeeSchuimer datasource (a custom Doutin/Instagram CSV uploaded would not be able to use map_item
). In practice, I am not exactly sure how often we ran into this problem since users cannot change datatypes only admins (am I wrong about that?).
# not a valid NDJSON file? | ||
return [] | ||
|
||
if (self.get_results_path().suffix.lower() == ".csv") or (self.get_results_path().suffix.lower() == ".ndjson" and self.get_own_processor() is not None and self.get_own_processor().map_item_method_available(dataset=self)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the extension checks necessary here? I think if there is a map_item there will be columns and if not then not, regardless of extension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I know they were already there, but I think they might be a left-over from when we were less extension-agnostic)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vaguely remember issue here with something in the frontend. get_item_keys
would fail because get_own_processor
would return None
. But perhaps it the extension checks are redundant. I can test this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CSV datasets do not have map_item
, but do have columns we wish to return. NDJSON datasets should only return columns if they can be mapped (else keys are not consistent enough to be used as "columns"). Some datasets do not have processors (rare; from what I can tell only deprecated datasets) and a check here avoids an error being raised by iterate_items
.
Potentially we could combine the two Dataset methods get_item_keys
and get_columns
. They are almost the same thing. get_item_keys
is used in various processors while get_columns
is used by the frontend and most often get_options
. Currently, get_item_keys
takes advantage of iterate_items
and grabs the first item's keys. get_columns
was a copy of the code from iterate_items
which is the code that I'm modifying in this PR.
When attempting to consolidate the code, I realized that get_columns
could not be completely replaced by get_item_keys
because of the previously mentioned instances where we do not wish to return columns. Possibly we do not wish to return item keys either in these instances via get_item_keys
, but I did not explore all usages of that function as no errors were occurring. Mostly likely, get_columns
returns False
via the frontend and so the backend never runs into a misuse of get_item_keys
.
This works... but is logging to the original dataset not the new one being created (plus I'm warning via the database since datasets do not have the 4CAT log). I think it may make more sense to wrap
map_item
at the processor level. Butmap_item
is a static method and thus,get_mapped_item
(or whatever I call the new method) ought to be as well... but then I don't have a log or dataset or anything useful really. Alsoiterate_item
is the main use ofmap_item
and exists in the dataset. So perhaps I need a processor method to check ifmap_item
works and then use theget_mapped_item
.