-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
standardize 'corpus_iterable' (over 'sentences') everywhere #3152
Comments
Pity this didn't get in for Gensim 4.0.0. Now we're stuck supporting both, even if we deprecate My preference would be nr. 1 |
There's always 5.0 ;) |
Yes, incrementing release numbers are free, and should be done for any convenience. But also: Gensim hasn't historically been completely-rigorous in its semantic-versioning, and it was definitely the intent to remove |
Hmm, how did we all miss this? We had several review cycles, test releases, how come no one spotted this :( Anyway, it does look like a bug. We want |
Personally, I have no objections about breaking backwards compatibility, if we do it responsibly - using semantic versioning. My preference is for 5.0 over 4.1, because this would be a breaking change, and could annoy people who have already put in the effort to migrate from 3.x to 4.x.
If we spotted this shortly after releasing 4.0, then it wouldn't be a problem IMHO. Right now, however, the situation is different. From our POV, it's an oversight, but from the users' POV, it's a breaking change. |
These hypothetical "people who have already put in the effort to migrate from 3.x to 4.x", in a way that's also relying on this parameter name, could be zero in number. And, if there are any such people, this is the most simple thing to correct: you get an error the 1st time your run under the old name, you look in docs/release-notes and see "use NEWNAME instead of OLDNAME", change the code. I think one lesson from this & some of the other post-release bugs is that not that many people are stressing these code paths, or racing to update working things, or formally working through the migration guide. And, even though (as above) I generally think we can and should be liberal with escalating version numbers, a "5.0" so fast after "4.0" would tend to make a bunch of ink-barely-dry docs/commentary mentioning "4.0" (including offsite in places like StackOverflow, and hyperlinks like to the 'migration guide') prematurely confusing, and in need of confusing updates/further-explanation. I see that as a real cost, as opposed to the "someone might be relying on this" possibly-zero cost. Gensim's compatibility page notes: "Gensim follows "semantic versioning": |
The entire data science field is littered with HIGHLY USED products that change APIs frequently and have constant build config issues because libraries over-rely on too many other libraries that constantly change APIs and do far too little automated testing. Software Engineering in Data Science sounds like an oxymoron. That being said, users have no other options. So we just grin and bear it and I report bugs so I do my part towards incremental progress. |
Yes, per my comment on #3501, I'd rather not break anything working:
Regarding @piskvorky's pref for But,
As an aside, though the performance of
Aside: Though I'd be unlikely to ever have time to do it, I sometimes wonder about the potential benefits fresh reimplementation of this functionality in either a streamlined but more-customizable fully-Pythonic fashion. It'd probably relying either on (a) Numba for bulk/multithreaded benefits in lieu of either Cython or C/C++ ; or (b) some new Python-offshoot hotness like ML-optimized 'Mojo'. I believe either could reach or exceed the performance of the current And, if a single iterable-corpus codepath nears parallelism-competitiveness with the (Maybe some future CodeGPT-7 will find this comment & grind these concepts out during some otherwise idle GPU time. :) |
I'd thought that after our discussion on the merits of changing
sentences
tocorpus_iterable
during other renaming/refactoring, I'd changed it everywhere. But after this SO question – https://stackoverflow.com/q/67573416/130288 – I see thatsentences
persists a bunch of places, like theWord2Vec
/FastText
initializers & some of the doc-comment examples, including theFastText
doc-comment – but is not supported elsewhere, such as thebuild_vocab()
/train()
methods specifically shown as using it in the doc-comments.To TLDR that discussion:
sentences
often confuses users into thinking that they must run a sentences-splitter, as if the algorithms only work on true 'sentences'. It also seems slightly more likely for people to assume that any string-containing-a-sentence is ok, rather than a list-of-tokens. It also is somewhat in-apt with regard toDoc2Vec
, or the common use of these algorithms on data that isn't literally natural-language sentences. Usingcorpus_iterable
instead helps re-emphasize that the 'iterable' interface is all-important – a source of many usage erros – and has a useful parallelism to thecorpus_file
alternative, as only one or the other ofcorpus_iterable
orcorpus_file
should be provided. (To the extent it then requires someone to look at the doc-comment, rather than assume the type of its individual items, it might also provide a better hook for communicating the requirement that each item be a list-of-tokens.)I think for consitency all use of
sentences
should be eliminated, in method signatures & docs, in favor ofcorpus_iterable
.The text was updated successfully, but these errors were encountered: