-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Series.cov #1620
base: master
Are you sure you want to change the base?
Implement Series.cov #1620
Conversation
databricks/koalas/series.py
Outdated
Parameters | ||
---------- | ||
other : Series | ||
min_periods : int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe min_periods
also can be an optional
?
- because It will be a zero when nothing is given for
min_periods
k_isnan = np.isnan(kser.cov(kser_other, 4)) | ||
p_isnan = np.isnan(pser.cov(pser_other, 4)) | ||
self.assert_eq(k_isnan, p_isnan) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a test when each Series has a different index and an Exception case?
For example,
kser = ks.Series([90, 91, 85], index=[1, 2, 3])
pser = kser.to_pandas()
kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3])
pser_other = kser_other.to_pandas()
self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)
and
self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series
self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned
return self._internal.spark_frame.cov(self.name, other.name) | ||
else: | ||
# if not on the same anchor calculate covariance manually | ||
return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(self.index)
is performed four times in this code.
What do you think about we assign a proper variable and reuse it?
(ex. len_index = len(self.index)
at the line above this variable is first used)
Could you add this to the docs also ?? It is placed at |
Otherwise, looks fine to me. Thanks, @lopez- |
if len(self.index) != len(other.index): | ||
raise ValueError("series are not aligned") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this from? Seems like pandas works even with a different length of Series.
>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I missed it. Thanks, @ueshin .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this:
>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5
And this:
>>> pd.Series([1, 2]).cov(pd.Series([5, 6]))
0.5
are equivalent... I believe this align
is not supported in Koalas today. If this is a blocker I could open an issue and wait until somebody implements this. Another option I can think of is to go ahead and have a slightly different behavior for this edge case while we wait for the align
implementation. Do you have any thoughts/preference on how to go about this @itholic @ueshin ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lopez- Could you file the issue for align
?
Also, is it possible for you to implement it?
kser = ks.Series([90, 91, 85]) | ||
pser = kser.to_pandas() | ||
kser_other = ks.Series([90, 91, 85]) | ||
pser_other = kser_other.to_pandas() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please define pandas object first. to_pandas()
invokes extra Spark jobs and it will take more time for tests.
pser = pd.Series([90, 91, 85])
kser = ks.from_pandas(pser)
databricks/koalas/series.py
Outdated
@@ -4858,6 +4858,54 @@ def mad(self): | |||
|
|||
return mad | |||
|
|||
def cov(self, other: "Series", min_periods: Optional[int] = None) -> float: | |||
""" | |||
Return the covariance between two series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we just copy the docstring from pandas' with a few modification of examples?
raise ValueError("series are not aligned") | ||
|
||
min_periods = 0 if min_periods is None else min_periods | ||
if len(self.index) < min_periods or len(self.index) <= 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also compare len(self.index)
with min_periods
?
>>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3)
nan
|
||
if same_anchor(self, other): | ||
# if the have the same anchor use the more performant Spark native `cov` | ||
return self._internal.spark_frame.cov(self.name, other.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self._kdf._internal.resolved_copy.spark_frame.cov(
self._internal.data_spark_column_names[0],
other._internal.data_spark_column_names[0])
?
FYI: self.name
won't always be the same as the underlying Spark DataFrame column name. See the description of #1554.
# if not on the same anchor calculate covariance manually | ||
return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should create a new DataFrame and use it, something like:
kdf = self._kdf.copy()
tmp_column = verify_temp_column_name(kdf, '__tmp_column__')
kdf[tmp_column] = other
return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)
I haven't checked the code, so please modify as it works.
Btw, we should do this at the beginning of this method to avoid extra checks for length or something.
Any updates here ? Just confirming :) |
Hi @lopez- , since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36401. Otherwise, I may do that for you next week. |
This PR proposes Series.cov