Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Series.cov #1620

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

lopez-
Copy link

@lopez- lopez- commented Jun 30, 2020

This PR proposes Series.cov

>>> s1 = ks.Series([1, 2, 3, 4])
>>> s2 = ks.Series([5, 6, 7, 8])
>>> s1
0    1
1    2
2    3
3    4
Name: 0, dtype: int64

>>> s2
0    5
1    6
2    7
3    8
Name: 0, dtype: int64

>>> s1.cov(s2)
1.666666...

Parameters
----------
other : Series
min_periods : int
Copy link
Contributor

@itholic itholic Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe min_periods also can be an optional ?

  • because It will be a zero when nothing is given for min_periods

k_isnan = np.isnan(kser.cov(kser_other, 4))
p_isnan = np.isnan(pser.cov(pser_other, 4))
self.assert_eq(k_isnan, p_isnan)

Copy link
Contributor

@itholic itholic Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test when each Series has a different index and an Exception case?

For example,

        kser = ks.Series([90, 91, 85], index=[1, 2, 3])
        pser = kser.to_pandas()
        kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3])
        pser_other = kser_other.to_pandas()

        self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)

and

    self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85]))  # 'other' must be a Series
    self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90])))  # series are not aligned

return self._internal.spark_frame.cov(self.name, other.name)
else:
# if not on the same anchor calculate covariance manually
return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)
Copy link
Contributor

@itholic itholic Jun 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(self.index) is performed four times in this code.

What do you think about we assign a proper variable and reuse it?
(ex. len_index = len(self.index) at the line above this variable is first used)

@itholic
Copy link
Contributor

itholic commented Jun 30, 2020

Could you add this to the docs also ??

It is placed at docs/source/reference/series.rst :)

@itholic
Copy link
Contributor

itholic commented Jun 30, 2020

Otherwise, looks fine to me.

Thanks, @lopez-

Comment on lines +4895 to +4896
if len(self.index) != len(other.index):
raise ValueError("series are not aligned")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this from? Seems like pandas works even with a different length of Series.

>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I missed it. Thanks, @ueshin .

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this:

>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6]))
0.5

And this:

>>> pd.Series([1, 2]).cov(pd.Series([5, 6]))
0.5

are equivalent... I believe this align is not supported in Koalas today. If this is a blocker I could open an issue and wait until somebody implements this. Another option I can think of is to go ahead and have a slightly different behavior for this edge case while we wait for the align implementation. Do you have any thoughts/preference on how to go about this @itholic @ueshin ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lopez- Could you file the issue for align?
Also, is it possible for you to implement it?

Comment on lines 952 to 955
kser = ks.Series([90, 91, 85])
pser = kser.to_pandas()
kser_other = ks.Series([90, 91, 85])
pser_other = kser_other.to_pandas()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please define pandas object first. to_pandas() invokes extra Spark jobs and it will take more time for tests.

pser = pd.Series([90, 91, 85])
kser = ks.from_pandas(pser)

@@ -4858,6 +4858,54 @@ def mad(self):

return mad

def cov(self, other: "Series", min_periods: Optional[int] = None) -> float:
"""
Return the covariance between two series.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we just copy the docstring from pandas' with a few modification of examples?

raise ValueError("series are not aligned")

min_periods = 0 if min_periods is None else min_periods
if len(self.index) < min_periods or len(self.index) <= 1:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also compare len(self.index) with min_periods?

>>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3)
nan


if same_anchor(self, other):
# if the have the same anchor use the more performant Spark native `cov`
return self._internal.spark_frame.cov(self.name, other.name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._kdf._internal.resolved_copy.spark_frame.cov(
    self._internal.data_spark_column_names[0],
    other._internal.data_spark_column_names[0])

?

FYI: self.name won't always be the same as the underlying Spark DataFrame column name. See the description of #1554.

Comment on lines +4906 to +4907
# if not on the same anchor calculate covariance manually
return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should create a new DataFrame and use it, something like:

kdf = self._kdf.copy()
tmp_column = verify_temp_column_name(kdf, '__tmp_column__')
kdf[tmp_column] = other
return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)

I haven't checked the code, so please modify as it works.

Btw, we should do this at the beginning of this method to avoid extra checks for length or something.

databricks/koalas/series.py Show resolved Hide resolved
@itholic
Copy link
Contributor

itholic commented Jan 11, 2021

Any updates here ? Just confirming :)

@xinrong-meng
Copy link
Contributor

Hi @lopez- , since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36401. Otherwise, I may do that for you next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants