Implement Series.cov #1620

lopez- · 2020-06-30T05:48:38Z

This PR proposes Series.cov

>>> s1 = ks.Series([1, 2, 3, 4])
>>> s2 = ks.Series([5, 6, 7, 8])
>>> s1
0    1
1    2
2    3
3    4
Name: 0, dtype: int64

>>> s2
0    5
1    6
2    7
3    8
Name: 0, dtype: int64

>>> s1.cov(s2)
1.666666...

itholic · 2020-06-30T05:57:15Z

databricks/koalas/series.py

+        Parameters
+        ----------
+        other : Series
+        min_periods : int


Maybe min_periods also can be an optional ?

because It will be a zero when nothing is given for min_periods

itholic · 2020-06-30T06:03:08Z

databricks/koalas/tests/test_ops_on_diff_frames.py

+        k_isnan = np.isnan(kser.cov(kser_other, 4))
+        p_isnan = np.isnan(pser.cov(pser_other, 4))
+        self.assert_eq(k_isnan, p_isnan)
+


Can we have a test when each Series has a different index and an Exception case?

For example,

kser = ks.Series([90, 91, 85], index=[1, 2, 3]) pser = kser.to_pandas() kser_other = ks.Series([90, 91, 85], index=[-1, -2, -3]) pser_other = kser_other.to_pandas() self.assert_eq(kser.cov(kser_other), pser.cov(pser_other), almost=True)

and

self.assertRaises(ValueError, lambda: kser.cov([90, 91, 85])) # 'other' must be a Series self.assertRaises(ValueError, lambda: kser.cov(ks.Series([90]))) # series are not aligned

itholic · 2020-06-30T06:15:58Z

databricks/koalas/series.py

+            return self._internal.spark_frame.cov(self.name, other.name)
+        else:
+            # if not on the same anchor calculate covariance manually
+            return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)


len(self.index) is performed four times in this code.

What do you think about we assign a proper variable and reuse it?
(ex. len_index = len(self.index) at the line above this variable is first used)

databricks/koalas/series.py

itholic · 2020-06-30T06:27:42Z

Could you add this to the docs also ??

It is placed at docs/source/reference/series.rst :)

itholic · 2020-06-30T06:28:03Z

Otherwise, looks fine to me.

Thanks, @lopez-

ueshin · 2020-06-30T21:15:51Z

databricks/koalas/series.py

+        if len(self.index) != len(other.index):
+            raise ValueError("series are not aligned")


Where is this from? Seems like pandas works even with a different length of Series.

>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6])) 0.5

Oops, I missed it. Thanks, @ueshin .

Mmm this is interesting. Seems like pandas performs an alignment between the series before computing the covariance. So, this:

>>> pd.Series([1, 2, 3, 4]).cov(pd.Series([5, 6])) 0.5

And this:

>>> pd.Series([1, 2]).cov(pd.Series([5, 6])) 0.5

are equivalent... I believe this align is not supported in Koalas today. If this is a blocker I could open an issue and wait until somebody implements this. Another option I can think of is to go ahead and have a slightly different behavior for this edge case while we wait for the align implementation. Do you have any thoughts/preference on how to go about this @itholic @ueshin ?

@lopez- Could you file the issue for align?
Also, is it possible for you to implement it?

ueshin · 2020-06-30T21:21:31Z

databricks/koalas/tests/test_ops_on_diff_frames.py

+        kser = ks.Series([90, 91, 85])
+        pser = kser.to_pandas()
+        kser_other = ks.Series([90, 91, 85])
+        pser_other = kser_other.to_pandas()


Please define pandas object first. to_pandas() invokes extra Spark jobs and it will take more time for tests.

pser = pd.Series([90, 91, 85]) kser = ks.from_pandas(pser)

ueshin · 2020-06-30T21:26:39Z

databricks/koalas/series.py

@@ -4858,6 +4858,54 @@ def mad(self):

        return mad

+    def cov(self, other: "Series", min_periods: Optional[int] = None) -> float:
+        """
+        Return the covariance between two series.


Shall we just copy the docstring from pandas' with a few modification of examples?

ueshin · 2020-06-30T21:27:50Z

databricks/koalas/series.py

+            raise ValueError("series are not aligned")
+
+        min_periods = 0 if min_periods is None else min_periods
+        if len(self.index) < min_periods or len(self.index) <= 1:


We should also compare len(self.index) with min_periods?

>>> pd.Series([1, 2]).cov(pd.Series([5, 6, 7, 8]), min_periods=3) nan

ueshin · 2020-06-30T21:36:30Z

databricks/koalas/series.py

+
+        if same_anchor(self, other):
+            # if the have the same anchor use the more performant Spark native `cov`
+            return self._internal.spark_frame.cov(self.name, other.name)


self._kdf._internal.resolved_copy.spark_frame.cov( self._internal.data_spark_column_names[0], other._internal.data_spark_column_names[0])

?

FYI: self.name won't always be the same as the underlying Spark DataFrame column name. See the description of #1554.

ueshin · 2020-06-30T21:42:45Z

databricks/koalas/series.py

+            # if not on the same anchor calculate covariance manually
+            return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)


Maybe we should create a new DataFrame and use it, something like:

kdf = self._kdf.copy() tmp_column = verify_temp_column_name(kdf, '__tmp_column__') kdf[tmp_column] = other return kdf._kser_for(self._column_label).cov(kdf._kser_for(tmp_column), min_period=min_period)

I haven't checked the code, so please modify as it works.

Btw, we should do this at the beginning of this method to avoid extra checks for length or something.

databricks/koalas/series.py

itholic · 2021-01-11T09:43:20Z

Any updates here ? Just confirming :)

xinrong-meng · 2021-08-03T23:06:32Z

Hi @lopez- , since Koalas has been ported to Spark as pandas API on Spark, would you like to migrate this PR to the Spark repository? Here is the ticket https://issues.apache.org/jira/browse/SPARK-36401. Otherwise, I may do that for you next week.

set up cov with tests

cd9c1be

itholic reviewed Jun 30, 2020

View reviewed changes

databricks/koalas/series.py Show resolved Hide resolved

ueshin reviewed Jun 30, 2020

View reviewed changes

lopez- added 2 commits July 19, 2020 21:57

adapt docstring to pandas

efdee72

define pandas object first to avoid generating extra spark jobs

a8138c0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Series.cov #1620

Implement Series.cov #1620

lopez- commented Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic commented Jun 30, 2020 •

edited

Loading

itholic commented Jun 30, 2020 •

edited

Loading

ueshin Jun 30, 2020

itholic Jul 2, 2020

lopez- Jul 20, 2020

ueshin Jul 22, 2020

ueshin Jun 30, 2020

ueshin Jun 30, 2020

ueshin Jun 30, 2020

ueshin Jun 30, 2020

ueshin Jun 30, 2020

itholic commented Jan 11, 2021

xinrong-meng commented Aug 3, 2021

		if len(self.index) != len(other.index):
		raise ValueError("series are not aligned")

		# if not on the same anchor calculate covariance manually
		return (self - self.mean()).dot(other - other.mean()) / (len(self.index) - 1)

Implement Series.cov #1620

Are you sure you want to change the base?

Implement Series.cov #1620

Conversation

lopez- commented Jun 30, 2020 • edited Loading

itholic Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

itholic Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

itholic Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

itholic commented Jun 30, 2020 • edited Loading

itholic commented Jun 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

itholic commented Jan 11, 2021

xinrong-meng commented Aug 3, 2021

lopez- commented Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic Jun 30, 2020 •

edited

Loading

itholic commented Jun 30, 2020 •

edited

Loading

itholic commented Jun 30, 2020 •

edited

Loading