Deduplication reporting update #187

agalitsyna · 2023-10-16T23:07:47Z

Added --count-dups reporting mode for scipy and scikit-learn backends that report the number of duplicates (a little bit unfair, though, as the parent is counted there as well - it seems to make most of sense to report).

Phlya · 2023-10-18T10:04:26Z

pairtools/cli/dedup.py

+    is_flag=True,
+    default=False,
+    help="Add column to the output pairs with the number of duplicates. "
+    "Comparible with sklearn and scipy backends only. "


Phlya · 2023-10-18T10:04:47Z

pairtools/cli/dedup.py

+    default=False,
+    help="Add column to the output pairs with the number of duplicates. "
+    "Comparible with sklearn and scipy backends only. "
+    "Is not counted by default. [output dedup pairs format option]",


Maybe "Off by default"?

Phlya · 2023-10-18T10:04:58Z

pairtools/cli/dedup.py

@@ -388,6 +398,11 @@ def dedup_py(
    send_header_to_dedup = send_header_to in ["both", "dedup"]
    send_header_to_dup = send_header_to in ["both", "dups"]

+    count_dups = kwargs.get("count_dups", False)
+    if backend=="cython" and count_dups:
+        logger.warning("Not countin number of duplicates with Cython backend.")


Phlya · 2024-01-03T15:09:49Z

Can you add tests for this feature?

golobor

looks good!

golobor · 2024-03-09T15:04:22Z

pairtools/lib/dedup.py

@@ -221,8 +279,16 @@ def _dedup_chunk(
    df = df.reset_index()  # Remove the index temporarily

    # Set up columns to store the dedup info:
-    df["clusterid"] = np.nan
-    df["duplicate"] = False
+    df.loc[:, "clusterid"] = np.nan


@agalitsyna , tests fail at this line with an error message "ValueError: cannot set a frame with no defined index and a scalar".
Googling suggests that such an error arises when one tried to index an empty frame: https://stackoverflow.com/questions/48306694/valueerror-cannot-set-a-frame-with-no-defined-index-and-a-value-that-cannot-be

ping @agalitsyna

Maybe df.assign? What's your favorite way of filling in columns in dataframe?
https://stackoverflow.com/questions/34811971/how-do-i-fill-a-column-with-one-value-in-pandas

Deduplication reporting update

08d3a90

agalitsyna requested a review from Phlya October 16, 2023 23:07

Phlya reviewed Oct 18, 2023

View reviewed changes

golobor approved these changes Mar 9, 2024

View reviewed changes

Merge branch 'master' into dedup_update

2fb3690

golobor reviewed Mar 9, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deduplication reporting update #187

Deduplication reporting update #187

agalitsyna commented Oct 16, 2023

Phlya Oct 18, 2023

Phlya Oct 18, 2023

Phlya Oct 18, 2023

Phlya commented Jan 3, 2024

golobor left a comment

golobor Mar 9, 2024

golobor Mar 16, 2024

agalitsyna Apr 25, 2024

Deduplication reporting update #187

Are you sure you want to change the base?

Deduplication reporting update #187

Conversation

agalitsyna commented Oct 16, 2023

Phlya Oct 18, 2023

Choose a reason for hiding this comment

Phlya Oct 18, 2023

Choose a reason for hiding this comment

Phlya Oct 18, 2023

Choose a reason for hiding this comment

Phlya commented Jan 3, 2024

golobor left a comment

Choose a reason for hiding this comment

golobor Mar 9, 2024

Choose a reason for hiding this comment

golobor Mar 16, 2024

Choose a reason for hiding this comment

agalitsyna Apr 25, 2024

Choose a reason for hiding this comment