Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Synthetic difference in differences #2095

Merged
merged 51 commits into from
Jan 12, 2024
Merged

Conversation

memoryz
Copy link
Contributor

@memoryz memoryz commented Oct 12, 2023

What changes are proposed in this pull request?

This PR contains Spark implementation for 3 causal estimation methods: difference in differences, synthetic control and synthetic difference in differences.

Additional contributors to this PR:
@sarahshy @andrewnaber-msft

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

  • No. You can skip this section.
  • Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

  • No. You can skip this section.
  • Yes. Make sure you have added samples following below steps.
  1. Find the corresponding markdown file for your new feature in website/docs/documentation folder.
    Make sure you choose the correct class estimators/transformers and namespace.
  2. Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
  3. Make sure the DocTable points to correct API link.
  4. Navigate to website folder, and run yarn run start to make sure the website renders correctly.
  5. Don't forget to add <!--pytest-codeblocks:cont--> before each python code blocks to enable auto-tests for python samples.
  6. Make sure the WebsiteSamplesTests job pass in the pipeline.

@github-actions
Copy link

Hey @memoryz 👋!
Thank you so much for contributing to our repository 🙌.
Someone from SynapseML Team will be reviewing this pull request soon.

We use semantic commit messages to streamline the release process.
Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix.
This helps us to create release messages and credit you for your hard work!

Examples of commit messages with semantic prefixes:

  • fix: Fix LightGBM crashes with empty partitions
  • feat: Make HTTP on Spark back-offs configurable
  • docs: Update Spark Serving usage
  • build: Add codecov support
  • perf: improve LightGBM memory usage
  • refactor: make python code generation rely on classes
  • style: Remove nulls from CNTKModel
  • test: Add test coverage for CNTKModel

To test your commit locally, please follow our guild on building from source.
Check out the developer guide for additional guidance on testing your change.

@memoryz
Copy link
Contributor Author

memoryz commented Oct 26, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Signed-off-by: Jason Wang <[email protected]>
@memoryz
Copy link
Contributor Author

memoryz commented Oct 26, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@memoryz
Copy link
Contributor Author

memoryz commented Oct 26, 2023

/azp run

}

override def copy(extra: ParamMap): DiffInDiffModel = {
copyValues(new DiffInDiffModel(uid))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you dont seem to handle extra, you can use defaultcopy i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I'm now passing extra to copyValues. The effect is same as defaultCopy.


trait HasUnitCol extends Params {
final val unitCol = new Param[String](this, "unitCol",
"Column that identifies the units in panel data")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you expand on this doc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

lossHistoryUnitWeights = Some(lossHistoryUnitWeights.toList)
)

copyValues(new DiffInDiffModel(this.uid))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you wrap in a copy?

Copy link
Contributor Author

@memoryz memoryz Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to copy the shared params. The DiffInDiffModel and SyntheticDiffInDiffEstimator share these two params: unitCol and timeCol. Wrapping in a copyValues will make sure the values of these two params are copied over to the DiffInDiffModel instance.

// Spark mode and breeze mode should get same loss history and same solution
// .setLocalSolverThreshold(1)

test("SyntheticDiffInDiffEstimator can estimate the treatment effect") {
Copy link
Collaborator

@mhamilton723 mhamilton723 Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we test both spark and breeze fitting branches?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the unit test for the linalg routine for both breeze mode and Spark mode, and since the MirrorDescent code is abstracted on top of linalg routines, I didn't implement the unit tests for the Spark fitting branch specifically. But I've just added a unit test for that with latest commit.

"name": "python"
},
"save_output": true,
"synapse_widget": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remove the output so it doesent blow up size of NB?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Collaborator

@mhamilton723 mhamilton723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments, thank you so much for this lovely contribution :)

@memoryz
Copy link
Contributor Author

memoryz commented Dec 18, 2023

/azp run

@memoryz memoryz enabled auto-merge (squash) December 18, 2023 21:22
@memoryz
Copy link
Contributor Author

memoryz commented Dec 18, 2023

@mhamilton723 I've addressed all the comments so far. Ready for review.

@mhamilton723
Copy link
Collaborator

/azp run

mhamilton723
mhamilton723 previously approved these changes Dec 22, 2023
@mhamilton723
Copy link
Collaborator

/azp run

2 similar comments
@mhamilton723
Copy link
Collaborator

/azp run

@mhamilton723
Copy link
Collaborator

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@memoryz
Copy link
Contributor Author

memoryz commented Jan 4, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@memoryz
Copy link
Contributor Author

memoryz commented Jan 8, 2024

@mhamilton723 can you merge this PR? The failing unit tests have nothing to do with this change.

@mhamilton723
Copy link
Collaborator

/azp run

@mhamilton723 mhamilton723 disabled auto-merge January 12, 2024 05:25
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mhamilton723 mhamilton723 merged commit cbc022c into microsoft:master Jan 12, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants