Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: devcontainer development environment #487

Draft
wants to merge 44 commits into
base: dev
Choose a base branch
from
Draft

Conversation

d0choa
Copy link
Collaborator

@d0choa d0choa commented Feb 12, 2024

This PR allows to setup a development environment using the devcontainer stragety. More info...

A few working functionalities:

  • start the development environment by just clicking a button/badge (in docker volume)
  • start the development environment from the local project directory (by mounting the directory)
  • working testing environment
  • working on local deployment of the documentation
  • devcontainer added to our documentation
  • Spark UI forwarded
  • github-cli and git credentials within devcontainer (mounted from local)
  • google-cloud-cli and gcp credentials within devcontainer (mounted from local)

Here are a few things I haven't fully tested or might be material for future work:

  • Run devcontainer inside dataproc cluster. Hint~~
  • Spawning Airflow from devcontainer ?
  • Separate 3 environments: devcontainer, ci, production Hint1 | Hint2
  • welcome message Hint

Feedback welcomed

@d0choa d0choa marked this pull request as ready for review February 12, 2024 15:37
@d0choa
Copy link
Collaborator Author

d0choa commented Feb 12, 2024

@ireneisdoomed and @DSuveges, I tagged you both here but only in case you have the time.

I think this doesn't clash with any current functionality and it's optional so it's not a big priority.

The thing to look at could be to start a devcontainer yourselves (whenever you have something to-do) and confirm things are mostly working for you. You can look at the requested changes in the documentation for more info.

@d0choa
Copy link
Collaborator Author

d0choa commented Feb 12, 2024

I managed to install the devcontainer in dataproc.

The only extra requirements were:

  • To activate docker when creating the cluster. Info
  • It needs to have the github credentials (~/.gitconfig). What I did was to install the github CLI and run gh auth in the remote machine before starting the container

Screenshot 2024-02-12 at 16 33 58

Update:
It looks like further configuration is required to make the cluster available to the devcontainer:

Exception in thread "main" org.apache.spark.SparkException: When running with master 'yarn' either HADOOP_CONF_DIR or YARN_CONF_DIR must be set in the environment.

Update 2:

After thinking about this for a little longer, I realised it doesn't make much sense. The point of using dataproc is to use the dataproc environment. By using a devcontainer you isolate the development from the Dataproc capabilities. There is a longer route, which is creating a dataproc custom image. I think it's a different type of endeavour to the one attempted here.

@DSuveges
Copy link
Contributor

After thinking about this for a little longer, I realised it doesn't make much sense. The point of using dataproc is to use the dataproc environment. By using a devcontainer you isolate the development from the Dataproc capabilities. There is a longer route, which is creating a dataproc custom image. I think it's a different type of endeavour to the one attempted here.

So is this going to be merged? To me it seems having dev containers won't provide much benefit compared to using poetry managed environment when working local. And dataproc based remote work won't be facilitatad by the current implementation.

@d0choa
Copy link
Collaborator Author

d0choa commented Feb 13, 2024

We have people in the team without a properly set environment. The entry point to start developing still requires a decent amount of investment. Everything about pyenv, poetry, gcp, github, pre-commits, testing, etc. There are quite a few things that tend to behave differently if different people use different environments. Having a single point of failure could ensure that we minimise different issues of the type "it works on my machine".

Because it's optional, if there is not enough adoption it defeats the purpose. IMO the benefits outweigh the risks, but we won't know unless we test in real-world scenarios.

@github-actions github-actions bot added documentation Improvements or additions to documentation size-M Feature labels Apr 29, 2024
@d0choa d0choa marked this pull request as draft April 29, 2024 19:46
@addramir
Copy link
Contributor

@d0choa what is the status of this PR? Is it still relevant?

@d0choa
Copy link
Collaborator Author

d0choa commented Sep 25, 2024

I wouldn't close it. It's not a priority, but it has a lot of good things we could eventually rescue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation Feature size-M
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants