Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Structured Dataset create with remote uri and read directly will fail #5954

Closed
2 tasks done
Future-Outlier opened this issue Nov 4, 2024 · 10 comments · Fixed by flyteorg/flytekit#2914
Closed
2 tasks done
Assignees
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers

Comments

@Future-Outlier
Copy link
Member

Describe the bug

This case should work.

@task
def return_sd() -> StructuredDataset:
    sd = StructuredDataset(uri="s3://my-s3-bucket/s3_flyte_dir/df.parquet", file_format="parquet")
    print("sd:", sd.open(pd.DataFrame).all())
    return sd

Expected behavior

This should work.

Additional context to reproduce

  1. we should upload a parquet file to s3 storage first.
  2. FlyteFile maybe has the same bug by @pingsutw

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@Future-Outlier Future-Outlier added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Nov 4, 2024
@Future-Outlier Future-Outlier changed the title [BUG] Structured Dataset create and read directly will fail [BUG] Structured Dataset create with remote uri and read directly will fail Nov 4, 2024
@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Nov 5, 2024

Hi @Future-Outlier,

If I'm not mistaken, steps to reproduce the error are described as follows:

  1. Upload a simple parquet file to s3 storage:

Screenshot 2024-11-05 at 9 45 56 PM

  1. Run the task remotely by executing this command:
pyflyte run --remote reprod.py return_sd

Error Message

Screenshot 2024-11-05 at 9 51 59 PM

Does this match your experience? Thanks.

@Future-Outlier
Copy link
Member Author

Hi @Future-Outlier,

If I'm not mistaken, steps to reproduce the error are described as follows:

  1. Upload a simple parquet file to s3 storage:

Screenshot 2024-11-05 at 9 45 56 PM

  1. Run the task remotely by executing this command:
pyflyte run --remote reprod.py return_sd

Error Message

Screenshot 2024-11-05 at 9 51 59 PM

Does this match your experience? Thanks.

yes and this will not work in local too.

@Future-Outlier
Copy link
Member Author

Hi @Future-Outlier,

If I'm not mistaken, steps to reproduce the error are described as follows:

  1. Upload a simple parquet file to s3 storage:

Screenshot 2024-11-05 at 9 45 56 PM

  1. Run the task remotely by executing this command:
pyflyte run --remote reprod.py return_sd

Error Message

Screenshot 2024-11-05 at 9 51 59 PM

Does this match your experience? Thanks.

my suggestion is that we should make local execution work first.

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Nov 5, 2024

For the local run, do you mean loading the parquet file from the local file system, instead of s3?

Some Observations

My observation is that self.literal is always set to None and has nothing to do with the input uri.

Screenshot 2024-11-05 at 10 52 37 PM

As suggested here, StructuredDataset should allow data loading by specifying uri. Am I correct?

@Future-Outlier
Copy link
Member Author

For the local run, do you mean loading the parquet file from the local file system, instead of s3?

Some Observations

My observation is that self.literal is always set to None and has nothing to do with the input uri.

Screenshot 2024-11-05 at 10 52 37 PM

As suggested here, StructuredDataset should allow data loading by specifying uri. Am I correct?

I mean make my example above works in the local execution.

  1. create a structured dataset from a parquet file from remote storage
  2. read it

@JiangJiaWei1103
Copy link
Contributor

Okay, let me try it, thanks!

@JiangJiaWei1103
Copy link
Contributor

JiangJiaWei1103 commented Nov 8, 2024

Follow-up Issues

After testing some cases, I observe two more errors:

  • Open StructuredDataset with df set to a pandas DataFrame
@task
def return_sd() -> StructuredDataset:
    sd = StructuredDataset(
        df=pd.DataFrame({
            "name": ["hanru", "jiawei"],
            "height": [190, 172]
        })
    )
    print("sd:", sd.open(pd.DataFrame).all())
    return sd
  • Open StructuredDataset with local uri
@task
def return_sd() -> StructuredDataset:
    sd = StructuredDataset(
        uri="./tmp/df.parquet",
        file_format="parquet"
    )
    print("sd:", sd.open(pd.DataFrame).all())
    return sd

Both of the cases are expected to work. I think they're highly related to this issue. But, I will fix the main error in the original proposal first.

@Future-Outlier
Copy link
Member Author

Very nice, let's push it this week, will reach out you

@wild-endeavor
Copy link
Contributor

in the follow-up issues, i think the second one is more important than the first. The first is a bit odd, esp if the type of the dataframe is different (polars, etc.)

Yeah, to what @pingsutw said, this issue affects files and folders also. cc @eapolinario since he's trying to spend some time on this. I think this is not so much a bug (it is that too) as it is a failure in ux design.

@JiangJiaWei1103
Copy link
Contributor

Hi @wild-endeavor,

Thanks for your reply. The newly merged PR focuses on solving the local run of StructuredDataset with uri only. I'm open to any further progress on related issues in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers
Projects
None yet
3 participants