Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add stdin_bytes to skip encoding #915

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Dreamsorcerer
Copy link
Contributor

Fixes #818.

@Dreamsorcerer Dreamsorcerer changed the title All stdin_bytes to skip encoding Add stdin_bytes to skip encoding Feb 12, 2023
@Dreamsorcerer
Copy link
Contributor Author

@bitprophet Not sure what you pushed, but seems to have added a bunch of unrelated changes to the diff.

@Dreamsorcerer
Copy link
Contributor Author

I should still have the original branch, if you just want me to force push it back?

@Dreamsorcerer
Copy link
Contributor Author

@bitprophet ?

@bitprophet
Copy link
Member

bitprophet commented May 12, 2023

Yea looks like the typing updates made github lose its mind. if you can rebase and force-push that'd probably be ideal - you'll probably want to make sure you copy over any type hints from main to your branch's diff, if there even are any.


From skimming what looks like the two commits actually relevant: my off the cuff thoughts:

  • if we go this route we may want to rename the option to stdin_is_bytes=True or maybe decode_stdin=False
  • however I'm wondering if there's anything cleaner/smarter we can do within the "number of bytes to read" helper instead (per your original 'fix' for this elsewhere)
  • or, since the "it's binary, please do not do any edge encoding/decoding" approach is arguably more correct, is there anything smarter we can do to autodetect?
    • are there approaches other tools use when handling unknown stdin?
    • eg "attempt to decode some reasonable first few bytes as if it was $configured_encoding, and if this fails, fallback to assuming binary"?

@Dreamsorcerer
Copy link
Contributor Author

Ah, it looks like I made the changes through Github, not locally. I'll try cherry picking to a new branch.

@Dreamsorcerer
Copy link
Contributor Author

Dreamsorcerer commented May 12, 2023

* however I'm wondering if there's anything cleaner/smarter we can do within the "number of bytes to read" helper instead (per your original 'fix' for this elsewhere)

That's a different issue, so not sure that function has anything to do with this one.

* or, since the "it's binary, please do not do any edge encoding/decoding" approach is arguably more correct, is there anything smarter we can do to autodetect?
  
  * are there approaches other tools use when handling unknown stdin?
  * eg "attempt to decode some reasonable first few bytes as if it was $configured_encoding, and if this fails, fallback to assuming binary"?

To be honest, I never fully understood the reason for decoding to a str and back again. So, you'll probably just have to make a decision on this one.

The only related thing I can think of is charset-normalizer, which can guess the encoding used. I'm still not sure I'd use it to try and detect if something is binary or not though. It's always going to be an estimate, so you'll likely still have edge cases that cause problems on occasion.

@Dreamsorcerer
Copy link
Contributor Author

Let me know if you want to continue this way, and I'll sort out the typing. It'll be a little tricky now, as the data being passed around is str | bytes and we're detecting it based on an option, rather than an isinstance() check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Encoding/decoding binary stream breaks application
2 participants