Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add top-level attribute to denote default phased/unphased status #2

Open
jeromekelleher opened this issue Nov 24, 2021 · 2 comments
Open

Comments

@jeromekelleher
Copy link
Contributor

Currently we assume genotypes are unphased if the phased marker isn't present. However, it's a pretty common case I'd imagine that all genotypes are either phased or unphased, so requiring the extra storage in the phased case seems excessive. Also, we don't want to have to go through everything to see if the data is all phased.

So, how about we have a top-level field which tells us the default phased-ness?

@tomwhite
Copy link
Collaborator

This would be useful to add.

Should it be required to specify this to indicate default phased-ness or would it be permitted to have an array full of true values (to indicate phased) or false (unphased)? In terms of implementation, when converting a VCF file we don't in general know if it is phased or not, so we'd have to generate the phased array, and then throw it away if all entries were true or false.

Would it be an error to specify both the attribute and the array?

@jeromekelleher
Copy link
Contributor Author

Hmm, it is tricky all right. I guess in retrospect the actual amount of storage required for an array of all 0s or all 1s is going to be pretty small, so perhaps it's not worth worrying about this. If we start summarising this at the file-level then why not summarise a bunch of other things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants