Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF added sample has undefined format fields #282

Open
noporpoise opened this issue Oct 1, 2015 · 2 comments
Open

VCF added sample has undefined format fields #282

noporpoise opened this issue Oct 1, 2015 · 2 comments
Assignees

Comments

@noporpoise
Copy link
Contributor

When adding a new sample to a VCF, format fields for the new sample seem to take on undefined values rather than the "missing" value. What is the best way to set all format fields to missing for a new sample?

I'm adding a sample and format field to a VCF with the following code:

// open input / output
htsFile *vin = hts_open(vcf_path, "r");
if(vin == NULL) die("Cannot read %s: %s", vcf_path, strerror(errno));
htsFile *vout = hts_open(out_path, "w");
if(vout == NULL) die("Cannot write %s: %s", out_path, strerror(errno));

// read/write headers
bcf_hdr_t *hdrin = bcf_hdr_read(vin);
bcf_hdr_t *hdrout = bcf_hdr_dup(hdrin);
bcf_hdr_append(hdrout, "##FORMAT=<ID=COV,Number=G,Type=Float,Description=\"Genotype?\">\n");
bcf_hdr_add_sample(hdrout, "NewSample");

if(bcf_hdr_write(vout, hdrout) != 0) die("Cannot write header");

// read vcf entries
bcf1_t *v = bcf_init();

size_t i, nsamples = bcf_hdr_nsamples(hdrout);
float *cov = calloc(2 * nsamples, sizeof(float));

while(bcf_read(vin, hdrin, v) >= 0)
{
  // Unpack ref,alt,filter,info info
  bcf_unpack(v, BCF_UN_ALL);

  // Set cov to something
  for(i = 0; i < 2*nsamples; i++) cov[i] = i;

  // Update cov
  bcf_update_format_float(hdrout, v, "COV", cov, 2*nsamples);

  // Print
  if(bcf_write(vout, hdrout, v) != 0) die("Cannot write record");
}

free(cov);

bcf_destroy(v);
bcf_hdr_destroy(hdrin);
bcf_hdr_destroy(hdrout);
hts_close(vin);
hts_close(vout);

Input VCF:

##fileformat=VCFv4.1
##fileDate=20150920
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=K29,Number=0,Type=Flag,Description="Found at k=29">
##FORMAT=<ID=HI,Number=G,Type=Float,Description="field">
##reference=ref/ref.fa
##contig=<ID=ref,length=599>
##contig=<ID=2,length=11>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ben jerry
ref 77  id0 AC  A   .   PASS    K29 HI  1.1,2.2 .,.
ref 125 id1 C   CA  .   PASS    K29 HI  2.2,3.3 4.4,5.5

Output VCF which seems to have weird values in the pre-existing format field HI for NewSample:

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20150920
##INFO=<ID=K29,Number=0,Type=Flag,Description="Found at k=29">
##FORMAT=<ID=HI,Number=G,Type=Float,Description="field">
##reference=ref/ref.fa
##contig=<ID=ref,length=599>
##contig=<ID=2,length=11>
##FORMAT=<ID=COV,Number=G,Type=Float,Description="Genotype?">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ben jerry   NewSample
ref 77  id0 AC  A   .   PASS    K29 HI:COV  1.1,2.2:0,1 .,.:2,3 0,3.58732e-43:4,5
ref 125 id1 C   CA  .   PASS    K29 HI:COV  2.2,3.3:0,1 4.4,5.5:2,3 0,3.58732e-43:4,5

I can't see any easy way to set all format fields to missing value for a new sample. Is there a simple way? It seems the fastest way would be from within htslib. A function in vcf.h would be useful, I can have a go at writing one if it's needed.

@mp15
Copy link
Member

mp15 commented Mar 15, 2016

I think the problem here is that:

  • You're reading using bcf_read(vin, hdrin, v) with record v and hdrin as your header.
  • You're writing with bcf_write(vout, hdrout, v) with the same record v and hdrout as your header.

Therefore there's nothing that tells the API that record v, is setup to correspond to the new changed header. I am considering writing a function bcf_translate_hdr with arguments old_hdr, new_hdr, and record to do the translation from old header to new header. Will that do for you?

@noporpoise
Copy link
Contributor Author

That sounds like it would be very useful. What is the correct way to add a sample to a VCF and print it out currently? Should I be editing the input header and not using an output header?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants