Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Peak calling with ATAC-seq data #8

Open
RenzoTale88 opened this issue Dec 11, 2019 · 26 comments
Open

Peak calling with ATAC-seq data #8

RenzoTale88 opened this issue Dec 11, 2019 · 26 comments

Comments

@RenzoTale88
Copy link

Hello,
I've got a question. I would like to try your software to perform peak calling, but instead of using Chip-seq data I woul like to use it on ATAC-seq data.
Do you think it is possible? If so, are there any settings that should be used?
Thank you in advance,

Andrea

@ivargr
Copy link
Member

ivargr commented Dec 12, 2019

Hi!

We have en experimental branch that supports this. It is not very well tested, but you could give it a try.

You will need to install Graph Peak Caller by cloning this repository, checking out the branch called dev-atac-seq-support and running pip3 install . in the graph peak caller directory.

Then you should be able to run graph_peak_caller callpeaks with -a True to tell Graph Peak Caller that it is dealing with ATAC-seq reads. When -a is set to True, each read is extended in both directions into a total fragment with the length specified by --fragment_length, so you will also need to set --fragment_length to some number depending on your data/smoothing. The --fragment_length parameter corresponds to MACS2's --extsize if you are familiar with that.

Let me know if you run into some problems!

@RenzoTale88
Copy link
Author

Thank you for your quick reply. Do you mind giving some more specific instruction on how to check out the branch? I've tried to simple run git checkout dev-atac-seq-support within the graph_peak_caller folder, but it didn't work.

@ivargr
Copy link
Member

ivargr commented Dec 12, 2019

Sure, try doing a git fetch origin first.

The following should work:

git fetch origin
git checkout dev-atac-seq-support
pip3 install -e .

If you already have Graph Peak Caller installed, you may need to uninstall it first with pip3 uninstall graph_peak_caller.

@RenzoTale88
Copy link
Author

So, when I try to run the checkout command I get this error:

error: Your local changes to the following files would be overwritten by checkout:
        setup.py
Please, commit your changes or stash them before you can switch branches.
Aborting

However, it seems to be working if I force the checkout (git checkout -f dev-atac-seq-support). The flag is present in the help, so I'll test it.

Which method would you recommend to estimate the fragment length?

Thanks again for the support,
Andrea

@ivargr
Copy link
Member

ivargr commented Dec 12, 2019

That warning means that you have made som changes to some of the files, but forcing checkout should work fine (as you saw), as long as you are not afraid of losing your changes.

I am not very familiar with ATAC-seq data/peak calling, and I think there are no correct answer for what fragment length should be used, but in this thread it seems that people are using either 73, 150 or 200 as fragment length.

@RenzoTale88
Copy link
Author

So, I've tried to proceed as suggested in the guide. Aligned the reads to my graph, filtered them with the parameters provided, converted to json and split them by chromosome.
In the meanwhile I've created from my single chromosome vg files the references for the analysis as referred in the wiki:

for chr in {1..N}; do
    vg view -Vj CHR${chr}.vg > chr.$chr.json
    graph_peak_caller create_ob_graph chr.$chr.json
    vg stats -r CHR${chr}.vg  | awk '{print $2}' > node_range_chr.$chr.txt
    graph_peak_caller find_linear_path -g chr.$chr.nobg chr.$chr.json chr.$chr chr.${chr}_linear_pathv2.interval
done

This process generates the files *.json, *.nobg, *.sequences, *.sequencesv2, linear_pathv2.interval and node_range for every chromosome.
However, when I try to run callpeaks I get the following error:

2019-12-13 07:22:40,892, INFO: Sample files: ['FILTER/sample1_ATAC/sample1_ATAC_filtered_chr.1.json']
2019-12-13 07:22:40,892, INFO: Using graphs: ['LOCALGRAPH/chr.1.nobg'] 
2019-12-13 07:22:40,893, INFO: Will use sequence graphs. ['LOCALGRAPH/chr.1.nobg.sequences']
2019-12-13 07:22:40,893, INFO: Using graphs from data directory LOCALGRAPH
2019-12-13 07:22:40,893, INFO: Will use [''] as extra experiments names for each run, based on graph file names.If only running on single graph, this should be empty. 
2019-12-13 07:22:40,893, WARNING: Did not find linear map for  for graph LOCALGRAPH/chr.1.nobg. Will create.
2019-12-13 07:22:40,893, WARNING: Did not find linear map for  for graph LOCALGRAPH/chr.1.nobg. Will create.
2019-12-13 07:22:42,605, INFO: Getting topologically sorted nodes
2019-12-13 07:22:42,605, INFO: 0 nodes processed
Traceback (most recent call last):
  File "/PATH/TO/MY/FOLDER/Andrea/myanaconda/graph_peak_caller/bin/graph_peak_caller", line 11, in <module>
    load_entry_point('graph-peak-caller', 'console_scripts', 'graph_peak_caller')()
  File "/PATH/TO/MY/FOLDER/Andrea/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 39, in main
    run_argument_parser(sys.argv[1:])
  File "/PATH/TO/MY/FOLDER/Andrea/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 679, in run_argument_parser
    args.func(args)
  File "/PATH/TO/MY/FOLDER/Andrea/graph_peak_caller/graph_peak_caller/callpeaks_interface.py", line 182, in run_callpeaks2
    create_linear_map(graph, linear_map_name)
  File "/PATH/TO/MY/FOLDER/Andrea/graph_peak_caller/graph_peak_caller/util.py", line 6, in create_linear_map
    linear_map = LinearMap.from_graph(ob_graph)
  File "/PATH/TO/MY/FOLDER/Andrea/graph_peak_caller/graph_peak_caller/control/linearmap.py", line 104, in from_graph
    node_ids = list(graph.get_topological_sorted_node_ids())
  File "/PATH/TO/MY/FOLDER/Andrea/myanaconda/graph_peak_caller/lib/python3.7/site-packages/offsetbasedgraph/graph.py", line 539, in get_topological_sorted_node_ids
    assert not unfinished, unfinished
AssertionError: {9238458: 1}

In addition to that, I'd like to signal a minor bug in a function, that tries to sum a list and a string causing an error. The problem is at line 101 of file preprocess_interface.py. I've changed the line from:
out_files = {chrom: open(reads_base_name + "_" + chrom + ".json", "w")
to
out_files = {chrom: open('.'.join(reads_base_name) + "_" + chrom + ".json", "w")

Thank you again for your help,

Andrea

@ivargr
Copy link
Member

ivargr commented Dec 13, 2019

Seems like something is wrong (either with Graph Peak Caller or your graphs). Thanks for sharing the details!

Would you be able to share either the *.nobg files or the *.vg files with me? Then I can try to figure out what is wrong.

@RenzoTale88
Copy link
Author

Hi,
sorry for the late reply. Do you mind if I share it through email or some cloud services?
Thank you in advance
Andrea

@ivargr
Copy link
Member

ivargr commented Dec 16, 2019

Sure, that's fine. You can email them to [email protected] or upload them somewhere and share the link

@RenzoTale88
Copy link
Author

Hello, just wanted to ask whether there are some progresses with this. Thank you in advance,
Andrea

@ivargr
Copy link
Member

ivargr commented Jan 10, 2020

Really sorry for the delay, I've recieved the graph you sent me and will try to reproduce the error as soon as possible, hopefully during the weekend.

@RenzoTale88
Copy link
Author

No need to apologise, thank you very much for your support! :)

Andrea

@ivargr
Copy link
Member

ivargr commented Jan 13, 2020

Hi!

I've managed to have a look at the graph you sent me, and I think maybe the problem is that not all nodes in your graph are connected. For instance, both node 135833768 and node 135834098 have no edges going out.

How did you make this graph? Did you make it from a vcf and a linear reference using vg construct -r ref.fa -v variants.vcf.gz?

@RenzoTale88
Copy link
Author

Hi!
So, the graph was generated using CACTUS, then hal2vg and finally indexed through vg version 1.20.0.
You can find a better explanation in this github issue:

vgteam/sv-genotyping-paper#6

@ivargr
Copy link
Member

ivargr commented Jan 13, 2020

I see! Graph Peak Caller really only works with graphs that are directed, connected and non-cyclic. I'm not sure why Cactus would create a non-connected graph. Is the graph you sent me a single graph with multiple chromosomes/scaffolds or is it only one chromosome?

@RenzoTale88
Copy link
Author

No the graph is multiple chromosomes, joined with the vg ids command and then indexed later on.
This was necessary because I've processed each chromosome separately, then converted alignments for each chromosome into a vg graph, renamed the ids to be on the same space and finally create the indexes.

@ivargr
Copy link
Member

ivargr commented Jan 14, 2020

Okay, I understand. Graph peak caller can only call peaks on a graph representing a single chromosome, so the way to run graph peak caller on multiple chromosomes is to run it on each chromosome graph separately, as described from Step 6 here: https://github.com/uio-bmi/graph_peak_caller/wiki/Graph-based-ChIP-seq-tutorial

If you've at one point in you graph creation pipeline had one graph per chromosome, this approach should work as long as they have converted node id space that match the joint graph that was used to make the index. Usually with vg, you would make one graph for each chromosome first, then run vg ids graph1.vg graph2.vg graph3.vg ... to convert the ids of all your graphs to be able to send all those graphs into vg index. If this is what you have done, you can convert each of these graphs to json and run graph peak caller on each graph as in the guide I linked to above. I'm happy to try to help you to get this to work if you have any issues.

This, however, still requires that each and one of you chromosome graphs is directed, connected and acyclic.

@RenzoTale88
Copy link
Author

Hello again,
I've amde a try using the graphs as you suggested. I've followed the guide to create the index and everything seems to have worked fine. I've also managed to split the filtered reads into chromosome-wise subsets to process separately. However, when I try to calculate the shift I get the following error:

2020-01-16 15:58:40,261, WARNING: Too few paired peaks (0) to get good shift estimate
2020-01-16 15:58:40,261, WARNING: Too few paired peaks (0) to get good shift estimate
Traceback (most recent call last):
  File "/PATH/TO/myanaconda/DataPy3/bin/graph_peak_caller", line 11, in <module>
    sys.exit(main())
  File "/PATH/TO/myanaconda/DataPy3/lib/python3.5/site-packages/graph_peak_caller/command_line_interface.py", line 36, in main
    run_argument_parser(sys.argv[1:])
  File "/PATH/TO/myanaconda/DataPy3/lib/python3.5/site-packages/graph_peak_caller/command_line_interface.py", line 673, in run_argument_parser
    args.func(args)
  File "/PATH/TO/myanaconda/DataPy3/lib/python3.5/site-packages/graph_peak_caller/preprocess_interface.py", line 37, in shift_estimation
    d = estimator.get_estimates()
  File "/PATH/TO/myanaconda/DataPy3/lib/python3.5/site-packages/graph_peak_caller/shiftestimation/shift_estimation_multigraph.py", line 18, in get_estimates
    peakmodel = PeakModel(opt, treatment)
  File "/PATH/TO/myanaconda/DataPy3/lib/python3.5/site-packages/graph_peak_caller/shiftestimation/shiftestimation.py", line 81, in __init__
    self.build()
  File "/PATH/TO/myanaconda/DataPy3/lib/python3.5/site-packages/graph_peak_caller/shiftestimation/shiftestimation.py", line 112, in build
    raise NotEnoughPairsException("Not enough pairs to build model. "
graph_peak_caller.shiftestimation.shiftestimation.NotEnoughPairsException: 'Not enough pairs to build model. Try a broader range by setting -m lower and/or -M higher.'

The software asks me to change parameters with -m or -M, but I cannot find these settings anywhere in help of the tool. I've used the filtered reads to do this, but maybe I should've used the raw reads. Do you think it's this? How should I proceed?

Thanks again for your help!

@ivargr
Copy link
Member

ivargr commented Jan 17, 2020

It seems like you are on the right direction now.

I think you get this error because you don't specify --fragment-length on graph_peak_caller callpeaks. Try specifying -a True --fragment_length SOME_NUMBER like I described in my first answer. The -a True tells Graph Peak Caller to treat this as ATAC-seq data, and --fragment-length will be your smoothing window. If you specify these, Graph Peak Caller will not try to estimate the --fragment length, and you will not get that error.

Let me know if you face any problems.

@RenzoTale88
Copy link
Author

I've tried to run the callpeaks on one chromosome as you suggested and following the guidelines on the website. This time I'm getting this error:

[ATAC]$ graph_peak_caller callpeaks -a True -g LOCALGRAPH/genome1.1.nobg -s FILTER/Sample1_Bcell_ATAC/Sample1_Bcell_ATAC_filtered_genome1.1.json -n "" -f 100 -p True -u $unique_reads -G 2700000000
2020-01-17 15:03:29,307, INFO: Sample files: ['FILTER/Sample1_Bcell_ATAC/Sample1_Bcell_ATAC_filtered_genome1.1.json']
2020-01-17 15:03:29,307, INFO: Using graphs: ['LOCALGRAPH/genome1.1.nobg'] 
2020-01-17 15:03:29,307, INFO: Will use sequence graphs. ['LOCALGRAPH/genome1.1.nobg.sequences']
2020-01-17 15:03:29,307, INFO: Using graphs from data directory LOCALGRAPH
2020-01-17 15:03:29,307, INFO: Will use [''] as extra experiments names for each run, based on graph file names.If only running on single graph, this should be empty. 
2020-01-17 15:03:29,307, WARNING: Did not find linear map for  for graph LOCALGRAPH/genome1.1.nobg. Will create.
2020-01-17 15:03:29,307, WARNING: Did not find linear map for  for graph LOCALGRAPH/genome1.1.nobg. Will create.
2020-01-17 15:03:42,053, INFO: Getting topologically sorted nodes
2020-01-17 15:03:42,053, INFO: 0 nodes processed
2020-01-17 15:03:42,054, INFO: Finding starts and ends
2020-01-17 15:03:42,054, INFO: 0 nodes processed
Traceback (most recent call last):
  File "/PATH/TO/Andrea/myanaconda/graph_peak_caller/bin/graph_peak_caller", line 11, in <module>
    load_entry_point('graph-peak-caller', 'console_scripts', 'graph_peak_caller')()
  File "/PATH/TO/Andrea/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 39, in main
    run_argument_parser(sys.argv[1:])
  File "/PATH/TO/Andrea/graph_peak_caller/graph_peak_caller/command_line_interface.py", line 679, in run_argument_parser
    args.func(args)
  File "/PATH/TO/Andrea/graph_peak_caller/graph_peak_caller/callpeaks_interface.py", line 182, in run_callpeaks2
    create_linear_map(graph, linear_map_name)
  File "/PATH/TO/Andrea/graph_peak_caller/graph_peak_caller/util.py", line 6, in create_linear_map
    linear_map = LinearMap.from_graph(ob_graph)
  File "/PATH/TO/Andrea/graph_peak_caller/graph_peak_caller/control/linearmap.py", line 106, in from_graph
    starts = cls.find_starts(graph, node_ids)
  File "/PATH/TO/Andrea/graph_peak_caller/graph_peak_caller/control/linearmap.py", line 134, in find_starts
    max_dists[j] = max(cur_dist, max_dists[j])
IndexError: index 2382941 is out of bounds for axis 0 with size 2

For every chromosome, I got the file such as these:

  1. genome1.1.nobg,
  2. genome1.1.nobg.sequences,
  3. genome1.1.nobg.sequencesV2,
  4. genome1.1.json,
  5. node_range_genome1.1.txt,
  6. genome1.1_linear_pathv2.interval

What should I do? Thank you anyway for your help!
Andrea

@RenzoTale88
Copy link
Author

Hi again, sorry was wondering whether there has been any development in this.
Thanks again for your help so far,
Andrea

@ivargr
Copy link
Member

ivargr commented Feb 17, 2020

Sorry for the late reply. Would you be able to share the above mentioned files and the Sample1_Bcell_ATAC_filtered_genome1.1.json file with me, and then I can have a look at this? :)

@RenzoTale88
Copy link
Author

Hi,
sure I've just sent you the link. Let me know if you receive it. Thanks
Andrea

@RenzoTale88
Copy link
Author

Hi,
Sorry to insist, have you received the dataset?

@ivargr
Copy link
Member

ivargr commented Feb 28, 2020

Hi!

Sorry for the late reply again. I got the data and managed to reproduce your error.

I think the problem still is that your graph is not connected. For instance, node 133450827 has an edge going to 135833768, but these two nodes don't have any other edges to any nodes, so they are isolated together. I guess there might be more such cases in your graph.

I am not sure why your graph is disconnected (I don't know much about cactus, but I thought it would create one graph for every chromosome). Do you know why?

Anyway, Graph Peak Caller will not work properly with a graph that is disconnected. It might be possible to fix this one error you got, but there will likely be more problems. Also, I suspect you will have trouble interpreting your results in the end if your graph is disconnected. Thus, I would suggest that you try to find out if this is what your graph is supposed to look like or not. I could try to help you with this if you need any help. Also let me know if you have any further questions.

@RenzoTale88
Copy link
Author

Hi, sorry for my late reply again. It is very strange, since the graph is supposed to be connected (and with alternative paths as well). Not sure how this is possible to be honest.
I'll try again without fragmenting into multiple subgraph, maybe that will help. I'll keep you updated anyway. Thanks for the help
Andrea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants