Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporting gWalks into json files #156

Open
mesut-unal opened this issue Oct 11, 2024 · 9 comments
Open

Exporting gWalks into json files #156

mesut-unal opened this issue Oct 11, 2024 · 9 comments

Comments

@mesut-unal
Copy link

Dear developers, I am using gGnome for a project. It is really helpful for getting the walks, but I am having trouble exporting all the walks to a json file that is suitable for gGnome.js browser (from the rds files). I tried writing them into csv files by using nodes, edges, grl info so that I can compile a json from them. It works and I can visualize them in gGnome.js, but I need to double check if I am extracting everything correctly. There are variables and names whose explanation I can't find such as edges.in and edges.out columns in gNodes. For instance, I see (4)->,1560(1)-> in one of the rows, but I can't be sure what these numbers are. I can't see 1560 anywhere else in the csv files I extracted. I'd appreciate if you can tell me where I can obtain such details and/or a gWalks-to-json function that I can cross check my results. I see that you have gen_gg_json_files jsUtils.R, but I haven't been able to use it without an error. I'd appreciate your help.

@shihabdider
Copy link
Collaborator

Thanks for trying the package!

Have you tried using the $json method of the gWalk class?

Here's an example signature:

 gwalk$json(
            filename = output_path, 
            verbose = TRUE,
            annotations = annotations,
            include.graph = FALSE
        )

(where gwalk is an instantiated gWalk object). See: http://www.mskilab.com.s3-website-us-east-1.amazonaws.com/gGnome/tutorial.html#How_are_files_generated

For the edges.in, the number in parens is the copy number of that edge. The other number is the node index the edge is pointing to/from.

@mesut-unal
Copy link
Author

Thanks for the reply. Yes, I tried it. When I use it on a gWalk that contains all the walks obtained with peel function

gg = gG(jabba=graph)
walks = peel(gg, verbose = T)
saveRDS(walks,outputname.rds)

I get this error

[1] "#############"
Key: <listid>
       listid    V1
        <num> <int>
    1:      1 -1714
    2:      1 -1713
    3:      1 -1712
    4:      1 -1711
    5:      1 -1710
   ---             
21681:    462 -1146
21682:    462 -1145
21683:    463  -849
21684:    463  -848
21685:    464 -1381
Error in cids[[x]] : subscript out of bounds
In addition: Warning message:
In rbind(abbs, toprint) :
  number of columns of result is not a multiple of vector length (arg 1)

I tried it also on the rds files that have gWalks with a specific length and containing a specific gene and I could produce a json file with

Key: <listid>
   listid    V1
    <num> <int>
1:      1  1751
2:      1  1752
Saving JSON to: '/path/to/your/data/walks_json'
Warning message:
In rbind(abbs, toprint) :
  number of columns of result is not a multiple of vector length (arg 1)

However, it is empty when I import it to gGnome.js browser.

@shihabdider
Copy link
Collaborator

Can you post the exact code you're using (including the calls for the $json method and the call that is actually producing the error). Also a print of the walks object.

@mesut-unal
Copy link
Author

yes, this is the part that returns the error message

rds_files <- list.files(input_folder, pattern = "^226295-WG01.*_walks.rds$", full.names = TRUE)

gwalk <- readRDS(rds_files[1])

gwalk$json(
  filename = "/path/to/your/data/walks_rds/227184-WG01_walks.json", 
  verbose = TRUE,
  include.graph = FALSE
)

and the output of gwalk is

> gwalk
gWalk object with 464 walks (432 linear and 32 circular)
Key: <walk.id>
   walk.id  name length       wid circular     cn
     <num> <int>  <int>     <int>    <num> <lgcl>
1:       1   405      6 198295559    FALSE      1
2:       2   406      3 190207008    FALSE      2
3:       3   362      4 170805979    FALSE      1
4:       4   363      8 161118484    FALSE      1
5:       5   407     10 152725008    FALSE      1
                                                                                                                                                            gr
                                                                                                                                                         <num>
1:         3:168352403-198295559- -> 3:164600003-168352402- -> 3:164187603-164600002- -> ... -> 3:91136003-164187602- -> 3:60661899-91136002- -> 3:1-60661898-
2:                                                                                            4:1-78347980+ -> 4:78354042-180955649+ -> 4:180957136-190214555+
3:                                                                    6:113745687-170805979- -> 6:47596995-113745686- -> 6:23081821-47596994- -> 6:1-23081820-
4: 7:142415727-143707602- -> 7:75256603-142415726- -> 7:70973894-75256602- -> ... -> 7:75256603-142415726+ -> 7:142415727-143707602+ -> 7:143707603-159345973+
5:          12:95660441-95727668- -> 12:56991907-57034372- -> 12:56929770-56991906- -> ... -> 1:124932203-150228402- -> 1:94954905-124932202- -> 1:1-94947816-

 ... 
(459 more walks )
Warning message:
In rbind(abbs, toprint) :
  number of columns of result is not a multiple of vector length (arg 1)

Please let me know if you need anything else.

@shihabdider
Copy link
Collaborator

Any chance you can upload one of the *walks.rds files into this issue thread? (assuming it's not protected patient data). I can then try reproducing your error on my side.

@mesut-unal
Copy link
Author

Hi @shihabdider, did you get a chance to look at the data I sent you?

@shihabdider
Copy link
Collaborator

@mesut-unal Sorry for the long delay! (It's been a hectic month). I'll take a look now and get back to you by end of day.

@shihabdider
Copy link
Collaborator

shihabdider commented Nov 22, 2024

OK it seems like there's an issue with a mismatch between the length of cids and that of the walks. I need to investigate further on why these cids are not being generated for these walks, but in the meantime, the following custom_json function can be used in place of the default $json method, which should fix this error by ignoring entries for which there is no cid:

custom_json <- function (
	walk,
	filename = ".",
	save = TRUE,
	verbose = FALSE,
	annotations = NULL,
	nfields = NULL,
	efields = NULL,
	stack.gap = 1e+05,
	include.graph = TRUE,
	settings = list(y_axis = list(title = "copy number", visible = TRUE)),
	cid.field = NULL,
	no.y = FALSE
) {
	message("custom_json")
    if (length(walk) == 0) {
        warning("This is an empty gWalk so no JSON will be produced.")
        return(NA)
    }
    if (length(walk$edges) == 0) {
        warning("There are no edges in this gWalk so no JSON will be produced.")
        return(NA)
    }
    non.alt.exist = any(walk$dt[, sapply(sedge.id, length) == 0])
    if (non.alt.exist) {
        return(refresh(walk[walk$dt[, sapply(sedge.id, length) > 0]])$json(filename = filename, save = save, verbose = verbose, annotations = annotations, nfields = nfields, efields = efields, stack.gap = stack.gap, include.graph = include.graph, settings = settings, no.y = no.y))
    }
    if (include.graph) {
        graph.js = refresh(walk$graph)$json(filename = NA, save = FALSE, verbose = verbose, annotations = annotations, nfields = nfields, efields = efields, settings = settings, no.y = no.y)
    }
    pids = split(walk$dt[, .(pid = walk.id, strand = "+", type = ifelse(walk$circular, "cycle", "path"))], 1:walk$length)
    efields = unique(c("type", efields))
    protected_efields = c("cid", "source", "sink", "title", "weight")
    rejected_efields = intersect(efields, protected_efields)
    if (length(rejected_efields) > 0) {
        warning(sprintf("The following fields were included in efields: \"%s\", but since these are conserved fields in the json walks output then they will be not be included in efields. If these fields contain important metadata that you want included in the json output, then consider renaming these field names in your gWalk object.", paste(rejected_efields, collapse = "\" ,\"")))
        efields = setdiff(efields, rejected_efields)
    }
    missing_efields = setdiff(efields, names(walk$edges$dt))
    if (length(missing_efields) > 0) {
        warning(sprintf("Invalid efields value/s provided: \"%s\". These fields were not found in the gWalk and since will be ignored.", paste(missing_efields, collapse = "\" ,\"")))
        efields = intersect(efields, names(walk$edges$dt))
    }
    sedu = dunlist(walk$sedge.id)
    print("#############")
    print(sedu)
    cids = lapply(unname(split(cbind(data.table(cid = sedu$V1, source = walk$graph$edges[sedu$V1]$left$dt$snode.id, sink = -walk$graph$edges[sedu$V1]$right$dt$snode.id, title = "", weight = 1), walk$graph$edges[sedu$V1]$dt[, ..efields], fill = TRUE), sedu$listid)), function(x) unname(split(x, 1:nrow(x))))
    snu = dunlist(walk$snode.id)
    snu$ys = gGnome:::draw.paths.y(walk$grl) %>% unlist
    protected_nfields = c("chromosome", "startPoint", "endPoint", "y", "type", "strand", "title")
    rejected_nfields = intersect(nfields, protected_nfields)
    if (length(rejected_nfields) > 0) {
        warning(sprintf("The following fields were included in nfields: \"%s\", but since these are conserved fields in the json walks output then they will be not be included in nfields. If these fields contain important metadata that you want included in the json output, then consider renaming these field names in your gWalk object.", paste(rejected_nfields, collapse = "\" ,\"")))
        nfields = setdiff(nfields, rejected_nfields)
    }
    missing_nfields = setdiff(nfields, names(walk$nodes$dt))
    if (length(missing_nfields) > 0) {
        warning(sprintf("Invalid nfields value/s provided: \"%s\". These fields were not found in the gWalk and since will be ignored.", paste(missing_nfields, collapse = "\" ,\"")))
        nfields = intersect(nfields, names(walk$edges$dt))
    }
    iids = lapply(unname(split(cbind(data.table(iid = abs(snu$V1)), walk$graph$nodes[snu$V1]$dt[, .(chromosome = seqnames, startPoint = start, endPoint = end, y = snu$ys, type = "interval", strand = ifelse(snu$V1 > 0, "+", "-"), title = abs(snu$V1))], walk$graph$nodes[snu$V1]$dt[, ..nfields]), snu$listid)), function(x) unname(split(x, 1:nrow(x))))
    walks.js = lapply(1:min(length(walk), length(cids), length(iids)), 
        function(x) c(as.list(pids[[x]]), list(cids = rbindlist(cids[[x]])), list(iids = rbindlist(iids[[x]]))))
    if (include.graph) {
        out = c(graph.js, list(walks = walks.js))
    }
    else {
        out = list(walks = walks.js)
    }
    if (save) {
        if (verbose) {
            message("Saving JSON to: ", filename)
        }
        jsonlite::write_json(out, filename, pretty = TRUE, auto_unbox = TRUE, digits = 4)
        return(normalizePath(filename))
    }
    else {
        return(out)
    }
}

walk = readRDS("226295-WG01_walks.rds")
custom_json(walk, filename="test.json", verbose = TRUE, include.graph = FALSE)

@mesut-unal
Copy link
Author

Hi @shihabdider , thanks for taking a look at it and preparing the custom json. Did you get a chance to try the output on gGnome.js browser? I still get the same problem with the custom_json, browser shows an empty page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants