Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

issues about inport data in SEVtras #20

Open
TATABOX99 opened this issue Jun 12, 2024 · 24 comments
Open

issues about inport data in SEVtras #20

TATABOX99 opened this issue Jun 12, 2024 · 24 comments

Comments

@TATABOX99
Copy link

Dear Author,

I am using this script for analysis:
import SEVtras
SEVtras.sEV_recognizer(input_path='/home/yeziyang/Sc', sample_file='/home/yeziyang/Sc/Sc1_LN', out_path='/home/yeziyang/Sc/outputs', species='Homo')

My 10x_mtx formatted data is stored in this directory: /home/yeziyang/Sc/Sc1_LN/outs/raw_feature_bc_matrix/matrix.mtx.gz. I have ensured it is extracted from the raw_feature_bc_matrix. However, I encountered the following error:
File "run_SEVtras.py", line 2, in
SEVtras.sEV_recognizer(input_path='/home/yeziyang/Sc', sample_file='/home/yeziyang/Sc/Sc1_LN', out_path='/home/yeziyang/Sc/outputs', species='Homo')
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/main.py", line 79, in sEV_recognizer
sample_log = get_sample(sample_file)
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/utils.py", line 18, in get_sample
with open(sample_log, 'r') as f:
IsADirectoryError: [Errno 21] Is a directory: '/home/yeziyang/Sc/Sc1_LN'

I am not sure why this error occurs. Could you please give me some advice? Thank you very much!

Additionally, when I used echo "Sc1_LN" > sample_file and ran the above code again, setting alpha to 0.08, it showed max() arg is an empty sequence. I am not sure if this is due to an error in my import process.

@RuiqiaoHe
Copy link
Member

The parameter sample_file should be a file containing all samples, formatted as follows:
sample1
sample2
......
sampleN

In the case of your dataset, it would be:
Sc1_LN
Sc2_LN
Sc3_LN
......
ScN_LN

For the "max() arg" error, please first check if your dataset was generated by scRNA-seq? SEVtras doesn't support single nucleus RNA-seq data. Then, you can try to lower the parameter alpha.

@TATABOX99
Copy link
Author

Thank you very much for your prompt response. The correct import command allowed the program to run successfully. Additionally, I saw in the guide that you suggest using multiple sample data to identify SEVs. Can I merge all my samples into one Seurat object, convert it to h5ad format, and then analyze it? If so, would clustering the cells in my object before using SEVtras affect the determination of the SEV sources?

@TATABOX99
Copy link
Author

By the way, I found that analyzing a single sample theoretically takes an entire day. Is this normal for this software?
My RAM is 256G, with CPU core is 20.

@RuiqiaoHe
Copy link
Member

RuiqiaoHe commented Jun 13, 2024

Please refer to Question 9 in Troubleshooting. You can obtain the cell type information based on your own processing procedure, such as filtering or regressing. SEVtras only needs the cell type information, instead of the processed expression matrix.
In my environment (naive Linux), SEVtras takes only a few minutes to finish a 10X single cell sample. You can try to speed it up by increasing the number of CPUs if the memory load is sufficient (predefine_threads).

@TATABOX99
Copy link
Author

Here I have an additional question. When perform data quality control, I use this code to remove poor-quality cells: test.seu <- subset(test.seu, subset = nFeature_RNA > 200 & nFeature_RNA < 6000 & percent.mt < 5). Will this step affect the identification of SEV (Single Extracellular Vesicles)?

@RuiqiaoHe
Copy link
Member

The identification of sEVs is a separate step without the information of the cell matrix (SEVtras.sEV_recognizer). Cell matrix pre-processing only affects the calculation of ESAI (SEVtras.ESAI_calculator).

@TATABOX99
Copy link
Author

I have encountered some new difficulties. After completing the calculations in part 1, I obtained 15 "itera_gene.txt" and "raw_file.h5ad" files. Before running part 2, I noticed that you mentioned, "The first two parameters represent the path to sEV- and cell- anndata objects" and "With the output of SEVtras.sEV_recognizer in Part I sEVs recognizing and cell matrix with cell type, SEVtras can track each sEV to the original cell type and calculate the sEV secretion activity index (ESAI)."

However, the data I included in the first step is single-cell sequencing data directly from patients' original samples, without any clustering, meaning I do not have a cell matrix with cell type, and there is no "celltype" metadata. I would like to ask if this means I should first merge the raw_file.h5ad obtained from the first step into a Seurat object, process and cluster it, and then convert it back to an h5ad object to get the "test_cell.h5ad" file mentioned in "SEVtras.ESAI_calculator(adata_ev_path='./tests/sEV_SEVtras.h5ad', adata_cell_path='./tests/test_cell.h5ad', out_path='./outputs', Xraw=False, OBSsample='batch', OBScelltype='celltype')". But where does the "sEV_SEVtras.h5ad" file come from?

@RuiqiaoHe
Copy link
Member

"sEV_SEVtras.h5ad" generated by SEVtras.sEV_recognizer, located in outputs directory.

@TATABOX99
Copy link
Author

Dear esteemed author,

In your guide, regarding the second part, you mentioned, "With the output of SEVtras.sEV_recognizer in Part I sEVs recognizing and cell matrix with cell type, SEVtras can track each sEV to original cell type and calculate sEV secretion activity index (ESAI)." Could you clarify whether the "cell matrix with cell type" referred to here should be constructed using the filtered feature bc matrix from single-cell data output or if it would be more appropriate to use the raw feature bc matrix to construct an object containing cell clustering information?

Thank you for your guidance.

@RuiqiaoHe
Copy link
Member

Please refer to question 9 in Troubleshooting. A raw feature bc matrix with cell clustering information is preferred, regardless of how you generate the cell clustering information.

@TATABOX99
Copy link
Author

I am encountering an issue while running the SEVtras.ESAI_calculator function.
I used a processed Seurat object which I converted to H5AD format as the tutorial in Troubleshooting. However, when I run the SEVtras.ESAI_calculator function, I encounter the following error:

/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/anndata/_core/anndata.py:1785: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behaviour.
[AnnData(sparse.csr_matrix(a.shape), obs=a.obs) for a in all_adatas],
/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/scanpy/preprocessing/_pca.py:229: FutureWarning: X.dtype being converted to np.float32 from float64. In the next version of anndata (0.9) conversion will not be automatic. Pass dtype explicitly to avoid this warning. Pass AnnData(X, dtype=X.dtype, ...) to get the future behaviour.
adata.obsm['X_pca'] = X_pca
No matched cell sample for sEV deconvolution
... (repeated multiple times) ...
Traceback (most recent call last):
File "ESAI_calculator.py", line 2, in
SEVtras.ESAI_calculator(adata_ev_path='/home/yeziyang/Sc/outputs/sEV_SEVtras.h5ad', adata_cell_path='/home/yeziyang/Sc/PTCALL_RAW.h5ad', out_path='/home/yeziyang/Sc/outputs/combined', Xraw=False, OBSsample='batch', OBScelltype='celltype')
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/main.py", line 188, in ESAI_calculator
celltype_e_number, adata_evS, adata_com = deconvolver(adata_ev, adata_cell, species, OBSsample, OBScelltype, OBSev, OBSMpca, cellN, Xraw, normalW)
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/SEVtras/functional.py", line 131, in deconvolver
near_neighbor_dat.index = adata_ev.obs.index
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/generic.py", line 5500, in setattr
return object.setattr(self, name, value)
File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.set
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/generic.py", line 766, in _set_axis
self._mgr.set_axis(axis, labels)
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
self._validate_set_axis(axis, new_labels)
File "/home/yeziyang/miniconda3/envs/SEVtras_env/lib/python3.7/site-packages/pandas/core/internals/base.py", line 58, in _validate_set_axis
f"Length mismatch: Expected axis has {old_len} elements, new "
ValueError: Length mismatch: Expected axis has 0 elements, new values have 31098 elements

Since I am a beginner in Python, I cannot understand what went wrong with the converted data. Additionally, it doesn't seem to be a problem with the samples, as I have confirmed that both adata_ev and adata_cell use the same samples. Below is the code I used in Python to convert a Seurat object to H5AD format. I greatly need your help, thank you!

import scanpy as sc
import anndata
from scipy import io
from scipy.sparse import csr_matrix
import pandas as pd

directory = '/home/yeziyang/Sc/'

X = io.mmread(directory + 'matrix.mtx')
adata = anndata.AnnData(X=X.transpose().tocsr())

metadata = pd.read_csv(directory + 'metadata.csv')

with open(directory + 'gene_names.csv', 'r') as f:
gene_names = f.read().splitlines()

adata.obs = metadata
adata.obs.index = adata.obs["barcode"]
adata.var.index = gene_names

adata.obs['celltype'] = pd.Categorical(adata.obs['celltype'].astype(str))

adata.write_h5ad(directory + 'PTCALL_RAW.h5ad')

@TATABOX99
Copy link
Author

adata_ev 'obs' columns:
Index(['n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt',
'total_counts_ribo', 'pct_counts_ribo', 'n_genes', 'score', 'batch',
'sEV', 'score1'],
dtype='object')

adata_cell 'obs' columns:
Index(['orig.ident', 'nCount_RNA', 'nFeature_RNA', 'barcode', 'percent.mt',
'patient', 'celltype', 'tissue_type', 'RNA_snn_res.1',
'seurat_clusters'],
dtype='object')
It looks like my adata_ev does not have a celltype obs. Is this normal? Did I overlook any important steps that caused the inconsistency between the two datasets?

@RuiqiaoHe
Copy link
Member

Sorry for my fault. Please use following code:
import scanpy as sc
adata_cell = sc.read_h5ad(directory + 'PTCALL_RAW.h5ad')
adata_cell.obs['batch'] = adata_cell.obs['orig.ident']
Since SEVtras gets sample information based on the OBSsample parameter, the default is batch.

@TATABOX99
Copy link
Author

Thank you very much for your suggestion. That part of the program is now running normally. However, an error occurred during the downstream analysis when converting to a Seurat object:
Convert("SEVtras_combined.h5ad", dest = "h5seurat", overwrite = T)
Warning: Unknown file type: h5ad
Warning: 'assay' not set, setting to 'RNA'
Creating h5Seurat file for version 3.1.5.9900
Adding X as data
Adding raw/X as counts
Adding meta.features from raw/var
Adding X_pca as cell embeddings for pca
Adding X_umap as cell embeddings for umap
Adding PCs as feature loadings for pca
Warning: Cannot find features for feature loadings, will not be able to load
Adding miscellaneous information for pca
Adding standard deviations for pca
Adding miscellaneous information for umap
Adding hvg to miscellaneous data
Adding log1p to miscellaneous data

scdata <- LoadH5Seurat("SEVtras_combined.h5seurat", meta.data = T, misc = F)
Validating h5Seurat file
Initializing RNA with data
Error in fixupDN.if.valid(value, x@Dim) :
length of Dimnames[[1]] (21203) is not equal to Dim[1] (4542)

This seems to indicate an error in generating the data in h5ad format. I wonder if you have encountered this error before. I would greatly appreciate any further advice you could provide.

@RuiqiaoHe
Copy link
Member

Please refer to question 13 at Troubleshooting html. The code would reliably convert Seurat object to h5ad format.

@TATABOX99
Copy link
Author

Dear Author,

Upon investigation, we found that the error originates from the version issue of the Seurat-disk we are using. It seems feasible to extract information from this h5ad format dataset directly in Python and then construct the Seurat object directly in R. However, we found that in the newly constructed Seurat object, the number of features and cells changed. Below is the information of the Seurat object we initially used:

testAB.integrated
An object of class Seurat
31317 features across 161079 samples within 1 assay
Active assay: RNA (31317 features, 5000 variable features)
3 layers present: counts, data, scale.data
4 dimensional reductions calculated: pca, harmony, umap, tsne

After processing with SEVtras, the number of features decreased to 4542, while the number of cells increased to 192177. We have two questions: first, is this a normal phenomenon that could occur during the computation process? Second, can we extract the EVs determination information in the metadata of the newly generated Seurat object and transfer it to the initial Seurat object, 'testAB.integrated', so that we can retain the EVs information while also preserving the feature information?

@RuiqiaoHe
Copy link
Member

Since I don't know the full picture of your data, I can only answer with what I assume to be the case.

  1. The number of cells depends only on your input in parameter adata_cell of SEVtras.ESAI_calculator. So it won't increase. If you indicate the cell number in the output of SEVtras.sEV_recognizer (raw_SEVtras.h5ad), this is not the real cell in your analysis. I only control the total UMI in the file. As for the number of features, it can rise from different feature selection parameters between Seurat and Scanpy.
  2. It is ok to extract the sEV information in adata or Seurat object. You can just save them in the adata.obs and use the original data for other analysis.

@TATABOX99
Copy link
Author

TATABOX99 commented Jul 14, 2024 via email

@RuiqiaoHe
Copy link
Member

SEVtras.ESAI_calculator() in Part 2 does not require the input cell matrix to be filtered or not, but is concerned with the number of genes in the matrix. The input for SEVtras.ESAI_calculator() can be raw_feature_matrix in you statement, and also can be other prepossessed cell matrix (filter_feature_matrix) that you want to analyze (with parameter Xraw). Please refer to point 9 in Troubleshooting..
The results in SEVtras_sEVs.h5ad would not conflict with your analysis in your conventional single-cell analysis.

@TATABOX99
Copy link
Author

Moreover, functional enrichment analysis of these sEV-containing droplets highlighted pathways associated with sEV formation and release (P < 0.05, hypergeometric test) (Extended Data Fig. ​Fig.5d).5d)
尊敬的作者您好,在您的文章中有这样一个描述。这一个图来自于您的GO富集分析结果,我想确认一下,你在这里比较的是part2的结果中的SEVtras_combined.h5ad内sEV部分和非sEV部分吗?因为我们的分析在这里有一些问题,在使用ESAI高水平的单细胞cluster与低水平cluster作比较时,EV释放相关通路并没有被富集。我们不太清楚如何去解释这个结果。按照您的算法,这个值的高水平似乎可以代表高EVs释放的能力。

@RuiqiaoHe
Copy link
Member

The functional enrichment analysis used high abundance genes in identified sEVs droplets.

@TATABOX99
Copy link
Author

作者您好,通过对raw_SEVtras.h5ad和sEV_SEVtras.h5ad的比对我们发现,在那些没有被识别为EVs的液滴中,有相当一部分的score值是大于15的,这导致了比对EVs和细胞的两个标记物CD63和CD9的表达情况时细胞部分显著高于了EVs部分,我们是否可以将raw_SEVtras.h5ad中的液滴进行手动标记,将大于15分的液滴标记为EVs,保存成h5ad格式以替代sEV_SEVtras.h5ad进行后续分析?我们已经这样做了并且发现这样区分后,EVs部分的两个标记物表达显著高于了细胞部分。但作为这个软件的初学者,我们不太确定这样的分类是否符合您创作这个软件时的逻辑。

@RuiqiaoHe
Copy link
Member

The operation of manual filtering is actually equivalent to modifying the parameter score_t. Thus, it is ok to filter by yourself.
And the question of parameter meaning you asked in other issue:
The score_t, threads, search_UMI have the same as corresponding parameters in SEVtras.sEV_recognizer(),
The max_M and flag are to ensure the enrichment will work.

@kingwzun
Copy link

kingwzun commented Aug 2, 2024

I have encountered some new difficulties. After completing the calculations in part 1, I obtained 15 "itera_gene.txt" and "raw_file.h5ad" files. Before running part 2, I noticed that you mentioned, "The first two parameters represent the path to sEV- and cell- anndata objects" and "With the output of SEVtras.sEV_recognizer in Part I sEVs recognizing and cell matrix with cell type, SEVtras can track each sEV to the original cell type and calculate the sEV secretion activity index (ESAI)."

However, the data I included in the first step is single-cell sequencing data directly from patients' original samples, without any clustering, meaning I do not have a cell matrix with cell type, and there is no "celltype" metadata. I would like to ask if this means I should first merge the raw_file.h5ad obtained from the first step into a Seurat object, process and cluster it, and then convert it back to an h5ad object to get the "test_cell.h5ad" file mentioned in "SEVtras.ESAI_calculator(adata_ev_path='./tests/sEV_SEVtras.h5ad', adata_cell_path='./tests/test_cell.h5ad', out_path='./outputs', Xraw=False, OBSsample='batch', OBScelltype='celltype')". But where does the "sEV_SEVtras.h5ad" file come from?

Hi TATABOX99, can I get your email? I want to get the code to generate adata_cell in ESAI_calculator.(My email is [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants