Skip to content

Commit

Permalink
Add another page
Browse files Browse the repository at this point in the history
  • Loading branch information
bsweger committed Oct 8, 2024
1 parent 1ee74c5 commit 668e658
Show file tree
Hide file tree
Showing 3 changed files with 66 additions and 2 deletions.
2 changes: 1 addition & 1 deletion docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
# -- Project information

project = "Cladetime"
copyright = "2024, Reich Lab @ The University of Massachusetts, Amherst"
copyright = "2024, Reich Lab @ The University of Massachusetts Amherst"
author = "Reich Lab"

release = "0.1"
Expand Down
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Cladetime

## Hi!
Cladetime is a lightweight Python library for manipulating SARS-CoV-2 sequence and clade data provided by [nextstrain.org](https://nextstrain.org/).
64 changes: 64 additions & 0 deletions docs/user-guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# User Guide

## Installing

cladetime can be installed with [pip])(https://pip.pypa.io/):

```bash
pip install git+https://github.com/reichlab/cladetime.git
```

## Finding Nextstrain SARS-CoV-2 sequences and sequence metadata

Cladetime provides a CladeTime class that provides a lightweight interface to nextstrain.org files.

```python
from cladetime import CladeTime

# Instantiating a CladeTime object with no parameters will use the
# latest available data from nextstrain.org.
ct = CladeTime()

# URL to the most recent SARS-CoV-2 sequence file (.fasta)
ct.url_sequence
# 'https://nextstrain-data.s3.amazonaws.com/files/ncov/open/sequences.fasta.zst?versionId=d66Hn1T0eFMAg8osEh8Yrod.QEUBRxvu'

# URL to the metadata that describes the sequences in the above file
ct.url.sequence_metadata
# 'https://nextstrain-data.s3.amazonaws.com/files/ncov/open/metadata.tsv.zst?versionId=JTXXFlKyyvt9AerxKMwoZflhFYQFrDek'

# Metadata about the nextstrain data pipeline that created generated the sequence file and its metadata
ct.ncov_metadata
# {'schema_version': 'v1',
# 'nextclade_version': 'nextclade 3.8.2',
# 'nextclade_dataset_name': 'SARS-CoV-2',
# 'nextclade_dataset_version': '2024-09-25--21-50-30Z',
# 'nextclade_tsv_sha256sum': '5b0f2b64bfe694a3c96bd5a116de8fae23b144bfd3d22da774d4bfe9a84399c3',
# 'metadata_tsv_sha256sum': '1dc6a4204039e5c69eed84583faf75bbec1629e531dc99aab5bd566d3fb28295'}
```

## Working with SARS-CoV-2 sequence metadata

The CladeTime class also provides a Polars LazyFrame object that points to the Nextstrain's sequence metadata file. This file is in .tsv format and contains information about the sequences, such as their collection date, host, and location.

The metadata also includes a clade assignment for each sequence. Nextstrain assigns clades based on a reference tree, and the reference tree varies over time.

TODO: better wording, add links for reference

```python
import polars as pl
from cladetime import CladeTime

ct = CladeTime()

# ct contains a Polars LazyFrame that references the sequence metadata .tsv file on AWS S3
lz = ct.sequence_metadata
lz
<LazyFrame at 0x105341190>

# TODO: some polars examples
```

## Time Traveling

omg!

0 comments on commit 668e658

Please sign in to comment.