Skip to content

Commit

Permalink
Add AnnotatedProteinChain, cache PDB downloads, include disordered at…
Browse files Browse the repository at this point in the history
…oms/residues
  • Loading branch information
AntonOresten committed Sep 11, 2024
1 parent 738d3e6 commit 51c3c32
Show file tree
Hide file tree
Showing 13 changed files with 266 additions and 181 deletions.
2 changes: 1 addition & 1 deletion Project.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
name = "ProteinChains"
uuid = "b8e8f2a5-48d3-44f1-ba0d-c71cb7726ff8"
authors = ["Anton Oresten <[email protected]> and contributors"]
version = "0.1.1"
version = "0.2.0"

[deps]
AssigningSecondaryStructure = "8ed43e74-60fb-4e11-99b9-91deed37aef7"
Expand Down
63 changes: 37 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
[![Build Status](https://github.com/MurrellGroup/ProteinChains.jl/actions/workflows/CI.yml/badge.svg?branch=main)](https://github.com/MurrellGroup/ProteinChains.jl/actions/workflows/CI.yml?query=branch%3Amain)
[![Coverage](https://codecov.io/gh/MurrellGroup/ProteinChains.jl/branch/main/graph/badge.svg)](https://codecov.io/gh/MurrellGroup/ProteinChains.jl)

This Julia package provides implements the `ProteinChain` type: an efficient and GPU-friendly representation of protein chains.
This Julia package provides implements the `ProteinChain` type: a GPU-friendly structure-of-arrays representation of protein chains.

## Installation

Expand All @@ -16,44 +16,55 @@ Pkg.add("ProteinChains")

## Examples

The `ProteinChain` type is meant to only store a set of quintessential fields, from which most other properties can be derived.

```julia
julia> using ProteinChains

julia> structure = pdb"1EYE" # convenient macro to download proteins from the PDB
julia> structure = pdb"1EYE" # string macro to fetch proteins from the PDB
[ Info: Downloading file from PDB: 1EYE
1-chain ProteinStructure "1EYE.cif":
2 fields:
name::String = "1EYE.cif"
chains::Vector{ProteinChain{Float64}} = <exceeds max length>
2 properties:
ids::Vector{String} = ["A"]
lengths::Vector{Int64} = [253]
1-chain ProteinStructure "1EYE.cif"
256-residue ProteinChain{Float64} (A)

julia> propertynames(chain)
(:id, :sequence, :backbone, :numbering, :atoms)
```
To store additional properties, `AnnotatedProteinChain` can be used to add dynamic properties to the chain:
julia> chain = structure["A"]
253-residue ProteinChain "A":
4 fields:
```julia
julia> annotated_chain = annotate(chain; model=1)
256-residue AnnotatedProteinChain{Float64} (A):
6 fields:
id::String = "A"
sequence::String = <exceeds max length>
backbone::Array{Float64,3} = <exceeds max length>
numbering::Vector{Int64} = <exceeds max length>
atoms::Vector{Vector{ProteinChains.Atom{Float64}}} = <exceeds max length>
2 properties:
indexable_properties::Vector{Symbol} = Symbol[]
1 property:
model::Int64 = 1
```
For properties of type `<:AbstractArray` that represent residue-level information, `annotate_indexable!` will index the last dimension of the property when the chain is indexed:
```julia
julia> annotate_indexable!(annotated_chain; secondary_structure=assign_secondary_structure(annotated_chain)
256-residue AnnotatedProteinChain{Float64} (A):
6 fields:
id::String = "A"
sequence::String = <exceeds max length>
backbone::Array{Float64,3} = <exceeds max length>
numbering::Vector{Int64} = <exceeds max length>
modelnum::Int64 = 1

julia> chain.numbering
253-element Vector{Int64}:
5
6
7
8
271
272
273
274
atoms::Vector{Vector{ProteinChains.Atom{Float64}}} = <exceeds max length>
indexable_properties::Vector{Symbol} = [:secondary_structure]
2 properties:
model::Int64 = 1
secondary_structure::Vector{Int64} = [1, 1, 3, 3, 3, 3, 3, 3, 3, 1 2, 2, 2, 2, 2, 2, 2, 1, 1, 1]
```
## See also
- [Backboner.jl](https://github.com/MurrellGroup/Backboner.jl)
- [BioStructures.jl](https://github.com/BioJulia/BioStructures.jl)
- [PDBTools.jl](https://github.com/m3g/PDBTools.jl)
15 changes: 10 additions & 5 deletions src/ProteinChains.jl
Original file line number Diff line number Diff line change
Expand Up @@ -18,20 +18,25 @@ export prepend_residue
include("atom.jl")

include("chain.jl")
export AbstractProteinChain
export ProteinChain
export countresidues
export length
export psi_angles, omega_angles, phi_angles

include("annotated-chain.jl")
export AnnotatedProteinChain
export annotate!, annotate
export annotate_indexable!, annotate_indexable

include("structure.jl")
export ProteinStructure

include("secondary-structure.jl")
export assign_secondary_structure, assign_secondary_structure!

include("io/io.jl")
export readchains
export readpdb
export readcif
export writechains
export readcif, readpdb
export writecif, writepdb
export PDBFormat, MMCIFFormat
export pdbentry, @pdb_str

Expand Down
52 changes: 52 additions & 0 deletions src/annotated-chain.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
@dynamic mutable struct AnnotatedProteinChain{T} <: AbstractProteinChain{T}
id::String
sequence::String
backbone::Array{T,3}
numbering::Vector{Int64}
atoms::Vector{Vector{Atom{T}}}
indexable_properties::Vector{Symbol}

function AnnotatedProteinChain(
id::AbstractString, sequence::AbstractString, backbone::AbstractArray{T,3},
numbering::AbstractVector{<:Integer}, atoms::Vector{Vector{Atom{T}}},
indexable_properties=Symbol[]; kwargs...
) where T
@assert length(sequence) == size(backbone, 3) == length(atoms) == length(numbering)
AnnotatedProteinChain{T}(id, sequence, backbone, atoms, numbering, indexable_properties; kwargs...)
end
end

ProteinChain(chain::AbstractProteinChain) = ProteinChain(chain.id, chain.sequence, chain.backbone, chain.numbering, chain.atoms)
AnnotatedProteinChain(chain::AbstractProteinChain) = AnnotatedProteinChain(chain.id, chain.sequence, chain.backbone, chain.numbering, chain.atoms)

function Base.getindex(chain::AnnotatedProteinChain, i::AbstractVector{<:Integer})
properties = Dict{Symbol,Any}()
for name in getproperties(chain; fields=false)
value = getproperty(chain, name)
properties[name] = name in chain.indexable_properties ? collect(selectdim(value, ndims(value), i)) : value
end
return annotate(ProteinChain(chain)[i]; properties...)
end

function annotate_indexable!(chain::AnnotatedProteinChain; annotations...)
for (name, value) in annotations
@assert value isa AbstractArray
@assert size(value)[end] == length(chain)
push!(chain.indexable_properties, name)
setproperty!(chain, name, value)
end
return chain
end

annotate_indexable(chain::AnnotatedProteinChain; annotations...) = annotate_indexable!(deepcopy(chain); annotations...)
annotate_indexable(chain::ProteinChain; annotations...) = annotate_indexable!(AnnotatedProteinChain(chain); annotations...)

function annotate!(chain::AnnotatedProteinChain; annotations...)
for (name, value) in annotations
setproperty!(chain, name, value)
end
return chain
end

annotate(chain::AnnotatedProteinChain; annotations...) = annotate(deepcopy(chain); annotations...)
annotate(chain::ProteinChain; annotations...) = annotate!(AnnotatedProteinChain(chain); annotations...)
16 changes: 8 additions & 8 deletions src/atom.jl
Original file line number Diff line number Diff line change
Expand Up @@ -6,22 +6,22 @@ const ATOMIC_NUMBER_TO_ELEMENT_SYMBOL = Dict(n => s for (s, n) in ELEMENT_SYMBOL
element_symbol_to_atomic_number(element_symbol::AbstractString) = ELEMENT_SYMBOL_TO_ATOMIC_NUMBER[uppercase(strip(element_symbol))]
atomic_number_to_element_symbol(atomic_number::Integer) = ATOMIC_NUMBER_TO_ELEMENT_SYMBOL[atomic_number]

pad_atom_name(atom_name::AbstractString, element_symbol::AbstractString) = rpad(" "^(2-length(strip(element_symbol)))*strip(atom_name), 4)
pad_atom_name(name::AbstractString, element_symbol::AbstractString) = rpad(" "^(2-length(strip(element_symbol)))*strip(name), 4)

encode_atom_name(atom_name::AbstractString, element_symbol::AbstractString) = reinterpret(UInt32, codeunits(pad_atom_name(atom_name, element_symbol)))[1]
decode_atom_name(atom_name::UInt32) = String(reinterpret(UInt8, [atom_name]))
encode_atom_name(name::AbstractString, element_symbol::AbstractString) = reinterpret(UInt32, codeunits(pad_atom_name(name, element_symbol)))[1]
decode_atom_name(name::UInt32) = String(reinterpret(UInt8, [name]))

mutable struct Atom{T<:AbstractFloat}
atom_name::UInt32
name::UInt32
atomic_number::Int8
x::T
y::T
z::T
end

Atom(atom_name::UInt32, atomic_number::Integer, x::T, y::T, z::T) where T = Atom{T}(atom_name, atomic_number, x, y, z)
Atom(atom_name::AbstractString, element_symbol::AbstractString, coords::AbstractVector{<:AbstractFloat}) =
Atom(encode_atom_name(atom_name, element_symbol), element_symbol_to_atomic_number(element_symbol), coords...)
Atom(name::UInt32, atomic_number::Integer, x::T, y::T, z::T) where T = Atom{T}(name, atomic_number, x, y, z)
Atom(name::AbstractString, element_symbol::AbstractString, coords::AbstractVector{<:AbstractFloat}) =
Atom(encode_atom_name(name, element_symbol), element_symbol_to_atomic_number(element_symbol), coords...)

coords(atom::Atom) = [atom.x, atom.y, atom.z]

Expand All @@ -33,4 +33,4 @@ function offset!(atom::Atom, coords::Vector{<:Real})
atom.y += coords[2]
atom.z += coords[3]
return atom
end
end
71 changes: 36 additions & 35 deletions src/chain.jl
Original file line number Diff line number Diff line change
@@ -1,41 +1,53 @@
@dynamic mutable struct ProteinChain{T<:AbstractFloat}
"""
AbstractProteinChain{T<:AbstractFloat}
An abstract type representing a protein chain, with a type parameter
`T` that specifies the floating-point type of the coordinates.
## Subtypes
- `ProteinChain{T}`: A concrete subtype of `AbstractProteinChain` that represents a protein chain with a spe.
"""
abstract type AbstractProteinChain{T<:AbstractFloat} end

Base.summary(chain::AbstractProteinChain) = "$(length(chain))-residue $(typeof(chain)) ($(chain.id))"
Base.length(chain::AbstractProteinChain) = size(chain.backbone, 3)

mutable struct ProteinChain{T} <: AbstractProteinChain{T}
id::String
sequence::String
backbone::Array{T,3}
numbering::Vector{Int}
atoms::Vector{Vector{Atom{T}}}

function ProteinChain(
id::AbstractString, sequence::AbstractString, backbone::AbstractArray{T,3},
numbering::AbstractVector{<:Integer} = collect(1:length(sequence)),
atoms::Vector{Vector{Atom{T}}} = [Atom{T}[] for _ in sequence],
) where T
@assert length(sequence) == size(backbone, 3) == length(numbering) == length(atoms)
new{T}(id, sequence, backbone, numbering, atoms)
end
end

"""
ProteinChain{T<:AbstractFloat}
## Examples
```jldoctest
julia> structure = pdb"1EYE";
[ Info: Downloading file from PDB: 1EYE
julia> structure[1]
253-residue ProteinChain "A":
4 fields:
id::String = "A"
sequence::String = <field value exceeds max length>
backbone::Array{Float64,3} = <field value exceeds max length>
atoms::Vector{Vector{ProteinChains.Atom{Float64}}} = <field value exceeds max length>
2 properties:
numbering::Vector{Int64} = [5, 6, 7, 8, 9, 10, 11, 12, 13, 14 … 265, 266, 267, 268, 269, 270, 271, 272, 273, 274]
modelnum::Int64 = 1
```
"""
ProteinChain

countresidues(chain::ProteinChain) = length(chain.sequence)
Base.getindex(chain::ProteinChain, i::AbstractVector{<:Integer}) =
ProteinChain(chain.id, chain.sequence[i], chain.backbone[:,:,i], chain.numbering[i], chain.atoms[i])

import Backboner

Base.getindex(chain::ProteinChain, i::AbstractVector{<:Integer}) = ProteinChain(chain.id, chain.sequence[i], chain.backbone[:,:,i], chain.atoms[i])
Backboner.Backbone(chain::AbstractProteinChain) = Backbone(chain.backbone)
Backboner.ChainedBonds(chain::AbstractProteinChain) = ChainedBonds(Backbone(chain))
Backboner.Frames(chain::AbstractProteinChain, ideal_residue=STANDARD_RESIDUE) = Frames(Backbone(chain), ideal_residue)

Base.summary(chain::ProteinChain) = "$(countresidues(chain))-residue ProteinChain \"$(chain.id)\""
psi_angles(chain::AbstractProteinChain) = get_torsion_angles(Backbone(chain))[1:3:end]
omega_angles(chain::AbstractProteinChain) = get_torsion_angles(Backbone(chain))[2:3:end]
phi_angles(chain::AbstractProteinChain) = get_torsion_angles(Backbone(chain))[3:3:end]

function offset!(chain::ProteinChain, coords::Vector{<:Real})
@assert length(coords) == 3
function offset!(chain::AbstractProteinChain, coords::Vector{<:Real})
chain.backbone .+= coords
for residue_atoms in chain.atoms
for atom in residue_atoms
Expand All @@ -44,14 +56,3 @@ function offset!(chain::ProteinChain, coords::Vector{<:Real})
end
return chain
end

import Backboner

Backboner.Backbone(chain::ProteinChain) = Backbone(chain.backbone)
Backboner.ChainedBonds(chain::ProteinChain) = ChainedBonds(Backbone(chain))
Backboner.Frames(chain::ProteinChain, ideal_residue=STANDARD_RESIDUE) = Frames(Backbone(chain), ideal_residue)

# TODO: "contiguity" field/property to mask outputs with NaN where residues are not contiguous
psi_angles(chain::ProteinChain) = get_torsion_angles(Backbone(chain))[1:3:end]
omega_angles(chain::ProteinChain) = get_torsion_angles(Backbone(chain))[2:3:end]
phi_angles(chain::ProteinChain) = get_torsion_angles(Backbone(chain))[3:3:end]
39 changes: 39 additions & 0 deletions src/io/download.jl
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
using Base.Filesystem

const MAX_CACHE_ENTRIES = Ref{Int}(4)
const CACHE_DIR = Ref{Union{String, Nothing}}(nothing)

function initialize_cache_dir()
if CACHE_DIR[] === nothing
CACHE_DIR[] = mktempdir(prefix="pdb_cache_")
atexit(() -> rm(CACHE_DIR[], recursive=true))
end
end

function manage_cache()
entries = readdir(CACHE_DIR[])
while length(entries) > MAX_CACHE_ENTRIES[]
oldest_entry = entries[argmin([mtime(joinpath(CACHE_DIR[], entry)) for entry in entries])]
rm(joinpath(CACHE_DIR[], oldest_entry))
entries = readdir(CACHE_DIR[])
end
end

function pdbentry(pdbid::AbstractString; format=BioStructures.MMCIFFormat, kwargs...)
initialize_cache_dir()
path = BioStructures.downloadpdb(pdbid; dir=CACHE_DIR[], format=format, kwargs...)
manage_cache()
return read(path, ProteinStructure, format)
end

macro pdb_str(pdbid::AbstractString)
:(pdbentry($(esc(pdbid))))
end

macro pdb_str(pdbid::AbstractString, chain::AbstractString)
:(pdbentry($(esc(pdbid)))[$chain])
end

macro pdb_str(pdbid::AbstractString, ba_number::Integer)
:(pdbentry($(esc(pdbid)), ba_number=$ba_number))
end
10 changes: 1 addition & 9 deletions src/io/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,4 @@ get_format(path::AbstractString) = get(pdbextension_to_format, lowercase(last(sp
include("renumber.jl")
include("read.jl")
include("write.jl")

pdbentry(pdbid::AbstractString; format=BioStructures.MMCIFFormat, kwargs...) = mktempdir() do dir
path = BioStructures.downloadpdb(pdbid; dir, format, kwargs...)
readchains(path, format)
end

macro pdb_str(pdbid::AbstractString)
:(pdbentry($(esc(pdbid))))
end
include("download.jl")
Loading

2 comments on commit 51c3c32

@AntonOresten
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator register

Release notes:

  • Add AnnotatedProteinChain
  • Cache up to four PDB entries that have been fetched with the pdbentry function or @pdb_str macro.
  • Disordered atoms/residues no longer get thrown away (collapses to default).

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/115010

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.2.0 -m "<description of version>" 51c3c32ea162b83913f36a144b42e84dd373bd10
git push origin v0.2.0

Please sign in to comment.