TranscriptResolver

Author

Jun Ying Lim

Department of Integrative Biology and UC Museum of Paleontology, University of California, Berkeley

Description

TranscriptResolver

Reconciles output from citizen science transcription efforts from Notes from Nature (http://notesfromnature.org) for the Calbug project (http://calbug.berkeley.edu/)
Code should work for any type of data where there are replicates (e.g., replicate verbatim transcriptions of the same specimen label)
Uses sequence alignment algorithms to find consensus strings (http://blog.notesfromnature.org/2014/01/14/checking-notes-from-nature-data/)

Different consensus methods:

vote_count - Chooses the most frequently occurring value. Recommended for fields where choice of values is constrained (e.g., drop down lists)
consensus - Implements a character sequence alignment on replicate strings and produces a consensus string. Recommended for fields where input is more free-style (e.g., verbatim transcription of fields)
metadata - Does not perform any consensus method per se. Instead combines all values into a single string, delimited by "|"

TranscriptPrepare

Custom script to prepare raw Notes from Nature output for resolving using TranscriptResolver
Main steps:
- Normalizes or prepare fields
- Cleans up column headers
- Creates a file for bulk upload into the Essig Database

TranscriptClean

Custom script to prepare raw Notes from Nature output for resolving using TranscriptResolver
Main steps:
- Normalizes or prepare fields
- Cleans up column headers
- Creates a file for bulk upload into the Essig Database

Installation

Python packages

You will require a number of python packages to run this. You can do this by using the pip installer. If you're running Python 3, pip should be by default installed. Input the following in Terminal

pip3 install fuzzywuzzy numpy pandas nltk python-Levenshtein pymysql biopython

Other binaries

The string consensus functions use the sequence alignment algorithms provided in MAFFT. You will have to install mafft into the directory /usr/local/bin

If you have homebrew installed, you can install it as so: brew install mafft

Usage

To run TranscriptResolver in Terminal, you simply need to cd to the folder containing TranscriptResolver and:

python3 transcriptResolver.py

The program will prompt you on the necessary parameters. You can also supply them directly in terminal.

python3 transcriptResolver.py -wd <yourworkingdir> -file <yourfile> -stem <yourstemname> -col_id UNIQUE_ID -col_target [<field1>,<field2>] -col_method [<method1>,<method2>]

Contact

Feel free to contact me (junyinglimberkeley.edu) if you have any questions or need help with installation!

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__pycache__		__pycache__
reference		reference
tests		tests
.gitignore		.gitignore
DevelopmentNotebook.ipynb		DevelopmentNotebook.ipynb
README.md		README.md
consensus_tools.py		consensus_tools.py
name_splitter.py		name_splitter.py
normalization_tools.py		normalization_tools.py
transcriptClean.py		transcriptClean.py
transcriptPrepare.py		transcriptPrepare.py
transcriptResolver.py		transcriptResolver.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TranscriptResolver

Author

Description

TranscriptResolver

TranscriptPrepare

TranscriptClean

Installation

Python packages

Other binaries

Usage

Contact

About

Releases

Packages

Languages

junyinglim/TranscriptResolver

Folders and files

Latest commit

History

Repository files navigation

TranscriptResolver

Author

Description

TranscriptResolver

TranscriptPrepare

TranscriptClean

Installation

Python packages

Other binaries

Usage

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages