renamed all dspy to rosetta

mdeland · Nov 10, 2013 · c0b10bd · c0b10bd
1 parent b13b5e0
commit c0b10bd
Show file tree

Hide file tree

Showing 44 changed files with 6,233 additions and 50 deletions.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -9,17 +9,17 @@ likelihood of your contribution being merged.**
 How to contribute
 -----------------
 
-The preferred way to contribute to dspy is to fork the 
-[project repository](https://github.com/columbia-applied-data-science/dspy/) on
+The preferred way to contribute to rosetta is to fork the 
+[project repository](https://github.com/columbia-applied-data-science/rosetta/) on
 GitHub:
 
-1. Fork the [project repository](https://github.com/columbia-applied-data-science/dspy/):
+1. Fork the [project repository](https://github.com/columbia-applied-data-science/rosetta/):
    click on the 'Fork' button near the top of the page. This creates
    a copy of the code under your account on the GitHub server.
 
 2. Clone this copy to your local disk:
 
-          $ git clone [email protected]:YourLogin/dspy.git
+          $ git clone [email protected]:YourLogin/rosetta.git
 
 3. Create a branch to hold your changes:
 
@@ -37,7 +37,7 @@ GitHub:
 
           $ git push -u origin my-feature
 
-Finally, go to the web page of the your fork of the dspy repo,
+Finally, go to the web page of the your fork of the rosetta repo,
 and click 'Pull request' to send your changes to the maintainers for
 review. request. This will send an email to the committers.
 
@@ -54,7 +54,7 @@ following rules before submitting a pull request:
    example script in the ``examples/`` folder. Have a look at other
    examples for reference. Examples should demonstrate why the new
    functionality is useful in practice and, if possible, compare it
-   to other methods available in dspy.
+   to other methods available in rosetta.
 
 -  At least one paragraph of narrative documentation with links to
 ````   references in the literature (with PDF links when possible) and

diff --git a/LICENSE.txt b/LICENSE.txt
@@ -2,10 +2,10 @@
 License
 =======
 
-DSpy is distributed under a 3-clause ("Simplified" or "New") BSD
+Rosetta is distributed under a 3-clause ("Simplified" or "New") BSD
 license. 
 
-DSpy license
+Rosetta license
 ==============
 
 Redistribution and use in source and binary forms, with or without
@@ -40,28 +40,28 @@ About the Copyright Holders
 ===========================
 
 The core team that coordinates development on GitHub can be found here:
-http://github.com/columbia-applied-data-science/DSpy
+http://github.com/columbia-applied-data-science/Rosetta
 
-Full credits for DSpy contributors can be found in the documentation.
+Full credits for Rosetta contributors can be found in the documentation.
 
 Our Copyright Policy
 ====================
 
-DSpy uses a shared copyright model. Each contributor maintains copyright
-over their contributions to DSpy. However, it is important to note that
+Rosetta uses a shared copyright model. Each contributor maintains copyright
+over their contributions to Rosetta. However, it is important to note that
 these contributions are typically only changes to the repositories. Thus,
-the DSpy source code, in its entirety, is not the copyright of any single
+the Rosetta source code, in its entirety, is not the copyright of any single
 person or institution. Instead, it is the collective copyright of the
-entire DSpy Development Team. If individual contributors want to maintain
+entire Rosetta Development Team. If individual contributors want to maintain
 a record of what changes/contributions they have specific copyright on,
 they should indicate their copyright in the commit message of the change
-when they commit the change to one of the DSpy repositories.
+when they commit the change to one of the Rosetta repositories.
 
 With this in mind, the following banner should be used in any source code
 file to indicate the copyright and license terms:
 
 #-----------------------------------------------------------------------------
-# Copyright (c) 2013, DSpy Development Team
+# Copyright (c) 2013, Rosetta Development Team
 # All rights reserved.
 #
 # Distributed under the terms of the BSD Simplified License.

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,10 +1,10 @@
-recursive-include dspy *
-recursive-include dspy/cmd *
-recursive-include dspy/modeling *
-recursive-include dspy/parallel *
-recursive-include dspy/tests *
-recursive-include dspy/text *
-recursive-include dspy/workflow *
+recursive-include rosetta *
+recursive-include rosetta/cmd *
+recursive-include rosetta/modeling *
+recursive-include rosetta/parallel *
+recursive-include rosetta/tests *
+recursive-include rosetta/text *
+recursive-include rosetta/workflow *
 
 include MANIFEST.in
 include LICENSE

diff --git a/README.md b/README.md
@@ -1,4 +1,4 @@
-DSpy
+Rosetta
 ====
 
 Tools for data science with a focus on text processing.
@@ -32,7 +32,7 @@ See the `examples/` directory for more details.
 
 Install
 -------
-Check out the dev branch or a tagged release from the [dspyrepo][dspyrepo].  Then (so long as you have `pip`).
+Check out the dev branch or a tagged release from the [rosettarepo][rosettarepo].  Then (so long as you have `pip`).
 
     make
     make test
@@ -44,11 +44,11 @@ Development
 
 You can check the latest sources with
 
-    git clone git://github.com/columbia-applied-data-science/dspy
+    git clone git://github.com/columbia-applied-data-science/rosetta
 
 ### Contributing
 
-Feel free to contribute a bug report or a request by opening an [issue](https://github.com/columbia-applied-data-science/dspy/issues)
+Feel free to contribute a bug report or a request by opening an [issue](https://github.com/columbia-applied-data-science/rosetta/issues)
 
 Before contributing code, read `CONTRIBUTING.md`
 
@@ -57,12 +57,12 @@ Dependencies
 
 Testing
 -------
-From the base repo directory, `dspy/`, you can run all tests with
+From the base repo directory, `rosetta/`, you can run all tests with
 
     make test
 
 History
 -------
-The *DS* in DSpy clearly relates to *Data Science*.  However, it came first from *Data Structure* and the *Dead Sea*.  The tools concentrate on streaming text, and the dead sea scrolls are the most famous version of text in a stream (a lake actually...but just pretend and it's really cool).
+The *DS* in Rosetta clearly relates to *Data Science*.  However, it came first from *Data Structure* and the *Dead Sea*.  The tools concentrate on streaming text, and the dead sea scrolls are the most famous version of text in a stream (a lake actually...but just pretend and it's really cool).
 
-[dspyrepo]: https://github.com/columbia-applied-data-science/dspy
+[rosettarepo]: https://github.com/columbia-applied-data-science/rosetta
diff --git a/examples/vw_helpers.md b/examples/vw_helpers.md
@@ -1,9 +1,9 @@
 Working with Vowpal Wabbit (VW)
 ===============================
 
-To work with the `dspy` utilities you need to:
+To work with the `rosetta` utilities you need to:
 
-* Clone the [dspy repo][dspyrepo]  and read `README.md`.
+* Clone the [rosetta repo][rosettarepo]  and read `README.md`.
 
 Create the sparse file (sfile)
 ------------------------------
@@ -21,15 +21,15 @@ The `TextFileStreamer` needs a method to convert the text files to a list of str
 Once you have a tokenizer, just initialize a streamer and write the VW file.
 
 ```python
-from dspy import TextFileStreamer, TokenizerBasic
+from rosetta import TextFileStreamer, TokenizerBasic
 
 my_tokenizer = TokenizerBasic()
 stream = TextFileStreamer(text_base_path='bodyfiles', tokenizer=my_tokenizer)
 stream.to_vw('doc_tokens.vw', n_jobs=-1)
 ```
 
 ### Method 2: `files_to_vw.py`
-`files_to_vw.py` is a fast and simple command line utility for converting files to VW format.  Installing `dspy` will put these utilities in your path.
+`files_to_vw.py` is a fast and simple command line utility for converting files to VW format.  Installing `rosetta` will put these utilities in your path.
 
 * Try converting the first 5 files in `my_base_path`.  The following should print 5 lines of of results, in [vw format][vwinput]
 
@@ -150,7 +150,7 @@ The python function `filter_sfile.py` takes in `ddrs.vw` and streams a filtered
 You can view the topics and predictions with this:
 
 ```python
-from dspy.text.vw_helpers import LDAResults
+from rosetta.text.vw_helpers import LDAResults
 num_topics = 5
 lda = LDAResults('topics.dat', 'prediction.dat', num_topics, 'sff_file.pkl')
 lda.print_topics()
@@ -195,9 +195,9 @@ Contribute!
 
 
 [vwinput]: https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format
-[dspyrepo]: https://github.com/columbia-applied-data-science/dspy
+[rosettarepo]: https://github.com/columbia-applied-data-science/rosetta
 [vwlda]: https://github.com/JohnLangford/vowpal_wabbit/wiki/lda.pdf
 [vwtricks]: www.slideshare.net/jakehofman/technical-tricks-of-vowpal-wabbit‎
 [hashing]: https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction
 [spot]: http://en.wikipedia.org/wiki/Single_Point_of_Truth
-[issue]: https://github.com/columbia-applied-data-science/dspy/issues
+[issue]: https://github.com/columbia-applied-data-science/rosetta/issues
diff --git a/makefile b/makefile
@@ -6,7 +6,7 @@ PYTHON ?= python
 UNITTEST ?= unittest
 CTAGS ?= ctags
 
-TESTDIR=dspy/tests
+TESTDIR=rosetta/tests
 
 all: install test
 
@@ -18,7 +18,7 @@ install: clean
 
 # Reinstall with pip
 reinstall: clean
-	pip uninstall dspy
+	pip uninstall rosetta
 	$(PYTHON) setup.py sdist
 	pip install dist/*
 
@@ -50,14 +50,14 @@ test-cmd:
 	$(PYTHON) -m $(UNITTEST) discover -s $(TESTDIR) -p '*cmd*' -v
 
 trailing-spaces:
-	find dspy -name "*.py" | xargs perl -pi -e 's/[ \t]*$$//'
+	find rosetta -name "*.py" | xargs perl -pi -e 's/[ \t]*$$//'
 
 ctags:
 	# make tags for symbol based navigation in emacs and vim
 	# Install with: sudo apt-get install exuberant-ctags
 	$(CTAGS) -R *
 
 code-analysis:
-	flake8 dspy | grep -v __init__ | grep -v external
-	pylint -E -i y dspy/ -d E1103,E0611,E1101
+	flake8 rosetta | grep -v __init__ | grep -v external
+	pylint -E -i y rosetta/ -d E1103,E0611,E1101
 
diff --git a/rosetta/__init__.py b/rosetta/__init__.py
@@ -0,0 +1 @@
+from rosetta.text.api import *
diff --git a/rosetta/cmd/__init__.py b/rosetta/cmd/__init__.py
diff --git a/rosetta/cmd/bashrc_additions b/rosetta/cmd/bashrc_additions
@@ -0,0 +1,27 @@
+# Additions to your bashrc
+#
+#
+###############################################################################
+# INSTALLATION
+###############################################################################
+# Put desired sections in your ~/.bashrc (or ~/.bash_profile on macs) and then
+# "source it" or close then open a new terminal.
+#
+###############################################################################
+# Body function
+###############################################################################
+# This allows you to run a command on the body of the function, skipping the header
+# (but still printing the header).  For example,
+# 
+# $ cat filewithheader | body sort -k1,1
+#
+# will sort filewithheader, using the first field, but leave the header at the top
+# of the file.
+
+body() {
+    IFS= read -r header
+    printf '%s\n' "$header"
+    "$@"
+}
+
+export -f body
diff --git a/rosetta/cmd/concat_csv.py b/rosetta/cmd/concat_csv.py
@@ -0,0 +1,80 @@
+#!/usr/bin/env python
+"""
+Concat a list of csv files in an "outer join" style.
+
+From pandas, uses DataFrame.from_csv, DataFrame.to_csv, concat to do
+reads/writes/joins.  Except noted below, the default arguments are used.
+"""
+
+import argparse
+import sys
+
+import pandas as pd
+
+
+def _cli():
+    # Text to display after help
+    epilog = """
+    EXAMPLES
+
+    Concat two files, each with a header and index, redirect output to newfile
+    $ python concat_csv.py --index --header file1 file2 > newfile
+
+    Concat two files, write result to newfile
+    $ python concat_csv.py --index --header -o newfile file1 file2
+
+    Concat all files in mydir/, write result to stdout.
+    $ python concat_csv.py  mydir/*
+    """
+    parser = argparse.ArgumentParser(
+        description=globals()['__doc__'], epilog=epilog,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+
+    parser.add_argument(
+        'paths', nargs='*', help='Concat files in this space separated list')
+    parser.add_argument(
+        '-o', '--outfile', default=sys.stdout,
+        type=argparse.FileType('w'),
+        help='Write to OUT_FILE rather than sys.stdout.')
+    parser.add_argument(
+        '-s', '--sep', default=',',
+        help='Delimiter to use.  Regular expressions are accepted.'
+        '  [default: %(default)s]')
+
+    parser.add_argument(
+        '--index', action='store_true', default=False,
+        help='Flag to set if files have an index (leftmost column).'
+        ' [default: %(default)s].')
+    parser.add_argument(
+        '--header', action='store_true', default=False,
+        help='Flag to set if files have headers (in top row).  '
+        '[default: %(default)s]')
+
+    parser.add_argument(
+        '-a', '--axis', type=int, default=0,
+        help='Axes along which to concatenate')
+
+    # Parse and check args
+    args = parser.parse_args()
+
+    # Call the module interface
+    _concat(
+        args.outfile, args.paths, args.sep, args.index, args.header, args.axis)
+
+
+def _concat(outfile, paths, sep, index, header, axis):
+    # Read
+    index_col = 0 if index else False
+    header_row = 0 if header else False
+    kwargs = {'sep': sep, 'index_col': index_col, 'header': header_row}
+    frames = pd.concat(
+        (pd.DataFrame.from_csv(p, **kwargs) for p in paths), axis=axis)
+
+    # Write
+    kwargs = {'sep': sep, 'index': index, 'header': header}
+
+    frames.to_csv(outfile, **kwargs)
+
+
+if __name__ == '__main__':
+    _cli()