Deploying to pages from @ 7893e8e 🚀

cytomining · Jan 13, 2024 · 9f1da2d · 9f1da2d
1 parent 4c68827
commit 9f1da2d
Show file tree

Hide file tree

Showing 5 changed files with 309 additions and 1 deletion.
diff --git a/_sources/overview.md.txt b/_sources/overview.md.txt
@@ -117,3 +117,163 @@ Specify the converted data destination using the  :code:`convert(..., dest_path=
 ```{eval-rst}
   Parquet data destination type may be specified by using :code:`convert(..., dest_datatype="parquet", ...)` (:mod:`convert() <cytotable.convert.convert>`).
 ```
+
+## Data Transformations
+
+CytoTable performs various types of data transformations.
+This section help define terminology and expectations surrounding the use of this terminology.
+CytoTable might use one or all of these depending on user configuration.
+
+### Data Chunking
+
+<table>
+<tr><th>Original</th><th>Changes</th></tr>
+<tr>
+<td>
+
+"Data source"
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+
+</td>
+<td>
+
+"Chunk 1"
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+
+</table>
+
+"Chunk 2"
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+
+</td>
+</tr>
+</table>
+
+_Example of data chunking performed on a simple table of data._
+
+```{eval-rst}
+Data chunking within CytoTable involves slicing data sources into "chunks" of rows which all contain the same columns and have a lower number of rows than the original data source.
+CytoTable uses data chunking through the ``chunk_size`` argument value (:code:`convert(..., chunk_size=1000, ...)` (:mod:`convert() <cytotable.convert.convert>`)) to reduce the memory footprint of operations on subsets of data.
+CytoTable may be used to create chunked data output by disabling concatenation and joins, e.g. :code:`convert(..., concat=False,join=False, ...)` (:mod:`convert() <cytotable.convert.convert>`).
+Parquet "datasets" are an abstraction which may be used to read CytoTable output data chunks which are not concatenated or joined (for example, see `PyArrow documentation <https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html>`_ or `Pandas documentation <https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html>`_ on using source paths which are directories).
+```
+
+### Data Concatenations
+
+<table>
+<tr><th>Original</th><th>Changes</th></tr>
+<tr>
+<td>
+
+"Chunk 1"
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+
+</table>
+
+"Chunk 2"
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+
+</td>
+<td>
+
+"Concatenated data"
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+
+</td>
+</tr>
+</table>
+
+_Example of data concatenation performed on simple tables of similar data "chunks"._
+
+Data concatenation within CytoTable involves bringing two or more data "chunks" with the same columns together as a unified dataset.
+Just as chunking slices data apart, concatenation brings them together.
+Data concatenation within CytoTable typically occurs using a [ParquetWriter](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html) to assist with composing a single file from many individual files.
+
+### Data Joins
+
+<table>
+<tr><th>Original</th><th>Changes</th></tr>
+<tr>
+<td>
+
+"Table 1" (notice __Col_C__)
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+
+</table>
+
+"Table 2" (notice __Col_Z__)
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_Z</th></tr>
+<tr><td>1</td><td>a</td><td>2024-01-01</td></tr>
+</table>
+
+</td>
+<td>
+
+"Joined data" (as Table 1 <a href="https://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join">left-joined</a> with Table 2)
+
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th><th>Col_Z</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td><td>2024-01-01</td></tr>
+</table>
+
+</td>
+</tr>
+<tr >
+<td colspan="2" style="text-align:center;font-weight:bold;">
+Join Specification in SQL
+</td>
+</tr>
+<tr >
+<td colspan="2">
+
+```sql
+SELECT *
+FROM Table_1
+LEFT JOIN Table_2 ON
+Table_1.Col_A = Table_2.Col_A;
+```
+
+</td>
+</tr>
+</table>
+
+_Example of a data join performed on simple example tables._
+
+```{eval-rst}
+Data joins within CytoTable involve bringing one or more data sources together with differing columns as a new dataset.
+The word "join" here is interpreted through `SQL-based terminology on joins <https://en.wikipedia.org/wiki/Join_(SQL)>`_.
+Joins may be specified in CytoTable using `DuckDB-style SQL <https://duckdb.org/docs/sql/introduction.html>`_ through :code:`convert(..., joins="SELECT * FROM ... JOIN ...", ...)` (:mod:`convert() <cytotable.convert.convert>`).
+Also see CytoTable's presets found here: :data:`presets.config <cytotable.presets.config>` or via `GitHub source code for presets.config <https://github.com/cytomining/CytoTable/blob/main/cytotable/presets.py>`_.
+```
+
+Note: data software outside of CytoTable sometimes makes use of the term "merge" to describe capabilities which are similar to join (for ex. [`pandas.DataFrame.merge`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html).
+Within CytoTable, we opt to describe these operations with "join" to avoid confusion with software development alongside the technologies used (for example, [DuckDB SQL](https://duckdb.org/docs/archive/0.9.2/sql/introduction) includes no `MERGE` keyword).
diff --git a/_static/custom.css b/_static/custom.css
@@ -2,3 +2,19 @@ div.body h5 {
     font-size: 110%;
     font-weight: bold;
 }
+
+html body table td,
+html body table th {
+    border: 1px solid #d6d6d6;
+    padding: 6px 13px;
+}
+
+
+html body table table th {
+    background: #eee;
+}
+
+table {
+    border-spacing: 0;
+    border-collapse: collapse;
+}
diff --git a/index.html b/index.html
@@ -94,6 +94,12 @@ <h2>References<a class="headerlink" href="#references" title="Permalink to this
 <li class="toctree-l3"><a class="reference internal" href="overview.html#data-destination-types">Data Destination Types</a></li>
 </ul>
 </li>
+<li class="toctree-l2"><a class="reference internal" href="overview.html#data-transformations">Data Transformations</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="overview.html#data-chunking">Data Chunking</a></li>
+<li class="toctree-l3"><a class="reference internal" href="overview.html#data-concatenations">Data Concatenations</a></li>
+<li class="toctree-l3"><a class="reference internal" href="overview.html#data-joins">Data Joins</a></li>
+</ul>
+</li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial.html">Tutorial</a><ul>

diff --git a/overview.html b/overview.html
@@ -141,6 +141,126 @@ <h3>Data Destination Types<a class="headerlink" href="#data-destination-types" t
 </div></blockquote>
 </section>
 </section>
+<section id="data-transformations">
+<h2>Data Transformations<a class="headerlink" href="#data-transformations" title="Permalink to this heading">¶</a></h2>
+<p>CytoTable performs various types of data transformations.
+This section help define terminology and expectations surrounding the use of this terminology.
+CytoTable might use one or all of these depending on user configuration.</p>
+<section id="data-chunking">
+<h3>Data Chunking<a class="headerlink" href="#data-chunking" title="Permalink to this heading">¶</a></h3>
+<table>
+<tr><th>Original</th><th>Changes</th></tr>
+<tr>
+<td>
+<p>“Data source”</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+</td>
+<td>
+<p>“Chunk 1”</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+</table>
+<p>“Chunk 2”</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+</td>
+</tr>
+</table>
+<p><em>Example of data chunking performed on a simple table of data.</em></p>
+<p>Data chunking within CytoTable involves slicing data sources into “chunks” of rows which all contain the same columns and have a lower number of rows than the original data source.
+CytoTable uses data chunking through the <code class="docutils literal notranslate"><span class="pre">chunk_size</span></code> argument value (<code class="code docutils literal notranslate"><span class="pre">convert(...,</span> <span class="pre">chunk_size=1000,</span> <span class="pre">...)</span></code> (<a class="reference internal" href="python-api.html#cytotable.convert.convert" title="cytotable.convert.convert"><code class="xref py py-mod docutils literal notranslate"><span class="pre">convert()</span></code></a>)) to reduce the memory footprint of operations on subsets of data.
+CytoTable may be used to create chunked data output by disabling concatenation and joins, e.g. <code class="code docutils literal notranslate"><span class="pre">convert(...,</span> <span class="pre">concat=False,join=False,</span> <span class="pre">...)</span></code> (<a class="reference internal" href="python-api.html#cytotable.convert.convert" title="cytotable.convert.convert"><code class="xref py py-mod docutils literal notranslate"><span class="pre">convert()</span></code></a>).
+Parquet “datasets” are an abstraction which may be used to read CytoTable output data chunks which are not concatenated or joined (for example, see <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html">PyArrow documentation</a> or <a class="reference external" href="https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html">Pandas documentation</a> on using source paths which are directories).</p>
+</section>
+<section id="data-concatenations">
+<h3>Data Concatenations<a class="headerlink" href="#data-concatenations" title="Permalink to this heading">¶</a></h3>
+<table>
+<tr><th>Original</th><th>Changes</th></tr>
+<tr>
+<td>
+<p>“Chunk 1”</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+</table>
+<p>“Chunk 2”</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+</td>
+<td>
+<p>“Concatenated data”</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+<tr><td>2</td><td>b</td><td>0.02</td></tr>
+</table>
+</td>
+</tr>
+</table>
+<p><em>Example of data concatenation performed on simple tables of similar data “chunks”.</em></p>
+<p>Data concatenation within CytoTable involves bringing two or more data “chunks” with the same columns together as a unified dataset.
+Just as chunking slices data apart, concatenation brings them together.
+Data concatenation within CytoTable typically occurs using a <a class="reference external" href="https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetWriter.html">ParquetWriter</a> to assist with composing a single file from many individual files.</p>
+</section>
+<section id="data-joins">
+<h3>Data Joins<a class="headerlink" href="#data-joins" title="Permalink to this heading">¶</a></h3>
+<table>
+<tr><th>Original</th><th>Changes</th></tr>
+<tr>
+<td>
+<p>“Table 1” (notice <strong>Col_C</strong>)</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td></tr>
+</table>
+<p>“Table 2” (notice <strong>Col_Z</strong>)</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_Z</th></tr>
+<tr><td>1</td><td>a</td><td>2024-01-01</td></tr>
+</table>
+</td>
+<td>
+<p>“Joined data” (as Table 1 <a href="https://en.wikipedia.org/wiki/Join_(SQL)#Left_outer_join">left-joined</a> with Table 2)</p>
+<table>
+<tr><th>Col_A</th><th>Col_B</th><th>Col_C</th><th>Col_Z</th></tr>
+<tr><td>1</td><td>a</td><td>0.01</td><td>2024-01-01</td></tr>
+</table>
+</td>
+</tr>
+<tr >
+<td colspan="2" style="text-align:center;font-weight:bold;">
+Join Specification in SQL
+</td>
+</tr>
+<tr >
+<td colspan="2">
+<div class="highlight-sql notranslate"><div class="highlight"><pre><span></span><span class="k">SELECT</span><span class="w"> </span><span class="o">*</span>
+<span class="k">FROM</span><span class="w"> </span><span class="n">Table_1</span>
+<span class="k">LEFT</span><span class="w"> </span><span class="k">JOIN</span><span class="w"> </span><span class="n">Table_2</span><span class="w"> </span><span class="k">ON</span>
+<span class="n">Table_1</span><span class="p">.</span><span class="n">Col_A</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">Table_2</span><span class="p">.</span><span class="n">Col_A</span><span class="p">;</span>
+</pre></div>
+</div>
+</td>
+</tr>
+</table>
+<p><em>Example of a data join performed on simple example tables.</em></p>
+<p>Data joins within CytoTable involve bringing one or more data sources together with differing columns as a new dataset.
+The word “join” here is interpreted through <a class="reference external" href="https://en.wikipedia.org/wiki/Join_(SQL)">SQL-based terminology on joins</a>.
+Joins may be specified in CytoTable using <a class="reference external" href="https://duckdb.org/docs/sql/introduction.html">DuckDB-style SQL</a> through <code class="code docutils literal notranslate"><span class="pre">convert(...,</span> <span class="pre">joins=&quot;SELECT</span> <span class="pre">*</span> <span class="pre">FROM</span> <span class="pre">...</span> <span class="pre">JOIN</span> <span class="pre">...&quot;,</span> <span class="pre">...)</span></code> (<a class="reference internal" href="python-api.html#cytotable.convert.convert" title="cytotable.convert.convert"><code class="xref py py-mod docutils literal notranslate"><span class="pre">convert()</span></code></a>).
+Also see CytoTable’s presets found here: <a class="reference internal" href="python-api.html#cytotable.presets.config" title="cytotable.presets.config"><code class="xref py py-data docutils literal notranslate"><span class="pre">presets.config</span></code></a> or via <a class="reference external" href="https://github.com/cytomining/CytoTable/blob/main/cytotable/presets.py">GitHub source code for presets.config</a>.</p>
+<p>Note: data software outside of CytoTable sometimes makes use of the term “merge” to describe capabilities which are similar to join (for ex. <a class="reference external" href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html"><code class="docutils literal notranslate"><span class="pre">pandas.DataFrame.merge</span></code></a>.
+Within CytoTable, we opt to describe these operations with “join” to avoid confusion with software development alongside the technologies used (for example, <a class="reference external" href="https://duckdb.org/docs/archive/0.9.2/sql/introduction">DuckDB SQL</a> includes no <code class="docutils literal notranslate"><span class="pre">MERGE</span></code> keyword).</p>
+</section>
+</section>
 </section>
 
 
@@ -181,6 +301,12 @@ <h3>Navigation</h3>
 <li class="toctree-l3"><a class="reference internal" href="#data-destination-types">Data Destination Types</a></li>
 </ul>
 </li>
+<li class="toctree-l2"><a class="reference internal" href="#data-transformations">Data Transformations</a><ul>
+<li class="toctree-l3"><a class="reference internal" href="#data-chunking">Data Chunking</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#data-concatenations">Data Concatenations</a></li>
+<li class="toctree-l3"><a class="reference internal" href="#data-joins">Data Joins</a></li>
+</ul>
+</li>
 </ul>
 </li>
 <li class="toctree-l1"><a class="reference internal" href="tutorial.html">Tutorial</a></li>

diff --git a/searchindex.js b/searchindex.js