Skip to content

Latest commit

 

History

History
214 lines (188 loc) · 9.32 KB

README.md

File metadata and controls

214 lines (188 loc) · 9.32 KB

codeCAI - Generating Code from Natural Language

codeCAI
Table of Contents
  1. About codeCAI
  2. Installation
  3. Usage
  4. Further Material
  5. License
  6. Acknowledgments

About codeCAI

codeCAI is a conversational assistant that enables analysts to specify data analyses using natural language, which are then translated into executable Python code statements.

The approach we used to realize an assistant capable of interpreting analytical instructions, is a Transformer-based language model with an adapted tree encoding scheme and a restrictive grammar model that learns to map natural language specifications to a tree-based representation of the output code. With this syntax-driven approach, we aim to enable the language model to capture hierarchical relationships in syntax trees representing Python code fragments.

We have implemented a RASA-based dialogue system prototype, which we have integrated into the JupyterLab environment. Using this natural-language interface, data analyses can be specified at a high abstraction level, which are then automatically mapped to executable program instructions that are generated in a Jupyter Notebook

Implementation overview

We performed the evaluation on two tasks, namely semantic parsing and code generation using the ZIH HPC cluster. In both cases, the goal is to generate formal meaning representations from natural language input, that is, lambda-calculus expressions or Python code.

Experimental results show that on one benchmark the tree encoding performs better than the sequential encoding used by the original Transformer architecture. To test whether the tree-encoded Transformer learns to predict the AST structure correctly, we looked at the exact match accuracy and token- and sequence-level precision and recall. The takeaway from the analysis of correctly predicted prefixes is that string literals have a significant impact on the quality of the prediction and that longer sequences are more difficult to predict. We also found that tree encoding gives an improvement of up to 3.0% when excluding string literals over sequential encoding.

demo.mp4

Installation

Prerequisites

  • Python 3.8 (earlier versions may work as well)
  • Conda (Miniconda3 is sufficient)

Rasa Backend

  1. Create and activate Python environment
    conda create -n pyenv_rasa python=3.8
    conda activate pyenv_rasa
    Alternatively, you can also use virtualenv (e.g. if you would like to use system site_packages during code inference)
    conda deactivate
    python3 -m virtualenv ~/pyenv_rasa
    source ~/pyenv_rasa/bin/activate # use this everywhere instead of "conda activate pyenv_rasa" below
    (if virtualenv is not installed: pip3 install --user virtualenv after conda deactivate)
  2. Install Rasa
    pip3 install rasa==2.2.8 rasa-sdk==2.2.0 tensorflow==2.3.4 protobuf==3.14.0
  3. Clone repository (if not done yet)
    git clone https://github.com/SmartDataAnalytics/codeCAI.git
    cd codeCAI
  4. Install nl2code model
    pip3 install -e nl2codemodel/
  5. Install missing and incorrectly versioned dependencies
    pip3 install sentencepiece==0.1.95 torchmetrics==0.5.1
  6. Download Rasa model
    mkdir -p rasa/models
    wget -P rasa/models --content-disposition 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=20210726-155823_kmeans.tar.gz'
  7. Download NL2Code vocabulary, grammar graph and checkpoint
    mkdir -p nl2codemodel/models
    wget -P nl2codemodel/models --content-disposition 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=usecase-nd_vocab_src.model' 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=usecase-nd_grammargraph.gpickle' 'https://cloudstore.zih.tu-dresden.de/index.php/s/c3PcAjpPX6AeZaF/download?path=%2F&files=last.ckpt' 
  8. Test Rasa installation
    cd rasa
    export NL2CODE_CONF=$PWD/nl2code.yaml
    rasa run actions -vv
    In a separate terminal (under rasa working directory):
    conda activate pyenv_rasa
    rasa shell -m models/20210726-155823_kmeans.tar.gz
    You can use the examples in the rasa/examples directory, e.g.k_means_clustering.csv (or .json) for conversing with the assistant.

(back to top)

Jupyter Plugin

  1. Create and activate Python environment using conda
    cd codecai_jupyter_nli
    conda env create -f environment.yml
    conda activate codecai_jupyter_nli
  2. Install dependencies
    pip3 install -e .
  3. Build the extension
    jlpm run build
    (or jlpm run watch for development, to automatically rebuild on changes)

Usage

codeCAI Rasa Backend

  1. Activate Python environment
    conda activate pyenv_rasa
  2. Change to the Rasa project directory
    cd rasa
  3. Run Rasa server (adjust port 8888 if taken by another application than JupyterLab)
    rasa run --enable-api --debug -m models/***.tar.gz --cors ["localhost:8888"]
  4. Run Rasa action server (in a separate console window, in the rasa directory)
    conda activate pyenv_rasa
    cd rasa
    export NL2CODE_CONF=$PWD/nl2code.yaml
    rasa run actions -vv

(back to top)

codeCAI Jupyter Plugin

  1. Activate Python environment using conda
    conda activate codecai_jupyter_nli
  2. Change to Jupyter base directory (creating if necessary)
    mkdir -p ~/jupyter_root
    cd ~/jupyter_root
  3. Run JupyterLab
    jupyter lab
  4. Open dialog assistant
    • Open (or create) a Jupyter notebook

    • Press Ctrl+Shift+C to open the Command Palette

      (or select View → Activate Command Palette)

    • Type "nli"

    • Select "Show codeCAI NLI"

  5. Use dialog assistant
    • Type an instruction (e.g. "Hi" or "Which ML methods do you know?") in the text field on the bottom of the "codeCAI NLI" side panel on the right. Working examples are found rasa/examples directory, e.g.k_means_clustering.csv (or .json).
    • Press Return or click the "Send" button

Further Material

  1. In the ScaDS.AI Living Lab lecture, we presented an overview of state-of-the-art language models for program synthesis, introduced some basic characteristics of these models, and discussed several of their limitations. One possible direction of research that could help alleviate these limitations is the inclusion of structural knowledge - an attempt we have made in this regard and which we briefly introduced:

    Language Models for Code Generation - ScaDS.AI Living Lab Lecture

  2. codeCAI Poster - Generating Code from Natural Language

  3. codeCAI Paper (OpenReview) - Transformer with Tree-order Encoding for Neural Program Generation

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Acknowledgments

This work was supported by the German Federal Ministry of Education and Research (BMBF, 01IS18026A-D) by funding the competence center for Big Data and AI "ScaDS.AI Dresden/Leipzig". The authors gratefully acknowledge the GWK support for funding this project by providing computing time through the Center for Information Services and HPC (ZIH) at TU Dresden.

(back to top)