Skip to content

Latest commit

 

History

History
97 lines (60 loc) · 5.22 KB

README.md

File metadata and controls

97 lines (60 loc) · 5.22 KB

Real-time Amazon bot detection with BERT and Deephaven

This example uses Deephaven to perform real-time predictions of whether or not an Amazon review was generated by ChatGPT. The data comes from the Amazon Reviews Dataset, collected by Julian McAuley's lab and hosted on Huggingface.

The model used for bot prediction comes from Vidhi Kishor Waghela's entry in a ChatGPT-generated text detection Kaggle competition. The detector training data, script, and resulting PyTorch model are stored in the detector directory.

This Deephaven example can be run in Jupyter using Deephaven's Python package, or inside of a Docker container. We've provided scripts, notebooks, and instructions for each of Jupyter and Docker, so pick the path that feels most comfortable to you.

Git LFS

The trained PyTorch model used in this project is stored with Git LFS. To access this model, you need to install LFS and use it for this repository.

  1. Install Git LFS by following the instructions here.

  2. Configure Git LFS for this repo and use it to pull the PyTorch model:

    git lfs install
    git lfs fetch
    git lfs pull

Now that the PyTorch model is available, continue to the Jupyter or Docker section to start working with this example.

Jupyter

Deephaven's Python package requires Java 17 or higher to be installed on your machine. See this page for OS-specific instructions on installing Java.

Set up the environment

  1. Navigate to the jupyter subdirectory:

    cd jupyter
  2. Then, execute a script to set up the environment:

    chmod +x create-venv.sh
    ./create-venv.sh

    This creates a Python virtual environment called dh-amazon-venv and installs all of the required Python packages into that environment.

  3. Next, activate the environment and start Jupyter:

    source dh-amazon-venv/bin/activate
    jupyter notebook

Once you've started Jupyter, you're ready to go!

Download the data

This step only needs to be done once, and can take quite a while, depending on the speed of your internet connection and the processing power of your machine. It took about 20 minutes on a Macbook Pro M2 with 8 cores.

  1. Open the download_data.ipynb notebook and select the dh-amazon-venv kernel.

  2. Set the NUM_PROC variable at the top of the second cell equal to the number of processors available to you. This has a significant impact on the download speed.

  3. Run the whole notebook. This will download the Amazon data, filter it for 2023, and write it to the amazon-data directory in Parquet format.

Run the example

Finally, navigate to the detect_bots.ipynb notebook and select the dh-amazon-venv kernel. This notebook walks you through the whole example, and gives you the opportunity to play with Deephaven. We hope you learn something new!

Docker

To run this example with Docker, you must have Docker installed on your machine. See this guide for OS-specific instructions.

Start the Deephaven server

  1. Navigate to the docker subdirectory:

    cd docker
  2. Build and run the Docker image using docker-compose:

    docker compose up
  3. Once the image is built, navigate to the Deephaven IDE at http://localhost:10000/ide/.

The Deephaven IDE contains all of the scripts associated to this example. Let's get started!

Download the data

This step only needs to be done once, and can take quite a while, depending on the speed of your internet connection and the processing power of your machine. It took about 20 minutes on a Macbook Pro M2 with 8 cores. You may need to allocate more resources to the Docker engine to access the full capabilities of your machine. This can be done using Docker Desktop. See this guide for more details.

  1. In the right-hand sidebar, open the download_data.py script.

  2. Set the NUM_PROC variable in line 8 equal to the number of processors available to you. This has a significant impact on the download speed.

  3. Run the script using the "play" button at the top of the screen. This will download the Amazon data, filter it for 2023, and write it to the amazon-data directory in Parquet format.

Run the example

Once you've downloaded the data, you're ready to start working with the example. The code is divided between two scripts, stream_data.py and detect_bots.py. Running detect_bots.py will also execute stream_data.py, so you can start there if you'd like. We hope you enjoy this example!