Skip to content

Code for CLIPPER: Compression enables long-context synthetic data generation

Notifications You must be signed in to change notification settings

chtmp223/CLIPPER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

✂️ CLIPPER: Compression enables long-context synthetic data generation

arXiV Dataset Models

This repository hosts the code for our paper, CLIPPER: Compression enables long-context synthetic data generation.

Pipeline Overview

✂️ CLIPPER is a compression-based approach to generating instruction-following data. CLIPPER works by compressing long-form documents (e.g., books) into smaller, information-rich representations (e.g. chapter outlines), which are then used to create grounded instructions for tasks like narrative claim verification.

📣 Updates

  • [2025-02-19]: Dataset and models for CLIPPER are now available as a Huggingface collection: link.

📦 Using CLIPPER

Getting Started

  1. Install the requirements for CLIPPER:
    conda create -n clipper python=3.10 
    conda activate clipper
    pip install -r requirements.txt
    python -m pip install flash-attn --no-build-isolation
    huggingface-cli login       # Log in to Huggingface using your access token 
    sudo apt-get install git-lfs
    
  2. Set up Huggingface cache directory:
    • Open your shell configuration file, which is typically ~/.bashrc or ~/.bash_profile for Bash, or ~/.zshrc for Zsh.
    • Add HF_HOME huggingface cache directory path to your configuration file: HF_HOME=/path/to/huggingface_cache.
    • Add HF_TOKEN huggingface access token to your configuration file: HF_TOKEN=<your_token>.
    • Save and close the file. Source the file to apply the changes: source ~/.bashrc or source ~/.bash_profile or source ~/.zshrc.
    • Double-check that the environment variable is set correctly: echo $HF_HOME.

Project Structure

.
├── README.md
├── assets
├── data
    ├── books 
    ├── outputs
    └── wp
├── prompts
└── scripts
    ├── clipper
    ├── eval
    └── wp
  • data contains all books as well as output chapter outlines and summaries used in the paper.
    • books contains all books used in the paper. Each subdirectory contains the segmented chapters of a book. The corresponding full book is also available in the books directory.
      • gutenberg.csv contains the metadata for the Gutenberg books used in the paper.
      • New books should be cleaned and chapterized.
    • outputs contains all output chapter outlines and summaries used in the paper. Each subdirectory contains the output for a book, including claims, summaries, and chapter outlines.
    • wp contains the writingprompt raw data (cleaned) and the corresponding generated claims.
  • scripts contains code to construct data with CLIPPER:
    • eval/inference.py contains code to do inference with the fine-tuned models on the test set.
    • pipeline_*.sh are bash scripts that run the entire data construction pipeline.
  • prompts contains all prompts used in the paper.

Datasets & Models

Finetuning

Evaluation

📜 Citation

@misc{pham2025clippercompressionenableslongcontext,
      title={CLIPPER: Compression enables long-context synthetic data generation}, 
      author={Chau Minh Pham and Yapei Chang and Mohit Iyyer},
      year={2025},
      eprint={2502.14854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.14854}, 
}

About

Code for CLIPPER: Compression enables long-context synthetic data generation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published