Skip to content

liuh886/open-phrasebank

Repository files navigation

Open Phrasebank

Building your own phrasebank. ✨

Documentation Status PyPI - Version GitHub Action GitHub License Docker Pulls

This repository provides an accessible phrase bank, which is a collection of frequently used phrases that can be utilized, for example, in the auto-complete function of an IDE. (Note: This library does not provide IDE or auto-complete functions but offers ready-to-use phrasebanks)

Moreover, this repository includes features for constructing a phrase bank from a provided text or an open corpus.

Why Use Phrase Bank

Boosting Typing Experience with Phrasebank 🚀

Academic Writing 🕵️‍♀

You can further customize the phrasebank according to your needs, e.g. for certain disciplines, for certain styles (descriptive, analytical, persuasive and critical), for certain sections (abstract, body text), as long as you can find good ingredients.

Open Phrasebanks

Academic Phrasebank

Elsevier OA CC-BY contains 40k articles from Elsevier's journals, including from Arts, Business, STEM to Social Sciences1.

No. Phrasebank Source N of grams Lines Comments
1 📍academic_phrasebank Book Academic Phrasebank 2014 2-5 2,190 Extract from pdf (Zhihao, 2024)
2 📍elsevier_phrasebank Corpus Elsevier OA CC-BY 2020 2-6 3,792 Extract by n-gram (Zhihao 2024)
3 📍bawe_1000.csv Corpus British Academic Written English 4-6 1,000 Due to inaccessible, only most frequent 1000 list here. (Zhihao, 2024)
4 📍academic_word_list Academic Word List Coxhead (2000) 1 570 The 570 word for academic English (exclude frequent 2000 words)
5 📍elsevier_awl 2,4 2-6 994 The Elsevier phrasebank that contains AWL (Zhihao, 2024)
6 📍elsevier_ENVI_EART 2 2-7 3,700 Environment & Earth Science 3700 collection (Zhihao 2024)
7 📍elsevier_PSYC_SOCI 2 2-7 3,700 Social Science & Psychology 3700 collection (Zhihao 2024)
8 📍elsevier_MEDI 2 2-7 3,700 Medicine 3700 collection (Zhihao 2024)

English Frequent Phrasebank

No. Phrasebank Source N-gram Length Lines Comments
1 📍google-10000-english Google Books Corpus 1 10,000 The 10,000 most common English words from Google Books Corpus
2 📍Wordlist 1200.txt Internet 1 2,000 The 2,000 most common English words

Other Phrasebank

No. Phrasebank Source N-gram Length Lines Comments
1 📍emoji 1 745 (Zhihao 2024)

Quickstart

You can download the pre-made phrasebank from the table. If you do require a custom one, go forward.

pip install openphrasebank

Get a Self-defined Phrasebank in 3 Steps

Below is an example based on n-gram frequency. More examples, e.g. extract from PDF, are available in documents.

1️⃣ Load and Tokenize the Data

import openphrasebank as opb

tokens_gen = opb.load_and_tokenize_data (dataset_name="orieg/elsevier-oa-cc-by", 
                                         subject_areas=['PSYC','SOCI'],
                                         keys=['title', 'abstract','body_text'],
                                         save_cache=True,
                                         cache_file='temp_tokens.json')

2️⃣ Generate N-grams

n_values = [1,2,3,4,5,6,7,8]
opb.generate_multiple_ngrams(tokens_gen, n_values)

3️⃣ Filter and save

# Define the top limits for each n-gram length
top_limits = {1: 2000, 2: 2000, 3: 1000, 4: 300, 5: 200, 6: 200, 7: 200, 8: 200}

# Filter the frequent n-grams and store the results in a dictionary
phrases = {}
freqs = {}
for n, limit in top_limits.items():
    phrases[n], freqs[n] = opb.filter_frequent_ngrams(ngram_freqs[n], limit,min_freq=20)

# Combine and sort the phrases from n-gram lengths 2 to 6
sorted_phrases = sorted(sum((phrases[n] for n in range(2, 7)), []))

# Write the sorted phrases to a Markdown file
with open('../elsevier_phrasebank_PSYC_SOCI.txt', 'w') as file:
    for line in sorted_phrases:
        file.write(line + '\n')

How to Contribute

You can either contribute the phrasebank or the code. Check out our contributing.

Known Issues

Phrasebank Issues
academic_phrasebank Due to the table in the PDF file not being properly handled, many sentences were not extracted correctly. (zhihao)
elsevier_phrasebank

ko-fi

Footnotes

  1. Over 20 disciplines orieg/elsevier-oa-cc-by · Datasets at Hugging Face