cosmosage

Introduction

Large language models are emerging as powerful tools for many natural language tasks. Very large parameter counts of 1e11 or more are needed for a general-purpose model such as GPT-4. However, even small models can be extremely powerful if the application is sufficiently narrow.

cosmosage is an attempt to fine-tune a relatively modest large language model on cosmology-specific datasets with the goal of making a general-purpose natural-language assistant for cosmologists.

Author

Tijmen de Haan [email protected]

I gave a colloquium on cosmosage, which you can watch here https://www.youtube.com/watch?v=azwfG2UTNEY

I also wrote an article on cosmosage which is available as preprint at https://arxiv.org/abs/2407.04420

Project Structure

A walkthrough of the project is given in iPython Notebook format in cosmosage.ipynb. This notebook walks through the several-step process for fine-tuning the language model on cosmology-specific datasets. It goes through steps for data collection, preprocessing, model training, and evaluation.

Syntax, Code Style, Tools Used

The .py files are kept consistently formatted with black on its default settings.

The codebase was written with the use of Pylance, GitHub Copilot, GPT-4, and VSCode fork cursor.

Usage

To get started with training cosmosage:

Ensure you have Jupyter Notebook and the required dependencies
Open and follow the steps in cosmosage.ipynb for a guide to training and using the model.

If you'd like to run a version of cosmosage that I've trained, head over to https://huggingface.co/Tijmen2 where I have and will continue to post versions.

Contributing

If you'd like to get involved, please contact me at [email protected]

License

This project is licensed under the MIT License - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
analyze_asl_dict.py		analyze_asl_dict.py
clean_jsonl.py		clean_jsonl.py
clean_mmd.py		clean_mmd.py
cosmosage.ipynb		cosmosage.ipynb
eval_corpus.py		eval_corpus.py
extract_textbooks.py		extract_textbooks.py
fine_tune.py		fine_tune.py
fine_tune_lora.py		fine_tune_lora.py
generate_faiss_index_standalone.py		generate_faiss_index_standalone.py
generate_summaries_standalone.py		generate_summaries_standalone.py
generate_synth_standalone.py		generate_synth_standalone.py
grade_qa.py		grade_qa.py
plot_tf_log.py		plot_tf_log.py
quant_autogptq.py		quant_autogptq.py
quantize_cosmosage.sh		quantize_cosmosage.sh
rag_create_index.ipynb		rag_create_index.ipynb
rag_inference.ipynb		rag_inference.ipynb
scrape_arxiv.py		scrape_arxiv.py
start_vllm.sh		start_vllm.sh
tex_to_json.py		tex_to_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cosmosage

Introduction

Author

Project Structure

Syntax, Code Style, Tools Used

Usage

Contributing

License

About

Releases

Packages

Languages

License

tijmen/cosmosage

Folders and files

Latest commit

History

Repository files navigation

cosmosage

Introduction

Author

Project Structure

Syntax, Code Style, Tools Used

Usage

Contributing

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages