Merge pull request #1530 from UUDigitalHumanitieslab/feature/settings…

…-documentation Feature/settings documentation
CentreForDigitalHumanities · Apr 18, 2024 · c234b81 · c234b81
2 parents 191ec50 + 839b5ad
commit c234b81
Show file tree

Hide file tree

Showing 5 changed files with 148 additions and 35 deletions.
diff --git a/backend/ianalyzer/settings.py b/backend/ianalyzer/settings.py
@@ -65,12 +65,9 @@
     'default': {
         'host': os.getenv('ES_HOST', 'localhost'),
         'port': 9200,
-        'username': '',
-        'password': '',
         'chunk_size': 900,  # Maximum number of documents sent during ES bulk operation
         'max_chunk_bytes': 1*1024*1024,  # Maximum size of ES chunk during bulk operation
         'bulk_timeout': '60s',  # Timeout of ES bulk operation
-        'overview_query_size': 20,  # Number of results to appear in the overview query
         'scroll_timeout': '3m',  # Time before scroll results time out
         'scroll_page_size': 5000,  # Number of results per scroll page
         'index_prefix': 'ianalyzer'  # Prefix applied to index names created on this server
@@ -82,7 +79,6 @@
 CORPORA = {}
 
 WORDCLOUD_LIMIT = 1000
-DIRECT_DOWNLOAD_LIMIT = 1000
 
 # Celery configuration
 CELERY_BROKER_URL = os.getenv('CELERY_BROKER', 'redis://')
@@ -95,15 +91,6 @@
 # for main content fields, needed for the new highlighter
 NEW_HIGHLIGHT_CORPORA = []
 
-# This needs to be the last line of the settings.py, so that all settings can be overridden.
-try:
-    from ianalyzer.settings_local import *
-except ImportError as e:
-    warnings.warn(
-        'No local settings file - configure your environment in backend/ianalyzer/settings_local.py',
-        Warning
-    )
-
 DEFAULT_FROM_EMAIL = '[email protected]'
 
 LOGGING = {
@@ -133,3 +120,12 @@
 }
 
 MEDIA_ROOT = 'data'
+
+# This needs to be the last line of the settings.py, so that all settings can be overridden.
+try:
+    from ianalyzer.settings_local import *
+except ImportError as e:
+    warnings.warn(
+        'No local settings file - configure your environment in backend/ianalyzer/settings_local.py',
+        Warning
+    )
diff --git a/documentation/Django-project-settings.md b/documentation/Django-project-settings.md
@@ -0,0 +1,131 @@
+# Django project settings
+
+This file describes how to configure project settings in Django.
+
+## Different settings files
+
+We keep different settings files to handle different environments.
+
+`settings.py` is the default settings file in a development file. The version in the repository is replaced in our deployment setup. This means that what you write here will affect all development environments, but not production environments. Developers can override settings in their own environment using `settings_local`, but this is a good place for sensible defaults.
+
+`common_settings.py` is intended for "universal" project settings that apply in both production and development servers. It is imported by `settings.py` on both development and production.
+
+`settings_local.py` is ignored in version control, but the development configuration `settings.py` will attempt to import it. This file can be used for sensitive information, or configurations that are unique to your setup, such as which corpora you're using. You can also use this to override any existing settings.
+
+`settings_test.py` is used during unit tests. It imports everything configured in `settings.py`, but can add or override some settings. Note that you can also adjust settings for individual tests.
+
+### Using a different settings module
+
+Django supports using a different settings module ([more about settings in Django](https://docs.djangoproject.com/en/5.0/topics/settings/)).
+
+However, the Celery configuration depends on `settings.py`, so when you run a command like `celery -A ianalyzer worker`, Celery will use the `settings.py` in `/backend/ianalyzer`.
+
+This mean that you cannot simply point to an alternative settings module if you also need a celery worker; you should overwrite the `settings.py` file in `/backend/ianalyzer` before starting the worker.
+
+## Project settings
+
+For project settings supported by external libraries, see:
+
+- [Django settings reference](https://docs.djangoproject.com/en/5.0/ref/settings/)
+- [configuration for Django REST framework](https://www.django-rest-framework.org/api-guide/settings/)
+- [configuration for dj-rest-auth](https://dj-rest-auth.readthedocs.io/en/latest/configuration.html)
+- [configuration for djangosaml2](https://djangosaml2.readthedocs.io/contents/setup.html#configuration)
+- [configuration for Celery](https://docs.celeryq.dev/en/stable/django/first-steps-with-django.html)
+
+In addition, I-analyzer adds the following settings.
+
+### `SERVERS`
+
+Configuration of elasticsearch servers. This is a dictionary. The keys are the (internal) names you give to each server. (You can connnect more than one server, though in most cases, you only need one.)
+
+The values in the dictionary give specifications.
+
+- `'host'` and `'port'` specify the address where you access the server
+- `'chunk_size'`: Maximum number of documents sent during ES bulk operation
+- `'max_chunk_bytes'`: Maximum size of ES chunk during bulk operation
+- `'bulk_timeout'`: Timeout of ES bulk operation
+- `'scroll_timeout'`: Time before scroll results time out
+- `'scroll_page_size'`: Number of results per scroll page
+
+The following optional settings are implemented but have no documentation:
+
+- `'certs_location'`
+- `'api_key'`
+- `'api_id'`
+
+
+#### Setting a default server
+
+If you name one of the servers `'default'`, it will act as the default for all corpora. This is especially recommended if you have only one server.
+
+If you don't assign a default server this way, the server for each corpus must be configured explicitly in `CORPUS_SERVER_NAMES` (see below).
+
+### `CORPORA`
+
+A dictionary that specifies Python corpus definitions that should be imported in your project.
+
+Each key must be the name of a corpus, where the value gives the absolute path to the Python file that contains the definition. For example:
+
+```python
+CORPORA = {
+    'times': '/home/me/ianalyzer/backend/corpora/times/times.py',
+}
+```
+
+The key of the corpus must match the name of the corpus class. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). For example, `'times'` is a valid key for the `Times` class. It will usually match the filename as well, but this is not strictly necessary.
+
+### `CORPUS_SERVER_NAMES`
+
+A dictionary that specifies which elasticsearch server should be used for which corpus.
+
+Each key in the dictionary should be the name for a corpus, i.e. one of the keys in the `CORPORA` setting. Each value should be the name for an elasticsearch server, i.e. one of the keys in the `SERVERS` setting.
+
+You do not need to include corpora which use the `'default'` server.
+
+### `LOGO_LINK`
+
+URL of the logo of your organisation. This is used in emails sent to users.
+
+### `DEFAULT_FROM_EMAIL`
+
+The address from which emails to users should be sent.
+
+By default, a development server will use the [console backend](https://docs.djangoproject.com/en/5.0/topics/email/#console-backend) for emails, where it does not really matter what you fill in here.
+
+### `NLTK_DATA_PATH`
+
+Some functionality on I-analyzer will download the stopwords corpus from [NLTK](https://nltk.readthedocs.io/en/latest/). This setting controls the directory where data downloaded from NLTK can be stored.
+
+### `CSV_FILES_PATH`
+
+Path to the directory where prepared download files for users should be stored.
+
+### `WORDCLOUD_LIMIT`
+
+The maximum number of documents that is analysed in the wordcloud (a.k.a. "most frequent words") visualisation.
+
+### `BASE_URL`
+
+The base URL for the application. This URL can be used to generate links to the frontend in emails and citation templates.
+
+### `NEW_HIGHLIGHT_CORPORA`
+
+List of corpora that have been re-indexed, so that the top-level term vectors for main content fields include positions and offsets. This is needed for the updated highlight functionality that was introduced in version 4.2.0 of I-analyzer.
+
+The list should contain the _titles_ (not names) of updated corpora. You only need to list corpora with a Python definition; legacy highlighting is not supported for database-only corpora.
+
+### `SAML_GROUP_NAME`
+
+Optional, should be a string.
+
+If you define a `SAML_GROUP_NAME` in settings, SAML users will always be added to a group with that name when they create an account. (The group will be created if it does not exist.) This can be used to give permissions to SAML users. The group is not used to handle authentication, so you can add non-SAML users to it as well.
+
+### `DEFAULT_CORPUS_IMAGE`
+
+A path (string) to an image file.
+
+Corpora can include an image to use in the interface (e.g. in the corpus selection menu); if the corpus has no image, this one will be used instead.
+
+### Settings for individual corpora
+
+Python corpus definitions typically rely on the Django settings, to avoid hard-coding properties that depend on the server. When you include a corpus definition in your settings, read the source file to see the related settings. Some of these settings may be optional.
diff --git a/documentation/First-time-setup.md b/documentation/First-time-setup.md
@@ -38,7 +38,7 @@ http.cors.enabled: true
 http.cors.allow-origin: "*"
 ```
 4. Create and activate a virtualenv for Python.
-5. Create the file `backend/ianalyzer/settings_local.py`.`ianalyzer/settings_local.py` is included in .gitignore and thus not cloned to your machine. The variable `CORPORA` specifies which corpora are available, and the path of the corpus definition file. This file is also the place to add corpus-specific configurations (like the location of source files). See instructions of adding corpora below.
+5. Create the file `backend/ianalyzer/settings_local.py`.`ianalyzer/settings_local.py` is included in .gitignore and thus not cloned to your machine. It can be used to customise your environment, and to include the corpora that are defined in the source code in your environment. See instructions of adding corpora below.
 6. Install the requirements for both the backend and frontend:
 ```
 yarn postinstall
@@ -75,8 +75,8 @@ To include corpora on your environment, you need to index them from their source
 
 _Note:_ these instructions are for indexing a corpus that already has a corpus definition. For adding new corpus definitions, see [How to add a new corpus to I-analyzer](./documentation/How-to-add-a-new-corpus-to-Ianalyzer.md).
 
-1. Add the corpus to the `CORPORA` dictionary in your local settings file. The key should match the class name of the corpus definition. This match is not case-sensitive, and your key may include extra non-alphabetic characters (they will be ignored when matching). The value should be the absolute path the corpus definition file (e.g. `.../backend/corpora/times/times.py`).
-2. Set configurations for your corpus. Check the definition file to see which variables it expects to find in the configuration. Some of these may already be set in settings.py, but you will at least need to define the (absolute) path to your source files.
+1. Add the corpus to the `CORPORA` dictionary in your local settings file. See [CORPORA settings documentation](/documentation/Django-project-settings.md#corpora).
+2. Set configurations for your corpus. Check the definition file to see which variables it expects to find in the configuration. Some of these may be optional, but you will at least need to define the (absolute) path to your source files.
 3. Activate your python virtual environment. Create an ElasticSearch index from the source files by running, e.g., `yarn django index dutchannualreports`, for indexing the Dutch Annual Reports corpus in a development environment. See [Indexing](documentation/Indexing-corpora.md) for more information.
 
 ## Running a dev environment

diff --git a/documentation/How-to-add-a-new-corpus-to-Ianalyzer.md b/documentation/How-to-add-a-new-corpus-to-Ianalyzer.md
@@ -77,7 +77,7 @@ The default implementation of `documentation_path` will look at the following at
 
 ## Settings file
 
-The django settings can be used to configure variables that may be depend on the environment. Please use the following naming convention.
+The django settings can be used to configure variables that may depend on the environment. Please use the following naming convention when you add settings for your corpus.
 
 ```python
 CORPUSNAME_DATA = '/MyData/CorpusData' # the directory where the xml / html or other files are located
@@ -104,26 +104,12 @@ Note that for a property like the elasticsearch index, we define a default value
 
 ### Corpus selection
 
-The dictionary `CORPORA` defines the name of the corpora and their filepath. It is defined as
+To include the new corpus in an instance of I-analyzer, the project settings must be adjusted.
 
-```python
-CORPORA = {
-    'times': '.../times.py',
-}
-```
-
-The key of the corpus must match the name of the corpus class (but lowercase/hyphenated), so `'times'` is the key for the `Times` class. Typically, the key also matches the `es_index` of the corpus, as well as its filename.
+The [CORPORA setting](/documentation/Django-project-settings.md#corpora) must be updated to include the corpus in your project.
 
-`CORPUS_SERVER_NAMES` defines to which server (defined in `SERVERS`) the backend should make requests. You only need to include corpora that do not use the `'default'` server.
-
-```python
-CORPUS_SERVER_NAMES = {
-    'times': 'special_server',
-}
-```
+Additionally, you can specify an elasticsearch server for the corpus with the [CORPUS_SERVER_NAMES setting](/documentation/Django-project-settings.md#corpus_server_names).
 
-### settings vs. settings_local
-`settings.py` imports all information in `settings_local.py`. If a variable is defined in both, `settings_local` overrules `settings`. All sensitive information (server names, user names, passwords) should be in `settings_local.py`, as this will 1) never be committed to github, and 2) be located in the `private` folder upon deployment.
 
 ## Elasticsearch
 Once the corpus definition and associated settings are added, the only remaining step is to make the Elasticsearch index. By running `yarn django index corpusname`, information is extracted and sent to Elasticsearch.

diff --git a/documentation/SAML.md b/documentation/SAML.md
@@ -10,4 +10,4 @@ The only tweaks added on top of the DjangoSaml2 package are:
 
 ### Authorisation
 
-If you define a `SAML_GROUP_NAME` in settings, SAML users will always be added to a group with that name when they create an account. (The group will be created if it does not exist.) This can be used to give permissions to SAML users. The group is not used to handle authentication, so you can add non-SAML users to it as well.
+The setting [SAML_GROUP_NAME](/documentation/Django-project-settings.md#saml_group_name) can be used to control permissions for SAML users.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -10,4 +10,4 @@ The only tweaks added on top of the DjangoSaml2 package are:

		### Authorisation

		If you define a `SAML_GROUP_NAME` in settings, SAML users will always be added to a group with that name when they create an account. (The group will be created if it does not exist.) This can be used to give permissions to SAML users. The group is not used to handle authentication, so you can add non-SAML users to it as well.
		The setting [SAML_GROUP_NAME](/documentation/Django-project-settings.md#saml_group_name) can be used to control permissions for SAML users.