Skip to content

Commit

Permalink
Refactor (#117)
Browse files Browse the repository at this point in the history
* refactored data pump including tools

* make possible to change configs from cmd
set port so it can be easily changed via cmd
updates for dev-5 import
less verbose progressbar
proper no term startup

* control anonym email by global var

* less verbose for acceptable outcomes

* less verbose, ignoring known inconsistencies

* ignoring known errors, issues and specific cases

* improve info

* less verbose
do not log specific cases
fixed add_checksums call by not parsing (empty) response content

* unify log before/after messages and add count check

* fix log

* improge logs

* better docs

* Generate JSON from DB into /input/data folder not /input/data-json because in repo_import.py it expects *.jsons in the /input/data folder.

* Add result of importing logs into files because in the `postgres.log` are only errors of the first attempt and the user do not know that the dumps was successfully imported.

* Add NOTE that all scripts must have `LF` line separator because it cannot find the `init.dspacedb5.sh` file.

* show db table count

* Fix importing of groups - one collection could have more group relations.

* For the workflowitem was used wrong endpoint.

* Generate JSON from DB into /input/data folder not /input/data-json because in repo_import.py it expects *.jsons in the /input/data folder.

* Add result of importing logs into files because in the `postgres.log` are only errors of the first attempt and the user do not know that the dumps was successfully imported.

* Add NOTE that all scripts must have `LF` line separator because it cannot find the `init.dspacedb5.sh` file.

* Fix importing of groups - one collection could have more group relations.

* For the workflowitem was used wrong endpoint.

* add @Property not get

* removed ()

* enforce line ending for specific types
update docs
reinit sql dumps rather then create (statement order references tables before creating them)

* revert fixes

* implementation of db diffs

* add local sch_id to schema dict

* Send bitstream mime type instead of id because IDs are different in CLARIN-DSpace5 and CLARIN-DSpace7

* key in dict as str, missing wf, incorrect handles

* Do not create list where it shouldn't be created because it throws an error during importing resource policies.

* add diff_all
more fixes, less logging if known cases
make consistency validation even more stricter
do not make assumptions about the response content
fix .html added to the end of the url
ignoring known metadatavalue changes
ignore file listing
date.issued/approx.date normalized in v5/v7 comparison
improved versions comparison

* If statement compares str not int - add item_id into migrated_versions as string value.

* norm text

* refactor_jm_resource_policies (#121)

* Do not import resourcepolicy for deleted item/bundle.

* Update condition -> do not call private attrs.

* refactor_jm_resourcepolicies_condition (#123)

* Updated condition

* Prettify condition

* Log resource policy type

* refactor_jm_user_metadata

* refactor_jm_conflicts (#125)

* Find out newer versions of the item

* One Item history is imported

* Item previous version are in the right sequence

* Versions of the item which are replaced in another repository is fetched in specific list.

* Importing of Item versions is working

* Importing of Item versions is working

* Uncommented item import and added handle prefix to const

* Updated checksum

* The sequences are updated

* clearly separate TESTs part of Readme

* Updated comments

* Removed empty row.

* Removed another empty row.

---------

Co-authored-by: MajoBerger <[email protected]>

---------

Co-authored-by: jm <jm@maz>
Co-authored-by: milanmajchrak <[email protected]>
Co-authored-by: Paurikova2 <[email protected]>
Co-authored-by: milanmajchrak <[email protected]>
Co-authored-by: MajoBerger <[email protected]>
  • Loading branch information
6 people authored Nov 30, 2023
1 parent 90c300a commit 07a3ae6
Show file tree
Hide file tree
Showing 109 changed files with 5,316 additions and 3,255 deletions.
5 changes: 5 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
*.sh eol=lf
*.py eol=lf
*.md eol=lf
apt-requirements.txt eol=lf
*.bat eol=crlf
23 changes: 9 additions & 14 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,33 +1,28 @@
name: Test dspace on dev-5
name: build-and-test

on:
workflow_dispatch:
schedule:
# * is a special character in YAML so you have to quote this string
- cron: '0 0 * * *'
push:
branches: [ "main" ]


jobs:
test:
runs-on: dspace-bbt
runs-on: ubuntu-latest

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
- uses: actions/checkout@v3

- name: install requirements
run: pip install -r requirements.txt

- name: test
run: python3 -m unittest -v 2> output.txt

- name: report result
run: echo $? > result.txt
- name: smoketest
run: |
cd ./src
python repo_import.py --help
# multi line commands for future reference
- name: Run a multi-line script
- name: test
run: |
echo first line
echo second line
cd ./tests
python -m unittest discover ./ -v
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -139,3 +139,7 @@ clarin-dspace-dump-8.8.23
# data folders
data/
temp-files/

__logs
input
*.bak
46 changes: 46 additions & 0 deletions README.dev.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# How to write new tests
Check test.example package. Everything necessary should be there.

Test data are in `test/data` folder.
If your test data contains special characters like čřšáý and so on, it is recommended
to make `.stripped` variation of the file.
E.g. `my_format.json` and `my_format.stripped.json` for loading data
and `my_format.test.xml` and `my_format.test.stripped.xml` for testing.

If not on dev-5 (e.g. when run on localhost), `.stripped` version of files will be loaded.
The reason for this is, that when dspace runs on windows, it has trouble with special characters.


## Settings
See const.py for constants used at testing.

To set up logs, navigate to support.logs.py and modify method set_up_logging.

## Run

In order to run tests, use command
`python -m unittest`

Recommended variation is
`python -m unittest -v 2> output.txt`
which leaves result in output.txt

Before running for the first time, requirements must be installed with following command
`pip install -r requirements.txt`

It is possible to run in Pycharm with configuration like so:

![image](https://user-images.githubusercontent.com/88670521/186934112-d0f828fd-a809-4ed8-bbfd-4457b734d8fd.png)


# How to re-initialize dspace 7 database

Recreate your local CLARIN-DSpace7.* database **NOTE: all data will be deleted**

- Install again the database following the official tutorial steps: https://wiki.lyrasis.org/display/DSDOC7x/Installing+DSpace#InstallingDSpace-PostgreSQL11.x,12.xor13.x(withpgcryptoinstalled)
- Or try to run these commands in the <PSQL_PATH>/bin:
> - `createdb --username=postgres --owner=dspace --encoding=UNICODE dspace` // create database
> - `psql --username=postgres dspace -c "CREATE EXTENSION pgcrypto;"` // Add pgcrypto extension
> > If it throws warning that `-c` parameter was ignored, just write a `CREATE EXTENSION pgcrypto;` command in the database cmd.
> > CREATE EXTENSION pgcrypto;
![image](https://user-images.githubusercontent.com/90026355/228528044-f6ad178c-f525-4b15-b6cc-03d8d94c8ccc.png)
154 changes: 39 additions & 115 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,143 +13,67 @@ there exists automatic function that sends email, what we don't want
because we use this endpoint for importing existing data.

### Prerequisites:
- Installed CLARIN-DSpace7.*. with running database, solr, tomcat
1. Install CLARIN-DSpace7.*. (postgres, solr, dspace backend)

### Steps:
1. Clone python-api: https://github.com/dataquest-dev/dspace-python-api (branch `main`) and dpace://https://github.com/dataquest-dev/DSpace (branch `dtq-dev`)
2. Clone python-api: https://github.com/dataquest-dev/dspace-python-api (branch `main`) and https://github.com/dataquest-dev/DSpace (branch `dtq-dev`)

***
2. Get database dump (old CLARIN-DSpace) and unzip it into the `<PSQL_PATH>/bin` (or wherever you want)

***
3. Create CLARIN-DSpace5.* databases (dspace, utilities) from dump.
> // clarin-dspace database
> - `createdb --username=postgres --owner=dspace --encoding=UNICODE clarin-dspace` // create a clarin database with owner
> // Running on second try:
> - `psql -U postgres clarin-dspace < <CLARIN_DUMP_FILE_PATH>`
3. Get database dump (old CLARIN-DSpace) and unzip it into `input/dump` directory in `dspace-python-api` project.

> // clarin-utilities database
> - `createdb --username=postgres --owner=dspace --encoding=UNICODE clarin-utilities` // create the utilities database with owner
> // Running on second try:
> - `psql -U postgres clarin-utilities < <UTILITIES_DUMP_FILE_PATH>`
***
4. Recreate your local CLARIN-DSpace7.* database **NOTE: all data will be deleted**
- Install again the database following the official tutorial steps: https://wiki.lyrasis.org/display/DSDOC7x/Installing+DSpace#InstallingDSpace-PostgreSQL11.x,12.xor13.x(withpgcryptoinstalled)
- Or try to run these commands in the <PSQL_PATH>/bin:
> - `createdb --username=postgres --owner=dspace --encoding=UNICODE dspace` // create database
> - `psql --username=postgres dspace -c "CREATE EXTENSION pgcrypto;"` // Add pgcrypto extension
> > If it throws warning that `-c` parameter was ignored, just write a `CREATE EXTENSION pgcrypto;` command in the database cmd.
> > CREATE EXTENSION pgcrypto;
![image](https://user-images.githubusercontent.com/90026355/228528044-f6ad178c-f525-4b15-b6cc-03d8d94c8ccc.png)


> // Now the clarin database for DSpace7 should be created
> - Run the database by the command: `pg_ctl start -D "<PSQL_PATH>\data\"`
4. Create CLARIN-DSpace5.* databases (dspace, utilities) from dump.
Run `scripts/start.local.dspace.db.bat` or use `scipts/init.dspacedb5.sh` directly with your database.

***
5. (Your DSpace project must be installed) Go to the `dspace/bin` and run the command `dspace database migrate force` // force because of local types
5. Go to the `dspace/bin` in dspace7 installation and run the command `dspace database migrate force` (force because of local types).
**NOTE:** `dspace database migrate force` creates default database data that may be not in database dump, so after migration, some tables may have more data than the database dump. Data from database dump that already exists in database is not migrated.

***
6. Create an admin by running the command `dspace create-administrator` in the `dspace/bin`

***
7. Prepare `dspace-python-api` project for migration
**IMPORTANT:** If `data` folder doesn't exist in the project, create it

Update `const.py`
- `user = "<ADMIN_NAME>"`
- `password = "<ADMIN_PASSWORD>"`

- `# http or https`
- `use_ssl = False`
- `host = "<YOUR_SERVER>" e.g., localhost`
- `# host = "dev-5.pc"`
- `fe_port = "<YOUR_FE_PORT>"`
- `# fe_port = ":4000"`
- `be_port = "<YOUR_BE_PORT>"`
- `# be_port = ":8080"`
- `be_location = "/server/"`
#### Database const - for copying sequences
- `CLARIN_DSPACE_NAME = "clarin-dspace"`
- `CLARIN_DSPACE_HOST = "localhost"`
- `CLARIN_DSPACE_USER = "<USERNAME>"`
- `CLARIN_DSPACE_PASSWORD = "<PASSWORD>"`
- `CLARIN_UTILITIES_NAME = "clarin-utilities"`
- `CLARIN_UTILITIES_HOST = "localhost"`
- `CLARIN_UTILITIES_USER = "<USERNAME>"`
- `CLARIN_UTILITIES_PASSWORD = "<PASSWORD>"`
- `CLARIN_DSPACE_7_NAME = "dspace"`
- `CLARIN_DSPACE_7_HOST = "localhost"`
- `CLARIN_DSPACE_7_PORT = 5432`
- `CLARIN_DSPACE_7_USER = "<USERNAME>"`
- `CLARIN_DSPACE_7_PASSWORD = "<PASSWORD>"`
#### const - for importing licenses
- `OLD_LICENSE_DEFINITION_STRING = <OLD_SITE_URL>`
- `NEW_LICENSE_DEFINITION_STRING = <NEW_SITE_URL>`

**NOTE:** Be sure, that `authorization = True`, because some of the used endpoints won't work
7. Create JSON files from the database tables.
**NOTE: You must do it for both databases `clarin-dspace` and `clarin-utilities`** (JSON files are stored in the `data` folder)
- Go to `dspace-python-api` and run
```
pip install -r requirements.txt
(optional on ubuntu like systems) apt install libpq-dev
python db_to_json.py --database=clarin-dspace
python db_to_json.py --database=clarin-utilities
```

***
8. Create JSON files from the database tables. **NOTE: You must do it for both databases `clarin-dspace` and `clarin-utilities`** (JSON files are stored in the `data` folder)
- Go to `dspace-python-api` in the cmd
- Run `pip install -r requirements.txt`
- Run `python create_jsons.py --database <DATABSE NAME> --host <HOST> --user postgres --password <PASSWORD FOR POSTGRES>` e.g., `python create_jsons.py --database clarin-dspace --host localhost --user postgres --password pass` (arguments for database connection - database, host, user, password) for the BOTH databases // NOTE there must exist data folder in the project structure
8. Prepare `dspace-python-api` project for migration

***
9. Make sure, your backend configuration (`dspace.cfg`) includes all handle prefixes from generated handle json in property `handle.additional.prefixes`,
e.g.,`handle.additional.prefixes = 11858, 11234, 11372, 11346, 20.500.12801, 20.500.12800`
- copy the files used during migration into `input/` directory:
```
> ls -R ./input
input:
data dump icon
input/data:
bitstream.json fileextension.json piwik_report.json
bitstreamformatregistry.json ...
input/dump:
clarin-dspace-8.8.23.sql clarin-utilities-8.8.23.sql
input/icon:
aca.png by.png gplv2.png mit.png ...
```

***
10. Copy `assetstore` from dspace5 to dspace7 (for bitstream import). `assetstore` is in the folder where you have installed DSpace `dspace/assetstore`.
9. update `project_settings.py`

***
11. Create `icon/` folder if it doesn't exist in project and copy all the icons that are used into it.
10. Make sure, your backend configuration (`dspace.cfg`) includes all handle prefixes from generated handle json in property `handle.additional.prefixes`,
e.g.,`handle.additional.prefixes = 11858, 11234, 11372, 11346, 20.500.12801, 20.500.12800`

11. Copy `assetstore` from dspace5 to dspace7 (for bitstream import). `assetstore` is in the folder where you have installed DSpace `dspace/assetstore`.

***
12. Import data from the json files (python-api/data/*) into dspace database (CLARIN-DSpace7.*)
11. Import data from the json files (python-api/input/*) into dspace database (CLARIN-DSpace7.*)
- **NOTE:** database must be up to date (`dspace database migrate force` must be called in the `dspace/bin`)
- **NOTE:** dspace server must be running
- From the `dspace-python-api` run command `python main.data_pump.py`
- run command `cd ./src && python repo_import.py`

***
## !!!Migration notes:!!!
- The values of table attributes that describe the last modification time of dspace object (for example attribute `last_modified` in table `Item`) have a value that represents the time when that object was migrated and not the value from migrated database dump.
- If you don't have valid and complete data, not all data will be imported.

# How to write new tests
Check test.example package. Everything necessary should be there.

Test data are in `test/data` folder.
If your test data contains special characters like čřšáý and so on, it is recommended
to make `.stripped` variation of the file.
E.g. `my_format.json` and `my_format.stripped.json` for loading data
and `my_format.test.xml` and `my_format.test.stripped.xml` for testing.

If not on dev-5 (e.g. when run on localhost), `.stripped` version of files will be loaded.
The reason for this is, that when dspace runs on windows, it has trouble with special characters.


## Settings
See const.py for constants used at testing.

To set up logs, navigate to support.logs.py and modify method set_up_logging.

## Run

In order to run tests, use command
`python -m unittest`

Recommended variation is
`python -m unittest -v 2> output.txt`
which leaves result in output.txt

Before running for the first time, requirements must be installed with following command
`pip install -r requirements.txt`

It is possible to run in Pycharm with configuration like so:

![image](https://user-images.githubusercontent.com/88670521/186934112-d0f828fd-a809-4ed8-bbfd-4457b734d8fd.png)
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
91 changes: 0 additions & 91 deletions const.py

This file was deleted.

Loading

0 comments on commit 07a3ae6

Please sign in to comment.