-
-
Notifications
You must be signed in to change notification settings - Fork 133
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit c1ce72a
Showing
33 changed files
with
6,900 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
--- | ||
title: Introduction to OpenRefine | ||
teaching: 15 | ||
exercises: 0 | ||
--- | ||
|
||
::::::::::::::::::::::::::::::::::::::: objectives | ||
|
||
- Explain what the OpenRefine software does | ||
- Explain how the OpenRefine software can help work with data files | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
:::::::::::::::::::::::::::::::::::::::: questions | ||
|
||
- What is OpenRefine? What can it do? | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
## What is OpenRefine? | ||
|
||
OpenRefine is a desktop application that uses your web browser as a graphical interface. It is described as "a power tool for working with messy data" ([David Huynh](https://web.archive.org/web/20141021040915/http://davidhuynh.net/spaces/nicar2011/tutorial.pdf)) - but what does this mean? It is probably easiest to describe the kinds of data OpenRefine is good at working with and the sorts of problems it can help you or your team solve. | ||
|
||
OpenRefine is most useful where you have data in a simple tabular format such as a spreadsheet, a comma separated values file (csv) or a tab delimited file (tsv) but with internal inconsistencies either in data formats, or where data appears, or in terminology used. OpenRefine can be used to standardize and clean data across your file. It can help you: | ||
|
||
- Get an overview of a data set | ||
- Resolve inconsistencies in a data set, for example standardizing date formatting | ||
- Help you split data up into more granular parts, for example splitting up cells with multiple authors into separate cells | ||
- Match local data up to other data sets - for example, in matching forms of personal names against name authority records in the Virtual International Authority File (VIAF) | ||
- Enhance a data set with data from other sources | ||
|
||
Some common scenarios might be: | ||
|
||
- Where you want to know how many times a particular value (name, publisher, subject) appears in a column in your data | ||
- Where you want to know how values are distributed across your whole data set | ||
- Where you have a list of dates which are formatted in different ways, and want to change all the dates in the list to a single common date format. For example: | ||
|
||
| Data you have | Desired data | | ||
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- | | ||
| 1st January 2014 | 2014-01-01 | | ||
| 01/01/2014 | 2014-01-01 | | ||
| Jan 1 2014 | 2014-01-01 | | ||
| 2014-01-01 | 2014-01-01 | | ||
|
||
- Where you have a list of names or terms that differ from each other but refer to the same people, places or concepts. For example: | ||
|
||
| Data you have | Desired data | | ||
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- | | ||
| London | London | | ||
| London] | London | | ||
| London,] | London | | ||
| london | London | | ||
|
||
- Where you have several bits of data combined together in a single column, and you want to separate them out into individual bits of data with one column for each bit of the data. For example going from a single address field (in the first column), to each part of the address in a separate field: | ||
|
||
| Address in single field | Institution | Library name | Address 1 | Address 2 | Town/City | Region | Country | Postcode | | ||
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- | :------------------------------------------------------------- | :---------------- | :-------- | :---------- | :------------ | :------------- | :------- | | ||
| University of Wales, Llyfrgell Thomas Parry Library, Llanbadarn Fawr, ABERYSTWYTH, Ceredigion, SY23 3AS, United Kingdom | University of Wales | Llyfrgell Thomas Parry Library | Llanbadarn Fawr | | Aberystwyth | Ceredigion | United Kingdom | SY23 3AS | | ||
| University of Aberdeen, Queen Mother Library, Meston Walk, ABERDEEN, AB24 3UE, United Kingdom | University of Abderdeen | Queen Mother Library | Meston Walk | | Aberdeen | | United Kingdom | AB24 3UE | | ||
| University of Birmingham, Barnes Library, Medical School, Edgbaston, BIRMINGHAM, West Midlands, B15 2TT, United Kingdom | University of Birmingham | Barnes Library | Medical School | Edgbaston | Birmingham | West Midlands | United Kingdom | B15 2TT | | ||
| University of Warwick, Library, Gibbett Hill Road, COVENTRY, CV4 7AL, United Kingdom | University of Warwick | Library | Gibbett Hill Road | | Coventry | | United Kingdom | CV4 7AL | | ||
|
||
- Where you want to add to your data from an external data source: | ||
|
||
| Data you have | Date of Birth from VIAF (Virtual International Authority File) | Date of Death from VIAF (Virtual International Authority File) | | ||
| ----------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------- | :------------------------------------------------------------- | | ||
| Braddon, M. E. (Mary Elizabeth) | 1835 | 1915 | | ||
| Rossetti, William Michael | 1829 | 1919 | | ||
| Prest, Thomas Peckett | 1810 | 1879 | | ||
|
||
## What Should I Know When Working With OpenRefine? | ||
|
||
- No internet connection is needed, and none of the data or commands you enter in OpenRefine are sent to a remote server. | ||
- You are NOT modifying original/raw data. | ||
- Projects are autosaved every five minutes and when OpenRefine is properly shut down (Ctrl+C). See [History in User Manual](https://docs.openrefine.org/manual/running/#history-undoredo) for details. | ||
- Files are saved locally such that if you are working on two computers you will have to export/import files/projects. | ||
|
||
:::::::::::::::::::::::::::::::::::::::: keypoints | ||
|
||
- OpenRefine is 'a tool for working with messy data' | ||
- OpenRefine works best with data in a simple tabular format | ||
- OpenRefine can help you split data up into more granular parts | ||
- OpenRefine can help you match local data up to other data sets | ||
- OpenRefine can help you enhance a data set with data from other sources | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
--- | ||
title: Importing data into OpenRefine | ||
teaching: 10 | ||
exercises: 5 | ||
--- | ||
|
||
::::::::::::::::::::::::::::::::::::::: objectives | ||
|
||
- Successfully import data into OpenRefine | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
:::::::::::::::::::::::::::::::::::::::: questions | ||
|
||
- How do I get data into OpenRefine? | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
## Importing data | ||
|
||
OpenRefine does not manipulate your data directly. | ||
Instead, the data you import and all the changes you make are stored in a project. | ||
You can stop working on a project and continue later if you like. | ||
When you want to 'refine' a new file, you start by creating a new project. | ||
When you want to continue working on a project, you can open it through "Open Project". | ||
It is also possible to export a project on one computer and continue working on it on a different | ||
computer. | ||
To do so, you transfer the exported files to the new computer and use "Import Project" on the new | ||
computer. | ||
|
||
::::::::::::::::::::::::::::::::::::::::: callout | ||
|
||
## What kinds of data files can I import? | ||
|
||
There are several options for getting your data set into OpenRefine. You can upload or import files in a variety of formats including: | ||
|
||
- TSV (tab-separated values) | ||
- CSV (comma-separated values) | ||
- TXT | ||
- Excel | ||
- JSON (javascript object notation) | ||
- XML (extensible markup language) | ||
- Google Spreadsheet | ||
|
||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
::::::::::::::::::::::::::::::::::::::: checklist | ||
|
||
## Create your first OpenRefine project (using provided data) | ||
|
||
To import the data for the exercise below, follow the instructions in [Setup](https://librarycarpentry.github.io/lc-open-refine/index.html) to download the data and run OpenRefine. *NOTE: If OpenRefine does not open in a browser window, open your browser and type the address [http://127.0.0.1:3333/](https://127.0.0.1:3333/) to take you to the OpenRefine interface.* | ||
|
||
1. Once OpenRefine is launched in your browser, click `Create Project` from the left hand menu and select `Get data from This Computer` | ||
2. Click `Choose Files` (or 'Browse', depending on your setup) and locate the file which you have downloaded called `doaj-article-sample.csv` | ||
3. Click `Next»` where the next screen (see below) gives you options to ensure the data is imported into OpenRefine correctly. The options vary depending on the type of data you are importing. | ||
4. Click in the `Character encoding` box and set it to `UTF-8`. This ensures that OpenRefine correctly interprets the imported data as UTF-8 encoded. If you don't select this you may find that some special characters (e.g. smart quotation marks) are not displayed correctly. | ||
5. Ensure the first row is used to create the column headings by checking the box `Parse next 1 line(s) as column headers` | ||
6. OpenRefine will automatically select `Use character " to enclose cells containing column separators` (such as a comma) as part of their data. This will make sure that OpenRefine doesn't misinterpret any commas (or other characters) within the column data as a delimiter. Keep this option selected. | ||
7. From OpenRefine 3.4 onwards there is an option to Trim leading \& trailing whitespace from strings when importing separator-based files. Keeping this checked will ensure that values like `English` and `English `, which differ by a single trailing space, are not treated as different values after the import | ||
8. Make sure the `Attempt to parse cell text into numbers` box is not checked, so OpenRefine doesn't try to automatically detect numbers because this could cause errors such as confusion between date formats (e.g. DD/MM/YYYY vs MM/DD/YYYY). | ||
9. The Project Name box in the upper right corner will default to the title of your imported file. Click in the `Project Name` box to give your project a different name, if desired. | ||
|
||
:::::::::::::::::::::::::::::::::::::: instructor | ||
|
||
This is a good moment to review the points from [What Should I Know When Working with OpenRefine?](01-introduction.md#what-should-i-know-when-working-with-openrefine) | ||
|
||
::::::::::::::::::::::::::::::::::::::::::::::::: | ||
10. Once you have selected the appropriate options for your project, click the `Create project »` button at the top right of the screen. This will create the project and open it for you. Projects are saved as you work on them, there is no need to save copies as you go along. | ||
|
||
![Create Project in OpenRefine](fig/openrefine_ui.png){alt="OpenRefine Create Project screen, with highlights for the address bar, mentioned settings and the Create Project button."} | ||
|
||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
To open an existing project in OpenRefine you can click `Open Project` from the main OpenRefine screen (in the left hand menu). When you click this, you will see a list of the existing projects and can click on a project's name to open it. | ||
|
||
### Going Further | ||
|
||
- Look at the other options on the Import screen - try changing some of these options and see how that changes the Preview and how the data appears after import. | ||
|
||
::::::::::::::::::::::::::::::::::::::: instructor | ||
Carefully guide learners on how to revisit OpenRefine's homepage to explore import options when creating new or re-opening existing projects, select the large blue diamond in the upper left corner of the browser window. | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
- Do you have access to JSON or XML data? If so the first stage of the import process will prompt you to select a 'record path' - that is the parts of the file that will form the data rows in the OpenRefine project. | ||
|
||
:::::::::::::::::::::::::::::::::::::::: keypoints | ||
|
||
- Use the `Create Project` option to import data | ||
- You can control how data imports using options on the import screen | ||
- Several files types may be imported into OpenRefine. | ||
|
||
:::::::::::::::::::::::::::::::::::::::::::::::::: | ||
|
||
|
Oops, something went wrong.