diff --git a/contact/index.html b/contact/index.html index 599e3cf9..e5972446 100644 --- a/contact/index.html +++ b/contact/index.html @@ -1,9 +1,9 @@ Contact Us - LDaCA
+We have a page on LinkedIn and we share a Twitter account with the Australian Text Analytics Platform:">

Contact Us


You can contact the Language Data Commons of Australia by email at ldaca@uq.edu.au and make sure you subscribe to our newsletter.

Our logo was designed by Otis Carmichael. Read more about Otis and the ideas behind his design.

We have a page on LinkedIn and we share a Twitter account with the Australian Text Analytics Platform.




Contact Us


You can contact the Language Data Commons of Australia by email at ldaca@uq.edu.au and make sure you subscribe to our newsletter.

Our logo was designed by Otis Carmichael. Read more about Otis and the ideas behind his design.

We have a page on LinkedIn and we share a Twitter account with the Australian Text Analytics Platform:




+Case Studies Accounts by collectors of various kinds of language data illustrating the different solutions they have adopted for the problems encountered in the process.">
@@ -81,7 +79,6 @@

View resources designed for data users

News

Follow what's new at LDaCA and be informed about events which we organise.

-

View the publications produced by LDaCA as well as blog posts written by our team.

@@ -126,21 +123,6 @@

Get in touch with LDaCA and get involved in our act

- -

LDaCA acknowledges Traditional Owners of Country throughout Australia and recognises the continuing connection to lands, waters and communities. We pay our respects to their Ancestors and their descendants, who continue cultural and spiritual connections to Country.

diff --git a/index.json b/index.json index 4546978f..6f2f8594 100644 --- a/index.json +++ b/index.json @@ -1 +1 @@ -[{"categories":null,"content":"A brief overview of Crate-O, RO-Crate, Schemas, Profiles and Modes.","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":" Crate-O Use Cases RO-Crate Collection Hierarchy Schemas, Profiles and Modes Schema.org Style Schemas (SOSSs) and RO-Crate Profiles and Modes Crate-O is a browser-based editor that allows you to create and update Research Object Crates (RO-Crates), either using the web interface or with metadata from a spreadsheet. It provides researchers with a relatively simple way to describe their data using best practice in formal metadata description. Designed as a Vue component, Crate-O can be easily integrated into any Vue.js project. Simply install the component with npm and include it in your application. As a Vue software component, it can be used, cloned and distributed by anyone. Crate-O is implemented in the GitHub page Crate-O. This implementation works only with Google Chrome and Microsoft Edge. It does not link to other services, and as such, any data uploaded to Crate-O will not go anywhere other than populating your RO-Crate. We will be releasing versions that work with online resources directly, which will be compatible with other browsers (see the Roadmap). RO-Crate is a way of packaging research data that stores the data together with its associated metadata and other component files, such as the data license. It is a flexible, developer-friendly approach to linked-data description and packaging. While the current version of Crate-O is designed for editing self-contained RO-Crates (and works fine with crates containing tens of thousands of entities), our roadmap includes adding the ability to edit fragments of larger linked-data resources and to integrate with repositories, such as the Oni repository, data API and archival repositories such as the Language Data Commons of Australia. ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:0:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"Crate-O Use Cases Crate-O is designed to work with any of the following use cases: Describe data collections and files on a user’s computer, and add contextual information about those files Describe abstract contextual entities, such as in a Cultural Collection or an encyclopaedia Annotate existing resources elsewhere on the web Submit a data collection to the LDaCA Portal Edit a Schema that contains a set of vocabulary terms, such as the terms used by LDaCA. ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:1:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"RO-Crate Collection Hierarchy The diagram below shows the hierarchical relationship between collections, objects and files in a corpus, together with the metadata categories which track these relationships. Self-contained corpus crate with all resources Image Source: LDaCA The metadata is organised according to Schema.org entity types. Entity Definition Class rdfs:Class is used to classify resources. Classes in the Language Data Commons (LDAC) schema include CollectionEvent, CollectionProtocol, DataDepositLicense, DataLicense and DataReuseLicense (see https://w3id.org/ldac/terms). Property rdfs:Property is an attribute of an instance of a Class. For example, on an entity that is an instance of Class Person the property “name” would be their name, expressed as a text string, while “affiliation” would be a property that referenced another entity, their university. DefinedTerm A ‘word, name, acronym, phrase, etc. with a formal definition’, ‘often used in the context of category or subject classification.’ DefinedTerms allow us to a) have accurate definitions of the values we want to give to properties, and b) group such definitions in DefinedTermSets, which can function as controlled vocabularies. The table below shows an example of the relationship between each of these entities: Level Example Class Annotation ↓ ↓ Property annotationType ↓ ↓ Defined Term Set AnnotationTypeTerms ↓ ↓ Defined Terms Gestural, Phonemic, Phonetic, Phonological, Prosodic, Semantic, Syntactic, Transcription, Translation For more details on these and other metadata entities, see Metadata for Language Data. ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:2:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"Schemas, Profiles and Modes This diagram shows the relationship between the three main components used by Crate-O and other tools employed by LDaCA for specifying and validating RO-Crates. This section explains what these components are and how they relate. The three main components for RO-Crate editing with Crate-O Image Source: LDaCA A Schema specifies a metadata vocabulary of Classes and Properties, based on the RO-Crate specification’s use of Schema.org classes. An RO-Crate Mode is a set of lightweight syntactic rules for combining Schema.org Style Schema (SOSS) Classes, Properties and DefinedTerms, expressed in a JSON file that can be: loaded into an editor such as Crate-O imported into another program and used for RO-Crate validation used to summarise the rules for an RO-Crate Profile. An RO-Crate Profile has (at least) a document that explains how metadata entities from the Schema are used for a particular purpose. These are all inter-related, and can be developed together or separately using tools. See the links below to the LDAC schema, profile and modes: LDAC Schema LDAC Profile LDAC Modes ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:3:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"Schema.org Style Schemas (SOSSs) and RO-Crate Profiles and Modes Schema.org, which provides the basic vocabulary for RO-Crate, has a light-touch approach to describing what it refers to as its schema (with a small-s), which might also be thought of as an ontology. Schema.org is defined as a set of Classes and Properties, each of which has an online definition. The below example illustrates that the base class Thing and its subclass Person has properties such as birthDate. Class: Thing → Sub-Class: Person → Property: birthDate Schema.org specifies which Properties can occur in the domain of which Classes, and the range of Classes that are expected as values for a Property. While Schema.org has terms for Class and Property, it does not use these for defining the classes and properties in Schema.org itself (possibly as this would be circular). Rather, it uses the equivalent Classes from the rdf: and rdfs: vocabularies. Here is the definition for Person: { \"@id\": \"schema:Person\", \"@type\": \"rdfs:Class\", \"owl:equivalentClass\": { \"@id\": \"foaf:Person\" }, \"rdfs:comment\": \"A person (alive, dead, undead, or fictional).\", \"rdfs:label\": \"Person\", \"rdfs:subClassOf\": { \"@id\": \"schema:Thing\" }, \"schema:source\": { \"@id\": \"http://www.w3.org/wiki/WebSchemas/SchemaDotOrgSources#source_rNews\" } } The Class definition does not have any information about the occurrence of properties – that is found in a Property definition: { \"@id\": \"schema:sibling\", \"@type\": \"rdf:Property\", \"rdfs:comment\": \"A sibling of the person.\", \"rdfs:label\": \"sibling\", \"schema:domainIncludes\": { \"@id\": \"schema:Person\" }, \"schema:rangeIncludes\": { \"@id\": \"schema:Person\" } } A SOSS is a Flattened JSON-LD graph, just like an RO-Crate. Some members of the RO-Crate community are beginning to define its basic schema and RO-Crate Profiles using the SOSS’s same approach. To make an RO-Crate Mode File, we transform the flat graph of a schema into something optimised for driving an editor or a validator; it creates a list of Classes, and what properties each may have. Base Mode File creation, combining the Schema.org schema and RO-Crate additions using the rocsoss script Image Source: LDaCA ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:4:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":" Back to Newsletters LDaCA Newsletter — Quarter 3 2024 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt Read about the new phase of the project, several events we will be running this year, a symposium hosted by the ARDC... and more! LDaCA Newsletter — Quarter 3 2024WelcomeWelcome to the third issue for 2024 of this newsletter about the activities of the Language Data Commons of Australia (LDaCA) and the Australian Text Analytics Platform (ATAP). This quarter, we announce the new phase of the project, advise of several workshops and other events we will be running, and report back from a symposium hosted by the Australian Research Data Commons (ARDC). If you have any questions or feedback, please email us at ldaca@uq.edu.au or message us on our LinkedIn page. NewsNew phase of LDaCAWe are delighted to announce that the ARDC’s investment in LDaCA will continue until June 2028. In this next phase, LDaCA will: develop the social and technical foundations for a national, distributed archival repository of language materials continue securing vulnerable and nationally significant collections of Aboriginal and Torres Strait Islander languages, Indigenous languages in Australia’s Pacific region, varieties of Australian English and migrant languages, and sign languages of Australia and its region continue to develop the LDaCA data portal for accessing and repurposing language data of significance to researchers and communities, including data which is held in galleries, libraries, archives and museums (GLAM) establish workflows which link repositories and analytics environments so that researchers can create fully described, reproducible research on written, spoken, multimodal and signed language provide training and develop resources for researchers and communities which support best practice in archiving, sharing, accessing and analysing language data in line with FAIR and CARE principles. In this phase, a new Chief Investigator (CI), Dr Rose Barrowcliffe, joins The University of Queensland (UQ) team (Prof Michael Haugh and Dr Martin Scheweinberger) and three new partners will join LDaCA: Batchelor Institute of Indigenous Tertiary Education (CI: Prof Kathryn Gilbey), the Australian Digital Observatory (ADO) (CI: Dr Marissa Takahashi) and the University of Western Australia (CI: Prof Clint Bracknell). Monash University will not continue as a partner, at least in the short term, and we would like to take this opportunity to thank CI Assoc Prof Louisa Willoughby and her team for the wonderful contribution they have made over the past two years. Our other partners will continue their involvement: Australian National University (ANU) (CI: Prof Catherine Travis), The University of Melbourne (CI: Assoc Prof Nick Thieberger), The University of Sydney (USyd) (CI: Prof Monika Bednarek), AARNet (CI: Adam Bell) and First Languages Australia (CI: Beau Williams). The project is part of the ARDC’s HASS and Indigenous Research Data Commons (RDC), which is establishing long-term, enduring national digital research infrastructure. It supports researchers in harnessing research data to enhance Australian social and cultural wellbeing, and helps us understand and preserve our culture, history and heritage. In partnership with research institutions and government, the HASS and Indigenous RDC has achieved joint resourcing of more than $40 million over the coming four years. LDaCA also welcomes the new focus areas which will increase the size and value of the RDC: the social sciences project (currentl","date":"2024-01-04","objectID":"/news/newsletter/newsletter-q3-2024/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 3 2024","uri":"/news/newsletter/newsletter-q3-2024/"},{"categories":null,"content":"Information about the principles on which the work of LDaCA is based.","date":"2022-02-15","objectID":"/about/principles/","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":" The Language Data Commons of Australia aims to ensure long-term access to language data collections for analysis and reuse. Sustainable management of and access to these significant collections of intangible cultural heritage are underpinned by two sets of complementary guiding principles of data management and stewardship, namely the FAIR and CARE principles. ","date":"2022-02-15","objectID":"/about/principles/:0:0","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"FAIR Principles The FAIR principles were first published by a group of stakeholders representing academia, industry, funding agencies and scholarly publishers. The principles aim to address issues surrounding data management and stewardship, focusing on four areas (which provide the FAIR acronym). ","date":"2022-02-15","objectID":"/about/principles/:1:0","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Findable Metadata and data should be easy to find for both humans and computers. Making the data findable includes: (Meta)data are assigned a globally unique and persistent identifier Data are described with rich metadata Metadata clearly and explicitly include the identifier of the data they describe (Meta)data are registered or indexed in a searchable resource ","date":"2022-02-15","objectID":"/about/principles/:1:1","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Accessible Once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation. (Meta)data are retrievable by their identifier using a standardised communications protocol The protocol is open, free and universally implementable The protocol allows for an authentication and authorisation procedure, where necessary Metadata are accessible, even when the data are no longer available ","date":"2022-02-15","objectID":"/about/principles/:1:2","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage and processing. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation (Meta)data use vocabularies that follow FAIR principles (Meta)data include qualified references to other (meta)data ","date":"2022-02-15","objectID":"/about/principles/:1:3","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. (Meta)data are richly described with a plurality of accurate and relevant attributes (Meta)data are released with a clear and accessible data usage license (Meta)data are associated with detailed provenance (Meta)data meet domain-relevant community standards The Australian Research Data Commons (ARDC) and LDaCA support FAIR data practices and initiatives that make data and related research outputs FAIR. At the same time, the ARDC and LDaCA acknowledge that the implementation of the principles will look different across disciplines and will need discipline-specific approaches and standards. ","date":"2022-02-15","objectID":"/about/principles/:1:4","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"CARE Principles The CARE Principles for Indigenous Data Governance were developed by the Global Indigenous Data Alliance (GIDA); they are a response to the FAIR principles and aim to complement them. GIDA highlights how the FAIR principles and the open data movement focus on increasing data sharing among researchers and entities but do not take into account power differentials and historical contexts or fully engage with Indigenous Peoples’ rights and interests. These include the rights to generate value from Indigenous data in ways that are grounded in Indigenous worldviews and to advance Indigenous innovation and self-determination. The CARE Principles also focus on four areas. ","date":"2022-02-15","objectID":"/about/principles/:2:0","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Collective Benefit Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data. For inclusive development and innovation For improved government and citizen engagement For equitable outcomes ","date":"2022-02-15","objectID":"/about/principles/:2:1","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Authority to control Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered. Recognizing rights and interests Data for governance Governance of data ","date":"2022-02-15","objectID":"/about/principles/:2:2","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Responsibility Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit. For positive relationships For expanding capability and capacity For Indigenous languages and worldviews ","date":"2022-02-15","objectID":"/about/principles/:2:3","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Ethics Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem. For minimizing harm and maximizing benefit For justice For future use The Australian Research Data Commons (ARDC) and LDaCA support the CARE principles to further extend data management principles, ensuring that Indigenous communities benefit from the data, and that authority to control Indigenous data is held by Indigenous Peoples. We have a responsibility to share how data is used to collectively benefit Indigenous Peoples, and that Indigenous Peoples’ rights and wellbeing are the primary concern at all stages of the data life cycle. ","date":"2022-02-15","objectID":"/about/principles/:2:4","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"A guide to navigating the various sections of the Crate-O interface.","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":" Main Menu Mode Selector Current Entity Property Groups Entity Properties Entity Navigator ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:0:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Main Menu Crate-O Main Menu Image Source: LDaCA From the Main Menu, the following options are available: File Option Description Open Directory Select a directory/folder to locate or set where files are saved. This is the required first step for all RO-Crates. Load Files Load files from the selected directory into this RO-Crate. Bulk Add Select an Excel spreadsheet or similar from a directory to assist you with metadata description (the directory can be the same as that for the RO-Crate). This will append to your existing RO-Crate if one has already been created. Note that this option currently only has functionality to add new data, and cannot overwrite or edit existing data in your RO-Crate. Save Save the state of this page into your RO-Crate. This will create an ro-crate-metadata.json file, or add data into an existing ro-crate-metadata.json file. Close Close without saving. Help Display general help from anywhere in Crate-O. About Display general information about Crate-O. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:1:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Mode Selector Crate-O Mode Selector Image Source: LDaCA Below the main menu, the Mode Selector allows you to choose a predefined mode or load one from your computer. Note that for blank RO-Crates, the mode will default to Simple RO-Crate Dataset, however for language data, the mode Language Data Commons top-level Collection (corpus) is most relevant. When a file or folder is open in Crate-O, Selected Directory to the right of the Mode dropdown menu shows the directory or folder you currently have open. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:2:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Current Entity Crate-O Current Entity Image Source: LDaCA Below the Mode Selector, a home icon followed by entity breadcrumbs indicates your entity history, and underneath that, the entity you are currently located on. These will update as you navigate to different entities. Note that the breadcrumbs section doesn’t reflect the tree structure of the RO-Crate, but instead shows the history of the entities you have visited. To the right of this is the option to Remove Entity that you are currently viewing (provided it is not the root entity). ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:3:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Property Groups On the left-hand panel, depending on the selected mode, different Property Groups will be available: Crate-O Property Groups Image Source: LDaCA Property Group Description About The core metadata for this RO-Crate and its subject matter. Related People, Orgs \u0026 Works The context for the creation of this RO-Crate: who made it, funded it etc. Structure How the parts of this RO-Crate relate, such as collection and object relationships. Provenance A detailed description of how entities were created, by whom and with which tools. Space \u0026 Time Where and when the data was collected, i.e. the times and places it mentions or describes. Software \u0026 Hardware Computer programs and execution environments that could be used to create data, have created data, or are being packaged and described. Others Other properties not in the above categories. Note that if you find a property in Other that should be in one of the above groups or have other suggestions, please raise an issue on GitHub. Hovering over each property group will display tooltips related to that group. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:4:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Entity Properties In the middle panel, the Entity Properties for the property group you have selected are shown. Crate-O Entity Properties Image Source: LDaCA The exclamation mark icons to the left indicate that an entity is a required property for that mode, and saving an RO-Crate without completing one of these sections will result in a warning message. Below each field is a tooltip providing further detail on the given entity. Each entity field has a ‘delete’ icon beside it, however, these are only functional (and appear in red) for entities that can be removed, i.e. required properties cannot be deleted, but you can edit these. There are several compulsory metadata fields required for each collection: Property Group Metadata Description Additional Notes About @id Persistent, managed unique ID in URL format (if available), for example, a DOI for a collection or an ORCID, personal home page URL or email address for a person. Existing persistent identifiers for persons and organisations can be found at: Open Researcher and Contributor ID (ORCID)Research Organization Registry (ROR)Trove (use the People \u0026 Organisations search category). About @type The type of the entity. Both Dataset and RepositoryCollection are required for top-level collections in the Language Data Commons (LDAC) Mode, as these determine the available properties. About Name The name of this collection. About Description An abstract of the collection. Include as much detail as possible about the motivation and use of the dataset, including things that we do not yet have properties for. May include but not limited to the following: a brief description of participants (age range, gender, particular characteristics as a group, etc.)purpose of the collectionresearch value of the dataset. About License Link to a document that describes the rights and obligations for users of this collection record. This does not necessarily cover the license terms that may apply to Objects in the collection which may have specific licensing. Licensing on specific objects overrides the license attached to a collection record. Related People, Orgs \u0026 Works Author The person or organisation responsible for creating this collection of data. The options Person and Organization are available. If Organization is selected, the lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Related People, Orgs \u0026 Works Publisher The organisation responsible for releasing this dataset. The lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Others Rights Holder A person or organisation owning or managing rights over the resource. The options Text, Person and Organization are available. If Organization is selected, the lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Others Accountable Person The person or organisation who is the data steward for this resource. The options Person and Organization are available. If Organization is selected, the lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Others In Language The language(s) of the materials (including PrimaryMaterials, DerivedMaterials and Annotations) in this collection. The lookup pulls from AustLang, Glottolog and Ethnologue. Both AustLang names and codes can be used. If a language is not present, it can be created. Others Subject Language The languages that the materials in the collection are about (not the language that it is in). This is particularly used on Annotations that may talk about PrimaryMaterials or DerivedMaterials. The lookup pulls from AustLang, Glottolog and Ethnologue. Both AustLang names and codes can be used. If a language is not present, it can be created. To browse all the metadata entities associated with the LDAC Mode, see Metadata for Language Data. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:5:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Entity Navigator On the Entity Navigator panel, there are some further options related to navigating and creating metadata entities. Crate-O Entity Navigator Image Source: LDaCA Create New Entity: Create a new metadata entity, such as a provenance action. For example, CreateAction describes how a work is created with more precision than a property like author. All Entities: Select to view all metadata entities associated with your collection. Unlinked Entities: Select to view all metadata entities that are not currently referenced by properties on the Root dataset or other entities, for example, an entity of type Person which is not referenced using an author property. Below this, there is pagination which allows you to navigate through results, with 10 results appearing per page. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:6:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter — Quarter 2 2024 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt Read about the new case study and blog post up on our website, our draft project plan, a recap of the ARDC Comp Skills Summer School 2024 and more! LDaCA Newsletter — Quarter 2 2024WelcomeWelcome to the second issue for 2024 of this newsletter about the activities of the Language Data Commons of Australia (LDaCA) and the Australian Text Analytics Platform (ATAP). This quarter, we highlight the LDaCA draft project plan and co-design workshop report published by the Australian Research Data Commons (ARDC), and report back from several exciting events, including the ARDC’s Computational Skills Summer School 2024 and a tech team visit to Batchelor Institute of Indigenous Tertiary Education. If you have any questions or feedback, please email us at ldaca@uq.edu.au or message us on our LinkedIn page. NewsNew content on websiteWe have added two new pieces of content to our website: a case study about data management in language technology, based on projects at tech company Appen. a blog post about why the Australian National Corpus is not a single collection within LDaCA. LDaCA draft project planOn 14 March, the ARDC released draft project plans for the next phase of the HASS and Indigenous Research Data Commons (HASS\u0026I RDC), including a plan for the LDaCA project. They were seeking feedback from the public on project activities planned for the next four years. Feedback closed on 22 March, and once this feedback has been reviewed and incorporated, final project plans will be released in the coming months. Meanwhile, the draft plans are still available to view online. Co-designing a digital future for Indigenous language materialsRobert McLellan and his team are seeking Indigenous language champions who are working with language materials for a research project called “Co-designing a digital future for Indigenous language materials”. We plan to run a series of Zoom and face-to-face interviews to find out what works well and what doesn’t work when finding, accessing and using Indigenous language materials. We know that there are challenges involved with accessing and using language materials stored in various locations. We want to contribute to making that data more findable and usable so that it can support and enhance the language work currently being undertaken. So, we are seeking input to help build language platforms, spaces and tools that suit Indigenous language workers and communities. If you are interested or would like to know more, please take a minute to get in touch through our Contact Form. All interviewees will receive a gift voucher worth $70 as a thank you for their time. Graduate Digital Research Fellowship programThe Graduate Digital Research Fellowship (GDRF) program ran very successfully in 2023. We are hoping to build on this success, with three fellows from The University of Queensland (UQ) participating in the 2024 program: Lu Jin is a PhD candidate in architecture and urban design. Her multidisciplinary work will design urban green infrastructural networks in a circular food system for city resilience and sustainability. In the GDRF program, Lu will be applying machine learning methods to identify and assess green cover in street view images. David Gilchrist is a PhD candidate in journalism. His research looks at how journalists connect with audiences, and in the GDRF program, he hopes to explore methods for finding and gathering relevant","date":"2023-02-04","objectID":"/news/newsletter/newsletter-q2-2024/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 2 2024","uri":"/news/newsletter/newsletter-q2-2024/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter — Quarter 1 2024 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt Read about our Graduate Digital Research Fellowship, ARDC's upcoming Computational Skills Summer School, our 2023 ALS conference activities and more! LDaCA Newsletter — Quarter 1 2024WelcomeWelcome to the first issue for 2024 of this newsletter about the activities of the Language Data Commons of Australia (LDaCA) and the Australian Text Analytics Platform (ATAP). This quarter, we introduce the 2024 Graduate Digital Research Fellowship program, highlight the upcoming Computational Skills Summer School 2024 from the Australian Research Data Commons (ARDC) and report back from the 2023 Australian Linguistics Society (ALS) annual conference. If you have any questions or feedback, please email us at ldaca@uq.edu.au or message us on our new LinkedIn page. NewsNew LinkedIn pageLDaCA is now on LinkedIn! We look forward to sharing more content about the project and connecting with the wider community. Follow our page or send us a message here. Website updateOur website has a new look and has been reorganised. We hope that you like the design changes and that information is now easier to find. We encourage feedback via email or LinkedIn so that we can continue to make the website more useful. Online policy documents and adviceSeveral documents outlining LDaCA policies and providing advice for data custodians are now available on the LDaCA website under Resources, then LDaCA Resources. The key policy document details our Access Policy; this policy document is supported by a guide to Determining Access Conditions. There is also a document explaining the LDaCA Data Onboarding Process, supported by Guidance for Data Governance Decisions and a guide to Obtaining a DOI. These documents are the outcome of wonderful work by project members at the Australian National University: Catherine Travis, Li Nguyen and Cale Johnstone. Rosanna Smith (LDaCA User Design Analyst) prepared the material for online publication. 2024 Graduate Digital Research Fellowship ProgramIn the last five years, the University of Queensland (UQ) has hosted several iterations of a Graduate Digital Research Fellowship program focused on researchers in the Humanities and Social Sciences (HASS). In 2023, the program ran under the auspices of ATAP. Two members of our UQ team, Sam Hames (Postdoctoral Research Fellow in Computational Humanities) and Simon Musgrave (Engagement Lead for ATAP and LDaCA), led the program and four students participated. You can read more about the history of the program and the 2023 cohort in this blog post. Image: LDaCA The program will run again in 2024 as an LDaCA activity, supporting research students to explore the possibilities of digital scholarship, particularly in HASS. Our Fellows are students undertaking extended research projects who spend 12–15 weeks learning about digital and computational skills in order to enhance their current research topic or to work on an independent digital project. They explore digital research methods in areas including, but not limited to: computational analysis of text creation of new software tools to support their specific research area social media analytics assessment and analysis of digital environments including mobile apps online games as social and cultural objects. The Fellows meet regularly in activities to develop a sound understanding of digital research methods and tools, including seminars, reading groups and training works","date":"2023-12-03","objectID":"/news/newsletter/newsletter-q1-2024/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 1 2024","uri":"/news/newsletter/newsletter-q1-2024/"},{"categories":null,"content":"A step-by-step guide to creating an RO-Crate metadata file.","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":" This guide takes you through the steps to create an RO-Crate metadata file, using a top-level collection for language data as an example. To view more details and screenshots of each section discussed, select the click-through links throughout the page. Open Directory Select Mode Add Entity Metadata Save RO-Crate Append Data from Spreadsheet ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:0:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Open Directory Open Crate-O in a compatible browser. In the Main Menu, select Open Directory. Navigate to a folder where the RO-Crate will be saved, or create a new one, then confirm your selection. For the pop-up message asking Let site view files?, select View Files. Your current working directory will be displayed in the Selected Directory section in the Mode Selector. Crate-O: Open Directory Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:1:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Select Mode In the Mode Selector, the Mode dropdown shows the current mode that is being displayed, i.e. the metadata framework associated with your collection. Change the mode from the default Simple RO-Crate Dataset to Language Data Commons top-level Collection (corpus). Crate-O: Select Mode Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:2:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Add Entity Metadata The Current Entity section shows your location in the current RO-Crate. The Property Groups panel allows you to navigate to the various groups associated with your RO-Crate. The Entity Properties panel is where you can add data about your collection. For blank RO-Crates, the following missing property messages will appear by default. Clicking the blue Add buttons will automatically add these missing items to your RO-Crate. Crate-O: Missing Properties Image Source: LDaCA Enter the entity properties you have for your collection. Add as little or as much information about your collection as you like, as this can be saved and worked on further later. To browse all the metadata entities associated with the Language Data Commons (LDAC) Mode, see Metadata for Language Data. Crate-O: Add Entity Metadata Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:3:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Save RO-Crate In the Main Menu, select Save. For the pop-up message asking Save changes to [Selected Directory]?, select Save changes. Crate-O: Save RO-Crate Image Source: LDaCA Your RO-Crate is now successfully saved in your working directory, with two files: ro-crate-metadata.json: The saved RO-Crate in JSON format. ro-crate-preview.html: An HTML file that can be viewed on a web browser and shows the contents of your RO-Crate. ro-crate-metadata.json ro-crate-preview.html ro-crate-metadata.json Image Source: LDaCA ro-crate-preview.html Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:4:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Required Properties After saving, if there are required properties missing from your RO-Crate, the section Saved with warnings will appear. You can select the dropdown on this message to view the missing required properties. Clicking on one of these warnings will take you to the relevant property group. If you choose to edit any of these sections, select Save again to ensure your most recent changes are not lost. Crate-O: Saved with Warnings Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:4:1","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Append Data from Spreadsheet If you have a spreadsheet in a compatible format that you want to add to your collection to assist with metadata description, or you want to create a new RO-Crate only using spreadsheet data, open an existing or new directory, then select Bulk Add in the Main Menu and load the spreadsheet. This will append it to your existing RO-Crate. For more guidance on the spreadsheet requirements and a template, see Spreadsheet Upload. Note that this option currently only has functionality to add new data, and cannot overwrite or edit existing data in your RO-Crate. Bulk Add also only reads from the spreadsheet and does not write to it. ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:5:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter - Quarter 4 2023 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt LDaCA Newsletter | Quarter 4, 202306/10/23 Welcome Welcome to the second issue of this newsletter. We thank you for your interest in the Language Data Commons of Australia and the Australian Text Analytics Program, and we hope that this newsletter will help to keep you informed about our activities. If you have questions about anything you read here, or if you have feedback for us, you can mail us at ldaca@uq.edu.au (or use one of the alternative contact points listed at the end of this message). NewsARDC All Staff MeetingARDC held an event in Melbourne last month for all of their staff. Michael Haugh and Robert McLellan were invited to present at the event about working with the ARDC on LDaCA. Their presentation, which was very well received, was part of a showcase of projects in the ARDC's three thematic data commons (Planet, People, HASS and Indigenous). Image Credit: ARDC Event at the Australian Linguistic Society ConferenceWe will be holding a meeting during the Australian Linguistics Society 2023 Conference to provide an update on the activities of the project. This event will take place at lunchtime on the last day of the conference (Friday December 1), venue TBC. New Team Members Two people have joined our team in the last quarter, Otis Carmichael and Rosanna Smith and we will let them introduce themselves: Hi there, my name’s Otis Carmichael, I’m a Waanyi man from Brisbane and a technical designer on the LDaCA team. My initial involvement with the team was to design the team’s logo and other promotional material, however I am now involved in Community Engagement and working on design throughout the website and tools. I’m excited to be on board with the project because I believe that it has potential to really support the domain of language research, by helping to ensure that research data is stored in a more sustainable manner and thus will remain accessible for the language’s community. Hi, my name's Rosanna Smith and I'm a user design analyst on the LDaCA team, based in Melbourne. My background is in linguistics, having managed a variety of language projects for use in machine learning development and language technology. I’m excited to contribute to a project such as LDaCA that is guided fundamentally by FAIR and CARE principles, as well as the opportunity to apply my previous experience and further promote diversity and accessibility in language data management and research. EventsForthcoming Events Workshop on immigrant language corpora in Australia When: 9-10 November, 2023 Where: Onsite. ANU, Acton, ACT 2601 Guest speakers: John Hajek, Ingrid Piller, Sophie Loy-Wilson, Adrian Vickers Australia has long been seen as one of the world’s most multilingual and multicultural societies, with more than 490 languages coming from around 300 ancestries and cultural traditions (ABS, 2021, 2022). For decades, the language and cultural maintenance of various immigrant groups have been under investigation by many scholars, not only in linguistics but also in history, sociology, anthropology, and many other disciplines. This work has amassed a large body of immigrant language data, that provides information about how Australia’s immigration history has contributed to the country today. The purpose of this workshop is to bring together scholars working with language corpora from across different disciplines. The workshop is being run as part of the Langua","date":"2023-08-14","objectID":"/news/newsletter/newsletter-q4-2023/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 4 2023","uri":"/news/newsletter/newsletter-q4-2023/"},{"categories":null,"content":"Guidance and a template for adding new data to an RO-Crate via spreadsheet.","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":" Template Tab Breakdown Column Breakdown Root Authors Publishers Licenses People Objects Files (CSV, EAF, WAV) Upload Spreadsheet to an RO-Crate with Crate-O ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:0:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Template For collections where there are a lot of interconnected objects and files, it may be easier or preferable to add the metadata for these via uploading a spreadsheet to an existing RO-Crate in Crate-O, rather than adding these items manually. An RO-Crate metadata spreadsheet template can be downloaded below and populated with metadata specific to your collection: ro-crate-metadata-template.xlsx Spreadsheet upload currently only has functionality to add new data, and cannot overwrite or edit existing data in your RO-Crate. The template is based on an example data collection that contains three types of files within each object: Audio files (WAV), the primary material Text files (CSV), transcriptions of the audio files ELAN files (EAF), linguistic annotations of the audio files ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:1:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Tab Breakdown The spreadsheet has the below tabs by default, but depending on your collection, you may need to add additional tabs, or others may not be applicable. Tab Description Root Metadata about the root or top level of the collection. Authors Metadata about the person or organisation responsible for creating this collection. Publishers Metadata about the organisation responsible for releasing this collection. Licenses Metadata about the license(s) within the collection; both for the objects and files, and for the collection’s metadata. People Metadata about the people within the collection. Objects Metadata about the entities within the collection that could encompass one or more files. Files Metadata about the files in your collection. If the collection has multiple file formats, duplicate this tab and add the formats to the tab names, e.g. csv_files, eaf_files, wav_files. ELAN (.eaf) files can have relative or absolute paths to the data they relate to. The ELAN preferences file is generally not needed for the collection and relates to the particular ELAN user only. Below the header, an example row is included to illustrate how the section can be filled. The example row is colour-coded according to whether the column: requires the user to input data (blue) is pre-filled with a formula or static value and doesn’t require editing (green). HINT: Highlight the example row and drag it down to copy all the pre-filled cells. Don’t forget to remove the example rows before you upload your spreadsheet to Crate-O! At a minimum, it’s best practice to include @id and @type columns in each of your spreadsheet tabs, as these appear in Crate-O for each of the entities. The tables in the next section provide further details on what constitutes a valid @id and @type in each tab. For more detailed lists of these, see Metadata for Language Data. HINT: To type a column name beginning with @ in Excel, put an apostrophe before it '@. This will force it to be recognised as a text value rather than a formula. The columns provided in the template tabs are illustrative only and may not all apply to your collection; please edit these as needed. Where a column header begins with a full stop (.), this indicates that the column will be ignored when the data is loaded into Crate-O and will not appear in the RO-Crate. This can be helpful if you want to retain other information in your spreadsheet that may not be in a format applicable to the RO-Crate. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:2:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Column Breakdown The section below describes each of the columns included in the template, ordered by tab. Please note that the columns provided in the template tabs are illustrative only and should be edited according to the requirements of your collection. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Root The root tab provides information about the top level of the collection. Unlike the other tabs, the root tab can only have one row, so if there are columns that require more than one value, duplicate that column. Column Type Description @id Data entry Persistent, managed unique ID in URL format (if available), for example, a DOI for a collection. @type Pre-filled The type of the collection. Both Dataset and RepositoryCollection are required. name Data entry The name of this collection. description Data entry An abstract of the collection. Include as much detail as possible about the motivation and use of the dataset, including things that we do not yet have properties for. isRef_author Pre-filled Generated from the @id column in the Author tab. isRef_publisher Pre-filled Generated from the @id column in the Publisher tab. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:1","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Authors An author is a person or organisation responsible for creating the collection. It is possible for collections to have multiple authors. Column Type Description @id Data entry Persistent, managed unique ID in URL format (if available), for example, an ROR for an organisation or an ORCID, personal home page URL or email address for a person. @type Data entry The type of the author. Either Person or Organization can be selected. name Data entry The name of the author. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:2","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Publishers A publisher is an organisation responsible for releasing the collection. It is possible for collections to have multiple publishers. Column Type Description @id Data entry Persistent, managed unique ID in URL format (if available), for example, an ROR for an organisation. @type Pre-filled The type of the publisher. Only Organization is valid. name Data entry The name of the organisation. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:3","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Licenses A license for a collection establishes the conditions for who can access, share and reuse the data, and other conditions as required. It is a legal arrangement between the creator of the data and the end-user specifying what users can do with the data. Column Type Description @id Data entry A URL to a version of the license (if available), for example, a URL of a Creative Commons license. If there is no URL, a license.txt file containing the text of the license needs to be included in the repository, and license.txt should be added as the @id. @type Pre-filled The type of the license. Only DataReuseLicense is valid. name Data entry The name of the license. description Data entry A description of the license. metadataIsPublic Data entry Determines whether the collection metadata can be viewed publicly. Requires a Boolean value (TRUE or FALSE). allowTextIndex Data entry Determines whether the collection text can be indexed for search purposes. Requires a Boolean value (TRUE or FALSE). It is possible to leave the licensing tab blank if these details are still being finalised for the collection, however, this will need to be amended later in Crate-O. For custom licenses (i.e. those specific to a particular collection), it is recommended that a copy of the license be included in the repository to ensure that it remains accessible. Furthermore, if there are any additional usage restrictions or options for use outside of a given license, this information can be included in a usageInfo field, e.g. “For any use not permitted by the CC-BY-ND 4.0 License, please contact the Data Steward”. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:4","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"People This tab contains information about the people within the collection. Column Type Description @id Pre-filled A unique identifier for the person, generated from the name column. Identifiers should be prefixed with #. @type Pre-filled The type of the entity. Only Person is valid. name Data entry The name of the person. language_code Data entry The language spoken by the person. An example of an optional metadata field from the source data. Language codes can be obtained from AustLang, Glottolog and Ethnologue. gender Data entry The gender of the person. An example of an optional metadata field from the source data. birthDate Data entry The birth date (year) of the person. An example of an optional metadata field from the source data. isRef_specializationOf Data entry A reference to another Person entity, used for collections where a person appears more than once with different demographic info (e.g. a different age). In these collections, there should be a ‘canonical’ person for each participant and another Person entity each time they participate, with different ages or other statuses. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:5","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Objects An object is a single resource or a group of tightly related resources in a collection. For example, a work (document) in a written corpus, or the files associated with a dialogue or session in a speech study (recordings, transcriptions etc.). Some systems, such as PARADISEC, refer to Objects as Items or may use other terms. Column Type Description @id Pre-filled A unique identifier for the object, generated from the name column. Identifiers should be prefixed with #. @type Pre-filled The type of the entity. Only RepositoryObject is valid. name Data entry The name of the object. description Data entry A description of the object. isRef_speaker Pre-filled Generated from the .pseudonym column with # prefixed. .pseudonym Data entry An example of a column from a data steward’s source data, so that speakers in the collection are anonymised. datePublished Data entry The date the object was published. The date will be converted in Crate-O to ISO 8601 format, e.g. Wed Jun 12 2024 10:00:00 GMT+1000 (Australian Eastern Standard Time). isRef_pdcm:memberOf Pre-filled The collection this object is a member of, generated from the @id column in the Root tab. Or if the collection contains sub-collections, a reference to another RepositoryCollection @id. isRef_license Data entry The @id of the license to which this object adheres. isRef_indexableText Data entry Identifies which of the files in the given object has content that is indexed for search purposes. For example, in the template, the content of the CSV file would be searchable, whereas the EAF and WAV files would not. If isRef_indexableText is not included in a collection, search will only run on the metadata and not the transcript file content. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:6","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Files (CSV, EAF, WAV) A file is a container for data and can store data in different formats. For example, a single object could have an audio file as well as a text file containing a transcription of the audio. Three examples of file tabs are included in the template, and their columns are combined in the table below. Tab Column Type Description CSV, EAF, WAV @id Pre-filled The filepath to the given file. Generated from the .folder, .filename and .postfix columns. CSV, EAF, WAV @type Pre-filled The type of the entity. Only File is valid. CSV, EAF, WAV .folder Data entry The folder name in which the given file appears. CSV, EAF, WAV .filename Data entry The name of the given file, without postfixes. CSV, EAF, WAV .postfix Data entry The file format of the given file, for example, .csv, .eaf, .wav. CSV, EAF isType_Annotation Data entry Indicates whether the given file is an annotation of another file. Requires a Boolean value (TRUE or FALSE). WAV isType_PrimaryMaterial Data entry Indicates whether the given file is the object of study, such as a literary work, film, or recording of natural discourse. Requires a Boolean value (TRUE or FALSE). CSV, EAF, WAV isRef_partOf Pre-filled Generated from the .filename column. CSV, EAF isRef_annotationOf Data entry The full filename of the primary material that the given file is an annotation of. CSV, EAF, WAV .objectId Pre-filled Generated from the .filename column. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:7","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Upload Spreadsheet to an RO-Crate with Crate-O For steps on adding your spreadsheet data to an existing RO-Crate, see Append Data from Spreadsheet. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:4:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter | Quarter 3, 2023 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt LDaCA Newsletter | Quarter 3, 202313/07/23 WelcomeWelcome to the first issue of this newsletter. We thank you for your interest in the Language Data Commons of Australia and the Australian Text Analytics Program, and we hope that this newsletter will help to keep you informed about our activities. If you have questions about anything you read here, or if you have feedback for us, you can mail us at info@ldaca.edu.au (or use one of the alternative contact points listed at the end of this message). NewsThe next phaseThe initial round of funding from ARDC for LDaCA and ATAP ended in June. ARDC have submitted a Research Infrastructure Investment Plan (otherwise known as the RIIP) covering the period through to 2028, but further funding for LDaCA has been approved for the coming financial year through to end of June 2024 while we await the outcome of the RIIP2023 application. LDaCA Director Prof Michael Haugh. Image ARDC/Marc Grimwade In the coming year, LDaCA will build on the work accomplished in the last two years while also developing some new areas of work. These will include building capabilities in repurposing data, particularly for community research platforms, and in collecting and organising data, particularly for donated social media data. ATAP will also continue to receive funding support within the broader LDaCA program. Language materials at Batchelor College, NT. Image: Ben Foley Community Connect project LDaCA has recently been successful in gaining funding under ARDC’s Community Connect program which aims to make data more accessible to communities (researchers or others). Through a co-design process in the second part of this year (2023), we will develop a training workshop and associated materials to provide Indigenous language workers and community members with the knowledge and skills to use our data portal effectively and to find the data in which they are interested. The co-design process will involve the participants in the University of Queensland’s Winter Language Revitalisation Intensive Workshop as well as discussion sessions at both the Australian Languages Workshop (July 21-23) and the Puliima Conference (August 21-25) where our team members, led by Robert McLellan, will explore ways of breaking down the jargon which is comfortable for academics but which can be a barrier for others. On the basis of the knowledge gained in these activities, the team will then develop the workshop and training materials. Initial delivery of a workshop will occur in November.   The Language Revitalisation program at UQ involves Samantha Disbray, Des Crump and Robert McLellan (also LDaCA’s Program Manager). On the LDaCA side, our Engagement lead, Simon Musgrave, will coordinate the project, and we are delighted that Otis Carmichael (our logo designer) will join the team. EventsForthcoming EventsWorkshop on Data Migration SkillsThis workshop will introduce you to tools to efficiently migrate material in various formats to the standards developed by LDaCA. ANU, July 20 and 21 (Day 1 - hybrid, Day 2 - in person only) Details: Events - ATAP Heritage data at Monash University. Image: Simon Musgrave Recent EventsCatherine Travis and Li Nguyen convened a workshop on Language Corpora in Australia (Canberra and online, July 3). The link to the program and abstracts can be found here. Peter Sefton and Simon Musgrave participated in the 2nd Workshop on Digita","date":"2023-07-13","objectID":"/news/newsletter/newsletter-q3-2023/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 3 2023","uri":"/news/newsletter/newsletter-q3-2023/"},{"categories":null,"content":"As detailed in an earlier post, we ran a Graduate Digital Research Fellowship (GDRF) program in 2023 under the auspices of the Australian Text Analytics Program (ATAP). The program ran again in 2024, under the Language Data Commons of Australia banner with three students participating and with Sam Hames and Simon Musgrave co-ordinating the activity. The program supports research students to explore the possibilities of digital scholarship, particularly in HASS. Our Fellows are students undertaking extended research projects who spent 12–15 weeks learning about digital and computational skills in order to enhance their current research/thesis topic or to work on an independent digital project. They explored digital research methods in areas including: computational analysis of text analysis of digital images automatic speech recognition (and automatic non-speech recognition). The Fellows met regularly in activities to develop a sound understanding of digital research methods and tools, including seminars and training workshops. They also had access to mentors who advised and collaborated on their digital projects. The three Fellows who participated in the 2024 program and their projects were: ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:0:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"David Gilchrist (UQ, Journalism): David’s research looks at how journalists connect with their audiences, particularly in local and hyper-local contexts. In the Fellowship program, David focused on two areas: how to collect and curate material published online whether there was a place for sentiment analysis in his methodology. More abstractly, the questions which David was addressing were about how to integrate computational methods into a qualitative enquiry, and (not surprisingly!) we had some fascinating discussions with David about possible answers to such questions. The opportunity to work with scholars from different disciplines was highly rewarding. It provided a variety of interesting and thought provoking insights and questions that ultimately enhanced my own work. What’s more, the experience of working with a small team of dedicated scholars and mentors established a high quality academic environment and built long-lasting friendships that made this a truly inspiring and rewarding fellowship. Our mentors are exceptional in their academic and leadership abilities that build a team that will find longevity well beyond this initial experience. I highly recommend this program. ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:1:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"Lu Jin (UQ, Architecture, Design and Planning): Lu is researching productive urban food landscapes as a spatial design issue. A preliminary step in her work is to be able to assess the amount and diversity of vegetation in the city and her work as a Fellow explored possible methods of analysing digital images which could make this task easy and quick. She used images and videos captured on a phone and then experimented with different image analysis algorithms to see which ones best identified greenery and its boundaries. The experiments showed that relatively simple approaches worked quite well, but further work is needed to decide whether the results will be useful for large scale surveys of urban environments. In Lu’s own words: Working closely with scholars from various disciplines allowed me to gain insights into research tools that are perhaps more common to other fields. Given my background in architecture, I’ve always been daunted by the idea of coding and programming. But in this day and age, knowing how to manipulate digital data is such a powerful skill to harness. This research fellowship program has kickstarted my learning journey into applying and adapting digital research methods for urban design and analysis. ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:2:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"Quy Pham (UQ, Applied Linguistics): Quy is studying errors in the spoken language produced by students learning English as a foreign language. He is using a large spoken corpus of material produced by learners of English in 10 countries and areas in Asia, ICNALE: The International Corpus Network of Asian Learners of English. One phenomenon which might relate to errors is pausing, and Quy was therefore interested to explore the possibilities of automatically identifying various characteristics of silent and filled pauses in recordings of speech, including their duration and placement within sentences. Building from this, he also looked at whether it was possible to automatically tag the start and end points of specific tasks within a controlled interaction (spoiler alert: not really). The GDRF program provided me with a unique opportunity to sharpen my coding and critical thinking skills. I gained valuable experience in applying a range of Python packages and large language models to analyse pausing behaviors in a large spoken learner corpus, while being aware of their critical limitations. On a personal level, I also had a chance to meet with an amazing group of bright and talented researchers from diverse fields. We engaged in insightful discussions on various topics such as citation practices, research ethics and project management. I highly recommend this program to anyone considering it. From left: Quy, Sam, Lu, Simon and David Image Source: Simon Musgrave A GDRF program will be offered again in 2025 by the Language Data Commons of Australia. If you think the program could benefit you and would like further information, you can contact Sam Hames or Simon Musgrave. ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:3:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"LDaCA Senior Data Manager Dr Julia Colleen Miller discusses her experience in archiving and data management for cultural heritage data.","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"by Dr Julia Colleen Miller For nearly ten years, I have been officially working as a Data Manager. I say officially because having worked on collaborative research projects and earning a PhD in Linguistics, within the context of an endangered language documentation collaborative project, I’ve been managing other people’s data, as well as my own, since I was a fledgling academic in the early 2000s. I am based at the Australian National University (ANU), and since mid-2023, I’ve been working as the senior data manager for the Language Data Commons of Australia (LDaCA). Prior to this, I worked for the ARC Centre of Excellence for the Dynamics of Language (CoEDL), who could offer a eight year data manager position, very generous in a field where contracts often last only two or three years. It was at CoEDL where I first took on this academic-adjacent role, focusing on bolstering research and research infrastructure. It was also the first time I had heard of such a position becoming available that addressed the data management and archiving needs of those actively recording or assembling cultural heritage data. Such data includes tabular data, text corpora, collections of songs and music, ethnographic interviews, elicited linguistic content, oral histories, traditional narratives and lexicons, and can be in different formats, such as text, video, audio, image, film, magnetic tape and digital files. The best part of this CoEDL role was that the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC), where I had been running the ANU unit since 2010, was going to be the main repository for CoEDL, and the two roles of archivist and data manager would be fundamentally intertwined. I had finally found my calling! ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:0:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"The day-to-day of data management and archiving At CoEDL, my primary responsibility was to help facilitate the data management and subsequent archiving of materials collected by CoEDL members and affiliates from institutions such as ANU, The University of Queensland, The University of Melbourne, Western Sydney University (WSU), as well as by the broader global research communities. This involved managing a continuous flow of enquiries from new and continuing researchers, offering guidance, and overseeing the entire data management and archiving processes. My aim was to ensure that valuable research data was preserved and made: accessible for future research available to the people and communities recorded, as well as to their descendants. Most days, in-person or over Zoom, I met with depositors to: guide them through the process of describing their data for their own research databases, making eventual archiving much easier discuss appropriate file formats find or create collaborative workspaces for CoEDL students, staff and their external collaborators develop data access protocols, refining conditions of access for archived collections offer file transfer options, ensuring secure data transmission inform researchers of other archiving alternatives if PARADISEC was not the best fit for their material. Over the years, I have tried to make the data management and archiving processes more efficient and helpful by creating archiving guides. Originally, these were static PDF files available on the CoEDL website; however, as they needed to be dynamic sources of information for rapidly changing technologies, they are now a series of webpages. Figure 1: Screenshot of the overview page of PARADISEC's Archiving Guides and Technical Workflows website. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:1:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Managing the PARADISEC ANU unit In my concurrent role as manager of the ANU unit of PARADISEC, I oversaw a wide range of digitisation and archiving activities, ensuring that important cultural materials were preserved and accessible for current and future use. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:2:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Digitising analogue materials We often receive magnetic tape audio (cassette and reel-to-reel tapes) and field notebooks to archive. Engaging with legacy formats is equal parts exciting and challenging. The challenging parts are: finding and maintaining the necessary playback devices dealing with damaged or mouldy materials creating functional, evolving workflows adhering to standards set by digital preservation’s peak bodies. …but the rewards are legion. There is nothing like hearing voices recorded decades ago, and then sharing those recordings with descendants of those speakers! It’s also a great feeling to make available rare language material by digitising the items, and then engage with young researchers to help enrich the descriptions of the content. Even more so if it helps them with their research topics. Some of the digitising tasks we conducted during this time for CoEDL researchers and general PARADISEC depositors were: digitising audio cassettes and reel-to-reel tapes, performing all necessary post-production editing to prepare for archiving Figure 2: Studer reel-to-reel tape player, with a tape threaded and ready to digitise. Image Source: Dr Julia Colleen Miller photographing tapes and tape boxes containing written metadata on their labels or inserts found within; these images then accompanied the audio files in the archive Figure 3: Collection of images of reel-to-reel tapes, tape boxes and loose sheets of paper, all containing important written metadata about the contents of the tapes. Image Source: Haoyi Li digitising manuscripts and field journals, ensuring high-quality reproduction. Figure 4: Digitising a field notebook with a DSLR camera mounted to an overhead shelf (out of shot). The camera is tethered to a laptop for remote capture. Image Source: Dr Julia Colleen Miller It is very satisfying knowing that by digitising these items, we helped make available legacy recordings brought to us by retiring researchers, small regional cultural institutions, and those found in our very own university archives. If you are interested in our workflows for audio digitisation or manuscript and field note image capture, you can find more information below: audio digitising workflow digitising fieldnotes \u0026 manuscripts workflow. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:2:1","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Handling born-digital files Of course, we are now living in a time where digital recording devices of all kinds are all around us, making content creation very easy. With this easy access to quality recorders for use in the field, people have the freedom to collect as much data as needed to address their research questions. We receive hundreds (sometimes thousands) of digital files each week to help manage, and eventually, archive. Some activities involved in managing born-digital files are: receiving files and creating detailed inventories of them — a critical step in ensuring all files are looked after and that quality checking can be logged resampling audio, transcoding image and video files to archival formats and managing the transfer to PARADISEC for archiving devising solutions for problematic video formats and liaising with commercial service providers and other institutions for help when needed. Below is an image of an inventory spreadsheet containing the structural metadata of thousands of files that went into a single collection: Lauren Reed’s Western Highlands Sign Languages. Figure 5: Screenshot of spreadsheet of structural metadata of audio and video files from the LRW1 collection in PARADISEC. This file and metadata inventory informed the transcoding workflows and quality-checking processes prior to archiving. Image Source: Dr Julia Colleen Miller Creating an inventory like this informed me as to how to proceed in transcoding tasks, based on the format and specifications of the files. Also, I could compare details of the output files to the originals, thus helping with quality checks. As I worked through the files, I could keep track of my progress by marking which ones had been transcoded and were ready to send to the archive. This task took months to complete, so it was important to be able to pick up each day right where I had left off on the previous day. If you are interested in learning how to batch extract this type of metadata from media files, visit the page Quality control of audio and video files from PARADISEC’s archiving guides and technical workflows. Over the life of CoEDL, we monitored the annual growth of files added to PARADISEC. The following graphic shows the period from 2008 to 2022. Note: CoEDL was active between November 2014 and December 2022. Figure 6: Line graph showing the growth of the number of files held in the PARADISEC archive between the years 2008 and 2022. Image Source: Dr Julia Colleen Miller Notice the increase in files starting in 2015. It was at this time I received research assistant (RA) support from CoEDL to help manage the influx of data. Over the years, I supervised several RAs and volunteers, managing their day-to-day tasks and overseeing their professional growth. By 2020, after refinements in our server connections and file transfer pipelines, we were sending a minimum of 500GB of audio-visual files to the archive every other day! I could never have achieved this without the support of the CoEDL RAs, Team Data: Jen Plaistowe, Tina Gregor, Melody Ross, Haoyi Li, Shubo Li, and our volunteer, Emma Cuppit. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:2:2","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Collaboration with cultural institutions in the region Collaborating with other cultural institutions was a significant aspect of my roles at CoEDL and PARADISEC, allowing us to increase access to archived materials, enhance our archiving capabilities and share expertise. Below are two examples of collaborative projects. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:3:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"ANU Pacific Research Archives Occasionally, requests would come from ANU’s Pacific Research Archives staff, or directly from students and researchers, to digitise items collected across the Pacific which were housed in ANU’s archives. We targeted audio tapes and field notes for digitisation, enhancing ANU’s finding guides with detailed metadata. Typically, we archived these digital files in PARADISEC and shared links to new collections, or provided copies back to the Pacific Research Archives. A notable project involved digitising select items from the Helen Groger-Wurm collection at ANU. Helen Groger-Wurm, an anthropologist active in the 1960s, focused on Northern Australia, particularly Eastern Arnhem Land bark paintings. PARADISEC and CoEDL secured funding to digitise 26 audio tapes from this collection. This effort supported CoEDL PhD student Haoyi Li, who, due to COVID-19 restrictions, couldn’t conduct fieldwork in Arnhem Land when she was hoping to. Haoyi identified relevant items for her research, and we expanded the digitisation to include field notes alongside the audio tapes. This initiative enabled her to start her primary research remotely and gain valuable skills in archival engagement with legacy recordings, high-resolution image capture of field notes, open reel audio tape digitisation, metadata collection and archival record enrichment. Figure 7: Haoyi Li threading a Helen Groger-Wurm reel-to-reel tape for digitisation on the Revox C 270 tape player. Image Source: Dr Julia Colleen Miller The images below are from a tape digitised by Haoyi Li from the Helen Groger-Wurm collection in the Pacific Research Archives at ANU. The image shows that there is metadata written on the tape box and labels, as well as on the slip of paper found in the tape box. These tape-counter reference points to topics contained on the recording and the tape labels provided the title and description for the archive. Figure 8: Images of a Helen Groger-Wurm tape digitised by Haoyi Li. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:3:1","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"AIATSIS, NLA and NFSA We sometimes find items held in other cultural institutions, which we would like to digitise and add to existing collections in PARADISEC. For example, we identified items produced by the important 20th century Australian linguist Arthur Capell, including 168 reel-to-reel audio tapes of non-Australian language recordings found in the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) that we were able to digitise, and nearly 50 acetate discs and one magnetic wire recording found at the National Library of Australia (NLA). We enlisted the aid of the National Film and Sound Archive (NFSA) in digitising the discs and magnetic wire, as we did not have the facilities to handle those formats. Below is an image of a Capell disc being digitised by the NFSA. This contains recordings of the parable of the Prodigal Son and the Lord’s Prayer in the Bilua language of the Solomon Islands. Listen to the recording via the Capell collection in PARADISEC. Figure 9: One of Arthur Capell's acetate discs being digitised at the National Film and Sound Archive (NFSA). Image Source: Gerry O'Neill The next image is of the magnetic wire recording on its spool. This item contains narratives in languages of North and Central America, including Kiowa and Navajo. To hear these recordings, see the Capell collection. Figure 10: Magnetic wire audio recording from the Arthur Capell collection. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:3:2","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Managing large-scale, short-term projects In addition to my regular duties as data manager and digital archivist, I was sometimes tasked with overseeing the management of large-scale projects that required careful planning, coordination and implementation. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:4:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Phonemic modelling of Tok Pisin The first was a three-year project between ANU and the Commonwealth Defence Science and Technology Group (DSTG) called ‘Phonemic Modelling of Tok Pisin'. This project involved creating phonemic transcriptions of spoken Tok Pisin, the English-based creole which is a national language of Papua New Guinea. We retrieved open-access, untranscribed audio recordings of Tok Pisin held in PARADISEC and created rich time-aligned transcriptions, adding these back to their source collections, as well as providing these transcriptions to the DSTG. Below are images of transcriptions in the ELAN transcription software. The recordings were made by Andy Pawley in the Kaironk Valley region of Papua New Guinea and are held in PARADISEC as collection AP4. Figure 11: Screenshots of the ELAN software showing an orthographic time-aligned transcription of Tok Pisin (left) and a phonemic transcription of the same recording (right). Image Source: Dr Julia Colleen Miller Some tasks for this project included: targeting existing PARADISEC collections that held Tok Pisin content and preparing the audio for transcription liaising with DSTG on technical details to ensure mutual understanding and alignment creating detailed workflows and timelines to meet project deadlines hiring, training and managing a transcription team, including three Tok Pisin speakers delegating tasks to team members and monitoring their progress writing annual reports and conducting team debriefing meetings to gather feedback and adjust workflows as needed. In the end, PARADISEC gained rich transcriptions of previously untranscribed Tok Pisin audio held within the archive. Additionally, we were able to employ and train two ANU PhD students and one of PARADISEC’s archivists, Steven Gagau, all fluent Tok Pisin speakers, in using the ELAN software for creating the time-aligned transcriptions. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:4:1","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Digitising language material from the Katherine region Another significant project involved the secure storage of audio-visual materials and manuscripts from the Katherine region of Australia during renovations at their language centre. CoEDL Deputy Director Jane Simpson (ANU) and CoEDL CI Caroline Jones (WSU) tasked my team with storing, inventorying and overseeing the digitisation of materials contained in 42 shipping boxes from Mimi Arts \u0026 Crafts, previously known as Diwurruwurru-jaru (Katherine Regional Aboriginal Language Centre). Figure 12: Shipping boxes from Mimi Arts \u0026 Crafts, formerly Diwurruwurru-jaru (Katherine Regional Aboriginal Language Centre), stored at the ANU awaiting comprehensive inventory. Each of these 42 boxes contained two file boxes. Image Source: Dr Julia Colleen Miller Each box contained two smaller boxes with language-learning materials and cultural recordings created with Elders, primarily in the 1990s and 2000s. We expanded on an earlier inventory, creating a comprehensive list to facilitate digitisation and the eventual return of original items and digitised files to the Katherine Region. Figure 13: Six file boxes showing the diversity of contents, such as audio tapes, VHS and U-Matic tapes and written language-learning materials. Image Source: Dr Julia Colleen Miller Key aspects of the project included: developing a detailed inventory workflow and training a new RA collaborating with stakeholders, such as Caroline Jones, Jane Simpson, other invested researchers, Mimi Arts \u0026 Crafts representatives, AIATSIS, commercial digitisation providers, and loads of volunteer facilitators providing progress reports and maintaining communication with stakeholders facilitating the temporary removal and return of items for early digitisation efforts managing the repackaging and distribution of boxes as digitisation progressed. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:4:2","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Service to the archiving community My contributions to the archiving community extended beyond CoEDL, through presentations, training modules and active participation in professional organisations. I have participated in conferences and workshops, representing CoEDL and PARADISEC while discussing best practices in archiving and data management. I also focused on developing training sessions aimed at enhancing digitisation techniques, metadata management and workflow optimisation for fellow archivists. As a member of the International Association of Sound and Audiovisual Archives (IASA), I continue to contribute to their Technical Committee, where I can participate in discussions that influence the development of digital preservation standards. I also contribute to the creation of guidelines for born-digital video, aiming to ensure they are relevant to current technological advancements and archival needs. I have collaborated with Charles Sturt University’s Master of Information Studies program, guiding a student through an approved curriculum I developed that aimed to bridge academic learning with practical skills. These teaching materials have been used again for students from the University of Manchester’s Master of Arts programs in Digital Media, Culture and Society, and in Library and Archive Studies under the tutelage of my PARADISEC colleague, Nick Ward. During ANU’s COVID-19 lockdown, I designed a 10-week online training module to accommodate my newly hired RAs, providing foundational knowledge in digitisation and archiving techniques. This online training was extended to ANU’s Summer Scholar interns, who were also missing out on activities due to lockdown. We focused on audio and manuscript digitisation practices, as well as introducing them to the history of digital language archives. When the lockdown was lifted, we followed up with hands-on training. One of the interns, a PhB student in Linguistics, Daniel Majchrzak (seen in the image below), has been trained in high-resolution image capture of manuscripts and fieldnotes and is now working in our studio to digitise the papers of Luise Hercus. Figure 14: PhB student, Daniel Majchrzak, using a shelf-mounted digital camera for high-resolution image capture of the manuscripts of Luise Hercus. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:5:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Final thoughts Data management is an integral element of institutional research infrastructure. It often falls within that fluid space between the roles of academic and professional staff — within what is called the Third Space in academia (to read more on this topic, take a look at \"Reconstructing Identities in Higher Education: The rise of ‘Third Space’ professionals\", by Celia Whitchurch, 2013). Many of the activities carried out as Data Manager for CoEDL, and now for LDaCA, were designed to assist in educational development and knowledge exchange, as well as to promote public engagement and outreach. These tasks not only assisted students and academic researchers to better conduct their research enquiries and safeguard their work, but they also facilitated activities that directly reinforced the academic missions of our universities. I love the work that I do. I feel lucky to be a part of so many diverse projects, even if it just involves: helping people name their files or store their research materials securely making an inventory of cassette tapes to be digitised connecting students to archival records of a language they are thinking about for their PhD research helping researchers repatriate rematriate recordings in a format that is accessible to the people who have been recorded, and to their descendants. Every day is a new and exciting challenge! ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:6:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"by Maria Weaver We have learned from conversations with data stewards over the last two years that some are new to the practice of using persistent identifiers (PIDs) for data collections. This blog post will hopefully offer some guidance by briefly providing some background information and addressing the common questions we receive about PIDs and Digital Object Identifiers (DOIs). ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:0:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"A brief overview of persistent identifiers Most of us are familiar with the concept of identifiers (IDs); they mark and individualise a person through life, from their name at birth through to IDs assigned at school, at work and in other contexts. Identifiers are needed for countless things in complex societies, e.g. material goods, documents and transactions, but the purpose remains the same: to unambiguously identify someone or something among similar others within a context. As far as we can see in the literature, the term PID is closely tied to the rise of the internet and digital information management in the 1990s. This makes sense as PIDs are meant to operate in the context of the web environment. The qualifier “persistent” refers to the need to mitigate effects of the transient nature of online content, where websites and digital resources can disappear, change or become inaccessible over time (Chapekis et al. 2024). PIDs are a critical part of scholarly communication, especially online where the volume of data is increasing every minute. Research outcomes and resources have to be shared appropriately, therefore, data objects and entities need to be unambiguously identifiable and reliably findable, and the connections between them made evident. PIDs are fundamental to fulfilling these requirements. The Australian Research Data Commons’s Australian National PID Strategy 2024 provides an excellent summary of how PIDs contribute to strengthening the digital infrastructure ecosystem and to driving research innovation in Australia. A strategy used by some PID systems (e.g. the DOI system) to sustain persistence is separating how the object is identified from where the object is located. The identification component consists of a globally unique identifier; a typical convention is as a series of numbers, letters and special characters. This identifier is optimised on the web by being actionable, that is, it can be formatted as a locator, e.g. a URL, that points to information or metadata about the object (also referred to as the reference to the object). Based on this strategy, an identifier is considered persistent when it consistently resolves over a reasonably long period to the webpage containing information or metadata about the object, including information about the location of the object. The location component is where the object is served or made available, whether on the web or in the physical world. Should the object location change, e.g. a data collection is moved to a different repository or becomes available through other websites, the relevant metadata on the reference page should reflect that fact. Note that although the reference is a digital description, it can be a description of any object or entity, whether digital, physical (e.g. scientific specimens, cultural heritage objects) or abstract (e.g. software, organisations). Understandably, PID systems on a global scale require large investments in robust technological infrastructures, including geographically distributed servers, digital preservation, and redirection and resolver systems. However, such infrastructure is useless without people and social structures to ensure that PIDs work as expected for as long as possible. Organisational commitment from stakeholders, including data owners, repository managers and PID registration agencies, are vital. Data stewards, for example, are responsible for obtaining PIDs for their datasets, creating rich and accurate metadata, ensuring changes to the metadata and the dataset are reflected in the PID registry, and adhering to data governance best practices, community standards and protocols. The PID landscape continues to evolve. If you are interested in delving deeper into persistence strategies used by different PID systems or nuanced views of identifier persistence from experts, see the Further Reading items below. Below are a few examples of PIDs within digital scholarly ecosystems: DOI (Digital Object Identi","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:1:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"Digital Object Identifiers Contributors who allow LDaCA to make their data collections available through the portal can use the PIDs they have if they already follow a specific PID practice. However, if it is their first time obtaining a PID, we recommend DOIs as they are likely obtainable through a university library or institutional research repository. This has been the pattern with collections in the LDaCA portal to date. DOIs are also available through general-purpose repositories like Zenodo. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"What is a DOI? DOI stands for Digital Object Identifier, a PID which is in fact a digital identifier of an object, rather than an identifier of a digital object (DOI Key Facts). The DOI System was launched in 2000 by the International DOI Foundation (IDF), a not-for-profit organisation that functions as the governing body. DOI Registration Agencies (RAs) all over the world offer services for registering DOIs and community-specific metadata. University libraries and institutional research repositories can be members of an RA (e.g. DataCite in Australia) which allows them to create (or mint) and register DOIs for their organisation. Here’s an example of a DOI for a collection available via the LDaCA data portal: In the partial screenshot below of the record for the collection, the DOI number (10.26181/23089559) is part of the collection’s @id. The full @id value is arcp://name,doi10.26181%2F23089559 and the last part of this is the DOI (the slash which separates parts of the DOI is represented by the encoding %2F here). Figure 1: The collection record for the La Trobe Corpus of Spoken Australian English in the LDaCA data portal. Image Source: LDaCA The DOI resolves to an item in La Trobe University’s institutional repository. The content of the collection is available here, but there are also links to the record in the LDaCA portal which includes the DOI. Figure 2: Top part of page in La Trobe University’s repository showing files in collection. Image Source: LDaCA Figure 3: Bottom part of page in La Trobe University’s repository showing collection details. Image Source: LDaCA Clicking the Cite button on the La Trobe repository page shows the DOI link in its full form: Figure 4: Full DOI link displayed under citation on repository page. Image Source: LDaCA The DOI link is comprised of three parts: Resolver service Prefix (the assigning body) Suffix (the resource) https://doi.org/ 10.26181 23089559.v1 ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:1","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"Why should my data collection have a DOI? A review of the uptake of PID systems from the late 1990s to the late 2010s reported that DOIs are, for several good reasons, the most widely adopted PIDs in research data repository systems (Klump \u0026 Huber 2017). DOIs follow a standardised system: there is a syntax specification for the DOI name construction which is followed by all registrars, providing consistency across the board. The Handle System, a trusted and long-established digital infrastructure, is the DOI system’s resolution component providing reliability and persistence in the findability of research objects. Another notable characteristic of the DOI system is its well-developed social infrastructure. The IDF develops and enforces rules and policies for the maintenance of records and for managing the division of responsibilities between RAs, RA members and DOI users. As a consequence, DOIs promote good data governance practices. Apart from the act of obtaining a DOI, the associated responsibilities of creating and recording accurate and rich metadata, ensuring that the DOI record is updated through location and data changes, and providing clear data reuse and access terms, enlighten data owners and stewards about some of the practicalities of data governance. The DOI system’s underlying infrastructure supports interoperability across different platforms and systems. For example, DataCite’s “Add to ORCID Record” feature allows a researcher to manually claim a DOI of their work and link it to their ORCiD record (DataCite 2024). Data repositories can use DOIs to track usage metrics such as views, downloads and citations, which also informs data owners as to how researchers engage with the data. Below is a screenshot of a section of the DOI landing page for the La Trobe Corpus of Spoken Australian English with some of the key information highlighted and annotated. The DOI was issued by DataCite via the La Trobe University Library’s DOI service. Figure 5: Annotated screenshot of DOI landing page for the La Trobe Corpus of Spoken Australian English with key information highlighted. Image Source: LDaCA ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:2","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"How do I get a DOI for my data collection? We have an article about this topic on our website — see Obtaining a DOI. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:3","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"What format(s) should my data files be in before I share them with LDaCA? We understand that the choice of file formats during data planning and collection can be influenced by factors like research team decisions, discipline-specific norms, software-hardware compatibility and equipment funding. Although LDaCA's data migration experts can work with your data in any provided format, file formats that are easily shareable and reusable include: Text — TXT, PDF, (DOC/DOCX) Tabular data (which may include text) — CSV, (XLS/XLSX) Audio — WAV, MP3 Video — MP4. Bracketed formats are proprietary and therefore less sustainable. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:4","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"References Australian Research Data Commons. 2024. Australian national Persistent Identifier (PID) strategy 2024. Zenodo. https://doi.org/10.5281/ZENODO.10656275. Chapekis, Athena, Samuel Bestvater, Emma Remy and Gonzalo Rivero. 2024. When Online Content Disappears. Pew Research Center. https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/. (4 July, 2024). DataCite. 2024. Add a DOI to Your ORCID Record. DataCite Support. https://support.datacite.org/docs/orcid-claiming. (4 July, 2024). Klump, Jens \u0026 Robert Huber. 2017. 20 Years of Persistent Identifiers – Which Systems are Here to Stay? Data Science Journal 16. 09. https://doi.org/10.5334/dsj-2017-009. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:3:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"Further reading Car, Nicholas N. J., Pavel Golodoniuc \u0026 Jens Klump. 2017. The Challenge of Ensuring Persistency of Identifier Systems in the World of Ever-Changing Technology. Data Science Journal 16. 13. https://doi.org/10.5334/dsj-2017-013. Kunze, John \u0026 Richard Rodgers. 2008. The ARK Identifier Scheme. UC Office of the President: California Digital Library. https://escholarship.org/uc/item/9p9863nc. (4 July, 2024). ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:4:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":["LDaCA"],"content":"PDF version | Powerpoint Version Developer Track Session 2 Time: 05 June 2024, 09:00 - 10:30 · Location: Drottningporten 2 Crate-O - A drop-in linked data metadata editor for RO-Crate (and other) linked data in repositories and beyond Peter Sefton, Alvin Sebastian, Moises Sacal Bonequi, Rosanna Smith University of Queensland, Australia Research Object Crate is a metadata packaging standard which has been widely adopted over the last few years in research contexts and which debuted at Open Repositories with a workshop in 2019. Crate-O is an editor for the RO-Crate Metadata Specification. RO-Crate has been presented here at Open Repositories for the last few years, and is now starting to be incorporated into many research repository solutions (though they are not always called repositories). I am presenting in the next session on why RO-Crate is important for repositories. Five ways RO-Crate data packages are important for repositories Time: 05 June 2024, 11:00 - 12:30 · Location: Drottningporten 1 Peter Sefton*, Stian Soiland-Reyes** *University of Queensland, Australia; **The University of Manchester, UK Research Object Crate is a linked data metadata packaging standard which has been widely adopted in research contexts. In this presentation, we will briefly explain what RO-Crate is, how it is being adopted worldwide, then go on to list ways that RO-Crate is growing in importance in the repository world: Uploading of complex multi-file objects means RO-Crate is compatible with any general-purpose repository that can accept a zip file (with some coding, repository services can do more with RO-Crates) Download for well-described data objects complete with metadata from a repository rather than just a zip or file with no metadata Using RO-Crate metadata reduces the amount of customisation that is required in repository software, as ALL the metadata is described using the same simple, self-documenting linked-data structures, so generic display templates Sufficiently well-described RO-Crates can be used to make data FAIR compliant, aiding in Findability, Accessibility, Interoperability and Reusability thanks to standardised metadata and mature tooling And if you’re looking for a sustainable repository solution, there are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by Suleman in the closing keynote for OR2023. The Language Data Commons of Australia Data Partnerships (LDaCA-DP), Language Data Commons of Australia Research Data Commons (LDaCA-RDC), and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768, https://doi.org/10.47486/HIR001, \u0026 https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). A version of the Crate-O component is available as a playground for Chrome browsers only. This allows you to describe your local folders in your computer and generate an RO-Crate. https://github.com/Language-Research-Technology/ro-crate-modes/blob/main/docs/soss-profiles.md A Schema.org style Schema (SOSS) specifies a metadata vocabulary of Classes and Properties, based on the RO-Crate specification’s use of Schema.org classes. An RO-Crate Profile has (at least) a document that explains how metadata entities from the Schema are used for a particular purpose. An RO-Crate Mode is a set of lightweight syntactic rules for combining SOSS Classes, Properties and DefinedTerms, expressed in a JSON file. Crate-O RO-Crate Editor Mode Files are editor configurations that implement RO-Crate Metadata Profiles. The configuration files are intended to form the basis of an approach for describing RO-Crate editor behaviour and can be used for RO-Crate validation. Initial versions of this work were based on the Describo Profiles (which vary between versions of Describo) used to configure the Describo family of RO-Crate editing","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-crate-o/:0:0","tags":["RO-Crate","Open Repositories","Crate-O"],"title":"Crate-O - a drop-in linked data metadata editor for RO-Crate (and other) linked data in repositories and beyond","uri":"/news/posts/open-repositories-2024-crate-o/"},{"categories":["LDaCA"],"content":"PDF version | Powerpoint Version Five ways RO-Crate data packages are important for repositories Presented at: The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg, Sweden Session: Presentations: Integrations for Research Data Management Time: 05/June/2024: 11:00 - 12:30 · Location: Drottningporten 1 Peter Sefton*, Stian Soiland-Reyes** *University of Queensland, Australia; **The University of Manchester, UK Research Object Crate is a linked data metadata packaging standard which has been widely adopted in research contexts. In this presentation, we will briefly explain what RO-Crate is, how it is being adopted worldwide, then go on to list ways that RO-Crate is growing in importance in the repository world: Uploading of complex multi-file objects means RO-Crate is compatible with any general-purpose repository that can accept a ZIP file (with some coding, repository services can do more with RO-Crates). Download for well-described data objects complete with metadata from a repository rather than just a ZIP or file with no metadata. Using RO-Crate metadata reduces the amount of customisation that is required in repository software, as ALL the metadata is described using the same simple, self-documenting linked-data structures, so generic display templates. Sufficiently well-described RO-Crates can be used to make data FAIR compliant, aiding in Findability, Accessibility, Interoperability and Reusability thanks to standardised metadata and mature tooling. And if you’re looking for a sustainable repository solution, there are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by Suleman in the closing keynote for OR2023. ","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:0:0","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"Uploading of complex multi-file objects RO-Crate [1], [2] is a data packaging format and can be used to put multiple data files together with their metadata into a package such as a ZIP, tar or disk image file. This means that as long as your repository can handle a ZIP file it can take RO-Crates. RO-Crates enable data to travel with metadata. Beyond simply allowing the upload of opaque RO-Crates, there are opportunities for repository software to recognise metadata in an uploaded package and to pre-populate built-in metadata forms and/or datastores. This is not a pattern the authors have seen widely implemented in comprehensive institutionally focussed repositories, although at the time of writing it is being explored in Dataverse and InvenioRDM/Zenodo. We would encourage repository developers to explore this further, particularly those working with research data. RO-Crate support is increasing in research-domain repositories; e.g. RO-Crate upload with metadata extract is supported by WorkflowHub and ROHub). One of the design features of an RO-Crate is that as it can be “just a ZIP file” it can be used with any old repository that can handle ZIP files. The JSON file can also be uploaded separately. As RO-Crate adoption increases, the repository may be able to start to use the RO-Crate metadata it already has. You can/will have heard from Dieuwertje Bloemen here at OR2024 that DataVerse has support for RO-Crate metadata preview and building import/export mechanisms. The RO-Crate specification described a method of packaging data in a folder, which can be zipped, with any kind of file. This slide shows a folder of data, including a file with a typical obscure file name based on a timestamp (in this case created by a Dropbox upload from a digital camera). The RO-Crate Metadata file, in JSON Linked Data Format, can describe files. This one has a name (i.e. a title) for the file; “Cute puppy”. A human-readable description and preview (ro-crate-preview.html) can be in an HTML file that lives alongside the metadata. This slide shows an HTML view of the data that shows the image with its metadata, including the Schema.org name (equivalent to a Dublin Core title) for the file. Increasingly, research repository infrastructure is accepting RO-Crate input - this screenshot from WorkflowHub documents the upload API for submitting RO-Crate packaged descriptions of scientific workflows to the system. These can then be downloaded by others for reuse. Here, RO-Crate allows bypassing of the traditional “title, author, license, description” fields (rendered from the crate), as well as permitting user extensions on metadata to be kept in the repository. ","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:1:0","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"RO-Crate is a packaging format suitable for downloads One of the perennial problems with downloads is that once a user has the data, it often does not come with metadata as shown on the landing page, or if present it is in an ad hoc or specialised format. RO-Crate solves this by specifying an extensible way to put linked-data metadata with data assets and to provide an HTML page or small website with the data to explain it. Thus data travels with its metadata and can be made human-readable. RO-Crate download is already available in many data repositories. Examples include: WorkflowHub: A registry for describing, sharing and publishing scientific computational workflows. ROHub: A repository of Earth Science datasets and computational methods. TLCMap: The Time Layered Cultural map is a set of tools that work together for mapping Australian history and culture. The Language Data Commons of Australia data portal: entirely built on RO-Crates, the underlying data consists of crates-on-disk and the API is based on RO-Crate metadata. Senckenberg Wildlive portal: exposes metadata about automatic photo captures of endangered animals using RO-Crate. Dataverse: at the time of writing, RO-Crate downloads are in development. We will encourage developers from other repository platforms to follow the Dataverse project’s lead and add RO-Crate support. This is a screenshot of the Gazetteer of Historical Australian Places – not exactly a repository but an example of a place where people can download datasets. This kind of download could be added to any repository system where there is at least one file that has metadata; offer a download ZIP option that has machine-readable JSON metadata (linked data in JSON-LD) and a human-readable summary of the metadata – this one has descriptions of a few files in it. Every repository implementer should add FAIR Signposting – just a couple of HTTP headers – this means machines can go from an HTML landing page to the actual download without guesswork – or even better – to an RO-Crate! I’m sure you’ll hear this mentioned in one of the talks by Herbert van de Sompel and colleagues, such as the one on FAIRiCat. Again, DataVerse is ahead of the curve and has already implemented this. As we mentioned above, data should travel with the metadata – one example of this from the WorkflowHub is how other services like the LifeMonitor retrieves the RO-Crate and then looks for custom annotations in the metadata to pick up and connect to the testing infrastructure. This was only possible because RO-Crate is extensible - you are not trapped with whatever 25 properties we’ve selected. The vessel is still RO-Crate, the repository didn’t need to add anything to support the LifeMonitor. One of the key benefits of linked-data metadata over previous ‘legacy’ approaches, is that multiple vocabularies can be combined into a single metadata document in a way that is not possible with, say MARC, or MODS XML, and that all these vocabularies can use the same syntax and approach to describing data. This means that a simple generic RO-Crate viewer can be used to visualise any metadata whether it is basic “Who, What, Where” metadata (like Dublin Core) or domain-specific metadata like the RO-Crate metadata profile (https://w3id.org/ldac/profile) used by the Language Data Commons of Australia. This can be displayed alongside the core RO-Crate metadata without any expensive configuration or coding. If the recommendations are followed, the RO-Crate metadata terms are self-documenting, e.g. all the Language Data Commons terms which use a Schema.org Style approach, are defined here: https://w3id.org/ldac/terms. This slide is a repeat of one we used last year – this screenshot is a bit of (undated) DSpace documentation found following a tip from Kim Sheppard – we have included it here to illustrate that storing additional metadata (in this case METS) for an object was done by convention – it had to be stored in a special file called METS.xml. Us","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:2:0","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"References [1] Sefton, Peter et al., “RO-Crate Metadata Specification 1.1.3,” Apr. 2023, doi: 10.5281/ZENODO.3406497. ↩ [2] S. Soiland-Reyes et al., “Packaging research artefacts with RO-Crate,” Data Sci., vol. 5, no. 2, pp. 97–138, Jan. 2022, doi: 10.3233/DS-210053. ↩ [3] S. Leo et al., “Recording provenance of workflow runs with RO-Crate.” arXiv, Dec. 12, 2023. doi: 10.48550/arXiv.2312.07852. ↩ [4] P. Sefton, M. La Rosa, and Mi. Lynch, “Arkisto: a repository based platform for managing all kinds of research data,” in ptsefton.com, Jun. 2021. Accessed: Jan. 31, 2022. [Online]. Available: http://ptsefton.com/2021/06/11/or-2021-arkisto/index.html ↩ ","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:2:1","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"PDF version | Powerpoint Version From “R-Drive to RRKive” – a comprehensive, open and sustainable set of principles and tools for low (and high) resource archival-repositories Presented at: The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg, Sweden Session: Presentations: Open and Sustainable Infrastructure Time: 04/June/2024: 13:30 - 15:00 · Location: Drottningporten 1 Peter Sefton*, Robert McLellan*, Michael Lynch**, Moises Sacal Bonequi*, Nick Thieberger*** *University of Queensland; **University of Sydney; ***University of Melbourne We present a toolkit for sustainable Archival Repositories, with metadata and storage standards as well as APIs and data portals, that can be assembled by communities with various levels of resourcing into repository solutions at a variety of scales, with fallback to offline operation. Tools are designed to work in low-resource environments, allowing communities to have agency and control over their materials. We prioritise sustainability, simplicity, standardisation, linked-data description and clear licensing over user interface features, in line with Suleman’s keynote presentation at OR2023, while still being able to drive rich, full-featured services when resources allow and fall back to a sustainable core if needed. Much of the data we work with is subject to Indigenous Cultural Intellectual Property (ICIP) rights, controlled by First Nations peoples guided by Indigenous data sovereignty principles; we must ensure it is handled in a conscionable and culturally responsible manner. Making data Accessible does not always mean Open Access – under both research ethics and the CARE and FAIR principles, data by and about humans needs access control and licensing. Our framework deals with these issues and is built from the ground up to be “as open as possible, as closed as needed”. NOTE: Since the abstract was submitted we have adopted the name “Protocols” for the use in work we are doing on tools for FAIR and CARE adoption. This slide summarises the aims of the Language Data Commons of Australia project. The area of the strategy we’re focusing on for this presentation is highlighted – tools, standards and technical infrastructure. The activities we are undertaking are: Develop shared tools, standards and technical infrastructure to help data stewards care for data for the long term Build data portals with useful search functions and lightweight technical structures. Towards the following outcomes: Standards and tools are available and being applied by data stewards Good governance and standardised, distributed storage of data helps preserve data Discovering and locating language data is easy via linked portals. We don’t need to define the term “Repository” at the Open Repositories conference, but we do need to define a term that we use at LDaCA when we talk about Repositories – we use the inclusive term “Archival Repository” to refer to systems designed to keep data for the long term. We do this to avoid drawing boundaries around what is a “repository” vs an “archive” or a Digital Preservation system and to include as many practitioners as possible in the conversation. The first implementation of this stream of work was the UTS Research Data Portal – this has not been updated for a while due to staffing changes at the university and a lack of ongoing governance that made the data repository dependent on individual people (three of whom are authors of this paper and no longer work for UTS), but as of mid-2024 it is being renovated and revitalised. This screenshot shows what an LDaCA site looks like; a typical data-discovery portal. Underneath is not a typical (by Open Repositories standards) repository solution like Dspace, Innvenio or EPrints, all of which are more or less monolithic architectures. Following the implementation protocols we will discuss below, data is very much considered separately from the application used to provide views of the Archival Repositor","date":"2024-06-04","objectID":"/news/posts/open-repositories-2024-pilars/:0:0","tags":["PILARS"],"title":"A comprehensive, open and sustainable set of principles and tools for low (and high) resource Archival Repositories\n","uri":"/news/posts/open-repositories-2024-pilars/"},{"categories":null,"content":"by Rosanna Smith, Simon Musgrave, Robert McLellan and Ben Foley ","date":"2024-05-23","objectID":"/news/posts/idil/:0:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"International Decade of Indigenous Languages 2022–2032 The world’s billions of people use thousands of languages in myriads of ways, through communication by speech, sign, text and song (Evans, 2022). Diversity of language is critical for conveying the world’s knowledges and expressing the range of human experiences. Australia is in an area of great language diversity, with hundreds of languages on this continent. Languages require action to remain in use. The United Nations General Assembly has declared the period between 2022 and 2032 as the International Decade of Indigenous Languages (IDIL). The aim of IDIL is to draw global attention to the critical status of Indigenous languages worldwide and encourage action for their revitalisation, promotion and ongoing use. The International Decade aims to achieve four main outcomes: Learn, teach, and transmit: Empower Indigenous Peoples to learn, teach and transmit their languages to the present and future generations. Establish as global priority and foster commitment: Establish the usage, preservation, revitalisation and promotion of Indigenous languages as a global priority for societal development, peacebuilding and reconciliation by 2030. Ensure legal recognition: Ensure Indigenous languages are recognised by Member States within their legal systems and legislation, which, in turn, are supported with comprehensive language-related laws and policy frameworks, and are backed by allocated financial, institutional and human resources. Enhance the functional usage: Develop an enabling environment to enhance the functional usage of Indigenous languages in socio-cultural, economic, environmental, legal and political domains. ","date":"2024-05-23","objectID":"/news/posts/idil/:1:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Voices of Country: Australia’s National Action Plan for IDIL International Decade of Indigenous Languages Logo Image Source: International Decade of Indigenous Languages Directions Group The Australian Government has partnered with the IDIL Directions Group to produce the Voices of Country Action Plan. The Action Plan provides a framework to guide Australia’s participation in the International Decade, focusing in particular on Aboriginal and Torres Strait Islander community needs, and is a call to action for all stakeholders. The plan was launched at the PULiiMA Indigenous Languages and Technology Conference in Garamilla/Darwin in August 2023. ","date":"2024-05-23","objectID":"/news/posts/idil/:2:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Voices of Country Themes Voices of Country is guided by five inter-connected themes, represented as a native wattle tree, itself a source of food, medicine and tools, with language, culture and traditional knowledges as its foundation. Image Source: Gilimbaa with cultural elements created by David Williams (Wakka Wakka) Voices of Country is framed through five interconnected themes: Stop the Loss: Securing the future and continuance of Australia’s first languages. Aboriginal and Torres Strait Islander Communities are Centre: Over the Decade, our voices, and those of our Elders, will ensure that community leadership and priorities are at the centre. Our voices will come through – everything we do will be by community, for community. This will give us stronger and safer communities. Intergenerational Knowledge Transfer: Restore intergenerational traditional language and cultural transmission. Caring for Country: Everyone has the opportunity to practise their language and culture on Country. Truth-telling and Celebration: Aboriginal and Torres Strait Islander peoples speak their language with integrity and pride. Wider Australia shares this pride, respecting and celebrating Australia’s first languages and all that they encompass. ","date":"2024-05-23","objectID":"/news/posts/idil/:3:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"LDaCA’s contribution A sub-part of Theme 2 in the Action Plan has the title ‘Support communities to build and be the custodians of language resources and materials’. Making language resources and materials sustainable and accessible is central to the work of LDaCA. This includes working in collaboration with communities, and providing them with the tools and support they need to be effective custodians of their language materials. The sub-theme has six components (Voices of Country p34): Embed Aboriginal and Torres Strait Islander cultural practices and knowledge systems into linguistic methodologies. Develop resources that document and record languages, such as dictionaries, encyclopaedias, thesauri, grammar and orthography documents. Record and digitise written and oral language collections, resources and materials. Provide communities with the support, training and infrastructure they require to build language resources and collections. Support Aboriginal and Torres Strait Islander peoples to access and repurpose historical language materials. Repatriate language materials and cultural knowledge to community ownership and control. We will give a brief summary of how LDaCA is contributing to each of these areas of activity, first covering components two to six and then returning to show how all of those activities contribute to the first component. In this summary, we will refer occasionally to the FAIR and CARE ethical research principles which are core to our work. The FAIR principles have been developed as a foundation for good, scholarly ways of finding and using data, providing key guideposts (Findable, Accessible, Interoperable, Reusable) for transparent, reproducible, and reusable language research. Further key principles for advancing Indigenous innovation and self-determination, CARE (Collective benefit, Authority to control, Responsibility, Ethics), guide ethical work with Indigenous language collections. The concluding discussion will also detail the connection between the principles, especially the CARE principles, and LDaCA’s work supporting the Voices of Country Action Plan. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 2: Develop resources that document and record languages LDaCA contributes to work under component two by providing recommendations for good practice in developing sustainable resources. It is important to avoid having data locked into bespoke individual solutions which may be difficult to maintain. Instead, we advocate using data formats that can be supported into the future and which can be the basis for various tools. Although LDaCA can have only a limited role in developing resources, such as dictionaries or text collections, we can provide advice about key aspects of data handling, assist with designing backup strategies, and work with people to develop long-term strategies for appropriate access to language resources. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:1","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 3: Record and digitise written and oral language materials Similarly to component two, we have an indirect role in activity related to component three. We support recording and digitisation through our collaborative partnerships with the Nyingarn project and with PARADISEC. Nyingarn is a platform for making manuscript and typescript sources for Indigenous languages available as searchable digital objects, while PARADISEC has extensive experience in digitising language materials. Increasingly, LDaCA and these partners share a single technological ecosystem, leading to greater alignment in formats and description for digital versions of historical materials, which, in turn, increases the sustainability and usability of the materials. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:2","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 4: Provide communities with support, training and infrastructure to build language resources Contributing to component four, LDaCA is working with various Indigenous groups and organisations to help manage the language resources and collections they hold. An example is our partnership with the Batchelor Institute of Indigenous Tertiary Education, which holds an important archive of physical and digital works in, and about, Indigenous languages. The collection is managed by the Batchelor Institute Library and has strong connections with language projects through the Institute’s Centre for Australian Languages and Linguistics (CALL). The CALL Collection includes text, audio and video resources that have been contributed over at least 30 years by students and staff of the Batchelor Institute, as well as by teachers, linguists and language workers external to the institute. We are working with the Batchelor Institute to: backup and reformat item record information copy metadata from an ageing database into contemporary metadata standard format embed cultural protocols to ensure long-term and appropriate access to this significant collection. We are also working on training programs, designed with First Nations people, which will help make resources accessible to non-academic users of language data. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:3","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 5: Support Aboriginal and Torres Strait Islander peoples to access and repurpose historical language materials Component five aligns closely with the A and the R in the FAIR principles: Accessibility and Reusability. In order for data to meet these FAIR criteria, it must be well-described, but in many cases, existing collections of Indigenous data are not as well-described as they might be, and even when they are considered to be well-described, the description is often based on non-Indigenous knowledge. LDaCA is contributing to component five of the sub-theme by developing methods to make it easier to enrich existing descriptions of data, with the aim of moving towards Indigenous knowledges as the basis for the description of Indigenous data. We have carried out a pilot project with The University of Queensland libraries, in which records from the library system have been enriched with additional annotations based on Indigenous knowledges of the collection items. The results of this project can be seen in a dedicated data portal. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:4","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 6: Repatriate language materials and cultural knowledge to community ownership and control The A in the CARE principles stands for Authority to control. LDaCA’s approach to sustainable data includes a commitment to storing explicit licences with data, detailing in plain text who can access data and what the data can be used for. In line with component six, the communities and people who are the source of Indigenous data have the authority to make decisions about such matters: they decide who can access data and what users can do with the data. LDaCA has developed policies and technical solutions for implementing such access controls that make ongoing administration of access as straightforward as possible. We are sharing ideas and solutions with our colleagues in the Indigenous stream of the HASS and Indigenous Research Data Commons. The solutions that we are using for Indigenous data can also apply to other human-sourced data. As Levi-Craig Murray is fond of saying: “What’s good for mob, is good for everyone”. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:5","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 1: Embed Aboriginal and Torres Strait Islander cultural practices and knowledge systems into linguistic methodologies As stated above, we see all of the activities from components two to six as contributing to the aim of embedding Indigenous knowledges and cultural practices into our methodologies. When working with language material, LDaCA is guided by FAIR and CARE principles (Carroll et al. 2020; Carroll et al. 2021; Wilkinson et al. 2016). The FAIR principles provide guidance on ensuring that data will be useful in the future, though they do this from a Western scientific perspective. The CARE principles extend the FAIR data management principles and we hope that our commitment to these principles assists us in contributing to the goals of the Voices of Country plan: Making data more sustainable and more informed by Indigenous knowledges helps ensure that Indigenous communities benefit from the data. Providing accessible solutions for access management contributes to First Nations people’s authority to control Indigenous data. Supporting Indigenous communities in developing data capabilities helps to build positive relationships and show responsibility in our work with Indigenous Peoples. Basing all of our work on ethical principles respects the rights and well-being of Indigenous Peoples. We see all of our activities as both being guided by the CARE principles, and as contributing to the overall aim of embedding Indigenous knowledges and cultural practices into our methodologies. Indigenous control of access to data, Indigenous perspectives made part of the description of data, and using Indigenous perspectives to make data more accessible to communities are all part of this endeavour. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:6","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Further reading and references 2022 - 2032 International Decade of Indigenous Languages. 2022 - 2032 International Decade of Indigenous Languages. https://idil2022-2032.org/. (3 October, 2023a). International Decade of Indigenous Languages 2022 – 2032 | United Nations For Indigenous Peoples. https://www.un.org/development/desa/indigenouspeoples/indigenous-languages.html. (3 October, 2023b). International Decade of Indigenous Languages 2022–2032 | Office for the Arts. https://www.arts.gov.au/what-we-do/indigenous-arts-and-languages/international-decade-indigenous-languages/international-decade-indigenous-languages-2022-2032. (3 October, 2023c). Voices of Country: Australia’s National Action Plan for the International Decade of Indigenous Languages | Office for the Arts. https://www.arts.gov.au/what-we-do/indigenous-arts-and-languages/international-decade-indigenous-languages/voices-country-australias-national-action-plan-international-decade-indigenous-languages. (3 October, 2023d). Carroll, Stephanie Russo, Ibrahim Garba, Oscar L. Figueroa-Rodríguez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, et al. 2020. The CARE Principles for Indigenous Data Governance. Data Science Journal 19. 43. https://doi.org/10.5334/dsj-2020-043. Carroll, Stephanie Russo, Edit Herczog, Maui Hudson, Keith Russell \u0026 Shelley Stall. 2021. Operationalizing the CARE and FAIR Principles for Indigenous data futures. Scientific Data. Nature Publishing Group 8(1). 108. https://doi.org/10.1038/s41597-021-00892-0. Evans, Nicholas. 2022. Words of Wonder: Endangered Languages and What They Tell Us (2nd ed.). Wiley. ISBN: 978-1-119-75877-8 Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(1). 160018. https://doi.org/10.1038/sdata.2016.18. Thanks to Teresa Chan and Harriet Sheppard who provided valuable feedback on this piece. ","date":"2024-05-23","objectID":"/news/posts/idil/:5:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"LDaCA team member Teresa Chan discusses data considerations for her Master's research project.","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"by Teresa Chan During a Master of Applied Linguistics degree at UNSW Sydney, I was given the opportunity to enrol in a course in which students become interns in the School of Humanities and Languages’ Language Processing Lab. Although I would still have readings and assessments to complete, the teaching model would be different to a standard postgraduate course: I would attend regular online lab meetings with my supervisor and other interns, and design and conduct my own research project. The first few weeks of the course were dedicated to learning about real-world applications and key methods in computational linguistics and language technology. We then focused on the research project for the remainder of the course. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:0:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project proposal The idea for the project stemmed from an intriguing observation I had made about language use in my circle of friends. Certain friends, who had grown up in south-west Sydney and were mostly from a Vietnamese background, were prone to using the expression and whatnot, something that I never said myself. Researching this phenomenon revealed that it was a discourse-pragmatic feature called a general extender (GE). Figure 1: Two people using general extenders (GEs) in a conversation Image Source: Teresa Chan GEs are phrase-final expressions such as and stuff and or whatever that prototypically ’extend’ the referent of a phrase by evoking a more general set of items to which it belongs. They serve a range of discourse and interpersonal functions, such as hedging, establishing solidarity or marking in-group affiliation (Norrby \u0026 Winter, 2002). Regional and social variation in the use of GEs has been found across spoken and written English, including Australian English (Travis, Grama \u0026 Gonzalez, 2017), with different patterns of usage depending on variables like age, gender and socio-economic class. However, at the time of the project, ethnicity had rarely been considered (e.g. Secova, 2017), despite the fact that the high frequency of a specific GE form in a region can make it a statistical “marker of ethnic identity” (Aijmer, 2013, p. 146). There was also only one study on GEs in computer-mediated communication (CMC) (Fernandez \u0026 Yuldashev, 2011). Additionally, although GE usage in Australian English had been compared to its usage in varieties of English spoken in other countries, no research had looked at usage across different cities within Australia. My project would attempt to fill these gaps in the literature and explore socio-regional variation in GE usage in spoken and CMC settings. “Socio-regional” refers to the stratification of socio-economic class by region (Sheard, 2019, p. 486). The research questions I would aim to address were: Are there differences in GE usage in Australia between: a. different cities? b. different ethnicities? Does GE usage differ between spoken settings and CMC settings in Australia? Do the results corroborate or refute previous findings, particularly those for Australian English? ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:1:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Data selection and collection Several factors influenced the selection of the data sources and the collection of data from these sources, which were X (formerly Twitter) and two spoken corpora of Australian English: AusTalk (Burnham et al., 2011) and Sydney Speaks (Travis, 2014–2021). ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project scope and timeline The limited scope of the project and the short timeline I had for data collection meant I had to be strategic when selecting data sources. For the spoken language component, I could not construct my own corpus of recordings without requiring a potentially lengthy ethics clearance process, and so would need to rely on existing datasets like AusTalk and Sydney Speaks. For the CMC component, I needed to create my own dataset, which was reliant on both the suitability of the data source for my purposes and the skills required to access the data. Previous studies have found linguistic differences by geographical location within the same country on X (e.g. Rechkemmer, Wilson \u0026 Mihalcea, 2020), making it an appropriate choice for a CMC site. Furthermore, the website had an advanced search engine that allowed you to search for geolocated posts. X was also compatible with commercial web scraping software, allowing me to compensate for my lack of knowledge and experience as a novice researcher whose only relevant capability was a rudimentary knowledge of Python. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:1","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Dataset accessibility Accessibility needed to be considered when selecting and collecting data. The spoken corpora must already have transcriptions of the recordings available and the metadata needed to contain fields that matched the variables I was looking for. A small portion of the recordings in the AusTalk dataset had already been transcribed and made available online. This data could be downloaded from two separate (now defunct) web-based tools: the transcriptions via the Alveo website and the metadata via the Alveo Query Engine. The metadata contained detailed demographic information about the speakers in the recordings, allowing me to operationalise different fields into my chosen variables: age, gender, socio-economic class (education level), ethnicity (cultural heritage) and geographical location (current city). For the Sydney Speaks corpus, I submitted an application for access to the main researcher, LDaCA Chief Investigator Catherine Travis, and once granted, she sent me a link to download the data files. The metadata was encoded in the file names, with different codes used for ethnicity, gender and age. Luckily for future researchers, the majority of the Sydney Speaks corpus will be incorporated into the LDaCA portal, where the data and metadata will be made FAIR: findable, accessible, interoperable and reusable. Aside from the metadata requiring more manual work to extract, the utility of the Sydney Speaks dataset was lower than AusTalk for three main reasons: Only a limited number of ethnicities were included: Anglo (Australian), Chinese, Greek and Italian. The data was collected solely in Sydney, meaning it could only be analysed for a single geographical location. No metadata field was available to operationalise into a class variable. The accessibility of data and metadata from the CMC source was also key. Posts on X were highly accessible compared to other CMC sites, such as Facebook, where posts could be hidden from those outside of the user’s social network. However, the reverse was true when looking for information about a user on X, as usernames, user profiles and profile pictures were less likely to reflect the user’s true identity compared to Facebook. As a result, the metadata was sparser, and aside from geographical location, only one other variable was operationalised: gender. I assigned a gender to each user based on their username and handle, consulting profile pictures and descriptions in ambiguous cases (see Figure 2 for examples). While being arguably limited in making such a determination, this method was the most suitable option available at the time. If I wanted more reliable user information, ethics clearance would be needed, as it was only possible to determine all variables conclusively by contacting the users. Figure 2: Assigning genders to X users on two example posts Image Source: Teresa Chan ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:2","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Dataset size The data sources needed to contain enough instances of GEs to allow robust analysis. In previous studies, the frequency of GE usage across major varieties of English ranged from about 20 to 40 GEs per 10,000 words of speech, though rates of 50 to 60 GEs per 10,000 words had been observed among adolescents (Cheshire, 2007; Norrby \u0026 Winter, 2002). See Figure 1 for the rates of use of GEs (per 10,000 words) by age across several of these studies. Figure 3: Rates of use of GEs (per 10,000 words) by age, across studies Image Source: Catherine Travis These studies looked at spoken language, as the prevalence of GEs was found to be higher in spoken corpora compared to written corpora. Indeed, one study (Martínez, 2011) found the frequency of GEs was approximately 9 to 10 per 10,000 words in the spoken subcorpus compared to 1 to 1.5 in the written subcorpus of two different corpora of British English (International Corpus of English-GB component and the British National Corpus). Additionally, in the written subcorpora, GEs were mainly found in fictional dialogue or conversation, or in informal writing, such as emails. This evidences the fact that GEs appear to be predominantly features of spoken language. Based on the frequency figures outlined above, datasets needed to contain hundreds of thousands of words of speech in order to produce a sufficient number of GE tokens. Both AusTalk and Sydney Speaks featured spoken language and proved to be large enough datasets (274,676 words and 568,555 words, respectively). For the AusTalk corpus, participants completed a number of different tasks, but only data from two of the tasks were considered for the dataset: a story re-telling task, where a participant was asked to re-tell a story from a previous task in their own words, and an interview task, where the participant and researcher discussed a topic of interest to the participant. These tasks were designed to elicit a more spontaneous speech style, which would be fruitful ground to find GEs. Although interactions on CMC sites generally feature more written than spoken language, they seem more like oral interactions than written interactions. In a study on a synchronous CMC site (instant messaging), Fernandez and Yuldashev (2011) found GE usage patterns were closer to those used in spoken interaction rather than written interaction, though the frequency of occurrence was noticeably lower compared to the spoken corpora they investigated. I theorised an asynchronous CMC site like X should also demonstrate usage patterns that were similar to spoken interaction. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:3","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Dataset novelty To corroborate and extend previous findings, data sources that had never been analysed before were preferred, which was the case for AusTalk and X. Although patterns of GE usage by age, gender and socio-economic class had been studied previously in Sydney Speaks (Travis, Grama and Gonzalez, 2017), re-analysis of the dataset could illuminate further trends. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:4","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Data analysis Once I had cleaned and organised the data for all three datasets, I uploaded them into commercial corpus manager and text analysis software (Sketch Engine) and conducted concordance searches. I searched for GE variants from a list I had compiled before starting data collection, based on those found in previous studies. Figure 2 shows the first 20 results of a concordance search for the GE and stuff in the Sydney Speaks dataset. Figure 4: Concordance search for *and stuff* in the Sydney Speaks dataset Image Source: Teresa Chan Each search result was manually inspected to ensure it could be identified as a GE. After selecting only genuine instances, the results of each search were exported as CSV files, then imported into Microsoft Excel for annotation. Each GE token was coded with the relevant variables for the dataset, e.g. age, gender. Each token was also sorted into a GE subcategory and broader GE category, for instance, or something was in the (or) something (like that) subcategory and something category. Once all tokens were coded and sorted, statistical analysis was completed in Excel using the Data Analysis ToolPak add-on. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:3:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project results Addressing each research question in turn: Are there differences in GE usage in Australia between: a. Different cities? There were no statistically significant differences in frequency of GE usage between different Australian cities in AusTalk. However, the Sydney Speaks mean frequency per 10,000 words was significantly higher than the mean frequencies of the other cities in AusTalk, including Sydney1. Looking at GE categories, there were no statistically significant results. However, there were apparent trends, such as Sydney AusTalk speakers using less common variants more often compared to other cities2. Thus, there is some evidence of linguistic differences by geographical location. b. Different ethnicities? There were no statistically significant differences in frequency of usage between Anglo and non-Anglo Australians within both spoken corpora. However, when the corpora were compared, the mean frequency of both Anglo and non-Anglo Australians in Sydney Speaks was significantly higher than that of both Anglo and non-Anglo Australians in AusTalk3. There were no statistically significant differences in preferred variants, though there were some evident trends. For instance, Chinese Australians in the Sydney Speaks dataset used whatnot/what not variants more often than other ethnicities4, which might relate to my original observation about Vietnamese Australians. Does GE usage differ between spoken settings and CMC settings in Australia? There were more GE tokens on X than in the spoken corpora. However, the mean frequency per user was significantly lower on X than the mean frequency per speaker in the spoken corpora5, likely due to the limitation in post lengths on X. Figure 3 illustrates the mean GE frequency for Sydney speakers in the AusTalk and Sydney Speaks datasets compared to X users6. There were also GE variants such as (and) stuff (like that) that were used significantly more frequently by speakers in the spoken corpora than by X users7. Therefore, it seems usage patterns on X differ from those in spoken Australian English. Figure 5: Mean GE frequency for Sydney speakers in AusTalk and Sydney Speaks datasets compared to X (formerly Twitter) Image Source: Teresa Chan Do the results corroborate or refute previous findings, particularly those specific to Australian English? The AusTalk results corroborated the robust finding on age (e.g. Secova, 2017; Travis et al., 2017), that is, younger speakers used GEs more frequently than older speakers8. There was also support across both corpora for previous studies that found no gender and class differences in frequency of usage. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:4:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project reflection Though the analysis revealed interesting trends, more data is necessary to make further claims about the difference between GE usage in spoken Australian English and on X. I considered expanding on this in a PhD degree, but decided against pursuing this, before I ended up working at LDaCA. If I do resume my research, I would need to consider other CMC sites, with X on the decline and no clear winner among the alternatives. However, other large platforms like Reddit and Facebook come with their own challenges, including restrictions in access to their underlying APIs (Application Programming Interfaces). Using the APIs would allow me to avoid web scraping and its associated risks (e.g. potential violation of website terms) and limitations (e.g. difficulty in handling large or complex websites). Additionally, I would drastically change how I approach data capture and analysis, with a focus on digital automation and open-source software. The LDaCA portal would also be an invaluable resource to locate corpora of Australian English, with a clear and straightforward path in accessing both the data and associated metadata. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:5:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"References Aijmer, K. (2013). General Extenders. In H. Pichler (Ed.). Understanding Pragmatic Markers: A Variational Pragmatic Approach (pp. 127–147). Edinburgh, United Kingdom: Edinburgh University Press. Burnham, D., Estival, D., Fazio, S., Viethen, J. Cox, J., Dale, R., Cassidy, S., Epps, J., Togneri, R., Wagner, M., Kinoshita, Y., Göcke, R., Arciuli, J., Onslow. M., Lewis, T., Butcher, A., \u0026 Hajek, J. (2011). Building an audio-visual corpus of Australian English: large corpus collection with an economical portable and replicable Black Box. Proceedings of 12th Annual Conference of the International Speech Communication Association (Interspeech 2011) (pp. 841–844). Retrieved from https://austalk.edu.au/bibliography Cheshire, J. (2007). Discourse variation, grammaticalisation and stuff like that. Journal of Sociolinguistics, 11(2), 155–193. Fernandez, J. \u0026 Yuldashev, A. (2011). Variation in the use of general extenders and stuff in instant messaging interactions. Journal of Pragmatics, 43(10), 2610–2626. Martínez, I. M. P. (2011). “I might, I might go I mean it depends on money things and stuff”. A preliminary analysis of general extenders in British teenagers’ discourse. Journal of Pragmatics, 43(9), 2452–2470. Norrby, C. \u0026 Winter, J. (2002). Affiliation in Adolescents’ Use of Discourse Extenders. In C. Allen (Ed.), Proceedings of the 2001 Conference of the Australian Linguistic Society. Retrieved from http://www.als.asn.au/proceedings/als2001.html Rechkemmer, A., Wilson, S. R. \u0026 Mihalcea, R. (2020). Small Town or Metropolis? Analyzing the Relationship between Population Size and Language. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, … \u0026 S. Piperidis (Eds.), Proceedings of the 12th Language Resources and Evaluation Conference. Retrieved from http://www.lrec-conf.org/proceedings/lrec2020/index.html Secova, M. (2017). Discourse-pragmatic variation in Paris French and London English: Insights from general extenders. Journal of Pragmatics, 114, 1–15. Sheard, E. (2019). Variation, Language Ideologies and Stereotypes: Orientations towards like and youse in Western and Northern Sydney. Australian Journal of Linguistics, 39(4), 485–510. Travis, C. E. (2014–2021). Sydney Speaks. Australian Research Council Centre of Excellence for the Dynamics of Language, Australian National University: http://www.dynamicsoflanguage.edu.au/sydney-speaks/ Travis, C. E., Grama, J. \u0026 Gonzalez, S. (2017). General extenders over time in Sydney English: From or something to and stuff. Paper presented at the Australian Linguistic Society Annual Conference, Sydney, Australia, 4–7 December. Retrieved from http://www.dynamicsoflanguage.edu.au/sydney-speaks/dissemination/ ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:6:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Footnotes 1 The mean frequency per 10,000 words for AusTalk speakers was significantly lower than Sydney Speaks speakers, in Sydney (t(129) = 2.40, p = .018), Adelaide (t(154) = 3.01, p = .003), Brisbane (t(29) = 4.89, p \u003c .001) and Canberra (t(129) = 2.03, p = .044). ↩ 2 Less common variants made up 15.00% of AusTalk tokens in Sydney, compared to 9.03% in Adelaide, 6.90% in Brisbane and 7.20% in Canberra. ↩ 3 The mean frequency of Anglo Australians in the Sydney Speaks dataset was significantly higher than both Anglo Australians (t(108) = 4.66, p \u003c .001) and non-Anglo Australians (t(69) = 3.67, p \u003c .001) in AusTalk. The mean frequency of non-Anglo Australians in the Sydney Speaks dataset was significantly higher than both Anglo Australians (t(121) = 3.15, p = .002) and non-Anglo Australians (t(55) = 2.85, p = .006) in AusTalk. ↩ 4 Whatnot/what not variants made up 9.04% of Sydney Speaks tokens for Chinese Australians, compared to 0.99% for Anglo Australians, 4.49% for Greek Australians and 1.61% for Italian Australians. ↩ 5 The mean frequency per speaker/user for AusTalk speakers was significantly higher than X users (t(64) = 4.51, p \u003c .001). The mean frequency for Sydney Speaks speakers was significantly higher than X users (t(117) = 9.85, p \u003c .001). ↩ 6 The mean frequency for AusTalk speakers was significantly higher than X users in Sydney (t(877) = 4.00, p \u003c .001). The mean frequency for Sydney Speaks speakers was significantly higher than X users in Sydney (t(117) = 9.85, p \u003c .001). ↩ 7 The (and) stuff (like that) variant was used significantly more frequently by speakers in the AusTalk (t(15) = 3.17, p = .006) and Sydney Speaks (t(78) = 6.21, p \u003c .001) datasets compared to X users. ↩ 8 Speakers in the datasets were assigned one of three age brackets: Young (19-29 year-olds), Middle (30-59 year-olds) and Old (60-79 year-olds). The mean frequencies for the Young (t(36) = 4.07, p \u003c .001) and Middle (t(48) = 2.42, p = .02) age brackets in AusTalk were significantly higher than the Old age bracket. ↩ ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:6:1","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":" This blog post features an interview with Kalin Stefanov (ARC DECRA Fellow at Monash University), who is working on an Australian Sign Language (Auslan) project. We would like to thank Chief Investigator Louisa Willoughby for first letting us know about Kalin’s work, and Kalin for graciously agreeing to the interview and providing his thoughts. Kalin Stefanov, Monash University Image Source: Kalin Stefanov How did you get interested in sign language? During my PhD at KTH Royal Institute of Technology, I worked on a project that aimed at creating a learning environment where children can learn signs in a game-like setting. An on-screen avatar presents the signs and gives the child certain tasks to accomplish, and in doing so, the child gets to practise the signs. The system was thus required to interpret the signs produced by the child, distinguish them from other signs, and indicate whether or not it was the right one. For this project, I developed a real-time hand gesture recognition technology applied to the recognition of Swedish Sign Language signs. To date, this is the only real-time Swedish Sign Language recogniser and one of the few such technologies in the world (both in academia and industry). The project involved close collaboration with researchers (native signers) at the Department of Linguistics, Stockholm University and technology evaluation with the target end-users. This was my (immersed) introduction to the fascinating world of sign language communication and since then I have participated in a range of research projects related to different sign language technologies. Can you tell us about the project you are currently undertaking? My current project addresses the computational modelling of Auslan. The project expects to generate knowledge by creating the largest Auslan dataset, enabling further advancements in this research area. The dataset will also play an essential role in other research fields, e.g. sign linguistics. Expected outcomes include the invention of the first Auslan recogniser and generator, representing a substantial advancement towards fully automated Auslan translation. The Auslan recogniser will be able to distinguish 1000+ isolated Auslan signs. This means that by providing a video of an isolated Auslan sign, the recogniser will ‘translate’ it to written or spoken English. We also aim to provide mechanisms for personalisation of the recogniser to account for individual variations in signing. The Auslan generator will operate in the reverse direction and be able to synthesise 1000+ isolated Auslan signs. This means that by providing a written or spoken English word, the generator will ‘translate’ it to a video of an isolated Auslan sign. We also aim to provide mechanisms for the personalisation of the generator to account for individual preferences in appearance and signing style. Who will be helped by the output of your project, and how? The Auslan recogniser and generator will be general enough to form the core of many applications targeted at specific needs. I see at least two contexts that could benefit directly from the project outcomes: Auslan education and research. In education, on the one hand, I can envision an application for individual Auslan sign acquisition and training (think ‘Duolingo for Auslan’). On the other hand, in collaboration with A/Prof Louisa Willoughby and her extensive network, we will attempt to co-design more focused pedagogical applications of the technologies. We are still in the early stages of the technology development, hence we have not yet finalised what this could look like. One possible direction is to embed the developed technologies in education offerings for future Auslan interpreters. This will potentially improve the quality of Auslan training and ensure that more students pass interpreting exams and enter the workforce as confident Auslan interpreters. In research, collecting and annotating sign language data is an expensive task that needs the","date":"2024-04-11","objectID":"/news/posts/kalin-stefanov/:0:0","tags":null,"title":"Interview with Kalin Stefanov","uri":"/news/posts/kalin-stefanov/"},{"categories":null,"content":"LDaCA team member Rosanna Smith discusses data management in language technology, focusing on projects at tech company Appen.","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"by Rosanna Smith Founded in Sydney in 1996, Appen is a technology company that collects and improves data for the purposes of training and developing machine learning and artificial intelligence systems. During my time at Appen, a significant part of my work involved a diverse range of projects for both automatic speech recognition (ASR) and text-to-speech (TTS) technologies, including phonetic transcription, grapheme-to-phoneme or letter-to-sound (L2S) rule development, and part-of-speech (POS) annotation. To break down these language technologies further, ASR enables computers to process human spoken language into readable text, allowing users to operate devices through speech or facilitate translation of that speech into other languages. Conversely, TTS generates an audio version of a written text. TTS can be used to improve accessibility, particularly for those with vision impairments or learning disabilities such as dyslexia. Language technology relies heavily on good-quality data, and the table below illustrates just some of the processes used in Appen projects, with use cases they can be applied to. Process Type Description Example Use Case Phonetic Transcription The process of transcribing words in a lexicon according to their sound (pronunciation lexicon), through the use of a phonetic alphabet such as the International Phonetic Alphabet (IPA) or X-SAMPA (a phonetic script that uses symbols all found on a standard keyboard, aiding in native-speaker transcription). Suprasegmental features such as stress, tone, vowel length, pitch accent and syllabification can also be encoded where relevant to the language. Use in ASR: Appen developed language packs in 26 languages for the Intelligence Advanced Research Projects Activity (IARPA) Babel program, facilitating improvement in speech recognition performance for languages other than English which have very little transcribed data. These projects generally have data collection, orthographic transcription and phonetic transcription components, the latter using X-SAMPA annotation. Further reading: IARPA - Babel Example dataset: IARPA Babel Mongolian Language Pack Use in TTS: In collaboration with health service providers GuildLink and MedAdvisor, Appen produces TTS audio of Australian Consumer Medicine Information (CMI) documents. This audio is available at medsinfo. Further reading: GuildLink Makes Consumer Medicines Information More Accessible Grapheme-to-Phoneme or Letter-to-Sound (L2S) Rule Development A set of rules for a given language that map between orthography and pronunciation, used to generate phonetic transcriptions from an input text. These rules can be both single and multi-letter mappings such as orthographic \u003cs\u003e → X-SAMPA /s/ and \u003csh\u003e → /S/ in English. The rules can refer to the surrounding environment which allows for alternative phonetic mappings based on the characters’ position in a word, surrounding morphemes, or other linguistic processes that impact pronunciation. For example, rules are implemented to specify when \u003cs\u003e should be pronounced as /z/ as in ‘roses’ or as /s/ as in ‘sit’. Suprasegmental features such as stress, tone, vowel length, pitch accent and syllabification may also be encoded, however, the accuracy of these in the output pronunciations depends on the given language and the extent to which these features are predictable from orthography alone. As was the case for the IARPA Babel program, one of the first steps in producing pronunciation lexicons involves running the wordlist through a language or dialect-specific L2S algorithm to generate hypothesised X-SAMPA pronunciations of each word. For many of the 26 languages in Babel, these rules didn’t yet exist and had to be developed through research, consultation and iterative testing with native speaker linguists. The automated output is then checked and edited as needed by native speaker linguists. L2S rules may be further developed according to customised requirements, for example, additiona","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:0:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"Data Management during the Life of a Project Due to the variety of services, data types and customisation capabilities provided by Appen, it is difficult to present a single standard approach to data management. Instead, I’ll focus on the processes that were developed for datasets that have both orthographic and phonetic transcription components, also known as transcription and lexicon multi-component projects. The basic procedure for these projects is outlined below: The client or project sponsor specifies the parameters for the dataset and its collection in consultation with Appen specialists. Appen carries out data collection, usually in the form of conversational or scripted telephony, VoIP or microphone recordings with qualified participants from Appen’s crowd of more than a million. Conversational data refers to spontaneous or unscripted natural speech on a variety of topics either over the telephone or with both participants in the same room, while for scripted data, participants read and respond to a set text of prompts, curated to facilitate topic, domain, keywords, key phrases and phonetic coverage. All participants are provided with a consent form explaining the purpose of the collection and how the data will be used, and only take part in the collection if they are happy and comfortable to do so, and if they sign the consent form. Personal data such as names are anonymised and sensitive personal data is not collected. The audio data is then quality-checked to ensure it meets the requirements for the collection, including demographic balance, language and content, background noise levels, audio levels and recording duration. The approved audio goes through some pre-processing steps to batch the data into smaller segments, then is orthographically transcribed by the transcription (TX) team. Timestamps are checked at this stage as well to ensure correct alignment of the audio and the transcribed text, and a variety of labels are added to capture speaker and non-speaker noise events (e.g. laugh, cough, background noise). Rigorous quality assurance processes are applied while the transcription is ongoing and on the resulting complete data set. Figure 1: Example of audio with transcription, batched and presented to transcribers in tool. Image Source: Appen Figure 2: Example of the resulting transcription text file. Image Source: Appen A sorted list of unique word forms (e.g. ‘house’, ‘houses’) occurring in the dataset is created from the orthographically transcribed data and sent to the lexicon (LX) team, who work with a group of native speaker linguists to prepare phonetic transcriptions of the words, either in X-SAMPA or another phonetic script. If a POS lexicon is also required for the dataset, this is similarly created from the sorted list of unique word forms and checked by native speaker linguists. Figure 3: Example of a Hungarian pronunciation lexicon loaded to the transcription tool for native speaker review. Image Source: Appen Figure 4: Example of a UK English pronunciation lexicon. Image Source: Appen Figure 5: Example of a UK English POS lexicon. Image Source: Appen Data across each stage is validated for quality assurance. The audio, transcription and lexicon(s) are packaged for delivery. As shown above, the data for a single project goes through multiple stages of annotation and requires specialist linguist expertise in the data management process before each part can be combined for final delivery. One major challenge in this process is dealing with errors and inconsistencies that arise both prior to, and during, the validation stage, particularly where data is shared between the TX and LX teams. As transcribers and linguists prepare the annotations, variations may arise in features like spelling, capitalisation and punctuation. Languages may have no or multiple accepted orthographic conventions, and a standard approach must be determined for the project and adhered to. Conversely, other languages may alre","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:1:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"Data Management for Pre-Labeled Datasets Figure 6: Visual representation of pre-labeled dataset types. Image Source: Appen At the time of writing, Appen has over 280 audio, image, video and text datasets in over 80 languages available as pre-labeled datasets. These datasets are publicly accessible and can be filtered according to several categories: product type, common use cases, language and number of hours of audio, word or image count, if applicable (called Unit in the table below). Combined with a standard search function, this filtering allows interested parties to further refine their query for data most applicable to their needs. Product Type Common Use Cases Language Unit AudioImageTextVideo ASRAction ClassificationAutomatic CaptioningBaby MonitorCall CentreChatbotContent ClassificationConversational AIDocument ProcessingDocument SearchFacial RecognitionFitness ApplicationsGesture RecognitionIn Car HMI \u0026 EntertainmentKeyword SpottingLanguage ModellingMTNERSearch EnginesSecurity \u0026 Other Consumer ApplicationsSpeech AnalyticsTTSVirtual Assistant A-Z A measure of the volume of the dataset, either in hours or word count. Each listing contains further details about the dataset, particularly in relation to aspects of the data collection, including the specifics of the language collected, the volume of the dataset, the number of contributors involved, recording conditions and the file formats available. Additional details appear in a pop-up window when you click on the “tile” for a given dataset. For example: Catalogue entries for a Danish POS dataset and a Mongolian pronunciation dictionary: Figure 7: Danish POS dictionary catalogue summary and metadata. Image Source: Appen Figure 8: Mongolian pronunciation dictionary catalogue summary and metadata. Image Source: Appen This is a helpful starting point for browsing and refining the dataset selection, but in order to find the most suitable match, clients consult directly with Appen to discuss their language technology needs in full. More detailed metadata for each dataset is tracked internally and this is used to further assist with client queries and requests. Some of the additional metadata categories recorded include the demographics of the contributors for each dataset, such as the age range and gender distribution of the participants, the dialect coverage within a specific language, the year of collection, and the domains or topics discussed in the collection. Using the metadata as filters, requests can be divided into three alternative outcomes: Dataset(s) matching all requirements are available and a sample of the data is shared with the client to confirm this is the case. Dataset(s) matching only some of the requirements are available and samples of these are shared in case they are still applicable to the project. No dataset matching the requirements is available and a quote is then prepared for producing a new dataset ​​or enhancing an existing dataset. In terms of dataset storage, collections are catalogued first according to language and second by dataset type. This method works particularly well for pre-labeled datasets, because, if nothing else, clients usually know which language(s) they want data for and other specifics can follow from there. ","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:2:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"Conclusions Both in cases where new datasets are being created and finalised collections made ready for cataloguing, data management is a crucial consideration at Appen and, more broadly, any language technology project. Two main issues that are important to consider for data collections are: Ensuring that collections with multiple component parts are organised in such a way that updates can be applied to the whole, rather than introducing inconsistencies between related data. Determining the parties that will be using the repository and assuring that the methods of accessing data are intuitive for all groups. These issues are handled in the Appen workflow through the use of iterative semi-automated spelling checks and other similar processes which enable modifications to be integrated into all levels of the dataset, and through clear recording and accessibility of the metadata describing a collection, available both to Appen staff and its clients. For LDaCA, these issues are of equal importance for wider data management. To the first point, while it is ultimately the responsibility and prerogative of the data contributor to decide how their data should be organised, LDaCA can also assist with this process. Guidance is available on best practices for data management and organisation of metadata, both in the form of the documentation available at LDaCA Resources, as well as automated validation of metadata category requirements for datasets added and edited through Crate-O. To the second point, LDaCA has a responsibility to ensure that tools to facilitate data management and discoverability are accessible and intuitive for all, including the portals that use the Oni application to access data packaged as RO-Crates. This should be an iterative process, with further improvements to operability implemented based on user feedback and needs. ","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:3:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"LDaCA team members who also worked on the AusNC project explain the relationship between the two and why AusNC is not an entity in LDaCA.","date":"2024-03-22","objectID":"/news/posts/ausnc/","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"by Simon Musgrave and Michael Haugh ","date":"2024-03-22","objectID":"/news/posts/ausnc/:0:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Overview of the Australian National Corpus (AusNC) The Australian National Corpus (AusNC) project was an important precursor to the Language Data Commons of Australia (LDaCA) project. Important aims and motivations of the LDaCA project are to secure language data in Australia and to make it easily available to researchers (and other interested parties). But language researchers have been aware of these issues for some time and the AusNC was an earlier attempt to address them. The AusNC was discussed in various places before the work of assembling data began. A meeting of interested people was held at the conferences of the Australian Linguistic Society and the Applied Linguistics Association of Australia in 2008, and a workshop was held in 2008 (proceedings published in Haugh et al. 2009), supported by an ARC-funded network for research into speech and communication (HSC-Net). Following these meetings, funding support from the Australian National Data Service, and subsequent funding and technical support from Griffith University enabled us to start the AusNC project, which was led by Michael Haugh, then a staff member at Griffith. Several data sets, which already existed and could easily be accessed, were included in this initial stage of the project: AustLit (selected samples of out-of-copyright poetry, fiction and criticism ranging from 1795 to the 1930s) The Australian Corpus of English (ACE) The Australian component of the International Corpus of English (ICE-AUS) Australian Radio Talkback (ART) The Corpus of Oz Early English (COOEE) The Monash Corpus of English (MCE) The Griffith Corpus of Spoken Australian English (GCSAusE) Braided Channels (oral histories of women from the Channel Country) Mitchell and Delbridge (a sample from data collected around 1960 documenting the speech of Australian adolescents) The team behind the AusNC always hoped that additional data would be added to the corpus, although funding constraints meant that only one further collection, the La Trobe Corpus of Spoken English, was added to the corpus shortly after the initial launch of the AusNC web interface. The Griffith University eResearch Services built the web interface and hosted the data for the AusNC, which was launched in 2012. From 2014, the Alveo virtual laboratory also held a copy of the data and provided alternative ways of interacting with it. However, after the launch of the Alveo version, the Griffith version of the corpus changed in various ways, and the two versions remained out of sync for a time, before both eventually became inaccessible when changes in the underlying technologies meant that maintaining those digital infrastructures was no longer cost effective. The Alveo virtual laboratory still exists, in principle, but has not been regularly maintained since 2019, and its services are rarely operational. With no ongoing funding and no staff member affiliated with the AusNC project, Griffith University indicated that it could not maintain the AusNC data sets and interface, and closed those facilities when the data was transferred to LDaCA in 2023.1 All the original data sets are now available (or are in the process of being made available) through the LDaCA data portal, and in one case, there is a substantial increase in the amount of data available (see discussion of Mitchell and Delbridge below). Figure 1: Results of a frequency search in the Australian National Corpus interface. Image Source: LDaCA ","date":"2024-03-22","objectID":"/news/posts/ausnc/:1:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Corpus design There are three general types of corpora for language data. In the first type, the data was collected for specific research purposes by individuals. In the second type, the data was collected following a plan. In the third type, the data was collected opportunistically. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Data collected for specific research purposes by individuals Likely the most common type of corpus is a collection created by an individual researcher or research group to try to answer a specific question or set of questions. An example of such a corpus in the AusNC is the Mitchell and Delbridge dataset. Mitchell and Delbridge wanted to know what Australian speech sounded like, how it differed from other varieties of English (often Received Pronunciation), and what sort of variation occurred within the Australian community. To answer these questions, Mitchell and Delbridge recruited school principals across the country to organise recordings of adolescents reading word lists and other prepared material, with a small, less structured component. As mentioned above, the AusNC made only a small part of this dataset available. LDaCA is delighted that, with the assistance of the University of Sydney, we are now able to provide access to data from all of the 7,736 speakers from 330 schools who were recorded. Another example of such a corpus in the AusNC is the Braided Channels collection, which contains 70 hours of oral history interviews with women from Queensland’s Channel Country. This is an example of a collection that was created to answer historical and cultural questions, but has been repurposed to enable researchers to address linguistic questions as well. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:1","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Data collected following a plan In the second type of corpus, data is collected according to a defined plan. The kinds of material and the relative amounts of different kinds of material are decided in advance, and then data is collected accordingly. We can distinguish two sub-types within this category: the plan already exists or the plan needs to be developed independently. Existing plan In some cases, data is collected according to the plan of another corpus which already exists, so that meaningful comparisons can be made. An example of this approach in the AusNC is the Australian Corpus of English (ACE), which was designed to be comparable with two previous corpora: the Brown Corpus of American English and the Lancaster-Oslo-Bergen Corpus of UK English (which was itself designed to be comparable with the Brown Corpus). The Australian component of the International Corpus of English (ICE-AUS) is another example of this type of corpus in the AusNC. Independent plan Alternatively, a corpus can be designed independently, but with the intention that it should be representative of some realm of language use, showing different genres and registers, and possibly also sampling some aspects of demographic variation. A classic example of such a corpus is the British National Corpus (BNC: 1994 version, 2014 version), which includes samples of both written and spoken language, with different kinds of written text in specified proportions, and which samples speaker variation across dimensions such as geography, gender and age. The BNC is sometimes referred to as a ‘reference corpus’; the idea is that it provides a baseline against which phenomena from other datasets can be compared. COOEE was constructed on similar, if less ambitious, lines. The corpus is divided into four time periods, with an approximately equal amount of material in each. Texts are assigned to one of four different types, and the proportion of each of those types is approximately equal in each time period. Such a design makes it easy to examine chronological change or variation between registers. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:2","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Data collected opportunistically The third type of corpus is one that is almost without a plan; the collection of data is opportunistic. Again, we can distinguish some sub-types here. Restricted amount of data Sometimes, opportunistic data collection is the only viable strategy. An example of this is the approach often taken by linguists documenting languages whose future is uncertain. In that situation, carefully planning what kinds of data to collect and in what proportions would be ideal, but is typically a luxury which cannot be afforded. Collecting anything which is easily available is at least a good starting point — if one becomes aware that there are large gaps in the data collected, one can try to address the problem in the future, but that may not ever be possible. This type of corpus is the result of making the best one can of limited opportunities, which restrict the amount of data which is available. Collections of spoken interaction are often assembled in this way. The Griffith Corpus of Spoken Australian English (GCSAusE), which includes recordings made and transcribed by students, is an example of this type of corpus in the AusNC. Extensive data Opportunistic data collection is also relevant to big data: many researchers work with data from the web nowadays, and this almost inevitably involves opportunistic procedures. However, in this case, the large amount of data which is accessible can justify a claim that a significant part of the variation occurring in language use (at least for a certain type of use) will be represented in the data. On this basis, many linguists are happy to use a corpus of web data, at least for exploratory research. Figure 2: The Australian National Corpus logo. Image Source: LDaCA ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:3","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"AusNC as an opportunistic collection We have set out three broad strategies above for constructing a corpus. The AusNC calls itself a corpus, so can we assign it to one of these types? We hope that the answer to this question is clear: the AusNC was an opportunistic collection. When the possibility of a corpus was being discussed, there were proposals for it to be carefully planned. For example, in a paper published following one of the workshops leading up to the AusNC, Pho (2009) argued that an Australian corpus should mirror the design of the BNC. However, even putting together a collection of written material on the scale of the BNC is a huge task, never mind the spoken language component; the BNC itself was produced by a large team working over a number of years. The AusNC project never enjoyed a level of funding which would have made such a plan possible (indeed, the amount of funding we received while very welcome was modest). In these circumstances, we had no alternative but to create the AusNC from existing data sources which custodians were prepared to share. We did hope that this model would make it easy to expand the corpus, but this did not occur, as we were not able to develop the infrastructure necessary for ongoing expansion of the corpus with the limited amount of funding we had. In 2012, it seemed obvious that the appropriate term for our project was corpus. We were putting together a collection of language data and it came to exist as an entity in its own right. What was included might be the result of rather random factors, but there was a coherence in that all the data related to language use, in fact, use of the English language in Australia. No one was talking about ‘data commons’2 when the AusNC was created (or we were not listening). If the concept had been actively discussed, perhaps we would have described what we were doing as a data commons. The inception of LDaCA coincided, more or less, with the two factors mentioned above which led to the AusNC becoming non-viable: the understandable withdrawal of support from Griffith University and the instability of the Alveo infrastructure due to funding constraints. It was therefore inevitable that transferring the AusNC collection to the new project would be an early priority. Approaching this task, we faced the problem of whether it still made sense to refer to the collection of datasets as the AusNC. After discussion within our project, and with the custodians of the AusNC datasets, we decided that retaining the name would suggest a coherence which did not exist now that the AusNC was part of LDaCA. The AusNC is now part of a larger assemblage of collections of data, and we felt that any internal coherence which the AusNC might have did not differentiate it, or the datasets it was composed of, from that broader landscape. Therefore, we made the decision to include the component datasets of the AusNC as separate collections in LDaCA data portal. The top-level description of each of the collections contains the following sentence to acknowledge the history of the data: “This collection was previously accessible online via the Australian National Corpus (AusNC), an initiative managed by Griffith University between 2012 and 2023.” Additionally, this blog post is available to those who are interested in the history of the AusNC.3 The corpus is dead, long live the data commons! ","date":"2024-03-22","objectID":"/news/posts/ausnc/:3:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"References Grossman, Robert L., Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. “A Case for Data Commons: Toward Data Science as a Service.” Computing in Science \u0026 Engineering 18 (5): 10–20. DOI: 10.1109/MCSE.2016.92 Haugh, Michael, Kate Burridge, Jean Mulder and Pam Peters (eds.). 2009. Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages. Sommerville, MA: Cascadilla Proceedings Project. http://www.lingref.com/cpp/ausnc/2008/index.html. Musgrave, Simon \u0026 Michael Haugh. 2020. The Australian National Corpus (and beyond). In Louisa Willoughby \u0026 Howard Manns (eds.), Australian English Reimagined. Abingdon: Routledge. Pho, Phuong Dzung. 2009. Towards the Design of the Australian National Corpus. In Michael Haugh, Kate Burridge, Jean Mulder \u0026 Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, 25–29. Cascadilla Proceedings Project. http://www.lingref.com/cpp/ausnc/2008/index.html. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:4:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Footnotes Thanks to Teresa Chan who provided valuable feedback on a draft. 1 We think it is important, however, to acknowledge and recognise the longstanding commitment by Griffith University to maintaining the AusNC over a period of more than ten years. ↩ 2 “A global trusted system of systems that provides frictionless access to high quality interoperable resources, services and artefacts for research….. Data commons collocate data, storage, and computing infrastructure with core services and commonly used tools and applications for managing, analyzing, and sharing data to create an interoperable resource for the research community.” (Grossman et al. 2016) ↩ 3 More details of the history of the AusNC can be found in Musgrave \u0026 Haugh (2020). ↩ ","date":"2024-03-22","objectID":"/news/posts/ausnc/:4:1","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"A list of the Oni data portals currently available, including their current collections.","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":" This page lists, and links to, the Oni portals currently available, together with the collections contained in each. As additional collections and portals are added, these will be documented below. To find out more about each collection, simply click through to be redirected to the Collections page. ","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/:0:0","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":"LDaCA Portal Current Collections: A COrpus of Oz Early English (COOEE) AustLit Australian Corpus of English Australian Radio Talkback Braided Channels International Corpus of English (ICE-AUS) The La Trobe Corpus of Spoken Australian English The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960 ","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/:1:0","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":"Indigenous Language Data (ILD) Portal Current Collections: UQ Indigenous Language Collection: Elwyn Flint Collection, UQFL173 Fryer Library, The University of Queensland UQ Library Collection Caroline Kelly Papers ","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/:2:0","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":"In 2018, when he was working at The University of Queensland’s (UQ) Research Computing Centre (RCC), Marco Fahmi (former LDaCA Project Manager) initiated a Graduate Digital Research Fellowship (GDRF) program focused on researchers in the Humanities and Social Sciences (HASS). The program has run in some years since then, hosted by the RCC (2019) and also by the UQ Graduate School (2020). In 2023, the program ran again, this time under the auspices of the Australian Text Analytics Program (ATAP). Two UQ staff members, Sam Hames (Postdoctoral Research Fellow in Computational Humanities) and Simon Musgrave (Engagement Lead for ATAP and for the Language Data Commons of Australia) led the program, and four students participated. The program supports research students to explore the possibilities of digital scholarship, particularly in HASS. Our Fellows are students undertaking extended research projects who spent 12-15 weeks learning about digital and computational skills in order to enhance their current research/thesis topic or to work on an independent digital project. They explored digital research methods in areas including Computational analysis of text Creation of new software tools to support their specific research area Social media analytics Assessment and analysis of digital environments including mobile apps Online games as social and cultural objects The Fellows met regularly in activities to develop a sound understanding of digital research methods and tools, including seminars, reading groups, and training workshops. They also had access to mentors who advised and collaborated on their digital projects. The four Fellows who participated in the 2023 program and their projects were: ","date":"2024-01-10","objectID":"/news/posts/gdrf/:0:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Evelyn (Eve) Ansell (UQ, Linguistics): Eve works in the field of Conversation Analysis which uses very detailed transcriptions of spoken interactions as data. The detail means the basic word forms involved are not easy to identify. For example, dea:::l indicates the word deal with emphasis (the underlining) and a greatly lengthened vowel sound (the colons). Clearly, the search function in a word processor is not going to find this as an instance of the word deal! Eve’s project looked at the possibility of developing code in a Jupyter notebook which could identify base word forms in such transcriptions and allow the researcher to see all instances of a word in an interaction. Eve also had the opportunity to work on the problem as part of an industry placement with AARNet. In Eve’s words: My only regret is that I didn’t do the GDRF sooner. This sort of learning should be required early on in all Humanities research training programs. I cannot speak highly enough of Simon and Sam. They enthusiastically share their domains of knowledge, but also engage with us about our own work. To have access to experts willing to talk through current projects, future scoping, frustrations, and wild ideas is incredibly valuable. It’s easy to forget how much of our research work happens in the digital space. It’s not just the data, it’s how we interface with it. The GDRF taught me to think differently about how I treat my research workflow at almost every point. ","date":"2024-01-10","objectID":"/news/posts/gdrf/:1:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Mengyan (Kelly) Hou (UQ, Electrical Engineering and Computer Science): The core of Kelly’s PhD project is designing an app to support young people who have completed active cancer treatment to enhance their experience in digital healthcare interventions. In the Fellowship program, Kelly began surveying existing apps with similar functions and aims as the start of her own design process. An important component of this work was considering methodological questions: when we want to answer the question “Is this a good app?”, how do we go about that? Can we understand the intentions of the app designers and assess how effectively those intentions have been realised? Or should we try to judge what sort of experience the app provides for users (even if we are not members of the target audience)? Kelly examined these issues in relation to a number of relevant apps, as well as beginning to grapple with the (very topical) question of how an AI-based conversational agent might be included in her design. ","date":"2024-01-10","objectID":"/news/posts/gdrf/:2:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Christina Maxwell (UQ, Psychology): Christina’s research investigates where cisgender women create the boundaries for ‘womanhood’ in relation to transgender women within the context of feminist social movements. In her work as a Fellow, Christina looked at whether social media data, in this case from Twitter, could be another source of data to investigate the boundary creation issue. Similar to Kelly, Christina’s work involved considering methodological issues: what methods for analysing texts could usefully be applied to the Twitter data? It is exciting for all of us involved in the program that Christina (with co-authors) presented this research to the SASP-ACPID Conference 2023. To quote the abstract for that presentation: “As both anti-inclusion and pro-inclusion feminists are increasingly turning to social media to communicate their messages, we will collect data from Twitter to map out each movement online and how they position themselves and their goals in relation to the inclusion of transgender women. This exploratory mixed-methods project will use topic modelling, thematic analysis, and Linguistic Inquiry and Word Count…”. In Christina’s words: The Fellowship was instrumental in honing my ideas and offering invaluable hands-on support to bring my vision to life. It played a pivotal role in developing one of the most exciting and interesting projects of my PhD and I look forward to using these new skills and ways of thinking in future digitally-based projects. Update - August 2024 Christina’s work has now been published as a journal article with Sam as one of her co-authors: Maxwell, C., Selvanathan, H. P., Hames, S., Crimston, C. R., \u0026 Jetten, J. (2024). A mixed-methods approach to understand victimization discourses by opposing feminist sub-groups on social media. British Journal of Social Psychology, 00, 1–22. https://doi.org/10.1111/bjso.12785 ","date":"2024-01-10","objectID":"/news/posts/gdrf/:3:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Giulio Pitroso (Griffith, Sociology): Giulio’s research examines gaming communities in Australia and in Italy, focused on the way stereotypes tied to Italians and organised crime are articulated among players of video games. He is also interested in digital communities, ethnic identities on the Internet, and cultural and social manifestations related to Italian criminal organisations. In his work as a Fellow, he looked at the intersection of these areas of interest by examining the discussions, speculations, and expectations of fans about the (yet to be released) Mafia IV video game. This was tackled using comment threads from YouTube channels as a source of data. Once again, methodological questions were important for Giulio, including the questions about what analytic techniques could usefully be used. A GDRF program will be offered again in 2024, this time under the banner of the Language Data Commons of Australia. If you think the program could benefit you, there is an online form for expressions of interest (due by February 19 2024). For further information, you can contact Sam Hames or Simon Musgrave. ","date":"2024-01-10","objectID":"/news/posts/gdrf/:4:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Find answers to the most commonly asked questions about the project.","date":"2023-12-12","objectID":"/about/faqs/","tags":null,"title":"Frequently Asked Questions","uri":"/about/faqs/"},{"categories":null,"content":"A variety of LDaCA open-source tools available on GitHub.","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":" A variety of LDaCA open-source tools are available at our GitHub organisation. Highlights include: ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:0:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Metadata Editor Crate-O: A tool that allows you to create and update Research Object Crates (RO-Crates) using a web interface, and with metadata spreadsheets. It provides researchers with a relatively simple way to describe their data using the best practices in formal metadata description. For more information about Crate-O and how to create RO-Crates with the interface, see the Crate-O User Guide. ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:1:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Data Storage OCFL-js, a library that implements the Oxford Common File Layout (OCFL): A specification for laying out digital collections on file or object storage. It is designed with long-term preservation principles in mind and does not rely on specialised software. Amongst the benefits of using OCFL with RO-Crate objects are: completeness: a repository can be re-indexed from the files it stores versioning: repositories can make changes to objects and still allow their history to persist ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:2:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Data Portal and Access API Oni: A web application that provides indexing, searching and access to secure data repositories following the Arkisto model. This is used to build the LDaCA Portal: The online interface of the Language Data Commons of Australia where users can discover and access language collections. ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:3:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Specifies the license(s) that data contributors have applied to the content of their collection, including the content coverage of that license.","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":" Specifies the license(s) that data contributors have applied to the content of their collection, including the content coverage of that license. ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:0","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"A Corpus of Oz Early English (COOEE) All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:1","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"AustLit All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:2","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"Australian Radio Talkback All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:3","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"Braided Channels All content: Attribution-NoDerivatives 3.0 Australia (CC BY-ND 3.0 AU) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:4","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"International Corpus of English (Aus) All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:5","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"The Australian Corpus of English All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:6","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"The La Trobe Corpus of Spoken Australian English All content: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:7","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960 All content: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:8","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"LDaCA team member Harriet Sheppard describes collecting data on a remote island in Papua New Guinea.","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/fieldwork-png/","tags":null,"title":"Fieldwork in Papua New Guinea","uri":"/resources/general-resources/case-studies/fieldwork-png/"},{"categories":null,"content":"by Harriet Sheppard Harriet on an outrigger boat off Vanatɨna Image Source: Harriet Sheppard My fieldwork experience, in many ways, conforms to stereotypes of language fieldwork – a community outsider working with a smaller language community located far from an urban centre. For my postgraduate research, I documented and described grammatical topics of Sudest (known as Vanga Vanatɨna by its speakers), an Oceanic language spoken in Papua New Guinea (PNG). The language is spoken on the islands of Vanatinai (also known as Sudest and Tagula) and Yeina in the Louisiade Archipelago, approximately 350 kilometres southeast of the PNG mainland. In the following, I discuss decisions I made and processes I followed related to the collection of language data and archiving the resulting corpus of texts, including ethical considerations, ownership of the data, data formats, metadata, and data security and reuse. This is by no means meant as a guide (of best practice) for projects where language data are being collected with speakers of un(der)described languages. Rather, it is an account of some of the types of considerations and procedures that need to be considered when collecting language data and how I handled them in this instance. Map of Louisiade Archipelago Image Source: Harriet Sheppard Before recording with a speaker for the first time, I would explain the project to the speaker and their rights as contributors to the project. I would explain how I would use the language recordings, how they would be stored and accessed and how speakers of the language or other researchers might access and use the recordings in the future. This included discussing how they, the person(s) making the recording, would retain ownership over the recording (i.e. intellectual and cultural property rights). They would also be the ones to choose access rights over the text and have the option to withdraw any or all recordings they made from the corpus at any point (more on these below). After these discussions, if the speaker was still interested in the project, I would then record an oral consent form with them. The consent form was in English, the language of wider communication in the province. In cases where the speaker didn’t speak or spoke limited English, we would have a translator present for this process. I chose to use an oral consent form as it is accessible to all participants no matter their literacy levels. It also has the advantage of being a more durable format compared with a paper consent form, particularly when working in a hot and humid climate and it can also be easily backed up. Transcription session with Abel Sam Image Source: Harriet Sheppard Each time I made a recording with a speaker or speakers, I played the recording back to them to check that they were happy with the recording and so that they could decide on what sort of access conditions they wanted the archived recording to have. If the speaker was not satisfied with the recording, sometimes they would choose to make another recording and delete the original or simply delete the recording and maybe try again later. If they were happy with the recording, they would decide on the access conditions they wanted for the archived recording. The possibilities ranged from the most restricted, being that only I, as the original researcher, could access the recordings and use them for my research, to the opposite end of the access continuum where anyone interested could download and listen to the recordings, read any transcriptions, and potentially use them for educational or research purposes. In between these two ends of the continuum, they could opt for other access conditions such as only allowing access for registered users of the archive, or only for researchers and speakers, or just for speakers. They could also choose to have a text embargoed for a certain period of time. Notes from an elicitation session Image Source: Harriet Sheppard When I was preparing my ethics applicati","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/fieldwork-png/:0:0","tags":null,"title":"Fieldwork in Papua New Guinea","uri":"/resources/general-resources/case-studies/fieldwork-png/"},{"categories":null,"content":"LDaCA Chief Investigator Catherine Travis and former team member Cale Johnstone explain the complexities of handling data from three periods in a 40 year span.","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"by Catherine Travis and Cale Johnstone ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:0:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Introducing the Sydney Speaks project Sydney Speaks is a large-scale sociolinguistic project, funded through the ARC Centre of Excellence for the Dynamics of Language (CoEDL 2014-2022) at the ANU, led by Catherine Travis, working with James Grama and Simon Gonzalez (CoEDL post-doctoral fellows (2017-2019); Katrina Hayes and Cale Johnstone (Project Managers); Benjamin Purser (lead RA); and many RAs who worked on data collection, transcription, and checking vowel alignment. The Sydney Speaks Corpora Contemporary linguistic recordings (collected 2015-present): Sydney Speaks 2010s Legacy linguistic recordings (Collected 1977-1981): Sydney Social Dialect Survey Legacy oral histories (collected 1987-1988): NSW Bicentennial Oral History Collection Data access concerns Combining legacy and contemporary data collections References Sydney Speaks Corpus and Publications Since 2015, the Sydney Speaks project has recorded the spontaneous speech of Sydneysiders for the purpose of documenting and exploring Australian English as spoken in Australia’s largest and most ethnically and linguistically diverse city. In order to study language change in real-time, the project also incorporates legacy data from two sources: a sociolinguistic project carried out in the late 1970s and an oral history collection created during the 1980s. Altogether, the birth years of the participants span over a century, from the 1890s to the 1990s, and five age groups are represented, as captured in Figure 1. Figure 1: Sub-corpora included in the Sydney Speaks project: Sydney Speaks 2010s; Sydney Social Dialect Survey; NSW Bicentennial Oral History Project Image Source: Catherine Travis Each of the three sub-corpora of Australian English presents a different set of challenges and issues for data management practices that maximise the value of the data, while protecting the ethical, moral and legal rights of the participants. Below, we present each sub-corpus individually, and provide a summary of each sub-corpus. We describe the ways in which recordings were made and transcription was managed; outline how we were able to get comparable demographic information from participants when this was not standardised across the corpora; and detail how we dealt with participant consent and privacy. We close by reviewing a set of data access concerns around de-identification, data security and access and re-use. In this overview of the process by which different datasets collected for distinct purposes and under varying conditions have been brought together to constitute a coherent corpus of language data, we hope to highlight some of the considerations key to working with sociolinguistic data, and the enormous potential for the incorporation of other data sources through the use of appropriate standards for data management. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:1:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"The Sydney Speaks corpora ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Contemporary linguistic recordings (collected 2015-present): Sydney Speaks 2010s Sub corpus overview The Sydney Speaks 2010s sub-corpus comprises recordings made from 2015 to the present. Data collection is still ongoing, but as of June 2023, Sydney Speaks 2010s comprises 142 sociolinguistic interviews with men and women of varying ages from different ethnic communities in Sydney representing some of the largest ethnic minorities in Australia, namely Anglo-Celtic, Chinese, Greek, and Italian Australians. The interviews were generally conducted individually (though some were done in pairs), and they were conducted by community members, typically with people they knew. The interviewers were provided with a list of suggested topics, covering such things as childhood memories, growing up in Sydney, school and work experiences, and for the Greek, Italian and Chinese participants, engagement with their ethnic community and language, but they were told to use these as a guide only, and to follow the participants’ lead in choosing topics. Thus, the data is highly interactional, in some cases conversational, and topics vary greatly across recordings. There is a focus on recording narratives of personal experience, as these have been demonstrated to be particularly ideal for recording the everyday vernacular (Labov 1984:32-42). After the sociolinguistic interview, participants read aloud a word list, which provides a comparison for both the sociolinguistic interview data and other studies of Australian English (as many have focused on read word lists). The suggested interview topics and the word list are given in Figures 2a and 2b. Figure 2a: Suggested interview topics Image Source: Catherine Travis Figure 2b: Suggested word list Image Source: Catherine Travis Recordings and transcripts Interviews are recorded using a Zoom H4N digital recording device in WAV format at 44.1 kHz/16 bits. They are orthographically transcribed, aligned at the utterance level and then uploaded into LaBB-CAT (Language, Brain and Behaviour Corpus Analysis Tool (Fromont \u0026 Hay 2012)), for forced alignment (which automates the process for alignment at the level of the phoneme). LaBB-CAT also serves as the corpus management tool, where the corpus is stored, data is further annotated, and concordance searches can be conducted. LaBB-CAT does not include corpus analysis tools as such, but data can be downloaded in multiple formats to allow for analyses, including CSV and WAV, as well as in formats that are compatible with tools that are used widely by linguists, such as Elan, illustrated in Figure 3 (Lausberg \u0026 Sloetjes 2009) and Praat in Figure 4 (Boersma \u0026 Weenink 2019). Figure 3: Sample SydS transcript in Elan, time aligned at the level of the Intonation Unit [SydS_CYM_061_Mark] Image Source: Catherine Travis Figure 4: Sample SydS transcript in Praat, downloaded from LaBB-CAT, where it has been automatically aligned at the level of the word and phoneme, and broken down by morpheme [SydS_CYM_061_Mark] Image Source: Catherine Travis Participant metadata Detailed speaker metadata is collected via a demographic form that includes information such as place and year of birth, current and past suburbs of residence, occupation, education, community background, languages, and social networks (see Figure 5). This questionnaire is conducted orally at the end of the interview (as a continuation of the interview), and the participants’ responses are filled in by the interviewer. Information is extracted from the written form, validated in the recorded audio, and added to a metadata spreadsheet in Excel. We use customised metadata, rather than drawing on a standard metadata vocabulary; however the demographic information from each sub-corpus is standardised in Excel and comparable across the three sub-corpora. Figure 5: Sample Sydney Speaks demographic information form Image Source: Catherine Travis Participant consent and privacy The Sydney Speaks contemporary data collection ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:1","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Legacy linguistic recordings (Collected 1977-1981): Sydney Social Dialect Survey Horvath, Barbara. 1985. Variation in Australian English: The sociolects of Sydney. Cambridge: Cambridge University Press Sub corpus overview The Sydney Social Dialect Survey (SSDS) is a collection of 177 sociolinguistic interviews with adult and teenage Australians from Anglo-Celtic, Greek, and Italian backgrounds, recorded in Sydney between 1977 and 1980, as part of an ARC-funded project led by Barbara Horvath, of the Department of Linguistics at the University of Sydney. For the Sydney Speaks project, recordings from Anglo-Celtic adults (born in the 1930s) and from Anglo-Celtic, Greek and Italian Teenagers (born in the 1960s) were included. The recordings with Greek and Italian adults were set aside, as they arrived in Australia as adults and speak English as a second language, and thus raise a different set of questions for the study of language variation and change. There were 7 participants who did not meet the Sydney Speaks participant criteria (e.g. one of their parents was not of the target ethnic groups), leaving a total of 20 adults and 72 teenagers for inclusion. Like the contemporary linguistic data, the Sydney Social Dialect Survey comprises sociolinguistic interviews, but these are more interview-like: the interviewers were not community members, and they did not typically know the participants; and the topics were more defined (some common topics being games, layout of the school, nicknames, and language). Recordings and transcripts Interviews were conducted in the 1970s with a cassette recorder and made available to the Sydney Speaks project directly by the lead researcher. The audio cassettes, type-written transcripts, and demographic information of the participants were stored in boxes in Horvath’s garage in Sydney and passed on to Catherine Travis in 2013. The cassettes (pictured in Figure 7) were digitised using PARADISEC equipment in the College of Asia Pacific Studies at the ANU to create WAV files (96.1 kHz, resampled using Audacity into smaller files of 44.1 kHz). Fortuitously, the cassettes had been preserved very well, and nearly all of them (124/130) were able to be digitised, and with further refinement, it was possible to conduct acoustic analyses (though in some cases, this presented considerable challenges that did not arise with the new recordings). The typewritten transcripts were scanned and digitised as PDFs, but it was not possible to convert them into machine-readable transcripts, partly because they had been marked up by hand for transcript corrections, coding and annotation, as can be seen in the sample in Figure 8. They were therefore re-transcribed in Elan by the Sydney Speaks team, following the protocols applied for the contemporary data (see sample Elan rendition of part of Figure 8 in Figure 9). Figure 7: SSDS cassette collection Image Source: Catherine Travis Figure 8: Sample SSDS transcript with annotations [SSDS_ITF_120_Sara] Image Source: Catherine Travis Figure 9: Sample SSDS transcript time-aligned in Elan [SSDS_ITF_120_Sara] Image Source: Catherine Travis Participant metadata Metadata for each speaker was extracted from original type-written profiles that included information such as suburb, date of birth, age, occupation, education, languages, and time lived in Sydney (see Figure 10). Details were standardised and added to the project metadata database in an Excel spreadsheet. Further demographic information was added from several other sources, including participant overview documents put together by the lead researcher, original cassette labels as well as from the audio recordings themselves. We were not able to extract the same demographic information for all participants, and we have more details for some than for others, something which needs to be taken into account in cross-corpus comparisons. Figure 10: Sample demographic information: Sydney Social Dialect Survey Image Source: Catherine ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:2","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Legacy oral histories (collected 1987-1988): NSW Bicentennial Oral History Collection NSW Bicentennial Oral History Project. 1987. NSW Bicentennial oral history collection. Council on the Ageing NSW Branch and NSW Oral History Association of Australia, housed at the National Library of Australia. The process of conducting oral history involves recording interviews to collect information about the past, from the perspective of those who lived through relevant events. In recording everyday voices, oral histories are particularly valuable for research across Humanities disciplines as they capture the ‘little-heard voices of society’. They are of particular value for linguistic analysis, because they aim to ensure that ‘the historical record includes different languages and vernacular speech, accent and dialect’ (Oral History Statement of value; What is oral history? Retrieved May 23, 2022, from Oral History Australia). Like the sociolinguistic interview, oral histories elicit narratives of personal experience, and thus they provide highly comparable data. Sub corpus overview The NSW Bicentennial Oral History Collection, produced by the NSW Bicentennial Oral History Project, comprises 200 interviews recorded in 1987 and 1988 with men and women born before 1910. The Sydney Speaks project has incorporated 31 interviews with people who were born in Sydney and whose parents were also born in Sydney. The recordings include discussions about life in Sydney in the early part of the twentieth century, including the war, women’s first experience in the workforce, the outbreak of the Spanish flu, and so on. The collection is managed by the State Library of NSW and the National Library of Australia. In 2017, the Sydney Speaks project gained access to the collection via direct request to the National Library of Australia. Recordings and transcripts The original audio of each interview was recorded on cassette tape and was made available to the Sydney Speaks team in MP3 and WAV format, allowing for acoustic analysis. Type-written transcripts (Figure 11) were also made available, from which the Sydney Speaks team was able to create machine-readable versions via OCR (Optical Character Recognition). The original transcript captured the content very accurately, and this was imported into Elan and edited to produce detailed transcriptions that are aligned with the audio and facilitate linguistic analysis (Figure 12). Figure 11: Sample transcript from Bicentennial Oral History Project [BCNT_AEF_032_Camila] Image Source: Catherine Travis Figure 12: Sample Bicentennial Oral History Project transcript time-aligned in Elan [BCNT_AEF_032_Camila] Image Source: Catherine Travis Participant metadata Demographic data was collected at the time of the interview for each participant; this included name and date/place of birth as a minimum, but a ‘summary’ provided further information, such as parents’ education and occupations, employment (past and current), interests and marital status (see sample in Figure 13). The recordings themselves also provided a wealth of demographic information, as is common practice in oral histories, meaning that full demographic profiles could be developed for most participants. Some investment is necessary to capture and systematise this information, but its availability is one of the clear advantages they have for (socio)linguistic analysis. Figure 13: Sample demographic information: Bicentennial Oral History Project Image Source: Catherine Travis Participant consent and privacy Incorporating speech data from oral history collections is new ground for linguistic research and there are no existing guidelines to follow regarding the consideration of participant consent. The NSW Bicentennial Oral History Collection manual indicates which ‘restrictions on use’ were sought by the participants. A small number of participants asked for their name not to be used and for permission to be sought before publication of the data. For most of ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:3","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Data access concerns ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"De-identification All speakers have been given pseudonyms, for which we aim to parallel the original name (thus, Sarah may be Sally, Alfredo may be Alberto etc.). In the transcripts themselves, names and all other identifying content (such as addresses, school names, nicknames, etc.) have been de-identified in all data formats. In the transcripts, this is done by using pseudonyms, for readability purposes (rather than noting [XXX], or [pseudonym], for example). To indicate to the analyst what words are pseudonyms, all pseudonyms are marked (preceded by a tilde, e.g. ~Jane, ~Millers ~Point). For the audio, we identify the segment that needs de-identification, and run a low-pass filter using a Praat script so that the name is not recognisable. During an interview, some participants have requested that a section not be included in the study, generally due to what they perceive to be the sensitive nature of a certain topic (for example, one participant talking about banking in Hong Kong). In these cases, this portion of the recording has been deleted, and it has not been transcribed or included in any analysis. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:1","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Data security Data security measures are guided by the Sydney Speaks ethics approval. Long-term storage is especially important for legacy data that hasn’t been digitised or archived where data loss is a realistic risk. Contemporary data is collected in Sydney and transferred to the research team based at the ANU in Canberra. A remote data transfer system using an online cloud service ensures the safe transfer of raw data. Original audio and transcripts in the possession of the project are stored in a locked cabinet, managed by the project lead. Data has been digitised to secure the collection long term, and it is stored in an online cloud service as well as backed up on external hard drives. Having multiple copies of the data increases the security of the collection and using pseudonyms in the file naming protocols protects the identity of participants. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:2","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Data access and reuse In accordance with the project ethics protocol, the data is made available to other researchers with the approval of the Sydney Speaks project lead, Catherine Travis. While data from the NSW Bicentennial Oral History Collection is openly accessible online, the agreement between the National Library of Australia and the Sydney Speaks project allows the Chief Investigator to determine future access to the data from the subset of speakers included in the Sydney Speaks collection, with the condition of correct attribution. The Sydney Social Dialect Survey legacy data was transferred to the Sydney Speaks in full, including the capacity to make decisions about data access and reuse. Regular outreach activities such as presentations, workshops, public lectures and publications, promote the corpus and increase awareness of the data. The corpora are also described on lists of significant language data collections, such as the Sydney Corpus Lab Blog. They are stored with the DoI managed through the library at the Australian National University: https://dx.doi.org/10.25911/m03c-yz22. Access to the Sydney Speaks collection (including all three sub-corpora) is managed on a case-by-case basis. There is an agreed-upon set of terms and conditions for use of the collection, to ensure that any use is in accordance with the ethics approval, and these conditions are specified in the data access licenses developed with support from the Language Data Commons of Australia (LDaCA). Users must fill in an online application form, specifying how the corpora will be used and guaranteeing appropriate attribution to gain access. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:3","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Combining legacy and contemporary data collections In the past, language data was often collected without much consideration of the use of that data beyond the specific purpose for which it was collected. Issues such as ethics, data storage, and long-term data management plans were not of primary concern in the way that they are today, when we are guided by the FAIR principles around Findability, Accessibility, Interoperability, and Re-usability. This does not mean, however, that older language collections (or collections made from other disciplines and for other purposes) cannot be made FAIR. The Sydney Speaks project has demonstrated that legacy corpora can be brought into line with standards appropriate for data in the current digital age. Integrating contemporary and legacy data collections in this way allows for an upscaling of our studies in both the size and the scope of the language data we work with, and in so doing can open a treasure trove of knowledge on Australian language, society, culture, and history. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:4:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"References Boersma, Frederic, J. and D. Weenink. 2019. Praat: Doing phonetics by computer [Computer Software] (6.1.03 ed.): Retrieved 1 September 2019 from http://www.praat.org/. Fromont, Robert and Jennifer Hay. 2012. LaBB-CAT: An annotation store. Proceedings of the Australasian Language Technology Workshop: 113-117. Labov, William. 1984. Field methods of the project on linguistic change and variation. In John Baugh, and Joel Sherzer (eds), Language in use: Readings in sociolinguistics, 28-53. Englewood Cliffs, NJ: Prentice Hall. Lausberg, Hedda and Han Sloetjes. 2009. Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, \u0026 Computers (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands. http://tla.mpi.nl/tools/tla-tools/elan/). 41(3): 841-849. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:5:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Sydney Speaks Corpus and Publications Travis, Catherine E., James Grama, Simon Gonzalez, Benjamin Purser and Cale Johnstone. 2023. Sydney Speaks Corpus. ARC Centre of Excellence for the Dynamics of Language, Australian National University. https://dx.doi.org/10.25911/m03c-yz22 Gonzalez, Simon, James Grama and Catherine E. Travis. 2020. Comparing the performance of forced aligners used in sociophonetic research. Linguistics Vanguard 6(1). https://doi.org/10.1515/lingvan-2019-0058 Grama, James, Catherine E. Travis and Simon Gonzalez. 2019. Initiation, progression and conditioning of the short-front vowel shift in Australian English. In Sasha Calhoun, Paola Escudero, Marija Tabain, and Paul Warren (eds), Proceedings of the 19th International Congress of Phonetic Sciences (ICPhS), Melbourne, Australia, 1769-1773. Canberra, Australia: Australasian Speech Science and Technology Association Inc. https://assta.org/proceedings/ICPhS2019/papers/ICPhS_1818.pdf Grama, James, Catherine E. Travis and Simon Gonzalez. 2020. Ethnolectal and community change ov(er) time: Word-final (er) in Australian English. Australian Journal of Linguistics 40(3): 346-368. https://doi.org/10.1080/07268602.2020.1823818 Grama, James, Catherine E. Travis and Simon Gonzalez. 2021. Ethnic variation in real time: Change in Australian English diphthongs. In Hans Van de Velde, Nanna Haug Hilton, and Remco Knooihuizen (eds), Studies in Language Variation, 292-314. Amsterdam: John Benjamins. https://www.jbe-platform.com/content/books/9789027259820-silv.25.13gra Lee, Esther. 2020. Quotatives over time: A study in ethnic variation. Honours thesis, School of Literature, Languages and Linguistics, Australian National University. http://hdl.handle.net/1885/298816 Purser, Benjamin, James Grama and Catherine E. Travis. 2020. Australian English over time: Using sociolinguistic analysis to inform dialect coaching. Voice and Speech Review 14(3): 269-291. https://doi.org/10.1080/23268263.2020.1750791 Qiao, Gan and Catherine E. Travis. 2022. Ethnicity and social class in pre-vocalic the in Australian English. In Rosey Billington (Ed.), Proceedings of the Eighteenth Australasian International Conference on Speech Science and Technology (pp. 56-60): Australasian Speech Science and Technology Association. https://sst2022.files.wordpress.com/2022/12/qiao-travis-2022-ethnicity-and-social-class-in-pre-vocalic-the-in-australian-english.pdf Sheard, Elena. 2022. Longevity of an ethnolectal marker in Australian English: Word-final (er) and the Greek-Australian community. In Rosey Billington (Ed.), Proceedings of the Eighteenth Australasian International Conference on Speech Science and Technology (pp. 51-55): Australasian Speech Science and Technology Association. https://sst2022.files.wordpress.com/2022/12/sheard-2022-longevity-of-an-ethnolectal-marker-in-australian-english-word-final-er-and-the-greek-australian-community.pdf Sheard, Elena. 2023. Explaining language change over the lifespan: A panel and trend analysis of Australian English. PhD thesis, School of Literature, Languages and Linguistics, Australian National University. http://hdl.handle.net/1885/292110 Travis, Catherine E., James Grama and Benjamin Purser. 2023. Stability and change in (ing): Ethnic and grammatical variation over time in Australian English. English World-Wide 44(3): 429-463. https://doi.org/10.1075/eww.22043.tra Travis, Catherine E. and Rena Torres Cacoullos. 2023. Form and function covariation: Obligation modals in Australian English. Language Variation and Change 35(3): 351-377. https://doi.org/10.1017/S0954394523000200. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:6:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Information about the approach to metadata being taken by LDaCA.","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":" Metadata is often defined as ‘data about data’. High-quality metadata is important in making data FAIR: Findable: Metadata is the starting point for searching data collections. For example, if we want to find data in a particular language, this will only be possible for data that has a language recorded in its metadata. (Tracking languages is in itself problematic, see below.) Accessible: Access conditions that apply to data should be part of the associated metadata. Interoperable: Information about the format of data and whether it requires specific software to be usable should be part of the associated metadata. Reusable: All of the aspects of metadata mentioned above contribute to making data reusable. The more we know about some data, the easier it is to know whether it will be useful to us or not. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:0:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"RO-Crate Profiles RO-Crates in general have basic metadata requirements, but it is possible to specify a profile for crates for specific purposes. LDaCA is developing such a profile for our data; we are basing this largely on previous work in the area. An important aspect of the RO-Crate approach is that it uses the principles of Linked Open Data. This means that terms used in our metadata will (whenever possible) link to an openly available definition. In developing the profile, we are drawing on two existing attempts to provide vocabularies for describing language data. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:1:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Open Language Archives Community (OLAC) OLAC is an international partnership of institutions and individuals; one of their activities is developing consensus on best current practice for the digital archiving of language resources and this includes making recommendations for metadata. The OLAC metadata scheme is based on Dublin Core (DC), a widely used general metadata schema. OLAC have suggested refinements and extensions of the DC base which make it more useful for describing language resources. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:2:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Component Metadata Infrastructure (CMDI) CMDI was developed within the CLARIN project. It draws on the earlier ISLE Metadata Initiative (IMDI), but where IMDI attempted to specify a comprehensive scheme for (multimodal) language data, CMDI adopts a more flexible approach where components are assembled into reusable profiles. This is very similar to the RO-Crate approach described above but with an important difference: the components of a CMDI profile are all drawn from a central registry, whereas components of an RO-Crate profile come from any linkable location. The metadata being developed as part of the LDaCA RO-Crate profile can be viewed as a Schema (a list of terms used to describe language reserouces). We are also developing a more fully documented version as a Gitbook, Metadata for Language Data. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:3:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Identifying Codes for Languages One very important piece of metadata for language data is a description of the language or languages that the data represent. This is not a simple problem because the relationship between languages and names for them is not one-to-one. Some languages have more than one name: for example, Farsi and Persian can both be used to refer to the same language. Some names refer to more than one language: for example, there are languages called Buru used in Nigeria and in Indonesia. To avoid the confusion that can arise from such situations, various systems have been developed to assign unique identifiers to languages. None of these systems gives a comprehensive list of languages and all such systems struggle with another problem, the distinction between separate languages and dialects of one language, as can be seen in the case study below. LDaCA includes identifiers from each of the three systems below where they are available and relevant. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"ISO-639 This system is recognised as a standard by the International Standards Organisation. An earlier version of this system used two-letter codes to identify languages; more recent versions use three-letter codes (referred to as ISO 639-3). These codes are used by Ethnologue, which is a catalogue of the languages of the world, and in many other contexts. The ISO639-3 code for French is fra, and Warlpiri is wbp. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:1","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Glottolog Glottolog is an alternative catalogue of the world’s languages, language families and dialects - Glottolog uses the term languoid to cover all of these. Each languoid is assigned a unique identifier consisting of four alphanumeric characters and four digits. For example, (standard) French has the code stan1290, and Warlpiri is warl1254. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:2","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Austlang AustLang provides a controlled vocabulary of persistent identifiers, a thesaurus of languages and peoples, and information about Aboriginal and Torres Strait Islander languages which has been assembled from referenced sources. Alphanumeric codes are used as persistent identifiers, while associated text strings are changeable and can reflect community preferences (including alternative names and spellings). In AustLang, Warlpiri has two codes: C15 for the language in general, and C15.1 for the variety named as Wakirti Warlpiri. (French is not covered by AustLang.) ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:3","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Case study - Kala Lagaw Ya Kala Lagaw Ya is a language spoken in the Torres Strait Islands. The language has several dialects or varieties and the table below shows how the different code schemes deal with this. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:4","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"A collection of published material related to language research.","date":"2023-11-06","objectID":"/resources/general-resources/publications/","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"The Open Handbook of Linguistic Data Management Edited by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister. MIT Press 2022. “A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice.” This material is Open Access. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:1:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Language Documentation and Conservation “LD\u0026C publishes papers on all topics related to language documentation and conservation, including, but not limited to, the goals of language documentation, data management, fieldwork methods, ethical issues, orthography design, reference grammar design, lexicography, methods of assessing ethnolinguistic vitality, archiving matters, language planning, areal survey reports, short field reports on endangered or underdocumented languages, reports on language maintenance, preservation, and revitalization efforts, plus software, hardware, and book reviews.” LD\u0026C is an Open Access journal. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:2:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Living Languages/Lenguas vivas/Línguas vivas journal “Living Languages is an international, multilingual journal dedicated to topics in language revitalization and sustainability. The goal of the journal is to promote scholarly work and experience-sharing in the field. The primary focus is on bringing together language revitalization practitioners from a diversity of backgrounds, whether academic or not, within a peer-reviewed publication venue that is not limited to academic contributions and is inclusive of a diversity of perspectives and forms of expression.” Living Languages is an Open Access journal. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:3:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Pacific Linguistics A digital archive of many Pacific Linguistics publications up to 2012. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:4:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Australian organisations that are active in the language research space, as well as international organisations working in the global language space.","date":"2023-11-06","objectID":"/resources/general-resources/organisations/","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australian Organisations ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:0","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australian Linguistic Society (ALS) The national organisation for linguists and linguistics in Australia. Its primary goal is to further interest in and support for linguistics research and teaching in Australia. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:1","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Applied Linguistics Association of Australia (ALAA) The national professional organisation for applied linguistics in Australia. It welcomes academics, teachers, researchers, students and members of the wider community to join and become part of an active community interested in questions, issues and problems that can be understood and addressed through a focus on language in our world. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:2","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australasian Language Technology Association (ALTA) Has the purpose of promoting language technology research and development in Australia and New Zealand. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:3","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australasian Speech Science and Technology Association (ASSTA) A scientific association that aims to advance the understanding of speech science and its application to speech technology in a way that is appropriate for Australia and New Zealand. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:4","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) An Indigenous-led, national institute that celebrates, educates and inspires people from all walks of life to connect with the knowledge, heritage and cultures of Australia’s First Peoples. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:5","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"International Organisations ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:0","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Common Language Resources and Technology Infrastructure (CLARIN) A research infrastructure that was initiated from the vision that all digital language resources and tools from all over Europe and beyond are accessible through an online environment for the support of researchers in the humanities and social sciences. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:1","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Endangered Languages Documentation Program (ELDP) Supports the documentation and preservation of endangered languages through granting, training and outreach activities. The collections compiled through their funding are freely accessible at the Endangered Languages Archive. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:2","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Linguistic Data Consortium (LDC) An open consortium of universities, libraries, corporations and government research laboratories. LDC was formed in 1992 to address the critical data shortage then facing language technology research and development. Initially, LDC’s primary role was as a repository and distribution point for language resources, but with the help of its members, LDC has grown into an organisation that creates and distributes a wide array of language resources. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:3","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Existing archives of language in numerous forms, including recorded speech, text, images, books and music.","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) Holds 14,500 hours of audio recordings and 2,000 hours of video recordings that might otherwise have been lost. These recordings are of performance, narrative, singing, and other oral tradition. This amounts to 150 terabytes and represents 1,315 languages, mainly from the Pacific region. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:1:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Living Archive of Aboriginal Languages (LAAL) A digital archive of endangered literature in Australian Indigenous languages of the Northern Territory. It contains nearly 4,000 books in 50 languages from 40 communities available to read online or download freely. This is a living archive, with connections to the people and communities where the books were created. This will allow for collaborative research work with the Indigenous authorities and communities. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:2:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Endangered Languages Archive (ELAR) A digital repository for preserving multimedia collections of endangered languages from all over the world, making them available for future generations. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:3:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"The Language Archive (TLA) An integral part of the Max Planck Institute for Psycholinguistics in Nijmegen. It contains various types of materials, including: audio and video language corpus data from languages around the world; photographs, notes, experimental data, and other relevant information required to document and describe languages and how people use them; records of speech in everyday interactions in families and communities; naturalistic data from adult conversations from endangered and under-studied languages, and linguistic phenomena. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:4:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Kaipuleohone Language Archive The digital language archive of the University of Hawai’i. Founded in 2008, the archive houses texts, images, audio, and video collected from around the world by linguists, anthropologists, ethnomusicologists, and more. Our collection includes a wealth of photographs, notes, dictionaries, transcriptions, and other materials related to small and endangered languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:5:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Digital Endangered Languages and Musics Archives Network (DELAMAN) An international network of archives of data on linguistic and cultural diversity, in particular on small languages and cultures under pressure. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:6:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Information about ethical practices, especially in the research of language and First Nations cultures.","date":"2023-11-06","objectID":"/resources/general-resources/ethics/","tags":null,"title":"Ethics","uri":"/resources/general-resources/ethics/"},{"categories":null,"content":" Australian Institute of Aboriginal and Torres Strait Islander Studies Code of Ethics Australian Linguistic Society policies Includes Policy on Linguistic rights of Aboriginal and Islander communities and Statement of Ethics. Australian Code for the Responsible Conduct of Research, 2018 Linguistic Society of America: Ethics Statement 2019 Further resources Tromsø recommendations for citation of research data in linguistics Language and linguistics datasets are often not cited, or cited imprecisely, because of confusion surrounding the proper methods for citing them. For the use of researchers and scholars in the field working with datasets, the Tromsø recommendations propose components of data citation for referencing language data, both in the bibliography and in the text of linguistics publications. ","date":"2023-11-06","objectID":"/resources/general-resources/ethics/:0:0","tags":null,"title":"Ethics","uri":"/resources/general-resources/ethics/"},{"categories":null,"content":"Learn more about Indigenous language work in Australia through key projects in the space, as well as projects involved in the documentation of the world's languages.","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Indigenous Languages of Australia ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:0","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"50 Words Project Showcases words from 64 languages across Australia, with further languages being added regularly as more communities around Australia become involved. For each language, you can hear the words spoken via a map that shows the general location of the language. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:1","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"AustLang Provides a controlled vocabulary of persistent identifiers, a thesaurus of languages and peoples, and information about Aboriginal and Torres Strait Islander languages which has been assembled from referenced sources. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:2","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Australian Indigenous Language Collections A guide to materials held in the National Library of Australia, with links to similar resources at State Libraries. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:3","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Contemporary and Historical Reconstruction in the Indigenous Languages of Australia (CHIRILA) A lexical database (a database with words from different languages). Currently, there are about 780,000 words, from all over Australia, of which about 20% is publicly available. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:4","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Digital Daisy Bates A collection of over 23,000 pages of wordlists of Australian languages, originally recorded by Daisy Bates in the early 1900s, made up of the original questionnaires and around 4,000 pages of typescripts. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:5","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"First Languages Australia First Languages Australia is a partner institution of LDaCA and encourages communication between communities, the government and key partners whose work can impact Aboriginal and Torres Strait Islander languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:6","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Gambay A map of Australia’s first languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:7","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Indigenous Voices of Queensland An online resource created to improve public access to audio recordings, held in the Fryer Library (UQ), of Aboriginal and Torres Strait Islander people speaking Indigenous languages of Queensland. The project aims to make the recordings as openly available as possible so they can be used to assist communities with language recovery and education, and deepen knowledge of First Nations languages. A map summary and table of languages is also available for browsing, based on recordings from three academics: Elwyn Flint, Bruce Rigsby and Bruce Sommer. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:8","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Interactive Queenslander Languages Map An interactive map providing resources for Queensland Aboriginal and Torres Strait Islander languages, hosted by the State Library of Queensland and searchable by suburb or language. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:9","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Language Centres The Indigenous Languages and Arts (ILA) program currently supports a number of organisations, including a network of 20 language centres. A list of these centres is available from the above link in PDF or DOCX format. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:10","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Living Archive of Aboriginal Languages (LAAL) See Language Archives. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:11","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Living Languages Provides grassroots training to people, communities and Language Centres in Australia doing language work on the ground in remote, regional and urban areas. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:12","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Nyingarn A platform for primary sources in Australian Indigenous languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:13","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Languages of the World ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:0","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Ethnologue Provides information about the languages of the world, but operates on a subscription model. The information that is available without a subscription is very limited: the first three lines of individual language entries which include the ISO 639-3 code, the classification of the language into a language family, and a link to the language’s OLAC page. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:1","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Glottolog Comprehensive reference information for the world’s languages, especially the lesser-known languages. LDaCA uses Glottolog language codes in our metadata. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:2","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Open Language Archives Community (OLAC) An international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. OLAC harvests metadata and their website has a search facility to find resources for languages. OLAC metadata recommendations are the basis for some of LDaCA’s metadata. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:3","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"World Atlas of Linguistic Structures Online (WALS) A large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:4","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Further information relating to Indigenous research.","date":"2023-11-06","objectID":"/resources/general-resources/indigenous-knowledge/","tags":null,"title":"Indigenous Knowledge","uri":"/resources/general-resources/indigenous-knowledge/"},{"categories":null,"content":" Protocols for using First Nations Cultural and Intellectual Property in the Arts (Australia Council). Citing Indigenous elders and knowledge keepers: Templates For Citing Indigenous Elders and Knowledge Keepers (Lorisia MacLeod) University of Alberta Library’s guide University of Alberta Library’s blog TK Labels: Traditional Knowledge labels are an initiative for Indigenous communities and local organisations. Developed through sustained partnership and testing within Indigenous communities across multiple countries, the Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance and protocols for using, sharing and circulating knowledge and data. Decolonising linguistics: Spinning a better yarn Rawlings, V, Flexner, J L, Riley, L (eds.) (2021) Community-Led Research: Walking New Pathways Together. Sydney University Press The Linguistics and Applied Linguistics Indigenous Curriculum Resource Project (The University of Melbourne) This report, prepared by Sasha Wilmoth, includes a bibliography of linguistic work by Australian Indigenous authors. ","date":"2023-11-06","objectID":"/resources/general-resources/indigenous-knowledge/:0:0","tags":null,"title":"Indigenous Knowledge","uri":"/resources/general-resources/indigenous-knowledge/"},{"categories":null,"content":"A collection of digital tools useful for language work.","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Audacity A free, easy-to-use, multi-track audio editor and recorder for Windows, macOS, GNU/Linux and other operating systems. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:1:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"ELAN A tool for making time-aligned annotations on audio and video recordings. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:2:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Praat Free software for doing phonetics by computer. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:3:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"FieldWorks Software tools for managing linguistic and cultural data. FieldWorks supports tasks ranging from the initial entry of collected data through to the preparation of data for publication, including dictionary development, interlinearization of texts, morphological analysis, and other publications. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:4:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Toolbox A data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:5:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"AntConc A freeware corpus analysis toolkit for concordancing and text analysis. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:6:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Browse the project's upcoming events.","date":"2023-10-18","objectID":"/news/events/","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":" Recurring Events Previous Events Webinars ","date":"2023-10-18","objectID":"/news/events/:0:0","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Indigenous Data Governance: A discussion Part of The University of Queensland Research and Innovation Week 2024 The University of Queensland is committed to growing the opportunities for Indigenous people to lead and conduct their own research. One important strand in developing this innovative research culture is ensuring that appropriate governance is in place for Indigenous data. This event will present a panel of Indigenous researchers, data custodians and data stewards discussing current developments in this field including: the importance of Indigenous Cultural and Intellectual Property concerns governance of Indigenous Data Framework for the HASS \u0026 Indigenous Research Data Commons tools and methods for data governance. Panellists: Rose Barrowcliffe, Robert McLellan, Lesley Acres. The discussion will be facilitated by Grant Sarra. When: 4–6pm, 30 September 2024 Where: Anthropology Museum, Michie Building, The University of Queensland (St Lucia Campus) Details ","date":"2023-10-18","objectID":"/news/events/:0:1","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Data Migration Skills Workshop The Language Data Commons of Australia infrastructure is based on widely-accepted standards such as Research Object Crates and the Oxford Common File Layout. The project has been running for more than three years and the processes and tools for aligning data with these standards are now well-developed. This workshop aims to show the application of those tools to data in a variety of formats to efficiently migrate material to the LDaCA standards. The hands-on workshop is intended for data librarians and other professionals who are interested in these issues, but participation is open to anyone who is prepared to work on language data. Participants will have some experience of working with code and/or metadata and will commit to bringing datasets which they work with (or are responsible for) to use in the practical exercises which will make up a large part of the workshop. We will also have example datasets available and we also are open to the possibility of participants working in teams based on complementary skills. Leader: Peter Sefton When: 3 - 5 September 2024 Where: Engma Room (Room 3.165), Coombs Building, Australian National University Further information: If you are interested in participating in this workshop, please contact Peter Sefton ","date":"2023-10-18","objectID":"/news/events/:0:2","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Recurring Events ","date":"2023-10-18","objectID":"/news/events/:1:0","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"RO-Crate Clinic Drop-in The RO-Crate community run a weekly drop-in call in Australia. For further information contact Peter Sefton. When: Weekly, Thursday 14:00 AEST Where: Online via Zoom ","date":"2023-10-18","objectID":"/news/events/:1:1","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Previous Events ","date":"2023-10-18","objectID":"/news/events/:2:0","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Previous Workshops If your university or organisation would like to host a workshop, please contact us. ","date":"2023-10-18","objectID":"/news/events/:2:1","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Webinars Our webinar series was a joint initiative with the Language Technology and Data Analysis Laboratory (LADAL) at the School of Languages and Cultures, The University of Queensland. ","date":"2023-10-18","objectID":"/news/events/:2:2","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Australian Digital Observatory Online Office Hours LDaCA has previously run regular online office hours, jointly hosted with the Australian Digital Observatory (ADO). This activity will not continue in 2024, but LDaCA and ADO are developing an alternative model to help Australian researchers working with linguistics, text analytics, digital and computational methods, social media and web archives. In the meantime, you are welcome to contact us by email at ldaca@uq.edu.au with your technical questions, research problems and rough ideas to get advice and feedback from the combined expertise of our ARDC research infrastructure projects. No question is too small, and even if we don’t know the answer we are likely to be able to point you to someone who does. ","date":"2023-10-18","objectID":"/news/events/:2:3","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Defines policies, roles, responsibilities and procedures for ongoing use and storage of data, as well as for access to data.","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":" Purpose 1. Access Conditions 1.1 Legal, moral and ethical constraints 1.2 Licensing 2. Persistent Identifiers 3. Metadata 4. Appendix: Determining Copyright ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:0:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"Purpose Data governance defines policies, roles, responsibilities and procedures for ongoing use and storage of data, as well as for access to data. Effective data governance maximises sustainability, while ensuring data integrity and protecting research participants. Long-term sustainability requires a data management plan. This document provides guidance on some components of data governance that are key to a data management plan, including access conditions, licensing, persistent identifiers and metadata. The principles below are those employed by LDaCA, but they are widely applicable and represent best practice for data governance, in accordance with the FAIR and CARE principles. Questions for reflection: In each section, you will find a thought bubble marking some questions for reflection that will help you start to explore these data governance topics. This content is designed as guidance for Data Stewards considering how to manage their data into the future. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:1:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"1. Access Conditions Access conditions refer to who can access data and what use is permitted. Defining specific conditions for access supports data reusability and the advancement of the scientific endeavour; it also protects the data from misuse. To determine access conditions, the Data Steward must: understand the legal, moral and ethical constraints to sharing data, and prepare a license outlining access conditions. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:2:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"1.1 Legal, moral and ethical constraints Legal constraints In Australia, as in many other countries, research data is recognised as Intellectual Property that can be protected under legal mechanisms such as copyright. When considering how data can be shared and accessed by others, it is important to consider these legal constraints. Copyright protects expressions of ideas in works such as books, music, paintings, films, and performative acts such as speech, sign and gesture, etc., and therefore also data collections. The creator of the work is known as the copyright owner. Copyright provides two types of rights: Economic rights: The owner has the exclusive right to reproduce, publish, perform, communicate, and adapt or modify their work, for both commercial and non-commercial purposes. This right can be transferred or shared with others via assignment or licensing. Moral rights: The work must be correctly attributed and not treated in a derogatory manner. This protects the integrity of the work. Moral rights cannot be transferred or shared. Questions for reflection: How can I identify the copyright owner of a language collection? Unlike trademarks and patents, copyright doesn’t require registration and there are no official records that can be searched to identify a copyright owner in Australia. Additionally, the creator of material is not necessarily the copyright owner and copyright may also be jointly owned. Copyright ownership is determined according to a set of complex rules set out in the Copyright Act and its amendments. It is important to review in detail the law to ensure the correct owner has been identified. Legal advice should be sought where the copyright owner cannot be clearly identified. If the copyright owner has died the copyright is usually passed on to that person’s spouse or children. (See more information at the end of this section.) The copyright owner of a language collection can be identified by considering the following questions: Question Further Information Does the collection comprise materials all collected under the same conditions? (e.g. as part of the same research project)YesNo If the collection includes material from a third party, the copyright owner and copyright status should be identified for each subset of the collection. Has the copyright owner been determined by a contract, formal agreement, or other relevant document?YesNo In some cases, existing contracts or agreements will assign copyright ownership in advance. This will take precedence over the rules set out in the Copyright Act. What type of material is included in the collection?Textual workMusical workDramatic workComputer programArtistic workFilm or videoSound recordings Generally, the author of a textual work, musical work, dramatic work, computer program or artistic work (i.e. the person who created the work) is the first owner of copyright. However, the general rules for films, videos and sound recordings are different.For sound recordings, copyright is owned by the ‘maker’ of the sound recording, typically the person or company who owns the recording equipment.For film and video, copyright is owned by the person or company which made the arrangements for the creation of the work.In the academic context, this is typically the university where the research was conducted. Was the material created by an employee in the course of their employment?YesNo When the work was created by an employee as part of their usual work duties, the employer is the copyright owner (unless there is a specific employment agreement that specifies otherwise). Is the work a performance?YesNo Performers’ rights apply to live performances including dramatic works, musical works, dances, circus acts, expressions of folklore, readings, and recitations of existing or improved literary works recorded or filmed with or without an audience. Permission must be sought to record the live performance, and to broadcast and distribute recordings.As of 1 January 200","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:2:1","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"1.2 Licensing Documenting access conditions is key to ensuring appropriate use of the data over time. Transparency and clarity surrounding access conditions also supports the sustainability of the data and reduces the need for the Data Steward to be available to communicate or enforce the access conditions. While access conditions can be documented internally in a data management plan, a common and useful mechanism for documenting and managing access is via licensing. Licensing allows the copyright owner to share the right to access and use the data without forfeiting or transferring the ownership of the copyright of the work. The license sets out the conditions for who can access the data, how it can be used, and what other conditions are required. While licensing is a legal mechanism, it can also be used to uphold other conditions as determined by the Data Steward. Questions for reflection: Is there an existing data license? Question Further Information Is there an existing license outlining the access conditions?YesNo If the collection has already been made available, a license may have already been prepared. Avoid duplicating previous work by checking this first. Although it is best practice to have a single license attached to a collection, multiple licenses can exist as long as they are non-exclusive. Questions for reflection: What information needs to be included in the license? Key parts of a data license Image Source: LDaCA Question Further Information What are the parties relevant to this license? Who is the author of the material? Who is the copyright owner? Is access managed by a Data Steward? Which individuals or groups are permitted access to the material? Who is the Licensor (copyright owner) and Licensee? What materials are covered by the license? Does this license cover the entire collection or a subset of the material? Is this an exclusive or non-exclusive license? Does this license grant an individual or group exclusive rights to use and share the material? If yes, this is an exclusive license. Describe the rights that are being transferred to the licensee. What rights are being transferred? Can the material be modified? For what purpose? Can the material be shared? Under what conditions? Does this collection include sensitive information? Does the material include personal information? What is the protocol for ensuring participant privacy? What are the responsibilities of the licensee? What are the requirements surrounding citation? How should the material be correctly attributed? What is the suggested citation for the collection? Under which region is the license legally binding? Is the license limited to a specific geographical region? Other considerations How long will the license be valid? Does it expire after a specific period of time? Is the licensee required to pay a fee? How is the fee processed? Under what conditions can the license be terminated? How will disputes be resolved? While in some cases it may be necessary to write a custom data license, for other collections it may be possible to apply an already-existing license, such as a Creative Commons license. Creative Commons provides a useful option for promoting data reusability while protecting key access conditions. The Creative Commons licenses combine four main elements in different ways: BY - Attribution: This is a requirement for all Creative Commons licenses. The original creator must always be attributed. SA - Share-Alike: Data can only be shared under identical license terms. NC - Non-Commercial: Only non-commercial use of the data is permitted. ND - No Derivatives: Material can be copied and distributed in the original form only and cannot be adapted. A copyright owner may also choose to waive their rights and make data available in the public domain before the duration of the copyright has expired. Creative Commons provides a relevant tool for documenting this decision: CC0: A copyright owner opts out of copyright and waives their ex","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:2:2","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"2. Persistent Identifiers Persistent identifiers (PIDs) are digital identifiers that are permanently assigned to physical and digital objects. In contrast to other identifiers that are used online, such as URLs, PIDs are persistent, meaning they point to reliable information in the long term. They have become crucial for research data as they make a collection citable; ensure that it is findable even if moved to a different location; and establish its relationship to other objects and entities in the academic research environment (e.g. researchers, funders, organisations, academic publications, software and other datasets). While various PID systems have been in use for research data over the last 25 years, including ISBN and ORCiD, the DOI System has been the most widely used globally to date and is becoming the default identifier for research datasets. A DOI is a unique number made up of a prefix and a suffix separated by a forward slash. It is resolvable by displaying it as a link: https://doi.org/10.1000/182. An example of a DOI for a data collection (obtained via the university library) can be viewed here: Sydney Speaks DOI landing page. Questions for reflection: Who can get a DOI? How? Question Further Information Does the collection have an existing PID?YesNo Investigate whether the collection already has an existing PID as this is sometimes automatically generated when the collection is listed or archived with an archive or library. Who can generate a DOI for the collection? The most common DOI minting services are universities, research organisations, research libraries and research repositories. You can make enquiries with your university library or the research repository of the organisation that supported the data collection. Find out more about PIDs: Klump, J. and Huber, R., 2017. 20 Years of Persistent Identifiers – Which Systems are Here to Stay?. Data Science Journal, 16, p.9. DOI: http://doi.org/10.5334/dsj-2017-009 ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:3:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"3. Metadata Metadata is data about data — information that describes the data collection as a whole; provides the context and conditions under which the data was collected, can be stored, shared and used; and the characteristics of the format, duration or size of data making up the collection; and includes socio-demographic details of participants. Standardised metadata allows data to be more easily found and understood, to be compared, and grouped with other similar objects. Metadata is often managed in two different ways: A standard metadata vocabulary (such as Dublin Core (DC), Darwin Core and Metadata Object Description Schema (MODS). These vocabularies set out a limited list of metadata terms that can be used across disciplines with the aim of promoting a shared metadata framework. A customised metadata strategy, unique to a particular project to meet specific needs. Where a customised metadata strategy is used, metadata terms should be clearly defined so as to facilitate comprehension of the metadata and to avoid misunderstandings or multiple interpretations. Customised metadata terms can be mapped onto existing vocabularies in order to organise metadata in a manner aligned to a standard. Questions for reflection: What does it mean to apply standards to metadata? Question Further Information Does the collection use metadata terms from an existing metadata vocabulary?YesNo If the collection uses customised metadata terms (i.e. not an existing metadata vocabulary), consider mapping the relationships between the custom system and an existing metadata vocabulary. This will facilitate findability as the metadata aligns with standards used in the research community. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:4:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"4. Appendix: Determining Copyright Information is provided as guidance only; legal advice should be sought. Select the correct table for works and subject matter other than works. Works: Step 1: Is the author known? Author of the work is known* Step 2: When was the material first made public? Work has not been made public Work has been made public Step 3: Was the material first made public before or after the author died? N/A Work was made public before the author died Work was not made public before the author died Step 4: Was the material first made public before or on/after 1 January 2019? N/A N/A Work was made public with the author’s permission before 1 January 2019 Work was not made public with the author’s permission before 1 January 2019 Copyright duration Date author died + 70 years. Date author died + 70 years. Date the material was first made public + 70 years. Date author died + 70 years. *Different copyright duration applies to works for which the author is unknown. Subject matter other than works: sound recordings, films and videos Step 1: When was the material created? Created before 1 January 2019 Created on or after 1 January 2019 Step 2: When was the material first made public? Not made public First made public before 1 Jan 2019 First made public on or after 1 Jan 2019 and within 50 years of the date of creation First made public on or after 1 Jan 2019 more than 50 years after the date of creation Made public within 50 years of the date of creation Made public more than 50 years after the date of creation Copyright duration Date the material was created + 70 years. Date the material was first made public + 70 years. Date the material was first made public + 70 years. Date the material was created + 70 years. Date the material was first made public + 50 years. Date the material was created + 50 years. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:5:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":["LDaCA"],"content":"The Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA) are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. In this blog post series, we feature interviews with the Chief Investigators of the two projects. In each post, we present their answers to three questions: What is your role in these projects? (What do you/your team do as part of your participation?) What excites you most about the projects? (What motivates you to participate?) What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? This blog post features Catherine Travis (CT), Nicholas Evans (NE), and Monika Bednarek (MB). Nick was a Chief Investigator in one of the first ARDC-funded projects for LDaCA under the Australian Data Partnerships program. Although he is not a CI on our current project, Nick remains a friend and supporter of LDaCA. The interview was undertaken via email, and we are grateful to Kelvin Lee from the Sydney Corpus Lab for his assistance in undertaking the interviews and creating these blog posts. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:0:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"What is your role in these projects? CT: Our main goal at the Australian National University is to grow the set of collections that will be incorporated into LDaCA with a particular focus on Australian English and migrant languages. This involves identifying relevant collections, which is a bit of a ‘language dig’, as these are diverse, dispersed, and often quite hidden away. Many have restrictions around data sharing, so we work closely with data stewards to set up the appropriate access and licensing conditions. As well as this, we are developing tools that can enhance the usability of both legacy corpora and newer collections, including tools for aligning transcripts that exist in different formats with their corresponding audio, standardising orthography, streamlining the anonymisation process, and so on. Right now, most of what we know about Australian English comes from middle-class Australians of Anglo-Celtic background living in major urban centres. A better representation of Australian society in our language collections will broaden the scope of the research that can be done, allow new questions to be asked, and provide a better picture of Australia overall. NE: My role is to keep our foot on the gas for the huge job of getting a continued flow of new corpus data, as well as digitised legacy data, mostly for the languages of New Guinea and the Pacific. MB: I’m the Academic Lead for these projects at the University of Sydney. In this role, I manage the projects overall and work closely with staff from the Sydney Corpus Lab and the Sydney Informatics Hub. We develop text analytics tools and training for analysis of text/language, including but not limited to tools for linguistic and discourse analysis. A team at PARADISEC is also involved and contributes to work packages around language collections for Indigenous languages of Australia and the Pacific. My role as director of the Sydney Corpus Lab is crucial for these projects. The lab’s mission is to build research capacity in corpus linguistics at the University of Sydney, to connect Australian corpus linguists, and to promote the method in Australia, both in linguistics and in other disciplines. We organise relevant events on corpus linguistics and text analytics, including guest lectures and workshops, and create resources such as corpora, blog posts, video playlists, and curated introductions in different languages. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:1:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"What excites you most about the projects? CT: There is a wide range of collections out there that represent an absolute treasure trove of knowledge not only for language in Australia, but also for Australian society, culture, and history. For example, oral histories or ethnographic interviews can provide invaluable linguistic data, and likewise, sociolinguistic interviews may contain invaluable information for historians, sociologists, or anthropologists. But because there is no way to know what collections exist, what is in them, and what is accessible, this treasure trove is underutilised. With many decades of data collection behind us and with recent technological developments, now is the perfect time to open this up to create a language data commons. It is exciting to think of the opportunities that this presents for us to cross disciplinary boundaries, and expand our understanding of Australia’s social, cultural, and linguistic history. NE: We all know that most of the world’s languages are under-resourced, lacking even a grammar or dictionary. A dream I have – which this project will move us very slightly toward – is that every one of the world’s 7,000 languages might one day have a vast library of corpus data, comparable to the 60 million words we have for Classical Greek, for example. Each language and culture deserves its own vast library wing. To give an idea of the scale of the challenge, so far in our part of the world there are only three languages (Bislama, Gurindji Kriol, and Ku Waru) where we have even one million words. MB: To work across disciplines, to collaborate with data scientists, and to learn many new things myself! And to discover the value that we bring to such collaborations as domain experts in the Humanities and Social Sciences. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:2:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? CT: Given the number of existing language collections, one piece of advice I would give to someone who wants to get started with data collection would be to learn what is already out there prior to beginning new data collection, and to think about how any new data you collect may contribute to existing collections or research projects, or alternatively, how existing collections might shape the kind of new data you might collect or the kind of research you might do. It is standard practice to contextualise the research that we do within the field; we are at a point where we should be doing the same with data collection. It is through our cumulative efforts, taking advantage of the work that has preceded us and building on that, that we will most advance in our scientific endeavours. NE: Find good ways of getting sensitive and accurate commentaries on the meaning of materials. We need the equivalent of Biblical or Talmudic commentaries for the digital age – metatexts that comment on what other texts mean. MB: Just give it a go and don’t be daunted! Go to workshops or summer schools, and avail yourself of other free or low-cost opportunities. Be (and remain) critical in your use of tools, and remember the value of your own knowledge and expertise. Discover the joy of mastering a new tool/technique or of knowing enough to find out that it is not for you or for your research project. If you find it useful, practice and also keep good notes so that you can return to the tool later and still know what to do. Always be transparent in how you use the tool/technique and make sure you know why you are using it. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:3:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"Acknowledgments The Australian Text Analytics Platform program (https://doi.org/10.47486/PL074) and the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001) received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ","date":"2023-10-18","objectID":"/news/posts/interview-2/:3:1","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":null,"content":"Outlines the standards and processes that support the onboarding of data collections to LDaCA.","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":" Purpose Onboarding Process 1. Initiating contact between LDaCA and Data Stewards 2. Work Plan 3. Determining appropriate governance and access conditions 4. Preparing data for cataloguing in LDaCA 5. Ongoing engagement ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:0:0","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"Purpose This document outlines the standards and processes that support the onboarding of language collections to LDaCA. A successful onboarding will result in the collection being catalogued with LDaCA and made available in accordance with the specified access conditions. To determine if you are ready to onboard a collection, check this data management and governance capacity flowchart. Data Management and Governance Capacity Image Source: LDaCA ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:1:0","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"Onboarding Process The LDaCA onboarding process is carried out collaboratively between LDaCA and the Data Steward. There are five broad steps to be followed; while the order of these steps is flexible, each step should be appropriately covered. Initiating contact between LDaCA and Data Stewards. Establishing a Work Plan. Determining the data governance and access conditions appropriate for this collection, and applying that in LDaCA. Preparing data for onboarding to LDaCA. Facilitating ongoing engagement. These five steps are covered in detail below, along with the responsibilities of LDaCA and the Data Steward for each step, and associated resources. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:0","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"1. Initiating contact between LDaCA and Data Stewards Data Stewards wishing to onboard data to LDaCA may approach LDaCA about this, and LDaCA will also seek to identify appropriate collections for onboarding. LDaCA will gather relevant publicly available information, and record that information in a collection onboarding database. This includes information about the current location, significance and description of the collection; size, format and data structure; existing metadata; persistent identifiers (if any) and contact person. Contact between LDaCA and the Data Steward is initiated with the aim of identifying the appropriate Data Steward, establishing a collaborative relationship, providing information about LDaCA, and discussing shared goals for the onboarding of the collection to LDaCA. LDaCA will: Respond to requests to onboard data to LDaCA and/or research data collections and approach contact people about onboarding data to LDaCA. Provide information about LDaCA. Confirm the Data Steward. Help to determine if the data collection is ready for onboarding. Discuss needs and goals for onboarding to LDaCA. Additional resources relevant to this step: Initial Contact Checklist ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:1","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"2. Work Plan Once communication between LDaCA and the Data Steward is in place, a Work Plan is prepared in order to establish the terms according to which the data will be onboarded to LDaCA, including the goals and responsibilities of each party, and the steps and timeline for carrying out the onboarding process. At this stage, it is important to share key information and resources about the LDaCA project and manage expectations. LDaCA will: Provide the Work Plan template for onboarding. Collaborate with the Data Steward on the development of the Work Plan. Provide further information about LDaCA, its services and limitations. Data Steward will: Confirm their role as the Data Steward. Collaborate with LDaCA on the development of the Work Plan. Additional resources relevant to this step: Work Plan template Terms of Use ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:2","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"3. Determining appropriate governance and access conditions LDaCA will collaborate with the Data Steward to (i) understand legal, ethical and moral considerations around data accessibility, and (ii) map the access conditions and how these should be implemented. At this stage, the Data Steward must prepare a data access license that clearly outlines the conditions for data access, sharing and reuse. LDaCA makes available varying levels of access, from open access to restricted access. A persistent identifier for the collection must also be provided to support sustainable data management practices. LDaCA will: Provide guidance and support with regards to data governance, persistent identifiers, determining access conditions and licensing. Data Steward will: Review the legal, ethical and moral conditions for providing access to the collection, including the identification of existing data access licenses if applicable. Ensure that any agreed-on restrictions with funding bodies and participants are strictly adhered to. Identify the copyright holder and confirm the copyright status. Provide or develop data access licenses that outline the specific conditions for access to the collection, including citation information. Provide a persistent identifier for the collection (obtaining a DOI if necessary). Additional resources relevant to this step: Access Policy Guidance for Data Governance Decisions Determining Access Conditions Obtaining a DOI ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:3","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"4. Preparing data for cataloguing in LDaCA LDaCA coordinates with the Data Steward to apply standards to the collection to support its long-term sustainability. The standards applied allow the data to be packaged in RO-Crates, the metadata to be mapped onto LDaCA’s metadata schema, and additional metadata to be added where possible/required. LDaCA will: Prepare the data for packaging as RO-Crates. Support the metadata mapping process. Data Steward will: Provide a copy of the collection data. Support the RO-Crate packaging by applying standards to the data and reviewing the packaged data. Provide metadata and collaborate with metadata mapping. Additional resources relevant to this step: Information on Metadata Information on RO-Crate ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:4","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"5. Ongoing engagement Finally, we implement ongoing engagement by outlining shared responsibilities and processes in order to ensure that access requests, system updates and user feedback are responded to appropriately by LDaCA and Data Stewards. This is key for collections that require access approval on a case-by-case basis and for collections that may introduce data updates, such as additional data, transcription, or annotation, edits to existing data, transcription, or annotation etc. LDaCA will: Incorporate updates to the data collection provided by the Data Steward. Make the Data Steward aware of any takedown requests. Update the Work Plan as relevant, with responsibilities for ongoing maintenance of the collection in LDaCA. Data Steward will: Make LDaCA aware of any updates to the data collection they wish to incorporate, and work with LDaCA to incorporate them. Review any takedown requests, make a decision about how to address them, and inform LDaCA of that decision. Update the Work Plan as relevant, with responsibilities for ongoing maintenance of the collection in LDaCA. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:5","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"Outlines the LDaCA Access Policy, making data appropriately accessible in accordance with legal, moral and ethical considerations of data sharing.","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":" Purpose 1. Access Types and Licensing Classification of Access Types 2. Onboarding Data to LDaCA LDaCA Responsibilities Data Steward Responsibilities 3. Ongoing Data Management Strategy 4. Supporting Information LDaCA Terms of Use Privacy Policy Takedown Policy ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:0:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Purpose This document outlines the LDaCA Access Policy, which is developed to accommodate the goal of making data appropriately accessible, in accordance with legal, moral and ethical considerations of data sharing, and tailored to meet the needs and requirements of different data collections. Access conditions for data in LDaCA are determined by the Data Steward. To assist with this, LDaCA provides: information to help Data Stewards make informed decisions about appropriate access conditions for the data; tools and resources to facilitate this process as collections are onboarded and made available (including for applying standards as required); and mechanisms for managing these once data is onboarded. The Access Policy comprises three key components, as outlined in this document: Access types and licensing Onboarding data to LDaCA Ongoing data management ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:1:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"1. Access Types and Licensing The foundation of the LDaCA Access Policy is licensing, i.e. the action of setting out conditions for accessing and using data in a license that is attached to all data in our systems. Access conditions are defined by the Data Steward for each collection, and they may restrict access as little or as much as desired, from open access, to an automated click-through license, to case-by-case approval prior to gaining access to the data. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:2:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Classification of Access Types Type of Access Description Basic Process Open access Data is openly available under a Creative Commons license or similar (including for data in the public domain). The user can read the license and directly access the data. Authorization Required Different levels of implementation:click-through licenseby invitationby application (case-by-case approval by Data Steward) Data is available with some restrictive conditions of use, as outlined in the license. Users must authenticate their identity to access these materials. In some cases, access to the license is limited to specific users, who can be defined by invitation and/or application (note that these are not mutually exclusive – either or both options may be implemented). Such authorisation requires ongoing engagement from the Data Steward in order to manage access lists and approve/decline access requests. A number of levels of authorization can be implemented depending on the access conditions:License acceptance: the user must read and accept the license (click-through) before accessing the data.By invitation: the Data Steward invites users to access and accept the license. Only invited users can gain access.By application: specific conditions apply, and the user must provide extra information to gain access, which they do through a linked application form which must be reviewed and approved by the Data Steward before access is granted. Approval may be dependent on, for example, the community or academic affiliation of the applicant; the specific use of the data requested; payment of a fee; or other. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:2:1","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"2. Onboarding Data to LDaCA In accordance with the Access Policy, LDaCA has an established process for onboarding to LDaCA. Throughout this process, the LDaCA team will work with the Data Steward to develop an effective strategy for managing access, in accordance with relevant legal, moral and ethical considerations applying to that data. Read more about our data onboarding process. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:3:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"LDaCA Responsibilities Supporting the Data Steward to standardise aspects of the language collection as required for successful onboarding, including with regards to metadata, data governance and data preparation. Adapting the onboarding process as relevant to the specific needs and requirements of the collection and Data Steward, and working to facilitate a successful and efficient onboarding process. Providing clear information to the Data Steward to ensure comprehension of the purpose of each step in the onboarding process, and responding to questions and concerns. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:3:1","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Data Steward Responsibilities Providing a persistent identifier for the data. If the collection does not have an existing persistent identifier, LDaCA recommends getting a DOI (Digital Object Identifier), which is becoming the default identifier for research datasets. A DOI makes a collection citable; it ensures that it is findable even if moved to a different location; and it establishes its relationship to other objects and entities in the academic research environment (e.g. researchers, funders, organisations, academic publications, software and other datasets). See Obtaining a DOI for more information. Providing metadata that is organised with a consistent structure, and includes descriptors, definitions and contextual information where relevant. This includes metadata at the level of the collection (e.g. collection name, a narrative description of the corpus, the subject language(s), author(s) or Collector(s), publication year, access conditions, etc.), file (e.g. length in minutes, words) and participants (e.g. age, gender). LDaCA technologies enable programmatic detection, extraction and summarisation of existing metadata in a dataset; in order for this to work effectively, the metadata must be standardised. Metadata are mapped to open schemas including the Open Language Archives Community (OLAC) vocabularies for describing language data and Schema.org for generic descriptions (e.g. Name, Identifier, Description). This approach allows many metadata terms to be linked to openly available and widely used definitions. Some metadata may be specific to a corpus or difficult to map to existing vocabulary terms. LDaCA makes use of the Language Data Ontology, which has been developed in consultation with OLAC and metadata specialists, to ensure consistency across terms. See Metadata for more information. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:3:2","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"3. Ongoing Data Management Strategy A strategy for ongoing access and data management of the collection in LDaCA must be developed in collaboration with the Data Steward. Responsibilities and processes must be clearly outlined specifying that Data Stewards will respond appropriately to access requests, system updates and user feedback. This is key for collections that require access approval on a case-by-case basis and for those collections that may introduce data updates, such as additional data, edits to existing data, transcription, or annotation etc. Both LDaCA and the Data Steward are jointly responsible for ongoing maintenance of the collection in LDaCA, and updating the Work Plan as needed. Read more about our ongoing engagement. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:4:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"4. Supporting Information The Access Policy is supported by the LDaCA Terms of Use, Privacy Policy, and Takedown Policy. These are summarised below; please go to the links to view the actual policies online. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"LDaCA Terms of Use The LDaCA Terms of Use provide an overarching agreement for all users who interact with the platform in adherence with general legal, ethical and moral standards. The key points included in the Terms of Use are: Users must adhere to conditions set by the Data Steward. Users must exercise ethical standards, paying particular attention to potential issues around sensitive data and participant privacy. Users must uphold the moral rights of data owners. Publicly available metadata in the LDaCA catalogue are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Data is provided ‘as is’ and LDaCA provides no guarantee to the accuracy or completeness of the material. LDaCA may change the site and suspend or cancel user access without notice. LDaCA is not responsible for any breaches of legal, moral or ethical standards by users. The Terms of Use can be viewed here. LDaCA Terms of Use: Key Points Image Source: LDaCA ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:1","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Privacy Policy The LDaCA Privacy Policy describes what information is collected about users and how personal information is managed by LDaCA. Information collected by LDaCA may include home organisation, social login provider and personal data such as name, email address, affiliation, and information about how a user accesses and uses LDaCA. This information is necessary to facilitate the functioning of the site and is required to provide access to restricted language collections. The LDaCA Privacy Policy outlines the limitations of the use of personal information and the measures taken to protect user privacy. The policy document is currently under development. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:2","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Takedown Policy The LDaCA Takedown Policy outlines the steps to be taken by users in making a request for data to be removed, or access to be adjusted in some way. LDaCA recognises that licensing and decisions surrounding access to data are made by the Data Steward. Therefore, the Data Steward is also responsible for assessing and determining the outcome of takedown requests. The Takedown Request mechanism supports FAIR and CARE Principles by facilitating a process by which data access may be questioned and discussed by relevant stakeholders. The policy document is currently under development. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:3","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Defines the workflow for determining the access conditions for a data collection, to be outlined in the license.","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":" Step 1: Consider access condition(s) for the entire collection Step 2: Organise relevant materials Step 3: Define the license terms Step 4: Determine authorisation level Options for access by application LDaCA recommends the following workflow to determine the access conditions to be outlined in the license. Examples of existing data access licenses can be found here. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:0:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 1: Consider access condition(s) for the entire collection Do the same access conditions apply to the entire collection? If yes → two licenses can be defined (one for metadata, one for the collection data). Metadata is licensed with an open license in order to allow the user to find the collection through the LDaCA catalogue. If no → all of the ways the data can be accessed must be defined, with a license for each subset. Figure 1: Access Conditions Flowchart Image Source: LDaCA Name the license(s). Complete the following steps 2-4 for each license. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:1:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 2: Organise relevant materials What material can be accessed with this license? The material can be organised in different ways: Metadata vs. data Formats, e.g. transcriptions vs. audio List of files/speakers (based on a shared quality such as sensitive material, transcribed or non-anonymised material) ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:2:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 3: Define the license terms What conditions are included in the license? Can data be shared? Can data be modified? What is the license duration? Provide a brief description of the license conditions (1-2 lines). List the correct citation for this material. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:3:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 4: Determine authorisation level What authorisation is required for users to access this license? If none → all users can accept the click-through license and access the material. If by invitation → only users invited by the Data Steward can accept the license and access the data. If by application → arrange a form or a document with requirements for users to fill in. Figure 2: Access Conditions Flowchart - Authorisation Level Image Source: LDaCA This information can be mapped out visually or listed in an Excel sheet with the following basic structure: License Step 1: Determine the number of separate licenses needed, and name the license Step 2: Define the material Step 3: List the license terms Step 4: Describe the authorisation process, if applicable 1 2 ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:4:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Options for access by application Access by application can be managed in a number of formats: Integrated forms, e.g. set three key questions for users requesting access to the collection. PDF application form, i.e. provide a dynamic PDF form that the user can download, complete and re-upload. Document upload request, i.e. request a copy of the user’s research ethics approval or proof of affiliation. Link to external service, i.e. Google Forms, payment service. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:5:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Explore our glossary of key terms and concepts related to LDaCA and the language data field.","date":"2023-10-03","objectID":"/resources/glossary/","tags":null,"title":"Glossary","uri":"/resources/glossary/"},{"categories":null,"content":" Access Conditions Conditions which specify who can access data and what they can do with that data. A well-governed archival repository has mechanisms in place to administer and implement such conditions which will be specified on a data license. ADA Australian Data Archive. A national service for the collection and preservation of digital research data. More information ADM+S ARC Centre of Excellence for Automated Decision-Making and Society. It brings together universities, industry, government and the community to support the development of responsible, ethical and inclusive automated decision-making. More information ADO Australian Digital Observatory. An ARDC platform working to establish a national infrastructure to support a diverse array of researchers, especially in the humanities, in accessing and working with dynamic digital data. More information AIATSIS Australian Institute of Aboriginal and Torres Strait Islander Studies. Australia’s only national institution focused exclusively on the diverse history, culture and heritage of Aboriginal and Torres Strait Island Australia, with a growing collection of over one million items, dedicated to Australian Aboriginal and Torres Strait Islander cultures, histories and contemporary stories. More information API Application Programming Interface. A way for computer programs to communicate with each other. It is a way for one computer or system to ask another computer or system to do something, like provide a dataset. ARC Australian Research Council. Its purpose is to grow knowledge and innovation for the benefit of the Australian community through funding the highest quality research, assessing the quality, engagement and impact of research, and providing advice on research matters. More information Archival Repository A location for the storage of data that has an appropriate governance regime in place. ARCP Archive and Packaging ID. A globally unique, searchable ID with zero management overhead, but which can be used like URLs in linked data systems, but does not resolve to content in a browser. ARDC Australian Research Data Commons. The ARDC is Australia’s leading research data infrastructure facility accelerating Australian research and innovation by driving excellence in the creation, analysis and retention of high-quality data assets. More information ARDS Aboriginal Resource and Development Services (ARDS Aboriginal Corporation). Its work champions the importance of language and culture in developing self-determination for Aboriginal people, and supports Aboriginal communities to increase control and understanding of mainstream services and systems. More information Arkisto A scalable, standards-based platform for sustainable data. Data on an Arkisto deployment is always available on disc (or object storage) with a complete description independently of any services such as websites or APIs. Once the data is safe and well-described, Arkisto has a flexible model for how data can be accessed using a variety of services. Built on top of RO-Crate and OCFL. More information See also: Oxford Common File Layout See also: RO-Crate ASR Automatic Speech Recognition. ASR enables computers to process human spoken language into readable text, allowing users to operate devices through speech or facilitate translation of that speech into other languages. ATAP Australian Text Analytics Platform. An open source environment that provides researchers with tools and training for analysing, processing and exploring text. More information AustLang Provides a controlled vocabulary of persistent identifiers, a thesaurus of languages and peoples, and information about Aboriginal and Torres Strait Islander languages which has been assembled from referenced sources. Alphanumeric codes are used as persistent identifiers, while associated text strings are changeable and can reflect community preferences (including alternative names and spellings). In AustLang, Warlpiri has two codes: C15 for the ","date":"2023-10-03","objectID":"/resources/glossary/:0:0","tags":null,"title":"Glossary","uri":"/resources/glossary/"},{"categories":null,"content":"Outlines steps for acquiring a Digital Object Identifier (DOI) for a data collection.","date":"2023-09-15","objectID":"/resources/ldaca-resources/obtaining-a-doi/","tags":null,"title":"Obtaining a DOI","uri":"/resources/ldaca-resources/obtaining-a-doi/"},{"categories":null,"content":"How do I get a DOI for my dataset? Through the collection's custodial organisation: Identify the collection’s custodial organisation – a university or research organisation that funded, sponsored, published, approved, or has a current or historical stakeholder interest in the research project that collected the data. In many cases, contacting the university library about getting a DOI for a dataset is a good starting point. Alternatively, an enquiry may be made with the research repository of the custodial organisation regarding options for getting a DOI. An example of a collection DOI obtained via the university library is Sydney Speaks at the Australian National University (ANU), which can be viewed here: Sydney Speaks DOI landing page. The ANU Library administers and manages the landing pages for ANU DOIs. If the above does not suit the context of the dataset, and the dataset has not been published before, consider depositing it in a data repository like Zenodo, a multi-disciplinary open repository that automatically assigns a DOI to uploaded datasets and publications and hosts landing pages. Zenodo is a proponent of FAIR Principles and open science; however, it allows upload of data with different access levels: Open, Closed, Restricted, and Embargoed - read the Zenodo General Policies, Terms of Use, and FAQs. The Zenodo DOI landing page can contain a link to point researchers to the data content in LDaCA. If unsure whether any of the above applies to your context, please contact LDaCA to discuss PIDs for your data. ","date":"2023-09-15","objectID":"/resources/ldaca-resources/obtaining-a-doi/:1:0","tags":null,"title":"Obtaining a DOI","uri":"/resources/ldaca-resources/obtaining-a-doi/"},{"categories":["LDaCA"],"content":" Download as PDF This presentation was delivered by Peter Sefton at the Open Repositories 2023 conference in South Africa on 2023-06-14 in the Presentations: Discipline specific systems with FAIR principles session. This contains the slides and complete speaker notes, which have been edited after the conference. We will present a standards-based generalised architecture for large-scale data* repositories for research and preservation illustrated with real-world examples drawn from a number of languages and cultural archive projects. This work is taking place in the context of the Australian Humanities and Social Sciences Research Data Commons, particularly the Language Data component thereof and the long-established PARADISEC cultural archive. The standards used include the Oxford Common File Layout for storage, Research Object CRATE (RO-Crate) for consistent linked-data description of FAIR digital objects, and a language data metadata profile to ensure long-term interoperability between systems and re-usability over time. We also discuss data licensing and authorization for access to non-open resources. We suggest that the approach shown here may be used in other disciplines or for other kinds of digital library, repository or archival systems. *The submitted abstract did not have the word data here - added for clarity By: Peter Sefton (University of Queensland), Simon Musgrave (University of Queensland \u0026 Monash University) \u0026 Nick Thieberger (University of Melbourne) This work is supported by the Australian Research Data Commons. The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the HASS (Humanities and Social Sciences) and Indigenous Research Data Commons (HASS+I RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Cultures at the University of Queensland with several partner institutions. We would like to acknowledge the traditional custodians of the lands on which we live and work and the importance of Indigenous knowledge, culture and language to these projects. Peter Sefton lives and works on Wiradjuri land, and for Nick Thieberger and Simon Musgrave, it’s the land of the Kulin nation. PARADISEC is an online archive of cultural data which has been maintained for twenty years, in this presentation, we will look at some of the lessons learned from PARADISEC. In summary – the PARADISEC approach to simple data and metadata storage is something we want to continue in LDaCA, while the high cost for PARADISEC of commissioning and maintaining its own software stack is something we want to address by taking a more standards-based approach to managing language and other data over the coming decades. The Arkisto platform started in 2019 as a way to capture the lessons of PARADISEC and other projects such as Alveo (another language data project similar in scope to LDaCA) which was presented at OR 2014: Sefton PM, Estival D, Cassidy S, Burnham D, Berghold J. The Human Communication Science Virtual Lab (HCS vLab): A repository microclimate in a rapidly evol","date":"2023-06-29","objectID":"/news/posts/arkisto-stack-or-2023/:0:0","tags":["Access"],"title":"Towards a Generic Research Data Commons: A highly scalable standards-based repository framework for Language and other Humanities data\n","uri":"/news/posts/arkisto-stack-or-2023/"},{"categories":["LDaCA"],"content":" The Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA) are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. In this blog post series, we feature interviews with the Chief Investigators of the two projects. In each post, we present their answers to three questions: What is your role in these projects? (What do you/your team do as part of your participation?) What excites you most about the projects? (What motivates you to participate?) What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? This blog post features Louisa Willoughby (LW), Martin Schweinberger (MS) and Nick Thieberger (NT). The interview was undertaken via email, and we are grateful to Kelvin Lee from the Sydney Corpus Lab for his assistance in undertaking the interviews and creating these blog posts. ","date":"2023-04-26","objectID":"/news/posts/interview-1/:0:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"What is your role in these projects? LW: I’m the academic lead on the multimodal corpus. We’re building infrastructure for people to create video dictionaries that link to corpus examples (and vice versa), using Signbank and the Auslan corpus as our test case. MS: I am a Chief Investigator and steering committee member of the Australian Text Analytics (ATAP) Project and I am also a CI of the Language Data Commons of Australia (LDaCA). I am particularly focusing on the Language Technology and Data Analysis Laboratory (LADAL) which is part of ATAP and represents an infrastructure for computational text analytics in the humanities. LADAL has an outreach component as it organises webinars and workshops and it provides computational resources in the form of self-paced online tutorials as well as interactive notebooks that allow researchers to try out methods and apply them to their own data. I see my task as organising and managing activities around LADAL by creating and optimizing resources as well as getting involved in outreach via events and building (inter-)national partnerships and collaborations with other computational humanities infrastructures. NT: I am in charge of work packages 1.1 and 1.2, Indigenous languages of Australia and the Pacific. I worked at AIATSIS in the past and set up an Aboriginal language centre in Port Hedland (Wangka Maya). I have worked in Vanuatu, and did my PhD research on Nafsan, a language from Efate, sparking an interest in languages of the Pacific. The work of recording speakers of Nafsan also resulted in a corpus of time-aligned text and media, as well as historical manuscripts that I have been finding in archives around the world. This led me to be concerned that existing records of these languages be made accessible to current speakers and so I helped establish the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) which has many connections to cultural agencies in the Pacific. I currently lead a project (Nyingarn) to convert manuscripts in Australian Indigenous languages to text. ","date":"2023-04-26","objectID":"/news/posts/interview-1/:1:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"What excites you most about the projects? LW: That it is both building an exciting tool for researchers to better understand language variation AND making a resource that is useful for Auslan students and members of the community. MS: One of the most fulfilling aspects of my work is the opportunity to collaborate with researchers from diverse communities and fields or to see how resources I created help people from very different backgrounds in their work. I do take pride in inspiring and enabling researchers to explore the potential of computational methods, whether they are working on projects related to the language sciences, health sciences, social sciences, arts, or beyond. As a quantitative linguist who came to computation as a computerphobe philosopher, I really enjoy creating visualizations that communicate data and results and that provide an intuitive understanding of complex data sets. I also take pleasure in showing others how to perform statistical analyses that help them make informed decisions based on their findings. By optimizing workflows and automating procedures, I help researchers to work more efficiently and effectively, enabling them to achieve their goals more quickly and easily. Through my work, I aim to create an environment where researchers from diverse backgrounds and specializations can benefit from the potential of computational tools, regardless of their field of study or level of experience. I value the opportunity to work collaboratively with researchers to find solutions to complex problems, and to share my expertise with others to promote innovation and growth within the field. Overall, I believe that the application of computational methods is an exciting area of research that has the potential to transform the way we approach scientific inquiry in the humanities and social sciences. By helping researchers to explore the possibilities of these tools, I hope to contribute to a more vibrant and innovative research community. NT: Making textual material in these languages available so that speakers can find records in their own languages. Often there is very little available information for these languages, so every record becomes all the more important, especially in the context of colonial dispossession where the languages are no longer spoken everyday and there are efforts to relearn the language. It is exciting that this work can become part of a national commons, and be supported into the future, so that more and more material can be included. The simple task of locating a language record, digitising it, and making the files available with appropriate licenses means that it can be used by speakers, and by researchers for various new purposes. ","date":"2023-04-26","objectID":"/news/posts/interview-1/:2:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? LW: Jump in and start playing! There are lots of different tools and corpora online and I find it easiest to learn how to use them by just having a go and seeing what you can get out. It can be fun to look for little bits of variation or to see how common a certain word or phrase is, and doing something small will give you a sense of the kind of data you get out of the tools and how you might then incorporate them into a wider project. MS: If you’re just starting out with a new project, it’s essential to take it one step at a time. Start with a simple visualization or basic text processing task, such as concordancing, and build on it gradually. Don’t be intimidated by others who may have more experience or skills than you do. Instead, take the opportunity to learn from them and seek their guidance when necessary. It’s important to keep in mind that progress takes time, and comparing yourself to others can be discouraging. Instead, focus on comparing your current skills and accomplishments to your former self. Take pride in what you’ve created and the progress you’ve made. Celebrate your small wins and use them as motivation to keep moving forward. Another key to improving your skills is to seek feedback from others. Share your work with peers, mentors, or instructors who can provide constructive criticism and suggestions for improvement. Don’t be afraid to ask questions or seek clarification when you’re unsure of something. Remember that learning is a continuous process, and even the most experienced professionals have room for growth and improvement. Embrace the journey, enjoy the learning process, and stay motivated by setting achievable goals for yourself. By doing so, you’ll be well on your way to developing your skills and achieving your goals. NT: The first steps in language data collection involve recording speakers, with all appropriate permissions in place, and with a consent form signed by them. Making those recordings as well as possible, following recommendations, doing training and so on, and managing the resulting files so they are not lost. Learn the basics of text querying, searching, and regular expressions. Be aware of the ethical considerations in using material from someone else’s language. But just start now! (Further information to supplement these suggestions is available from our Resources page and from the ATAP pages introducing Text Analysis) ","date":"2023-04-26","objectID":"/news/posts/interview-1/:3:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"Acknowledgments The Australian Text Analytics Platform program (https://doi.org/10.47486/PL074) and the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001) received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ","date":"2023-04-26","objectID":"/news/posts/interview-1/:3:1","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":null,"content":"A guide to navigating the various sections of the portal interface.","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Data and Page Structure Home Page (Top Menu, Left Panel, Main Panel) Collection Page Object Page File Page ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:0:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Data and Page Structure Both the data and the webpages in the portal are structured in a heirarchy: Collections contain Objects and Objects contain Files. Level Description Collection A group of related Objects. Examples of collections include corpora, and sub-corpora, as well as aggregations of cultural objects such as PARADISEC collections, which bring together items collected in a region or a session with consultants. ↓ Object A single resource or a group of tightly related resources that record a communicative event; for example, a dialogue or session in a speech study, a work (document) in a written corpus. ↓ File A container for data (in the form of bits and bytes). Files can store data in different formats; for example, a single Object could have an audio file as well as a text file containing a transcription of the audio. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:1:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Home Page There are three main sections of the Home Page in the portal: the top menu bar, the left panel and the main panel. Each of these is discussed below. Home Page: Top Menu, Left Panel and Main Panel sections Image Source: LDaCA ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Top Menu Home Page: Top Menu Image Source: LDaCA The top menu of the home page allows you to access some of the main features of the portal: LDaCA logo: Returns you to the home page from anywhere in the portal. Collections: Filters the results section of the main panel so that only collections are shown (excluding objects, files and notebooks). A search performed while in this view will only search the text in the Name and Description fields at the collection level. Browse: Resets your results to the default settings. Login: Takes you to the CILogon page where you can login to apply for access to collections or view your current access. See Login for more details. Help: Provides more general information about the infrastructure the portal was built with, as well as the Oni API functionalities. It also links back to this user guide for easy reference. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:1","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Left Panel Home Page: Left Panel Image Source: LDaCA The left panel of the portal home page allows you to refine your data query through the use of a search field and filters. Searching allows you to view records which include specific information, while filtering allows you to view records which share values for metadata categories. Hovering over the information icon next to each filter will display tooltips related to that filter. See Search and Filters for more details on how to use the search and filter functions. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:2","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Main Panel Home Page: Main Panel Image Source: LDaCA The main panel on the right of the home page allows you to view the top-level descriptions of collections, objects, files and notebooks in the portal. The top row shows the total number of Index entries or items; this total will reduce as you apply filters and search queries. Below this is the option to Reset Search, clearing all (current) filters and searches. The Sort by and Order by dropdown boxes can also be configured; see Sort and Order to learn more about their usage. Below this, there is a row of buttons which allow you to navigate through pages of results, with 10 results appearing on each page. The results in the main panel contain a set of top-level descriptions: Name: The name of the collection, object, file or notebook. Type: The type of object a record describes, i.e. a collection, object or file. Language: The language(s) of the materials (including PrimaryMaterials, DerivedMaterials and Annotations) in this item. Description: An abstract of the collection, object, file or notebook. Members: The number of items (objects, files, etc.) within the given collection. Member of: The collection that an object, file or notebook is a part of. Note that top-level descriptions will not appear in cases where they aren’t provided for that item, e.g. usually Description is not present for objects and files, and Member of does not appear for top-level collections. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:3","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Item Icons The icons next to each item show you some important information about that item (Access, Communication Mode and File Format). You can hover over a icon to see more detail. Icon Category Icon Tooltip Access You can access this data immediately. By doing so, you accept the license terms specified on the record. Access You can access this data after logging in. You may also have to agree to license terms in an automatic process. Access There are restrictions on access to this data. Log in to get further information. Communication Mode SpokenLanguage Communication Mode WrittenLanguage Communication Mode SignedLanguage File Format text/plain File Format application/msword File Format application/tei+xml application/vnd.openxmlformats-officedocument.wordprocessingml.document File Format audio/mpeg audio/x-wav File Format application/mp4 File Format text/csv File Format application/pdf ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:4","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Collection Page Clicking on one of the collections from the main page results will take you to the Collection page for that item. Collection Page: Braided Channels Example Image Source: LDaCA The main panel of the Collection page lists the main details and metadata associated with the collection. Clicking on the question mark icon or information icon next to each heading will display tooltips related to that item. Below the main description and metadata is the number of objects present in the collection. The objects are then listed below, with buttons allowing navigation by page if there are more than 10 objects in the collection. The right panel has the following sections: Access: Defines the license and access conditions for the current collection. Content: Lists some of the main features of the current collection including Language, Linguistic Genre, Communication Mode, File Formats and Data licenses for access. Retrieve Metadata: View or download the metadata associated with the current collection, as well as the license and access conditions for this metadata. Notebooks: Lists any notebooks associated with the current collection. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:3:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Object Page Clicking on one of the objects or files from either the main page results or from a Collection page will take you to the Object page for that item. Object Page: A Corpus of Oz Early English Example Image Source: LDaCA The left panel of the Object page lists the main details and metadata associated with the object. Clicking on the question mark icon or information icon next to each heading will display tooltips related to that item. The right panel has the following sections: Access: Defines the license and access conditions for the current object, together with a click-through link to the full license. Member Of: Lists the collection that the current object is a part of. Other Objects in this Collection: Lists the other objects that are in the same collection as the current object. The end of the Object page provides details about the files that form part of the current object. In the example above, the object Text 1-028 1791 Convict has two associated files: a plain text version and a text version including metadata codes, the details of which can be viewed using the dropdown arrow to the right of the file name. A preview of the file is available (provided that the access conditions permit this), and there are two options: View File to display the full file in the web browser, and Download File to save a copy of the file to your local system. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:4:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"File Page Clicking View File on one of the files in an Object page will take you the File page for that item. File Page: Braided Channels Example Image Source: LDaCA From this page, you can view the whole file or download it with the option at the end of the page. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:5:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":["LDaCA"],"content":" The data which is being made accessible through the Language Data Commons of Australia will contribute to the task of documenting language use and language behaviour in Australia. But what does this include? We are aware of many data sources and, in the short term, the important questions are about priorities: What will be immediately useful? What data is stored precariously? What relationships can be leveraged? But in thinking for the long term, there is value in asking the question from a more conceptual point of view: What were the possibilities for making records of language use over our history and what has resulted (and not resulted) from those possibilities? I will try to at least start answering that question by providing a conceptual map of a metaphorical landscape and this will be structured around two themes: demography and technology. Demography is important because what languages were being used at any particular time depends on who was living in Australia at that time. On this dimension, the major point of articulation is the arrival of non-Indigenous people. The places of origin of those non-Indigenous people have changed over time, and that has had linguistic consequences, but not on the same scale as the initial change. Technology is important because the kinds of records which might exist of language use depend on the means which were available to make such records at a given time. And on this dimension, I think that there are three points of articulation that are important: the possibility of making written records, the possibility of recording sound and vision, and the possibility of digital records. Before European contact, Australia had a very diverse linguistic ecology. Credible estimates of the number of distinct languages range from around 250 up to as many as 490, with many more dialects. Contact between language groups was common and therefore multilingualism and multidialectalism were also common. We know that there was contact between Indigenous Australians and fishermen from the Indonesian archipelago (especially Makassarese people) from the evidence of loan words. But the only record which existed of this period in the linguistic history of the continent was what was passed from one generation to another by oral transmission. The presence of Europeans on the Australian continent brought a huge change to both dimensions of the map I am developing. Europeans landed on parts of Australia from some time in the 17th century, but I will take two dates in the 18th century to be crucial. Although earlier visitors may have recorded a few words which they heard from Indigenous Australians, written records of the languages only start properly (and even then to a limited extent) with Cook’s expedition in 1770, reflecting perhaps the scientific orientation of Joseph Banks. This is the first point of technological articulation in the language data landscape which introduced the possibility of making language records independent of human memory. And then from 1788, there has been a continuous non-Indigenous presence in Australia representing a huge demographic articulation point. During the 19th century (and into the 20th century), written records of Australian languages were produced by a variety of people such as explorers, missionaries, administrators (in fact, pretty much anyone who could be bothered). These records are scattered and new material continues to be found (for example, Des Crump’s ongoing work in the State Library of Queensland and the Queensland State Archives). If you would like to get a flavour of some records of this kind, the Nyingarn project is making many of them available online. (Nyingarn builds on earlier work which presented the Indigenous language materials collected by Daisy Bates as an online resource.) However, the efforts of these early recorders were sporadic, uncoordinated and poorly focused. In 1945, Sydney Baker wrote: “Records of their languages are extremely deficient for","date":"2022-12-07","objectID":"/news/posts/data-map/:0:0","tags":["Australia","language data"],"title":"Language data in Australia - Mapping a conceptual landscape","uri":"/news/posts/data-map/"},{"categories":["LDaCA"],"content":" PDF version By Peter Sefton, Nick Thieberger, Marco La Rosa, Simon Musgrave, River Tae Smith, Moises Sacal Bonequi – delivered by Peter Sefton at eResearch 2022 in Brisbane This work is licensed under CC BY 4.0. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ This presentation will look at how a Metadata Standard - RO-Crate - with Metadata Profile (the Language Data Commons) is being developed and implemented. Two major collections are collaborating on the standard, PARADISEC and the Language Data Commons of Australia (LDaCA). This ongoing standardisation effort for language data is designed to improve interoperability, reduce costs for data migration and allow storage on disk, object storage or in archival repositories. RO-Crate is a linked-data metadata system which allows discovery metadata (Who, what where) based on the widely adopted Schema.org vocabulary to be seamlessly integrated with more discipline-specific metadata. RO-Crate uses metadata profiles to provide guidance for packaging resources for particular disciplines and purposes. In this presentation, we will introduce a RO-Crate metadata profile for language data which extends the core RO-Crate standard with new vocabulary terms adapted from pre-linked-data discipline-specific metadata efforts, particularly the Open Language Archives Community (OLAC) standards. The profile has English-language guidance on how to structure collections of resources in a repository with links between them, such that they can be indexed and displayed via APIs and search/browse portals. The profile is also implemented as a series of machine-readable profiles for the Describo Online metadata description system. We will demonstrate current ways of describing items in a variety of languages and modes (spoken, written and signed), from a large set of heterogeneous language resources held by PARADISEC and LDaCA. We will also show how to access them via API calls and a search portal, and how resources may be stored in simple storage systems using the Arkisto platform (a set of standards and principles). This work is supported by the Australian Research Data Commons. The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions. This page shows a YouTube demo of the PARADISEC website. PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) is a digital archive of records of some of the many small cultures and languages of the world and it has developed models to ensure that the archive can provide access to interested communities while also conforming with emerging international standards for digital archiving. Australian researchers have been making unique and irreplaceable audiovisual recordings in the region since portable field recorders became available in the mid-twent","date":"2022-11-25","objectID":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/:0:0","tags":["Metadata"],"title":"Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate)\n","uri":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/"},{"categories":["LDaCA"],"content":"Conclusion In this presentation, we have shown the major components of an ecosystem for storing, discovering and analysing language data using common standards for describing objects in a repository. The RO-Crate standard is used as the key metadata container, with a common vocabulary of language-specific terms for describing data. This approach should reduce development costs and increase data reuse. The approach can also be adapted to other disciplines and domains with the development only of new profiles. ","date":"2022-11-25","objectID":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/:1:0","tags":["Metadata"],"title":"Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate)\n","uri":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/"},{"categories":["LDaCA"],"content":" This work is licensed under a Creative Commons Attribution 4.0 International License. Download as PDF This is a write-up of a talk given at eResearch Australasia 2022, delivered by Peter Sefton, with some additional detail. By: Peter Sefton, Jenny Fewster, Moises Sacal Bonequi, Cale Johnstone, Catherine Travis, River Tae Smith, Patrick Carnuccio Edited by: Simon Musgrave The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions. This work is supported by the Australian Research Data Commons. Last year at eResearch Australasia, the Language Data Commons of Australia (LDaCA) team presented a design for a distributed access control system which could look after the A-is-for-accessible in FAIR data; in this presentation, we describe and demonstrate a pilot system based on that design, showing how data licenses that allow access by identified groups of people to language data collections can be used with an AAF pilot system (CILogon) to give the right people access to data resources. The ARDC have invested in a pilot of this work as part of the HASS Research Data Commons and Indigenous Research Capability Program integration activities. The system has to be able to implement data access policies with real-world complexity and one of our challenges has been developing a data access policy that works across a range of different collections of language data. Here we present a pilot data access policy that we have developed, describing how this policy captures the decisions that must be made by a range of data providers to ensure data accessibility that complies with diverse legal, moral and ethical considerations. We will discuss how the CARE and FAIR principles underpin this work, and compare this work to other projects such as CADRE, which promise to deliver more complex solutions in the future. Initial work is with collections curated in a research context but we will also address community access to these resources. The idea is to separate safe storage of data from its delivery. Each item in a repository is stored with licensing information in natural language (English at the moment, but could be other languages) and the repository defers access decisions to an Authorization system, where data custodians can design whatever process they like for granting license access. This can range from simple click-through licenses where anyone can agree to license terms, to detailed multi-step workflows where applicants are vetted based on whatever criteria the rights holder wishes; qualifications, membership of a cultural group, have they paid a subscription fee, etc Regarding rights, our project is informed by the CARE principles for Indigenous data which also describe the level of respect which should be given to any data collected from individuals or communities. The current moveme","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:0:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: Why not “just” implement an access control list (ACL) in the repository? There are a few reasons for the distributed approach we have taken in LDaCA: ACLs need maintenance over time - people’s identities change, they retire and die, so storing a list of identifiers such as email addresses alongside content is not a viable long-term preservation strategy. Rather, we will encourage data custodians to describe in words what are permitted uses for the data, and by whom, in a license, then allow whomever is the current data custodian to manage that access in a separate administrative system. We expect these administrative systems to be ephemeral, and change over time but also to generate less friction over time as standards are developed. Expected future benefits of concentrating these processes will include that people do not have to prove the same claims they make about themselves multiple times and that it is easier for data custodians to authorise access. LDaCA data will be stored in a variety of places with separate portal applications serving data for specific purposes; if these systems all have in-built authorization schemes, even if they are the same, then we have the problem of synchronizing access control lists around a network of services. Accessing data that requires some sort of authorization process is not language or humanities-specific, so working with an existing application that can handle pre-authorization workflows and access-control authorization decisions is an attractive choice and should allow LDaCA to take advantage of centrally managed services with functionality that improves over time rather than having to develop and maintain our own systems. If complex access controls are implemented inside a system then there is a risk that data becomes stranded inside that system and cannot be reused without completely re-implementing the access control. For example, imagine an archive of cultural material with complex access controls encoded into the business logic such as “this item is accessible only to male initiates”. Applications like this need to store user accounts with attributes on both data and user records that can be used to authorise access. There is a high risk of data being stranded in a system such as this if it is no longer supported. This will be mitigated somewhat if the rules are also expressed as licenses, perhaps a composition of Traditional Knowledge (TK) Labels - but the access system is baked-in to the application and not portable. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:1:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: Yes but why does data need to have a license if we already have access controls? The point of Research Data Commons projects like LDaCA is to create an ecosystem where data can be re-used. For language data, this means that users, including researchers and community members, will be able to download data for certain authorised purposes and activities. The license is the way that data custodians communicate to data users (and future administrators) what those purposes and activities are. A license, which is always packaged with data will allow: A user to inspect a five-year-old dataset in their downloads folder and work out what they are allowed to do with it. An IT professional to clean up a laptop that has been handed in by (or seized from – it happens) a departing faculty member. A developer to re-create an access control replacing a decommissioned system. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:2:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: So many licenses! Sounds like a lot of work! We expect that the overhead of writing licenses will diminish greatly over time and standard clauses and complete licenses will be established. A data depositor will be able to choose from a set of standard license terms (such as a standard “restricted to CIs and participants license” for a given repository, using that as a template to mint their own license for a given data set with its own name and ID. The user can choose a standard way of adding pre-authorised licensees (such as email invitations). This ID can then be used by an authorization system. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:3:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: So you have centralised authorization into a system that grants licenses doesn’t that mean you are locked-in to that system? No, and Yes No, there is no lock-in regarding the list of Licenses and pre-authorised users; licenses and access control lists can be exported via an API so it is possible to import them into another system or save them for audit purposes. Yes, there is lock-in, in that at this stage the workflow used to give access to users is specific to the system (such as REMS) But, because our process requires a governance step first in writing a license, then there is a statement of intent for re-building those processes later if needed - a step which is very likely to be missing in a system with built-in access control. Also, over time, we expect the administrative burden of constructing workflows will become less as standards are developed for a couple of things: Licenses can be made less complex (particularly in the context of academic studies) if they specify re-use by particular known cohorts in advance - this comes down to improving the design of studies to encourage data reuse. This may also help to simplify academic ethics processes in the medium to long term. The CADRE project is looking to improve pre-authorization workflows that automatically source relevant information about potential users - fetching their publication record, and potentially remembering what certifications they have, so these attributes can be used and reused for decision-making. It is conceivable that this approach might be useful in cultural contexts as well to allow data custodians to manage data sharing - this is a discussion we have yet to have in the broader HASS RDC. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:4:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: What if I have a really simple requirement like giving access to just a couple of people - doesn’t this license approach just add complexity? If a data item needs to be locked down to a small group of people, say the chief investigator and the participants in a recorded dialogue then an obvious implementation is to maintain a small access control list (ACL) for the item. But all of the issues identified above with application-specific ACLs are the same, no matter the size of the cohort: the data set can’t be access controlled outside of its home system. If the system is no longer running then the data may be completely inaccessible, and if there is no license document stored with the data setting out terms of re-use in general terms then there is no indication to future administrators about who, if anyone, should have access of the data. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:5:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: We don’t need a license, we have a “terms of use” Same thing. Terms of use for data are what a license does. We are designing our systems so that all the relevant terms and conditions go in one place to minimise confusion. The final three slides have been contributed by co-author Patrick. These slides briefly outline the AAF process for the next phase that will provide the foundations for the development of the service and the creation of those policies to support the community and the service. This process will support the project to deliver a viable service that meets researchers’ needs and is trusted by the community and the participants to safely distribute data to authorised persons. The AAF’s business analyst is conducting interviews with the key stakeholders. This discovery process will collect information on the current and the “to-be” state of the service. Together these will establish goals and expectations and provide the basis for further prototyping a service that meets stakeholder needs. The process will facilitate the building of a service that empowers the data custodians, the communities and participants to manage access. The basis for prototyping is iterative: Identify Prioritise Pilot Review Update requirements This leads to a production service that meets participant, community and researcher requirements and unifies the services, policies and trust framework for the community. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:6:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":null,"content":" Bio My name is Otis Carmichael and I am a proud Waanyi man living in Meanjin. My designs reflect my appreciation of the past and my hope for the future. The design of the LDaCA logo came from my own experience researching the Waanyi language, the language of my mob. The design flows from the beginning to the end, where it flows back into itself at the beginning, representing the cyclical nature of life and the Land. The strands of language interweave with themselves, influenced by everything that touches them until they become something entirely their own. The complexity of these languages is a thing of beauty and their preservation, study and proliferation lays claim to this unceded land. The inspiration of the design is as follows: L (Language) - Language is shown as a flow of words that can be merged, split and rearranged to create different meanings, just as these lines can be. D (Data) - Data is depicted by these specks of raw information, undecipherable before processing but heavy with power. a - This data is not just numbers and words but aspects of cultures and knowledge with connections to the land. C (Commons) - Whilst research often appears niche in these commons, there is no limit to the range of it’s influence, the ripples of which disrupt everything it touches. A (Australia) - The language data of this common has been collected from many individual nations that make up ‘Australia’, often with a foundation on extraction based research techniques. This project aims to make up for the mistakes of the past by celebrating and giving back to these nations, which are so incredibly distinct whilst sharing beautiful connections with each other. The colours utilised in this design draw from those that are seen in my peoples land of Boodjamulla, in the Gulf of Carpentaria (see below). They reflect the connections between the Sky, the People and the Land in the hope that these connections continue to strengthen indefinitely. ","date":"2022-11-07","objectID":"/designer/:0:0","tags":null,"title":"Otis Carmichael","uri":"/designer/"},{"categories":null,"content":" You can contact the Language Data Commons of Australia by email at ldaca@uq.edu.au and make sure you subscribe to our newsletter. Our logo was designed by Otis Carmichael. Read more about Otis and the ideas behind his design. We have a page on LinkedIn and we share a Twitter account with the Australian Text Analytics Platform. ","date":"2022-10-28","objectID":"/contact/:0:0","tags":null,"title":"Contact Us","uri":"/contact/"},{"categories":["LDaCA"],"content":" This is a presentation Peter Sefton gave to the Humanities, Arts and Social Sciences Research Data Commons and Indigenous Research Capability Program Technical Advisory Group on Friday 11th February 2022. Thanks to Simon Musgrave for reviewing this and adding a little detail here and there. (This is also available on Dr Sefton’s site) PDF version The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are establishing a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. For this Research Data Commons work, we are using the Arkisto Platform (introduced at eResearch 2020). Arkisto aims to secure the long-term preservation of data independently of code and services - recognizing the ephemeral nature of software and platforms. We know that sustaining software platforms can be hard and aim to make sure that important data assets are not locked up in a database or hard-coded logic of some hard-to-maintain application. We are using three key standards on this project … The first standard is the Oxford Common File Layout - this is a way of keeping version-controlled digital objects on a plain old filesystem or object store. Here’s the introduction to the spec: Introduction This section is non-normative. This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital objects in a structured, transparent, and predictable manner. It is designed to promote long-term access and management of digital objects within digital repositories. Need The OCFL initiative began as a discussion amongst digital repository practitioners to identify well-defined, common, and application-independent file management for a digital repository’s persisted objects and represents a specification of the community’s collective recommendations addressing five primary requirements: completeness, parsability, versioning, robustness, and storage diversity. Completeness The OCFL recommends storing metadata and the content it describes together so the OCFL object can be fully understood in the absence of original software. The OCFL does not make recommendations about what constitutes an object, nor does it assume what type of metadata is needed to fully understand the object, recognizing those decisions may differ from one repository to another. However, it is recommended that when making this decision, implementers consider what is necessary to rebuild the objects from the files stored. Parsability One goal of the OCFL is to ensure objects remain fixed over time. This can be difficult as software and infrastructure change, and content is migrated. To combat this challenge, the OCFL ensures that both humans and machines can understand the layout and corresponding inventory regardless of the software or infrastructure used. This allows for humans to read the layout and corresponding inventory, and understand it without the use of machines. Additionally, if existing software were to become obsolete, the OCFL could easily be understood by a lightweight application, even without the full feature repository that might have been used in the past. Versioning Another need expressed by the community was the need to update and change objects, either the content itself or the metadata associated with the object. The OCFL relies heavily on the prior art in the [Moab] Design for Digital Object Versioning which utilises forward deltas to track the history of the object. Utilizing this schema allows implementers of the OCF","date":"2022-06-08","objectID":"/news/posts/rdc-tech-meeting/:0:0","tags":["Repositories"],"title":"HASS RDC Technical Advisory Group Meeting LDaCA \u0026 ATAP Intro","uri":"/news/posts/rdc-tech-meeting/"},{"categories":null,"content":"Information about the first datasets which have been added to the LDaCA and Oni .","date":"2022-02-15","objectID":"/about/sample-collections/","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"LDaCA Portal LDaCA has begun adding datasets, including: most of the datasets which were part of the Australian National Corpus collection, now available at our data portal (or click the Data Portal button top right). ","date":"2022-02-15","objectID":"/about/sample-collections/:1:0","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"A Corpus of Oz Early English (COOEE) A collection of texts written in Australia between 1788 and 1900. The corpus is divided into four time periods (1788–1825, 1826–1850, 1851–1875 and 1876–1900) each holding about 500,000 words. Four registers were defined for COOEE: the Speech-based Register (SB), the Private Written Register (PrW), the Public Written Register (PcW) and the register of Government English (GE). For each time period, there is a similar number of words in the different registers. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:1","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"AustLit The contribution from AustLit provides full-text access to select samples of out-of-copyright poetry, fiction and criticism ranging from 1795 to the 1930s. The collection includes literature intended for popular audiences as well as literature intended for audiences concerned with literary quality or the establishment of a national canon. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:2","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Australian Corpus of English The Australian Corpus of English (ACE) was compiled to match Australian data from 1986 with the American (Brown) and British (LOB) corpora of written English from the 1960s. It includes 500 samples of published texts taken from 15 different categories of nonfiction and fiction, including newspapers, reportage, editorials, reviews; magazines and journals: popular, academic; government and corporate documents; fiction monographs and short stories (both popular and literary). ","date":"2022-02-15","objectID":"/about/sample-collections/:1:3","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Australian Radio Talkback Australian Radio Talkback (ART) is a set of transcribed recordings of samples of national, regional and commercial Australian talkback radio from 2004 to 2006. It includes 27 audio recordings and transcripts of talkback from ABC National Radio, ABC Radio broadcasts to eastern Australia, ABC Radio broadcasts to southern and western Australia, as well as commercial stations broadcasting to eastern Australia and southern and western Australia. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:4","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Braided Channels The Braided Channels research collection is constructed from some 70 hours of oral history interviews with women from Australia’s Channel Country, together with archival film, transcripts, photos and music. It includes both audiovisual recordings and transcripts of interviews. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:5","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"International Corpus of English (ICE-AUS) The Australian component of the International Corpus of English (ICE-AUS) is an approximately one million word corpus of transcribed spoken and written Australian English from 1992-1995. It consists of 500 samples of Australian English (60% speech, 40% writing) that match the structure of other corpora associated with the International Corpus of English. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:6","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"The La Trobe Corpus of Spoken Australian English The La Trobe Corpus of Spoken Australian English comprises a collection of six recordings and transcriptions of spoken interaction amongst Australian speakers of English (some in conversation with native French speakers speaking English) made in Melbourne from 2001 to 2002. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:7","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960 This dataset comprises 22,187 recordings of Australian English as spoken by 7,736 students at 330 schools across Australia and specific information about the speakers. The recordings were made on reel-to-reel tapes and were used to create the 1965 monograph The speech of Australian adolescents: a survey and the revised 1965 publication The pronunciation of English in Australia (originally published in 1946). The Australian National Corpus provided access to a sample of this material; the full dataset is now available. Other datasets not yet available in the portal: Sydney Speaks: This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney. From Farms to Freeways: This research project sought to analyse the experiences of women who had lived in the Blacktown and Penrith areas since the early 1950s, including their responses to social changes brought about by rapid suburbanisation in the Western Sydney region in the post-war period. Two-hour taped discussions were held with 34 women, aged 60 and over, who were in their early twenties during the Western Sydney region’s population growth. A collection of government documents in various languages. This is a very small dataset assembled to check that our technology can handle different languages and different scripts; more information about this work is available in this presentation. Work is underway to make data from other earlier projects accessible through LDaCA: Datasets from The Australian National Corpus not listed above (Monash Corpus of English, Griffith Corpus of Spoken Australian English) AusTalk Corpus of Australian English as a Second Language (AusESL) ","date":"2022-02-15","objectID":"/about/sample-collections/:1:8","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Indigenous Language Data (ILD) Portal ","date":"2022-02-15","objectID":"/about/sample-collections/:2:0","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Caroline Kelly Papers The Caroline Kelly Papers collection comprises personal and professional papers of Caroline Kelly, including correspondence; financial and legal papers; unpublished poetry and stories; theatre records and publications; anthropology field notes, reports and articles; photographs and newspaper cuttings. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:1","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Elwyn Flint Collection, UQFL173 The Elwyn Flint collection comprises written documents collected by Flint mostly as part of a long term research project in the 1960s, known as the Queensland Speech Survey, during which Flint recorded traditional Aboriginal languages. This collection also documents the Yuulngu (Gupapuyngu) language. Other parts of the collection include journal articles of languages, correspondence with field-linguists and staff from academic institutions, Indonesian material, and documentation of the English spoken by Australian Aboriginal and Torres Strait Islander peoples, Australian and regional pidgins, “migrants” and “mother-tongue” people from different ages, regions or socio-economic groups. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:2","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Fryer Library, The University of Queensland The Fryer Library manuscript collections consists of unpublished materials and includes personal papers, photographs, audio recordings, architectural plans, artworks and more. Manuscripts are generally arranged in collections according to who created or collected the material. Note that Fryer Library’s published material can also be searched through its Library Search. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:3","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"UQ Library Collection A selection of Indigenous language resources held in the University of Queensland Library. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:4","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Principles Information about the principles on which the work of LDaCA is based. ","date":"2022-02-15","objectID":"/data/:0:1","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Technologies Information about the technologies being used in LDaCA. ","date":"2022-02-15","objectID":"/data/:0:2","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Metadata Information about the approach to metadata being taken by LDaCA. ","date":"2022-02-15","objectID":"/data/:0:3","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Sample Collections Information about the first datasets which have been added to LDaCA. ","date":"2022-02-15","objectID":"/data/:0:4","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Case Studies Accounts by collectors of various kinds of language data illustrating the different solutions they have adopted for the problems encountered in the process. ","date":"2022-02-15","objectID":"/data/:0:5","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Learn more about our project's structure and partnerships.","date":"2022-02-15","objectID":"/about/organisation/","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":" LDaCA is one strand of the partnership between the Australian Research Data Commons (ARDC) and the School of Languages and Cultures at The University of Queensland. This partnership includes a number of projects that explore language-related technologies, data collection infrastructure and Indigenous capability programs. These projects are being led out of the Language Technology and Data Analytics Lab (LADAL), which is overseen by Professor Michael Haugh and Dr Martin Schweinberger. The Language Data Commons of Australia (LDaCA) project receives investment from the Australian Research Data Commons (ARDC) through its HASS and Indigenous Research Data Commons program. ","date":"2022-02-15","objectID":"/about/organisation/:0:0","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":"Partner Institutions University of Queensland: Professor Michael Haugh Australian National University: Professor Catherine Travis Monash University: Associate Professor Louisa Willoughby University of Melbourne: Associate Professor Nick Thieberger University of Sydney: Professor Monika Bednarek (Sydney Corpus Lab) AARNet: Adam Bell First Languages Australia: Beau Williams ","date":"2022-02-15","objectID":"/about/organisation/:1:0","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":"Advisory/Consultative Partners Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) Australian Digital Observatory CLARIN ","date":"2022-02-15","objectID":"/about/organisation/:2:0","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":"Information about the technologies being used in LDaCA.","date":"2022-02-15","objectID":"/about/technologies/","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":null,"content":" LDaCA is basing its data storage on Research Object Crates and the Oxford Common File Layout, which are described below. The overall approach is informed by the Arkisto platform, taking the view that research data has interest and value that extends beyond funding cycles and its long-term preservation and accessibility must continue to be managed. This presentation gives further details of the technical architecture. ","date":"2022-02-15","objectID":"/about/technologies/:0:0","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":null,"content":"Research Object Crates (RO-Crate) A Research Object (RO) is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations. RO-Crate is a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wide variety of situations. While RO-Crates can be considered general-purpose containers of arbitrary data and open-ended metadata, in practical use within a particular domain, application or framework, it is beneficial to further constrain RO-Crate to a specific profile: a set of conventions, types and properties that one minimally can require and expect to be present in that subset of RO-Crates. LDaCA is developing such a profile to be used for language data. Watch this video for a quick introduction to the RO-Crate concept. The RO-Crate community run a weekly drop-in call in Australia (details). ","date":"2022-02-15","objectID":"/about/technologies/:1:0","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":null,"content":"Oxford Common File Layout (OCFL) OCFL is a specification for laying out digital collections on file or object storage. It is designed with long-term preservation principles in mind and does not rely on specialised software. Amongst the benefits of using OCFL with RO-Crate objects are: completeness: a repository can be re-indexed from the files it stores versioning: repositories can make changes to objects and still allow their history to persist ","date":"2022-02-15","objectID":"/about/technologies/:2:0","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":["LDaCA"],"content":"FAIR and CARE Data is becoming increasingly important in today’s world, so corpus linguists might feel that the rest of the world is finally catching up. But the rest of the world are bringing with them new approaches to how data is handled. This means that fields such as corpus linguistics may need to reassess their practices. Such reassessment includes addressing concerns about how data is stored and who can access it (data stewardship) – concerns that are a part of the Open Science movement, ultimately grounded on principles of equity and accountability. The most influential approach to data stewardship today is the FAIR principles. According to these principles, data should be: Findable Metadata and data should be easy to find for both humans and computers. Accessible Once the user finds the required data, she/he/they need to know how can they be accessed, possibly including authentication and authorisation. Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing. Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. In general, corpus linguists do well on the interoperability criterion. Corpus data is usually stored in non-proprietary formats; even when some structure is imposed on the data, this is almost always in a form which is saved as a simple text file (e.g. CSV files or XML annotations). Data stored in such formats is easy to move between applications. But what about the other three criteria? Some corpus data is easy to discover; it is findable. For example, CLARIN, the portal to the European Union language resource infrastructure, provides access to many large data collections, as does the Linguistic Data Consortium in the USA. However, some data is never made part of a large collection and often remains under the control of individual researchers or research teams. Such data may be almost impossible to find. Even if we can find such data, it is unlikely to be accompanied by good descriptions of the data and metadata, making reusability problematic. Of course, big corpora such as the British National Corpus will be both findable and accompanied by comprehensive corpus manuals. However, it is worth considering how to make other corpora more findable, including the provision of corpus manuals or corpus descriptions. Corpus resource databases such as CoRD do aim to work towards this principle. Accessibility may also be an issue for some data. Copyright law may allow use of material for individual research but prohibit any further distribution of the material. The FAIR approach to such cases is that metadata should be available so that interested parties can know that a data holding exists (F), and the metadata will include information about the conditions under which the data may or may not be shared or reused (A and R). For linguists, there is another very important set of principles concerning data, the CARE principles developed by the Global Indigenous Data Alliance: Collective Benefit Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data. Authority to control Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered. Responsibility Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit. Ethics Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem. These principles are presented as applying particularly to Indigenous data, but we believe that researchers should adopt this approach in all cases where the people who particip","date":"2022-02-08","objectID":"/news/posts/fair-and-care/:1:0","tags":["FAIR","CARE"],"title":"What are the FAIR and CARE principles and why should corpus linguists know about them?","uri":"/news/posts/fair-and-care/"},{"categories":null,"content":"Refine your results according to a set of major metadata categories.","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Filters allow you to refine what items are available to view and search according to a set of major metadata categories and can be accessed from the left panel of the portal home page. Select a filter in order to see the dropdown list of options; the number icons beside each filter indicate how many options are available. A total of five options are visible per page; to access more options, use the pagination icons, or the search bar if you are looking for a particular option. Part of the Filter List Linguistic Genre Filter Expanded Filter List Example Image Source: LDaCA Linguistic Genre Filter Example Image Source: LDaCA Within each filter, the number icons beside each option indicate how many items are available for that particular option. For example, in the images above, there are 11 options under the Linguistic Genre filter, and there are 8098 items available for the Linguistic Genre Narrative. Once you have selected all the filters you wish to apply, select the Apply Filters button, which appears on a floating pop-up box and also above the Reset Search button. The top of the results section will show the filters and selected options that were applied next to the Filtering by: heading, and these can either be removed individually by clicking the × next to that filter or in bulk with the Clear Filters button. If needed, you can further select or deselect options by repeating this process with the left panel filters, at which point the Apply Filters button will reappear. ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:0:0","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Filter Categories ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:0","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Collection A group of related objects such as a corpus, a sub-corpus or items collected in a single data collection session. Examples: A COrpus of Oz Early English (COOEE) Braided Channels International Corpus of English (ICE-AUS) ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:1","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Sub-Collection The sub-collections, if any, that comprise a collection. Examples: ICE: S1A: Conversations ICE: S2A: Dialogues ICE: S2B: Monologues ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:2","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Access The access conditions associated with an item. Note that these differ from the three general categories of the Access icons and are instead the specific licenses for the item’s usage. Examples: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Attribution 4.0 International (CC BY 4.0) Attribution-NoDerivs 3.0 Australia (CC BY-ND 3.0 AU) ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:3","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Record Type The type or categorisation of an item, i.e. a collection, object or file. For individual files, this field also gives information about the nature of the material, i.e. primary material, transcription, annotation etc. Examples: Annotation Video Transcription ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:4","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Language The language(s) which the content of the item is in. Examples: English Waanyi Vietnamese ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:5","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Communication Mode The mode(s) (spoken, written, signed, etc.) in which the content of the resource was created. Examples: WrittenLanguage SpokenLanguage SignedLanguage ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:6","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Linguistic Genre The style or form of communication to which the item can be assigned. Examples: Narrative Report Interview ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:7","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"File Format The media type and encoding format of the file. Examples: text/plain (.txt) application/mp4 (.mp4) audio/x-wav (.wav) ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:8","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Refine your queries with word, phrase and pattern searches.","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Basic Search Advanced Search (Search Fields, Boolean Operators, Query String Syntax, Regular Expressions, Reserved Characters, Show Query) The portal has both basic and advanced search capabilities. Search allows you to refine what items are displayed through basic queries with one or more words, or advanced queries using regular expressions to find specific patterns. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:0:0","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Basic Search The basic search function can be found in the top left section of the portal home page. Basic Search Image Source: LDaCA Basic Search allows you to search for: single words multiple words (i.e. two or more) where at least one of these words occur in the result (e.g. “northern territory” will return results where “northern” and “territory” occur in isolation, as well as any instances where “northern” and “territory” occur in the same item) case-insensitivity. Basic Search does not allow you to search for: exact phrases parts of words case-sensitivity. Selecting the magnifying glass icon or hitting Enter/Return on a keyboard executes a Basic Search (but note that the Enter/Return option is not available in Advanced Search). By default, results are sorted by relevance and ordered in descending order. Each result will display a Relevance Score based on the relevance of the result to the search query. If at least one of the query words occurs in the text field of an object, this will be highlighted in yellow; however, highlighting will not occur for matches within metadata fields such as name and description. Note that queries in Basic Search will be applied to all fields in the collections (e.g. name, description, text). If you need to refine the search field, or search for multiple words in specific fields, use the Advanced Search function. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:1:0","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Advanced Search The advanced search function is accessed by selecting Advanced Search below the Basic Search bar. Advanced Search Image Source: LDaCA ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:0","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Search Fields By default, queries in Advanced Search are applied to all fields, however, this can be refined using the All Fields dropdown menu on the left. The current search fields available are Name, Description, Language and Text. To search in more than one field, select Add New Line and specify the additional field you wish to search. To reset your search query, select Clear. The information entered in the Advanced Search text field is treated as part of a ‘mini-language’: Your query specifications across multiple lines will be compiled as a single query string consisting of a series of search terms and operators which can be viewed by clicking the Show Query button. In general, the search text is not case-sensitive. Exceptions to this are Boolean operators (see below). ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:1","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Boolean Operators The standard Boolean operators AND, OR and NOT are supported in Advanced Search. These can either be added in the dropdown menu between fields when Add New Line is selected, or included within the search text field, using parentheses around the full query, e.g. (koala AND kangaroo). For instance, to search for items that contain both ‘rainbow’ and ’lorikeet’ or ‘pink’ and ‘cockatoo’ but not ‘galah’, the query should be: (((rainbow AND lorikeet) OR (pink AND cockatoo)) NOT galah) If you are using multiple Boolean operators within a single text field, you may want to group parts of the query in parentheses like in the example above, but ensure in all cases you remember the parentheses around the full query or the results will be incorrect. To search for the literal words AND, OR and NOT, add a backward slash (\\) before that word to ’escape’ it, e.g. \\OR. Note that this is a situation where the search text is case-sensitive; ‘and’ does not need to be escaped, but ‘AND’ does. Escaping will not return case-sensitive matches; it will just prevent its use as a Boolean operator. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:2","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Query String Syntax Symbol Function \" \" Use double quotation marks before and after a phrase to search for that exact phrase, e.g. \"public house\". Entries where a hyphen occurs in the text instead of a space will also be returned in a phrasal search. ~ Creates a fuzzy query to return results similar to the search term by changing, removing, inserting or transposing one or more characters. Add a number following this operator to increase the number of variations, e.g. brwn~2 will find instances of ‘brown’, ‘been’, ‘own’, etc. Fuzzy queries can also be applied to phrasal searches, allowing the specified words to be further apart or in a different order, e.g. \"house home\"~3 will find instances of ‘house and home’, ‘house is my home’, ‘home, the house’, etc. ? Wildcard to replace a single character. Wildcards cannot be included in a phrasal search. e.g. qu?ck will find instances of ‘quick’ and ‘quack’. * Wildcard to replace zero or more characters. Wildcards cannot be included in a phrasal search. e.g. gre* will find instances of ‘green’, ‘grew’, ‘greater’, etc. This wildcard can also be used to find related word forms e.g. ask* will find instances of ‘ask’, ‘asks’, ‘asked’ and ‘asking’. ( ) Defines a sub-expression. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:3","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Regular Expressions Some regular expression patterns can be used within the query string by surrounding the pattern in forward slashes, e.g. /gr[ae]y/ or /honou*r/. Currently, regular expressions can only be used for complete word searches and not for partial words or phrases. This search function does not support full Perl-compatible regex syntax. For more information see: RegExp Syntax. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:4","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Reserved Characters The following are reserved characters (i.e. part of the ‘mini-language’) that can have specific search functions and may need ’escaping’ with \\ if you want to search for the literal characters. + − = \u0026\u0026 ; || \u003e \u003c ! ( ) { } [ ] ^ \" ~ * ? : \\ / ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:5","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Show Query If you need to check your search query against what it actually sent to the back-end search engine, select Show Query. For example, setting the search field to Language and searching for Danish has the following query string: ( language.name.@value: Danish ) ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:6","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Sort and order the results in the portal.","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). The portal search engine allows you to sort and order all items of a query according to the below options. By default, items are sorted by Collections first, and are in Descending order. ","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/:0:0","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":"Sort By Collections Relevance Name ","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/:1:0","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":"Order By Ascending Descending ","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/:2:0","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":"A guide to logging in to the portal.","date":"2019-01-29","objectID":"/resources/user-guides/portal/login/","tags":null,"title":"Login","uri":"/resources/user-guides/portal/login/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Some collections require that you log in either to access, or apply for access, to the data. This can be done by selecting Login from the top menu bar. Portal Top Menu Options Image Source: LDaCA You will be taken to the LDaCAData login page. Select Sign in with CILogon. LDaCA Logon Image Source: LDaCA You will be taken to the CILogon page which: provides details of the Consent to Attribute Release allows you to select an identity provider. CILogon facilitates secure access to cyber infrastructure (CI). In order to use the CILogon Service, you must first select an identity provider. An identity provider is an organisation where you have an account and can log in to gain access to online services. Note that the identity provider defaults to ORCID but this can be changed to your preferred authentication. If you are a faculty, staff or student member of a university or college, please select that for your identity provider. If your school is not listed, please contact help@cilogon.org, and they will try to add your school in the future. You can also select your ORCID, Google, GitHub or Microsoft account to authenticate for the CILogon Service. For further details about CILogon and the identity providers it supports, see CILogon FAQ. CILogon Image Source: LDaCA First select your preferred identity provider in the dropdown before clicking Log On (there is also a search bar available to more easily find your selection). If you want the system to remember your selection on subsequent visits, select Remember this selection. CILogon: Select an Identity Provider Image Source: LDaCA Once your login is successful, you will be redirected back to the portal page and the Login section of the top menu bar will instead display your account name. Portal: Logged In View Image Source: LDaCA Hovering your cursor on the arrow next to your account name will trigger a dropdown menu with the following two options: User Information: View your user details and memberships to collections that have more specific access conditions. Logout: Select to log out of your account. ","date":"2019-01-29","objectID":"/resources/user-guides/portal/login/:0:0","tags":null,"title":"Login","uri":"/resources/user-guides/portal/login/"},{"categories":null,"content":"A guide to viewing and applying for access to collections in the portal.","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Apply for Access to Collections with Access Restrictions Apply for a Data License View your Current Access List ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:0:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"Apply for Access to Collections with Access Restrictions When viewing an item that has restricted access, you will see one of the following permission warnings, depending on whether or not you are logged in to the portal: You do not have permission to see these files. Sign up or Login You do not have permission to see these files. You are logged in and you can apply for permission to view these files apply for access or refresh permissions Restricted Access Example Image Source: LDaCA Once you have logged in to the LDaCA portal (see Login for further details), select apply for access in the warning text and you will be directed to the LDaCA-REMS login page in a new tab. REMS (Resource Entitlement Management System) is an open-source tool for managing access rights to resources such as research datasets. REMS is used by many research and government organisations. In this document, LDaCA-REMS refers to the specific way REMS is implemented in the LDaCA environment. LDaCA-REMS Interface Welcome Page Image Source: LDaCA Select Login and follow the CILogon prompts (see Login again). Once logged in, you will be presented with the license application form containing the following five sections: State: Shows the status of your application across three states: Apply: Displays a tick mark when you click on the Accept Terms button in the Licenses section. Approval: Displays a tick mark when you click Submit application in the Actions section. Approved: Displays a tick mark when your application is approved. Applicants: Shows the name of the applicant and an Invite member button that allows the applicant to send a link to the license application page to another person; the invitee will be prompted to log into LDaCA-REMS. Resources: Provides a link to the license document. Licenses: Provides a link for the applicant to view the license and an Accept license button to accept the terms. Actions: Contains features actionable by the applicant, including Submit application, an important function to select during the application process. LDaCA-REMS Interface Submit Application Image Source: LDaCA You will also notice the three top menu items, Catalogue, Applications and About: Catalogue: Lists resources to which you can apply for access. Applications: Lists all your applications, each with details such as the application ID, resource name, state and activity dates. About: A brief description of the LDaCA-REMS tool. ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:1:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"Apply for a Data License Data licensing is the mechanism used in LDaCA-REMS for data stewards to grant or deny legal permission to access and use data. The image below shows how to apply for a data license in three steps: Go to Licenses to read the license terms. Clicking on the link will open these in a new tab. Then, still under Licenses, select Accept license. Go to Actions and select Submit application to activate the approval process. LDaCA-REMS Application Submission Steps Image Source: LDaCA Note that the application form above illustrates a basic license application procedure. Access restrictions and procedures for application approval will vary across collections (e.g. requiring applicants to complete a questionnaire, or to upload material explaining intended use of the data). Data stewards can set up the license application approval as an automated process (using bots) which takes 1-2 minutes to complete. Other data stewards might prefer to receive, review and approve or reject the application themselves, or designate a colleague for the task. Understandably, human-mediated approval is likely to take longer than an automated process. Every applicant will be sent an email notification confirming the submission of a license application, and another email confirming approval. Once approved, you should see all three tick marks under State: LDaCA-REMS State: Application Approved Image Source: LDaCA Navigate back to the data object you require. If the warning text is still displayed, click on refresh permissions. The page will update and you should be able to access the data. Note that these permissions apply to other data objects in the collection that are covered by the same license terms; you do not need to apply for permission for each data object. ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:2:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"View your Current Access List To view your current list of licenses you have access to, hover your cursor over the arrow next to your account name to trigger a dropdown menu and select User Information. If you are not logged in, see the steps under Login. The User Memberships section shows your current access list, and you can click Check Memberships to refresh this list. You can also select Verify your access in REMS to log into LDaCA-REMS. LDaCA-REMS User Memberships Image Source: LDaCA ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:3:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"A guide to downloading individual files from the portal.","date":"2017-01-29","objectID":"/resources/user-guides/portal/download-data/","tags":null,"title":"Download Data","uri":"/resources/user-guides/portal/download-data/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). ","date":"2017-01-29","objectID":"/resources/user-guides/portal/download-data/:0:0","tags":null,"title":"Download Data","uri":"/resources/user-guides/portal/download-data/"},{"categories":null,"content":"Download Individual Files Files from a collection can be downloaded from two locations: the File section of an Object page the File page for the selected item. To download a file from an Object page, navigate to the Files section at the end of the page (ensuring you click on the arrow to expand the relevant file) and select Download File on the right-hand side of the file entry. Object Page: Download File Image Source: LDaCA To download a file from a File page, navigate to the end of the page and select Download File. File Page: Download File Image Source: LDaCA ","date":"2017-01-29","objectID":"/resources/user-guides/portal/download-data/:1:0","tags":null,"title":"Download Data","uri":"/resources/user-guides/portal/download-data/"},{"categories":null,"content":"A guide to citing collections and data accessed through the portal.","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Citing a Collection Citing Parts of a Collection Citing Published Work Citing LDaCA as Access Source Follow this guide when you need to cite a data source that you accessed through the LDaCA portal in a publication. ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:0:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing a Collection Cite the collection as a whole whenever you use, or need to refer to, any part of it in your work, unless otherwise stated in the citation information provided in the LDaCA portal. You can find each collection’s citation information on the collection level page under “How to cite this collection”. The collection level page is the webpage presented upon selection of the collection name from the portal homepage or from a list of search results. Note that this citation information is separate from the Citation metadata field that appears in some collections and refers to published papers about the collection. The citation information is not in any specific format or style (e.g. APA, MLA); it is meant to provide the essential citation elements for a minimal bibliographic reference (Tromsø Recommendations 2019) which are: Author, Date, Title, Publisher, Locator However, the citation information given may be what the data steward had approved or had specified in the data license. For the definitions of the above and additional citation elements, and further advice on citing linguistic data, refer to The Tromsø Recommendations for Citation of Research Data in Linguistics. ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:1:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing Parts of a Collection You can mention particular parts of the collection (e.g. objects, files, accompanying materials) you used in the text of your work, accompanied by the in-text citation e.g. Author, Date (or equivalent footnote if using a numbered style). Note that the manner of citing a collection would most likely be subject to the context of your research project, and guidance from your school or publisher may take precedence over the information provided here. ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:2:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing Published Work Descriptive metadata at the collection level may include a Citation field that contains a link to, or citation information for, a published work. The published work is typically also authored by the data collection creator and explains the research work and processes behind the collection of the data. LDaCA provides this information as specified by the data collection creator. In cases like this, cite the publication in addition to citing the data collection. As an example, the COrpus of Oz Early English (COOEE) shows a Citation field with a link to the thesis that explains the research concepts, planning, protocols, processes etc. in the creation of the COOEE collection. The user should cite both the collection and the thesis. COOEE Citation Metadata Field Example Image Source: LDaCA ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:3:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing LDaCA as Access Source If you discovered or accessed the data collection that you used in your work through the LDaCA portal, please cite or acknowledge LDaCA as follows: Data accessed via the Language Data Commons of Australia data portal https://data.ldaca.edu.au on [yyyy-mm-dd]. (The Language Data Commons of Australia is a nationally funded partnership project by the Australian Research Data Commons, The University of Queensland, Australian National University, The University of Melbourne, The University of Sydney, Monash University, First Languages Australia and AARNet). ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:4:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"}] \ No newline at end of file +[{"categories":null,"content":"A brief overview of Crate-O, RO-Crate, Schemas, Profiles and Modes.","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":" Crate-O Use Cases RO-Crate Collection Hierarchy Schemas, Profiles and Modes Schema.org Style Schemas (SOSSs) and RO-Crate Profiles and Modes Crate-O is a browser-based editor that allows you to create and update Research Object Crates (RO-Crates), either using the web interface or with metadata from a spreadsheet. It provides researchers with a relatively simple way to describe their data using best practice in formal metadata description. Designed as a Vue component, Crate-O can be easily integrated into any Vue.js project. Simply install the component with npm and include it in your application. As a Vue software component, it can be used, cloned and distributed by anyone. Crate-O is implemented in the GitHub page Crate-O. This implementation works only with Google Chrome and Microsoft Edge. It does not link to other services, and as such, any data uploaded to Crate-O will not go anywhere other than populating your RO-Crate. We will be releasing versions that work with online resources directly, which will be compatible with other browsers (see the Roadmap). RO-Crate is a way of packaging research data that stores the data together with its associated metadata and other component files, such as the data license. It is a flexible, developer-friendly approach to linked-data description and packaging. While the current version of Crate-O is designed for editing self-contained RO-Crates (and works fine with crates containing tens of thousands of entities), our roadmap includes adding the ability to edit fragments of larger linked-data resources and to integrate with repositories, such as the Oni repository, data API and archival repositories such as the Language Data Commons of Australia. ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:0:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"Crate-O Use Cases Crate-O is designed to work with any of the following use cases: Describe data collections and files on a user’s computer, and add contextual information about those files Describe abstract contextual entities, such as in a Cultural Collection or an encyclopaedia Annotate existing resources elsewhere on the web Submit a data collection to the LDaCA Portal Edit a Schema that contains a set of vocabulary terms, such as the terms used by LDaCA. ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:1:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"RO-Crate Collection Hierarchy The diagram below shows the hierarchical relationship between collections, objects and files in a corpus, together with the metadata categories which track these relationships. Self-contained corpus crate with all resources Image Source: LDaCA The metadata is organised according to Schema.org entity types. Entity Definition Class rdfs:Class is used to classify resources. Classes in the Language Data Commons (LDAC) schema include CollectionEvent, CollectionProtocol, DataDepositLicense, DataLicense and DataReuseLicense (see https://w3id.org/ldac/terms). Property rdfs:Property is an attribute of an instance of a Class. For example, on an entity that is an instance of Class Person the property “name” would be their name, expressed as a text string, while “affiliation” would be a property that referenced another entity, their university. DefinedTerm A ‘word, name, acronym, phrase, etc. with a formal definition’, ‘often used in the context of category or subject classification.’ DefinedTerms allow us to a) have accurate definitions of the values we want to give to properties, and b) group such definitions in DefinedTermSets, which can function as controlled vocabularies. The table below shows an example of the relationship between each of these entities: Level Example Class Annotation ↓ ↓ Property annotationType ↓ ↓ Defined Term Set AnnotationTypeTerms ↓ ↓ Defined Terms Gestural, Phonemic, Phonetic, Phonological, Prosodic, Semantic, Syntactic, Transcription, Translation For more details on these and other metadata entities, see Metadata for Language Data. ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:2:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"Schemas, Profiles and Modes This diagram shows the relationship between the three main components used by Crate-O and other tools employed by LDaCA for specifying and validating RO-Crates. This section explains what these components are and how they relate. The three main components for RO-Crate editing with Crate-O Image Source: LDaCA A Schema specifies a metadata vocabulary of Classes and Properties, based on the RO-Crate specification’s use of Schema.org classes. An RO-Crate Mode is a set of lightweight syntactic rules for combining Schema.org Style Schema (SOSS) Classes, Properties and DefinedTerms, expressed in a JSON file that can be: loaded into an editor such as Crate-O imported into another program and used for RO-Crate validation used to summarise the rules for an RO-Crate Profile. An RO-Crate Profile has (at least) a document that explains how metadata entities from the Schema are used for a particular purpose. These are all inter-related, and can be developed together or separately using tools. See the links below to the LDAC schema, profile and modes: LDAC Schema LDAC Profile LDAC Modes ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:3:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":"Schema.org Style Schemas (SOSSs) and RO-Crate Profiles and Modes Schema.org, which provides the basic vocabulary for RO-Crate, has a light-touch approach to describing what it refers to as its schema (with a small-s), which might also be thought of as an ontology. Schema.org is defined as a set of Classes and Properties, each of which has an online definition. The below example illustrates that the base class Thing and its subclass Person has properties such as birthDate. Class: Thing → Sub-Class: Person → Property: birthDate Schema.org specifies which Properties can occur in the domain of which Classes, and the range of Classes that are expected as values for a Property. While Schema.org has terms for Class and Property, it does not use these for defining the classes and properties in Schema.org itself (possibly as this would be circular). Rather, it uses the equivalent Classes from the rdf: and rdfs: vocabularies. Here is the definition for Person: { \"@id\": \"schema:Person\", \"@type\": \"rdfs:Class\", \"owl:equivalentClass\": { \"@id\": \"foaf:Person\" }, \"rdfs:comment\": \"A person (alive, dead, undead, or fictional).\", \"rdfs:label\": \"Person\", \"rdfs:subClassOf\": { \"@id\": \"schema:Thing\" }, \"schema:source\": { \"@id\": \"http://www.w3.org/wiki/WebSchemas/SchemaDotOrgSources#source_rNews\" } } The Class definition does not have any information about the occurrence of properties – that is found in a Property definition: { \"@id\": \"schema:sibling\", \"@type\": \"rdf:Property\", \"rdfs:comment\": \"A sibling of the person.\", \"rdfs:label\": \"sibling\", \"schema:domainIncludes\": { \"@id\": \"schema:Person\" }, \"schema:rangeIncludes\": { \"@id\": \"schema:Person\" } } A SOSS is a Flattened JSON-LD graph, just like an RO-Crate. Some members of the RO-Crate community are beginning to define its basic schema and RO-Crate Profiles using the SOSS’s same approach. To make an RO-Crate Mode File, we transform the flat graph of a schema into something optimised for driving an editor or a validator; it creates a list of Classes, and what properties each may have. Base Mode File creation, combining the Schema.org schema and RO-Crate additions using the rocsoss script Image Source: LDaCA ","date":"2024-05-22","objectID":"/resources/user-guides/crate-o/general/:4:0","tags":null,"title":"General Information","uri":"/resources/user-guides/crate-o/general/"},{"categories":null,"content":" Back to Newsletters LDaCA Newsletter — Quarter 3 2024 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt Read about the new phase of the project, several events we will be running this year, a symposium hosted by the ARDC... and more! LDaCA Newsletter — Quarter 3 2024WelcomeWelcome to the third issue for 2024 of this newsletter about the activities of the Language Data Commons of Australia (LDaCA) and the Australian Text Analytics Platform (ATAP). This quarter, we announce the new phase of the project, advise of several workshops and other events we will be running, and report back from a symposium hosted by the Australian Research Data Commons (ARDC). If you have any questions or feedback, please email us at ldaca@uq.edu.au or message us on our LinkedIn page. NewsNew phase of LDaCAWe are delighted to announce that the ARDC’s investment in LDaCA will continue until June 2028. In this next phase, LDaCA will: develop the social and technical foundations for a national, distributed archival repository of language materials continue securing vulnerable and nationally significant collections of Aboriginal and Torres Strait Islander languages, Indigenous languages in Australia’s Pacific region, varieties of Australian English and migrant languages, and sign languages of Australia and its region continue to develop the LDaCA data portal for accessing and repurposing language data of significance to researchers and communities, including data which is held in galleries, libraries, archives and museums (GLAM) establish workflows which link repositories and analytics environments so that researchers can create fully described, reproducible research on written, spoken, multimodal and signed language provide training and develop resources for researchers and communities which support best practice in archiving, sharing, accessing and analysing language data in line with FAIR and CARE principles. In this phase, a new Chief Investigator (CI), Dr Rose Barrowcliffe, joins The University of Queensland (UQ) team (Prof Michael Haugh and Dr Martin Scheweinberger) and three new partners will join LDaCA: Batchelor Institute of Indigenous Tertiary Education (CI: Prof Kathryn Gilbey), the Australian Digital Observatory (ADO) (CI: Dr Marissa Takahashi) and the University of Western Australia (CI: Prof Clint Bracknell). Monash University will not continue as a partner, at least in the short term, and we would like to take this opportunity to thank CI Assoc Prof Louisa Willoughby and her team for the wonderful contribution they have made over the past two years. Our other partners will continue their involvement: Australian National University (ANU) (CI: Prof Catherine Travis), The University of Melbourne (CI: Assoc Prof Nick Thieberger), The University of Sydney (USyd) (CI: Prof Monika Bednarek), AARNet (CI: Adam Bell) and First Languages Australia (CI: Beau Williams). The project is part of the ARDC’s HASS and Indigenous Research Data Commons (RDC), which is establishing long-term, enduring national digital research infrastructure. It supports researchers in harnessing research data to enhance Australian social and cultural wellbeing, and helps us understand and preserve our culture, history and heritage. In partnership with research institutions and government, the HASS and Indigenous RDC has achieved joint resourcing of more than $40 million over the coming four years. LDaCA also welcomes the new focus areas which will increase the size and value of the RDC: the social sciences project (currentl","date":"2024-01-04","objectID":"/news/newsletter/newsletter-q3-2024/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 3 2024","uri":"/news/newsletter/newsletter-q3-2024/"},{"categories":null,"content":"Information about the principles on which the work of LDaCA is based.","date":"2022-02-15","objectID":"/about/principles/","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":" The Language Data Commons of Australia aims to ensure long-term access to language data collections for analysis and reuse. Sustainable management of and access to these significant collections of intangible cultural heritage are underpinned by two sets of complementary guiding principles of data management and stewardship, namely the FAIR and CARE principles. ","date":"2022-02-15","objectID":"/about/principles/:0:0","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"FAIR Principles The FAIR principles were first published by a group of stakeholders representing academia, industry, funding agencies and scholarly publishers. The principles aim to address issues surrounding data management and stewardship, focusing on four areas (which provide the FAIR acronym). ","date":"2022-02-15","objectID":"/about/principles/:1:0","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Findable Metadata and data should be easy to find for both humans and computers. Making the data findable includes: (Meta)data are assigned a globally unique and persistent identifier Data are described with rich metadata Metadata clearly and explicitly include the identifier of the data they describe (Meta)data are registered or indexed in a searchable resource ","date":"2022-02-15","objectID":"/about/principles/:1:1","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Accessible Once the user finds the required data, they need to know how they can be accessed, possibly including authentication and authorisation. (Meta)data are retrievable by their identifier using a standardised communications protocol The protocol is open, free and universally implementable The protocol allows for an authentication and authorisation procedure, where necessary Metadata are accessible, even when the data are no longer available ","date":"2022-02-15","objectID":"/about/principles/:1:2","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage and processing. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation (Meta)data use vocabularies that follow FAIR principles (Meta)data include qualified references to other (meta)data ","date":"2022-02-15","objectID":"/about/principles/:1:3","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. (Meta)data are richly described with a plurality of accurate and relevant attributes (Meta)data are released with a clear and accessible data usage license (Meta)data are associated with detailed provenance (Meta)data meet domain-relevant community standards The Australian Research Data Commons (ARDC) and LDaCA support FAIR data practices and initiatives that make data and related research outputs FAIR. At the same time, the ARDC and LDaCA acknowledge that the implementation of the principles will look different across disciplines and will need discipline-specific approaches and standards. ","date":"2022-02-15","objectID":"/about/principles/:1:4","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"CARE Principles The CARE Principles for Indigenous Data Governance were developed by the Global Indigenous Data Alliance (GIDA); they are a response to the FAIR principles and aim to complement them. GIDA highlights how the FAIR principles and the open data movement focus on increasing data sharing among researchers and entities but do not take into account power differentials and historical contexts or fully engage with Indigenous Peoples’ rights and interests. These include the rights to generate value from Indigenous data in ways that are grounded in Indigenous worldviews and to advance Indigenous innovation and self-determination. The CARE Principles also focus on four areas. ","date":"2022-02-15","objectID":"/about/principles/:2:0","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Collective Benefit Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data. For inclusive development and innovation For improved government and citizen engagement For equitable outcomes ","date":"2022-02-15","objectID":"/about/principles/:2:1","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Authority to control Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered. Recognizing rights and interests Data for governance Governance of data ","date":"2022-02-15","objectID":"/about/principles/:2:2","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Responsibility Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit. For positive relationships For expanding capability and capacity For Indigenous languages and worldviews ","date":"2022-02-15","objectID":"/about/principles/:2:3","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"Ethics Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem. For minimizing harm and maximizing benefit For justice For future use The Australian Research Data Commons (ARDC) and LDaCA support the CARE principles to further extend data management principles, ensuring that Indigenous communities benefit from the data, and that authority to control Indigenous data is held by Indigenous Peoples. We have a responsibility to share how data is used to collectively benefit Indigenous Peoples, and that Indigenous Peoples’ rights and wellbeing are the primary concern at all stages of the data life cycle. ","date":"2022-02-15","objectID":"/about/principles/:2:4","tags":null,"title":"Principles","uri":"/about/principles/"},{"categories":null,"content":"A guide to navigating the various sections of the Crate-O interface.","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":" Main Menu Mode Selector Current Entity Property Groups Entity Properties Entity Navigator ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:0:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Main Menu Crate-O Main Menu Image Source: LDaCA From the Main Menu, the following options are available: File Option Description Open Directory Select a directory/folder to locate or set where files are saved. This is the required first step for all RO-Crates. Load Files Load files from the selected directory into this RO-Crate. Bulk Add Select an Excel spreadsheet or similar from a directory to assist you with metadata description (the directory can be the same as that for the RO-Crate). This will append to your existing RO-Crate if one has already been created. Note that this option currently only has functionality to add new data, and cannot overwrite or edit existing data in your RO-Crate. Save Save the state of this page into your RO-Crate. This will create an ro-crate-metadata.json file, or add data into an existing ro-crate-metadata.json file. Close Close without saving. Help Display general help from anywhere in Crate-O. About Display general information about Crate-O. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:1:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Mode Selector Crate-O Mode Selector Image Source: LDaCA Below the main menu, the Mode Selector allows you to choose a predefined mode or load one from your computer. Note that for blank RO-Crates, the mode will default to Simple RO-Crate Dataset, however for language data, the mode Language Data Commons top-level Collection (corpus) is most relevant. When a file or folder is open in Crate-O, Selected Directory to the right of the Mode dropdown menu shows the directory or folder you currently have open. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:2:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Current Entity Crate-O Current Entity Image Source: LDaCA Below the Mode Selector, a home icon followed by entity breadcrumbs indicates your entity history, and underneath that, the entity you are currently located on. These will update as you navigate to different entities. Note that the breadcrumbs section doesn’t reflect the tree structure of the RO-Crate, but instead shows the history of the entities you have visited. To the right of this is the option to Remove Entity that you are currently viewing (provided it is not the root entity). ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:3:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Property Groups On the left-hand panel, depending on the selected mode, different Property Groups will be available: Crate-O Property Groups Image Source: LDaCA Property Group Description About The core metadata for this RO-Crate and its subject matter. Related People, Orgs \u0026 Works The context for the creation of this RO-Crate: who made it, funded it etc. Structure How the parts of this RO-Crate relate, such as collection and object relationships. Provenance A detailed description of how entities were created, by whom and with which tools. Space \u0026 Time Where and when the data was collected, i.e. the times and places it mentions or describes. Software \u0026 Hardware Computer programs and execution environments that could be used to create data, have created data, or are being packaged and described. Others Other properties not in the above categories. Note that if you find a property in Other that should be in one of the above groups or have other suggestions, please raise an issue on GitHub. Hovering over each property group will display tooltips related to that group. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:4:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Entity Properties In the middle panel, the Entity Properties for the property group you have selected are shown. Crate-O Entity Properties Image Source: LDaCA The exclamation mark icons to the left indicate that an entity is a required property for that mode, and saving an RO-Crate without completing one of these sections will result in a warning message. Below each field is a tooltip providing further detail on the given entity. Each entity field has a ‘delete’ icon beside it, however, these are only functional (and appear in red) for entities that can be removed, i.e. required properties cannot be deleted, but you can edit these. There are several compulsory metadata fields required for each collection: Property Group Metadata Description Additional Notes About @id Persistent, managed unique ID in URL format (if available), for example, a DOI for a collection or an ORCID, personal home page URL or email address for a person. Existing persistent identifiers for persons and organisations can be found at: Open Researcher and Contributor ID (ORCID)Research Organization Registry (ROR)Trove (use the People \u0026 Organisations search category). About @type The type of the entity. Both Dataset and RepositoryCollection are required for top-level collections in the Language Data Commons (LDAC) Mode, as these determine the available properties. About Name The name of this collection. About Description An abstract of the collection. Include as much detail as possible about the motivation and use of the dataset, including things that we do not yet have properties for. May include but not limited to the following: a brief description of participants (age range, gender, particular characteristics as a group, etc.)purpose of the collectionresearch value of the dataset. About License Link to a document that describes the rights and obligations for users of this collection record. This does not necessarily cover the license terms that may apply to Objects in the collection which may have specific licensing. Licensing on specific objects overrides the license attached to a collection record. Related People, Orgs \u0026 Works Author The person or organisation responsible for creating this collection of data. The options Person and Organization are available. If Organization is selected, the lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Related People, Orgs \u0026 Works Publisher The organisation responsible for releasing this dataset. The lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Others Rights Holder A person or organisation owning or managing rights over the resource. The options Text, Person and Organization are available. If Organization is selected, the lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Others Accountable Person The person or organisation who is the data steward for this resource. The options Person and Organization are available. If Organization is selected, the lookup pulls from the Research Organization Registry (ROR), or a new organisation can be created if not present. Others In Language The language(s) of the materials (including PrimaryMaterials, DerivedMaterials and Annotations) in this collection. The lookup pulls from AustLang, Glottolog and Ethnologue. Both AustLang names and codes can be used. If a language is not present, it can be created. Others Subject Language The languages that the materials in the collection are about (not the language that it is in). This is particularly used on Annotations that may talk about PrimaryMaterials or DerivedMaterials. The lookup pulls from AustLang, Glottolog and Ethnologue. Both AustLang names and codes can be used. If a language is not present, it can be created. To browse all the metadata entities associated with the LDAC Mode, see Metadata for Language Data. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:5:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":"Entity Navigator On the Entity Navigator panel, there are some further options related to navigating and creating metadata entities. Crate-O Entity Navigator Image Source: LDaCA Create New Entity: Create a new metadata entity, such as a provenance action. For example, CreateAction describes how a work is created with more precision than a property like author. All Entities: Select to view all metadata entities associated with your collection. Unlinked Entities: Select to view all metadata entities that are not currently referenced by properties on the Root dataset or other entities, for example, an entity of type Person which is not referenced using an author property. Below this, there is pagination which allows you to navigate through results, with 10 results appearing per page. ","date":"2023-04-22","objectID":"/resources/user-guides/crate-o/basic-navigation/:6:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/crate-o/basic-navigation/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter — Quarter 2 2024 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt Read about the new case study and blog post up on our website, our draft project plan, a recap of the ARDC Comp Skills Summer School 2024 and more! LDaCA Newsletter — Quarter 2 2024WelcomeWelcome to the second issue for 2024 of this newsletter about the activities of the Language Data Commons of Australia (LDaCA) and the Australian Text Analytics Platform (ATAP). This quarter, we highlight the LDaCA draft project plan and co-design workshop report published by the Australian Research Data Commons (ARDC), and report back from several exciting events, including the ARDC’s Computational Skills Summer School 2024 and a tech team visit to Batchelor Institute of Indigenous Tertiary Education. If you have any questions or feedback, please email us at ldaca@uq.edu.au or message us on our LinkedIn page. NewsNew content on websiteWe have added two new pieces of content to our website: a case study about data management in language technology, based on projects at tech company Appen. a blog post about why the Australian National Corpus is not a single collection within LDaCA. LDaCA draft project planOn 14 March, the ARDC released draft project plans for the next phase of the HASS and Indigenous Research Data Commons (HASS\u0026I RDC), including a plan for the LDaCA project. They were seeking feedback from the public on project activities planned for the next four years. Feedback closed on 22 March, and once this feedback has been reviewed and incorporated, final project plans will be released in the coming months. Meanwhile, the draft plans are still available to view online. Co-designing a digital future for Indigenous language materialsRobert McLellan and his team are seeking Indigenous language champions who are working with language materials for a research project called “Co-designing a digital future for Indigenous language materials”. We plan to run a series of Zoom and face-to-face interviews to find out what works well and what doesn’t work when finding, accessing and using Indigenous language materials. We know that there are challenges involved with accessing and using language materials stored in various locations. We want to contribute to making that data more findable and usable so that it can support and enhance the language work currently being undertaken. So, we are seeking input to help build language platforms, spaces and tools that suit Indigenous language workers and communities. If you are interested or would like to know more, please take a minute to get in touch through our Contact Form. All interviewees will receive a gift voucher worth $70 as a thank you for their time. Graduate Digital Research Fellowship programThe Graduate Digital Research Fellowship (GDRF) program ran very successfully in 2023. We are hoping to build on this success, with three fellows from The University of Queensland (UQ) participating in the 2024 program: Lu Jin is a PhD candidate in architecture and urban design. Her multidisciplinary work will design urban green infrastructural networks in a circular food system for city resilience and sustainability. In the GDRF program, Lu will be applying machine learning methods to identify and assess green cover in street view images. David Gilchrist is a PhD candidate in journalism. His research looks at how journalists connect with audiences, and in the GDRF program, he hopes to explore methods for finding and gathering relevant","date":"2023-02-04","objectID":"/news/newsletter/newsletter-q2-2024/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 2 2024","uri":"/news/newsletter/newsletter-q2-2024/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter — Quarter 1 2024 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt Read about our Graduate Digital Research Fellowship, ARDC's upcoming Computational Skills Summer School, our 2023 ALS conference activities and more! LDaCA Newsletter — Quarter 1 2024WelcomeWelcome to the first issue for 2024 of this newsletter about the activities of the Language Data Commons of Australia (LDaCA) and the Australian Text Analytics Platform (ATAP). This quarter, we introduce the 2024 Graduate Digital Research Fellowship program, highlight the upcoming Computational Skills Summer School 2024 from the Australian Research Data Commons (ARDC) and report back from the 2023 Australian Linguistics Society (ALS) annual conference. If you have any questions or feedback, please email us at ldaca@uq.edu.au or message us on our new LinkedIn page. NewsNew LinkedIn pageLDaCA is now on LinkedIn! We look forward to sharing more content about the project and connecting with the wider community. Follow our page or send us a message here. Website updateOur website has a new look and has been reorganised. We hope that you like the design changes and that information is now easier to find. We encourage feedback via email or LinkedIn so that we can continue to make the website more useful. Online policy documents and adviceSeveral documents outlining LDaCA policies and providing advice for data custodians are now available on the LDaCA website under Resources, then LDaCA Resources. The key policy document details our Access Policy; this policy document is supported by a guide to Determining Access Conditions. There is also a document explaining the LDaCA Data Onboarding Process, supported by Guidance for Data Governance Decisions and a guide to Obtaining a DOI. These documents are the outcome of wonderful work by project members at the Australian National University: Catherine Travis, Li Nguyen and Cale Johnstone. Rosanna Smith (LDaCA User Design Analyst) prepared the material for online publication. 2024 Graduate Digital Research Fellowship ProgramIn the last five years, the University of Queensland (UQ) has hosted several iterations of a Graduate Digital Research Fellowship program focused on researchers in the Humanities and Social Sciences (HASS). In 2023, the program ran under the auspices of ATAP. Two members of our UQ team, Sam Hames (Postdoctoral Research Fellow in Computational Humanities) and Simon Musgrave (Engagement Lead for ATAP and LDaCA), led the program and four students participated. You can read more about the history of the program and the 2023 cohort in this blog post. Image: LDaCA The program will run again in 2024 as an LDaCA activity, supporting research students to explore the possibilities of digital scholarship, particularly in HASS. Our Fellows are students undertaking extended research projects who spend 12–15 weeks learning about digital and computational skills in order to enhance their current research topic or to work on an independent digital project. They explore digital research methods in areas including, but not limited to: computational analysis of text creation of new software tools to support their specific research area social media analytics assessment and analysis of digital environments including mobile apps online games as social and cultural objects. The Fellows meet regularly in activities to develop a sound understanding of digital research methods and tools, including seminars, reading groups and training works","date":"2023-12-03","objectID":"/news/newsletter/newsletter-q1-2024/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 1 2024","uri":"/news/newsletter/newsletter-q1-2024/"},{"categories":null,"content":"A step-by-step guide to creating an RO-Crate metadata file.","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":" This guide takes you through the steps to create an RO-Crate metadata file, using a top-level collection for language data as an example. To view more details and screenshots of each section discussed, select the click-through links throughout the page. Open Directory Select Mode Add Entity Metadata Save RO-Crate Append Data from Spreadsheet ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:0:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Open Directory Open Crate-O in a compatible browser. In the Main Menu, select Open Directory. Navigate to a folder where the RO-Crate will be saved, or create a new one, then confirm your selection. For the pop-up message asking Let site view files?, select View Files. Your current working directory will be displayed in the Selected Directory section in the Mode Selector. Crate-O: Open Directory Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:1:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Select Mode In the Mode Selector, the Mode dropdown shows the current mode that is being displayed, i.e. the metadata framework associated with your collection. Change the mode from the default Simple RO-Crate Dataset to Language Data Commons top-level Collection (corpus). Crate-O: Select Mode Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:2:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Add Entity Metadata The Current Entity section shows your location in the current RO-Crate. The Property Groups panel allows you to navigate to the various groups associated with your RO-Crate. The Entity Properties panel is where you can add data about your collection. For blank RO-Crates, the following missing property messages will appear by default. Clicking the blue Add buttons will automatically add these missing items to your RO-Crate. Crate-O: Missing Properties Image Source: LDaCA Enter the entity properties you have for your collection. Add as little or as much information about your collection as you like, as this can be saved and worked on further later. To browse all the metadata entities associated with the Language Data Commons (LDAC) Mode, see Metadata for Language Data. Crate-O: Add Entity Metadata Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:3:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Save RO-Crate In the Main Menu, select Save. For the pop-up message asking Save changes to [Selected Directory]?, select Save changes. Crate-O: Save RO-Crate Image Source: LDaCA Your RO-Crate is now successfully saved in your working directory, with two files: ro-crate-metadata.json: The saved RO-Crate in JSON format. ro-crate-preview.html: An HTML file that can be viewed on a web browser and shows the contents of your RO-Crate. ro-crate-metadata.json ro-crate-preview.html ro-crate-metadata.json Image Source: LDaCA ro-crate-preview.html Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:4:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Required Properties After saving, if there are required properties missing from your RO-Crate, the section Saved with warnings will appear. You can select the dropdown on this message to view the missing required properties. Clicking on one of these warnings will take you to the relevant property group. If you choose to edit any of these sections, select Save again to ensure your most recent changes are not lost. Crate-O: Saved with Warnings Image Source: LDaCA ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:4:1","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":"Append Data from Spreadsheet If you have a spreadsheet in a compatible format that you want to add to your collection to assist with metadata description, or you want to create a new RO-Crate only using spreadsheet data, open an existing or new directory, then select Bulk Add in the Main Menu and load the spreadsheet. This will append it to your existing RO-Crate. For more guidance on the spreadsheet requirements and a template, see Spreadsheet Upload. Note that this option currently only has functionality to add new data, and cannot overwrite or edit existing data in your RO-Crate. Bulk Add also only reads from the spreadsheet and does not write to it. ","date":"2022-03-27","objectID":"/resources/user-guides/crate-o/ro-crate-creation/:5:0","tags":null,"title":"RO-Crate Creation","uri":"/resources/user-guides/crate-o/ro-crate-creation/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter - Quarter 4 2023 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt LDaCA Newsletter | Quarter 4, 202306/10/23 Welcome Welcome to the second issue of this newsletter. We thank you for your interest in the Language Data Commons of Australia and the Australian Text Analytics Program, and we hope that this newsletter will help to keep you informed about our activities. If you have questions about anything you read here, or if you have feedback for us, you can mail us at ldaca@uq.edu.au (or use one of the alternative contact points listed at the end of this message). NewsARDC All Staff MeetingARDC held an event in Melbourne last month for all of their staff. Michael Haugh and Robert McLellan were invited to present at the event about working with the ARDC on LDaCA. Their presentation, which was very well received, was part of a showcase of projects in the ARDC's three thematic data commons (Planet, People, HASS and Indigenous). Image Credit: ARDC Event at the Australian Linguistic Society ConferenceWe will be holding a meeting during the Australian Linguistics Society 2023 Conference to provide an update on the activities of the project. This event will take place at lunchtime on the last day of the conference (Friday December 1), venue TBC. New Team Members Two people have joined our team in the last quarter, Otis Carmichael and Rosanna Smith and we will let them introduce themselves: Hi there, my name’s Otis Carmichael, I’m a Waanyi man from Brisbane and a technical designer on the LDaCA team. My initial involvement with the team was to design the team’s logo and other promotional material, however I am now involved in Community Engagement and working on design throughout the website and tools. I’m excited to be on board with the project because I believe that it has potential to really support the domain of language research, by helping to ensure that research data is stored in a more sustainable manner and thus will remain accessible for the language’s community. Hi, my name's Rosanna Smith and I'm a user design analyst on the LDaCA team, based in Melbourne. My background is in linguistics, having managed a variety of language projects for use in machine learning development and language technology. I’m excited to contribute to a project such as LDaCA that is guided fundamentally by FAIR and CARE principles, as well as the opportunity to apply my previous experience and further promote diversity and accessibility in language data management and research. EventsForthcoming Events Workshop on immigrant language corpora in Australia When: 9-10 November, 2023 Where: Onsite. ANU, Acton, ACT 2601 Guest speakers: John Hajek, Ingrid Piller, Sophie Loy-Wilson, Adrian Vickers Australia has long been seen as one of the world’s most multilingual and multicultural societies, with more than 490 languages coming from around 300 ancestries and cultural traditions (ABS, 2021, 2022). For decades, the language and cultural maintenance of various immigrant groups have been under investigation by many scholars, not only in linguistics but also in history, sociology, anthropology, and many other disciplines. This work has amassed a large body of immigrant language data, that provides information about how Australia’s immigration history has contributed to the country today. The purpose of this workshop is to bring together scholars working with language corpora from across different disciplines. The workshop is being run as part of the Langua","date":"2023-08-14","objectID":"/news/newsletter/newsletter-q4-2023/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 4 2023","uri":"/news/newsletter/newsletter-q4-2023/"},{"categories":null,"content":"Guidance and a template for adding new data to an RO-Crate via spreadsheet.","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":" Template Tab Breakdown Column Breakdown Root Authors Publishers Licenses People Objects Files (CSV, EAF, WAV) Upload Spreadsheet to an RO-Crate with Crate-O ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:0:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Template For collections where there are a lot of interconnected objects and files, it may be easier or preferable to add the metadata for these via uploading a spreadsheet to an existing RO-Crate in Crate-O, rather than adding these items manually. An RO-Crate metadata spreadsheet template can be downloaded below and populated with metadata specific to your collection: ro-crate-metadata-template.xlsx Spreadsheet upload currently only has functionality to add new data, and cannot overwrite or edit existing data in your RO-Crate. The template is based on an example data collection that contains three types of files within each object: Audio files (WAV), the primary material Text files (CSV), transcriptions of the audio files ELAN files (EAF), linguistic annotations of the audio files ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:1:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Tab Breakdown The spreadsheet has the below tabs by default, but depending on your collection, you may need to add additional tabs, or others may not be applicable. Tab Description Root Metadata about the root or top level of the collection. Authors Metadata about the person or organisation responsible for creating this collection. Publishers Metadata about the organisation responsible for releasing this collection. Licenses Metadata about the license(s) within the collection; both for the objects and files, and for the collection’s metadata. People Metadata about the people within the collection. Objects Metadata about the entities within the collection that could encompass one or more files. Files Metadata about the files in your collection. If the collection has multiple file formats, duplicate this tab and add the formats to the tab names, e.g. csv_files, eaf_files, wav_files. ELAN (.eaf) files can have relative or absolute paths to the data they relate to. The ELAN preferences file is generally not needed for the collection and relates to the particular ELAN user only. Below the header, an example row is included to illustrate how the section can be filled. The example row is colour-coded according to whether the column: requires the user to input data (blue) is pre-filled with a formula or static value and doesn’t require editing (green). HINT: Highlight the example row and drag it down to copy all the pre-filled cells. Don’t forget to remove the example rows before you upload your spreadsheet to Crate-O! At a minimum, it’s best practice to include @id and @type columns in each of your spreadsheet tabs, as these appear in Crate-O for each of the entities. The tables in the next section provide further details on what constitutes a valid @id and @type in each tab. For more detailed lists of these, see Metadata for Language Data. HINT: To type a column name beginning with @ in Excel, put an apostrophe before it '@. This will force it to be recognised as a text value rather than a formula. The columns provided in the template tabs are illustrative only and may not all apply to your collection; please edit these as needed. Where a column header begins with a full stop (.), this indicates that the column will be ignored when the data is loaded into Crate-O and will not appear in the RO-Crate. This can be helpful if you want to retain other information in your spreadsheet that may not be in a format applicable to the RO-Crate. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:2:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Column Breakdown The section below describes each of the columns included in the template, ordered by tab. Please note that the columns provided in the template tabs are illustrative only and should be edited according to the requirements of your collection. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Root The root tab provides information about the top level of the collection. Unlike the other tabs, the root tab can only have one row, so if there are columns that require more than one value, duplicate that column. Column Type Description @id Data entry Persistent, managed unique ID in URL format (if available), for example, a DOI for a collection. @type Pre-filled The type of the collection. Both Dataset and RepositoryCollection are required. name Data entry The name of this collection. description Data entry An abstract of the collection. Include as much detail as possible about the motivation and use of the dataset, including things that we do not yet have properties for. isRef_author Pre-filled Generated from the @id column in the Author tab. isRef_publisher Pre-filled Generated from the @id column in the Publisher tab. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:1","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Authors An author is a person or organisation responsible for creating the collection. It is possible for collections to have multiple authors. Column Type Description @id Data entry Persistent, managed unique ID in URL format (if available), for example, an ROR for an organisation or an ORCID, personal home page URL or email address for a person. @type Data entry The type of the author. Either Person or Organization can be selected. name Data entry The name of the author. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:2","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Publishers A publisher is an organisation responsible for releasing the collection. It is possible for collections to have multiple publishers. Column Type Description @id Data entry Persistent, managed unique ID in URL format (if available), for example, an ROR for an organisation. @type Pre-filled The type of the publisher. Only Organization is valid. name Data entry The name of the organisation. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:3","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Licenses A license for a collection establishes the conditions for who can access, share and reuse the data, and other conditions as required. It is a legal arrangement between the creator of the data and the end-user specifying what users can do with the data. Column Type Description @id Data entry A URL to a version of the license (if available), for example, a URL of a Creative Commons license. If there is no URL, a license.txt file containing the text of the license needs to be included in the repository, and license.txt should be added as the @id. @type Pre-filled The type of the license. Only DataReuseLicense is valid. name Data entry The name of the license. description Data entry A description of the license. metadataIsPublic Data entry Determines whether the collection metadata can be viewed publicly. Requires a Boolean value (TRUE or FALSE). allowTextIndex Data entry Determines whether the collection text can be indexed for search purposes. Requires a Boolean value (TRUE or FALSE). It is possible to leave the licensing tab blank if these details are still being finalised for the collection, however, this will need to be amended later in Crate-O. For custom licenses (i.e. those specific to a particular collection), it is recommended that a copy of the license be included in the repository to ensure that it remains accessible. Furthermore, if there are any additional usage restrictions or options for use outside of a given license, this information can be included in a usageInfo field, e.g. “For any use not permitted by the CC-BY-ND 4.0 License, please contact the Data Steward”. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:4","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"People This tab contains information about the people within the collection. Column Type Description @id Pre-filled A unique identifier for the person, generated from the name column. Identifiers should be prefixed with #. @type Pre-filled The type of the entity. Only Person is valid. name Data entry The name of the person. language_code Data entry The language spoken by the person. An example of an optional metadata field from the source data. Language codes can be obtained from AustLang, Glottolog and Ethnologue. gender Data entry The gender of the person. An example of an optional metadata field from the source data. birthDate Data entry The birth date (year) of the person. An example of an optional metadata field from the source data. isRef_specializationOf Data entry A reference to another Person entity, used for collections where a person appears more than once with different demographic info (e.g. a different age). In these collections, there should be a ‘canonical’ person for each participant and another Person entity each time they participate, with different ages or other statuses. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:5","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Objects An object is a single resource or a group of tightly related resources in a collection. For example, a work (document) in a written corpus, or the files associated with a dialogue or session in a speech study (recordings, transcriptions etc.). Some systems, such as PARADISEC, refer to Objects as Items or may use other terms. Column Type Description @id Pre-filled A unique identifier for the object, generated from the name column. Identifiers should be prefixed with #. @type Pre-filled The type of the entity. Only RepositoryObject is valid. name Data entry The name of the object. description Data entry A description of the object. isRef_speaker Pre-filled Generated from the .pseudonym column with # prefixed. .pseudonym Data entry An example of a column from a data steward’s source data, so that speakers in the collection are anonymised. datePublished Data entry The date the object was published. The date will be converted in Crate-O to ISO 8601 format, e.g. Wed Jun 12 2024 10:00:00 GMT+1000 (Australian Eastern Standard Time). isRef_pdcm:memberOf Pre-filled The collection this object is a member of, generated from the @id column in the Root tab. Or if the collection contains sub-collections, a reference to another RepositoryCollection @id. isRef_license Data entry The @id of the license to which this object adheres. isRef_indexableText Data entry Identifies which of the files in the given object has content that is indexed for search purposes. For example, in the template, the content of the CSV file would be searchable, whereas the EAF and WAV files would not. If isRef_indexableText is not included in a collection, search will only run on the metadata and not the transcript file content. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:6","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Files (CSV, EAF, WAV) A file is a container for data and can store data in different formats. For example, a single object could have an audio file as well as a text file containing a transcription of the audio. Three examples of file tabs are included in the template, and their columns are combined in the table below. Tab Column Type Description CSV, EAF, WAV @id Pre-filled The filepath to the given file. Generated from the .folder, .filename and .postfix columns. CSV, EAF, WAV @type Pre-filled The type of the entity. Only File is valid. CSV, EAF, WAV .folder Data entry The folder name in which the given file appears. CSV, EAF, WAV .filename Data entry The name of the given file, without postfixes. CSV, EAF, WAV .postfix Data entry The file format of the given file, for example, .csv, .eaf, .wav. CSV, EAF isType_Annotation Data entry Indicates whether the given file is an annotation of another file. Requires a Boolean value (TRUE or FALSE). WAV isType_PrimaryMaterial Data entry Indicates whether the given file is the object of study, such as a literary work, film, or recording of natural discourse. Requires a Boolean value (TRUE or FALSE). CSV, EAF, WAV isRef_partOf Pre-filled Generated from the .filename column. CSV, EAF isRef_annotationOf Data entry The full filename of the primary material that the given file is an annotation of. CSV, EAF, WAV .objectId Pre-filled Generated from the .filename column. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:3:7","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":"Upload Spreadsheet to an RO-Crate with Crate-O For steps on adding your spreadsheet data to an existing RO-Crate, see Append Data from Spreadsheet. ","date":"2021-02-28","objectID":"/resources/user-guides/crate-o/spreadsheet-upload/:4:0","tags":null,"title":"Spreadsheet Upload","uri":"/resources/user-guides/crate-o/spreadsheet-upload/"},{"categories":null,"content":" Back to Newsletters \u003c!DOCTYPE html\u003e LDaCA Newsletter | Quarter 3, 2023 Campaign URL Copy Twitter 0 tweets Subscribe Past Issues RSS Translate English العربية Afrikaans беларуская мова български català 中文(简体) 中文(繁體) Hrvatski Česky Dansk eesti keel Nederlands Suomi Français Deutsch Ελληνική हिन्दी Magyar Gaeilge Indonesia íslenska Italiano 日本語 ភាសាខ្មែរ 한국어 македонски јазик بهاس ملايو Malti Norsk Polski Português Português - Portugal Română Русский Español Kiswahili Svenska עברית Lietuvių latviešu slovenčina slovenščina српски தமிழ் ภาษาไทย Türkçe Filipino украї́нська Tiếng Việt LDaCA Newsletter | Quarter 3, 202313/07/23 WelcomeWelcome to the first issue of this newsletter. We thank you for your interest in the Language Data Commons of Australia and the Australian Text Analytics Program, and we hope that this newsletter will help to keep you informed about our activities. If you have questions about anything you read here, or if you have feedback for us, you can mail us at info@ldaca.edu.au (or use one of the alternative contact points listed at the end of this message). NewsThe next phaseThe initial round of funding from ARDC for LDaCA and ATAP ended in June. ARDC have submitted a Research Infrastructure Investment Plan (otherwise known as the RIIP) covering the period through to 2028, but further funding for LDaCA has been approved for the coming financial year through to end of June 2024 while we await the outcome of the RIIP2023 application. LDaCA Director Prof Michael Haugh. Image ARDC/Marc Grimwade In the coming year, LDaCA will build on the work accomplished in the last two years while also developing some new areas of work. These will include building capabilities in repurposing data, particularly for community research platforms, and in collecting and organising data, particularly for donated social media data. ATAP will also continue to receive funding support within the broader LDaCA program. Language materials at Batchelor College, NT. Image: Ben Foley Community Connect project LDaCA has recently been successful in gaining funding under ARDC’s Community Connect program which aims to make data more accessible to communities (researchers or others). Through a co-design process in the second part of this year (2023), we will develop a training workshop and associated materials to provide Indigenous language workers and community members with the knowledge and skills to use our data portal effectively and to find the data in which they are interested. The co-design process will involve the participants in the University of Queensland’s Winter Language Revitalisation Intensive Workshop as well as discussion sessions at both the Australian Languages Workshop (July 21-23) and the Puliima Conference (August 21-25) where our team members, led by Robert McLellan, will explore ways of breaking down the jargon which is comfortable for academics but which can be a barrier for others. On the basis of the knowledge gained in these activities, the team will then develop the workshop and training materials. Initial delivery of a workshop will occur in November.   The Language Revitalisation program at UQ involves Samantha Disbray, Des Crump and Robert McLellan (also LDaCA’s Program Manager). On the LDaCA side, our Engagement lead, Simon Musgrave, will coordinate the project, and we are delighted that Otis Carmichael (our logo designer) will join the team. EventsForthcoming EventsWorkshop on Data Migration SkillsThis workshop will introduce you to tools to efficiently migrate material in various formats to the standards developed by LDaCA. ANU, July 20 and 21 (Day 1 - hybrid, Day 2 - in person only) Details: Events - ATAP Heritage data at Monash University. Image: Simon Musgrave Recent EventsCatherine Travis and Li Nguyen convened a workshop on Language Corpora in Australia (Canberra and online, July 3). The link to the program and abstracts can be found here. Peter Sefton and Simon Musgrave participated in the 2nd Workshop on Digita","date":"2023-07-13","objectID":"/news/newsletter/newsletter-q3-2023/:0:0","tags":null,"title":"LDaCA Newsletter Quarter 3 2023","uri":"/news/newsletter/newsletter-q3-2023/"},{"categories":null,"content":"As detailed in an earlier post, we ran a Graduate Digital Research Fellowship (GDRF) program in 2023 under the auspices of the Australian Text Analytics Program (ATAP). The program ran again in 2024, under the Language Data Commons of Australia banner with three students participating and with Sam Hames and Simon Musgrave co-ordinating the activity. The program supports research students to explore the possibilities of digital scholarship, particularly in HASS. Our Fellows are students undertaking extended research projects who spent 12–15 weeks learning about digital and computational skills in order to enhance their current research/thesis topic or to work on an independent digital project. They explored digital research methods in areas including: computational analysis of text analysis of digital images automatic speech recognition (and automatic non-speech recognition). The Fellows met regularly in activities to develop a sound understanding of digital research methods and tools, including seminars and training workshops. They also had access to mentors who advised and collaborated on their digital projects. The three Fellows who participated in the 2024 program and their projects were: ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:0:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"David Gilchrist (UQ, Journalism): David’s research looks at how journalists connect with their audiences, particularly in local and hyper-local contexts. In the Fellowship program, David focused on two areas: how to collect and curate material published online whether there was a place for sentiment analysis in his methodology. More abstractly, the questions which David was addressing were about how to integrate computational methods into a qualitative enquiry, and (not surprisingly!) we had some fascinating discussions with David about possible answers to such questions. The opportunity to work with scholars from different disciplines was highly rewarding. It provided a variety of interesting and thought provoking insights and questions that ultimately enhanced my own work. What’s more, the experience of working with a small team of dedicated scholars and mentors established a high quality academic environment and built long-lasting friendships that made this a truly inspiring and rewarding fellowship. Our mentors are exceptional in their academic and leadership abilities that build a team that will find longevity well beyond this initial experience. I highly recommend this program. ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:1:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"Lu Jin (UQ, Architecture, Design and Planning): Lu is researching productive urban food landscapes as a spatial design issue. A preliminary step in her work is to be able to assess the amount and diversity of vegetation in the city and her work as a Fellow explored possible methods of analysing digital images which could make this task easy and quick. She used images and videos captured on a phone and then experimented with different image analysis algorithms to see which ones best identified greenery and its boundaries. The experiments showed that relatively simple approaches worked quite well, but further work is needed to decide whether the results will be useful for large scale surveys of urban environments. In Lu’s own words: Working closely with scholars from various disciplines allowed me to gain insights into research tools that are perhaps more common to other fields. Given my background in architecture, I’ve always been daunted by the idea of coding and programming. But in this day and age, knowing how to manipulate digital data is such a powerful skill to harness. This research fellowship program has kickstarted my learning journey into applying and adapting digital research methods for urban design and analysis. ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:2:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"Quy Pham (UQ, Applied Linguistics): Quy is studying errors in the spoken language produced by students learning English as a foreign language. He is using a large spoken corpus of material produced by learners of English in 10 countries and areas in Asia, ICNALE: The International Corpus Network of Asian Learners of English. One phenomenon which might relate to errors is pausing, and Quy was therefore interested to explore the possibilities of automatically identifying various characteristics of silent and filled pauses in recordings of speech, including their duration and placement within sentences. Building from this, he also looked at whether it was possible to automatically tag the start and end points of specific tasks within a controlled interaction (spoiler alert: not really). The GDRF program provided me with a unique opportunity to sharpen my coding and critical thinking skills. I gained valuable experience in applying a range of Python packages and large language models to analyse pausing behaviors in a large spoken learner corpus, while being aware of their critical limitations. On a personal level, I also had a chance to meet with an amazing group of bright and talented researchers from diverse fields. We engaged in insightful discussions on various topics such as citation practices, research ethics and project management. I highly recommend this program to anyone considering it. From left: Quy, Sam, Lu, Simon and David Image Source: Simon Musgrave A GDRF program will be offered again in 2025 by the Language Data Commons of Australia. If you think the program could benefit you and would like further information, you can contact Sam Hames or Simon Musgrave. ","date":"2024-08-13","objectID":"/news/posts/gdrf-2024/:3:0","tags":null,"title":"Graduate Digital Research Fellowship — 2024","uri":"/news/posts/gdrf-2024/"},{"categories":null,"content":"LDaCA Senior Data Manager Dr Julia Colleen Miller discusses her experience in archiving and data management for cultural heritage data.","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"by Dr Julia Colleen Miller For nearly ten years, I have been officially working as a Data Manager. I say officially because having worked on collaborative research projects and earning a PhD in Linguistics, within the context of an endangered language documentation collaborative project, I’ve been managing other people’s data, as well as my own, since I was a fledgling academic in the early 2000s. I am based at the Australian National University (ANU), and since mid-2023, I’ve been working as the senior data manager for the Language Data Commons of Australia (LDaCA). Prior to this, I worked for the ARC Centre of Excellence for the Dynamics of Language (CoEDL), who could offer a eight year data manager position, very generous in a field where contracts often last only two or three years. It was at CoEDL where I first took on this academic-adjacent role, focusing on bolstering research and research infrastructure. It was also the first time I had heard of such a position becoming available that addressed the data management and archiving needs of those actively recording or assembling cultural heritage data. Such data includes tabular data, text corpora, collections of songs and music, ethnographic interviews, elicited linguistic content, oral histories, traditional narratives and lexicons, and can be in different formats, such as text, video, audio, image, film, magnetic tape and digital files. The best part of this CoEDL role was that the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC), where I had been running the ANU unit since 2010, was going to be the main repository for CoEDL, and the two roles of archivist and data manager would be fundamentally intertwined. I had finally found my calling! ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:0:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"The day-to-day of data management and archiving At CoEDL, my primary responsibility was to help facilitate the data management and subsequent archiving of materials collected by CoEDL members and affiliates from institutions such as ANU, The University of Queensland, The University of Melbourne, Western Sydney University (WSU), as well as by the broader global research communities. This involved managing a continuous flow of enquiries from new and continuing researchers, offering guidance, and overseeing the entire data management and archiving processes. My aim was to ensure that valuable research data was preserved and made: accessible for future research available to the people and communities recorded, as well as to their descendants. Most days, in-person or over Zoom, I met with depositors to: guide them through the process of describing their data for their own research databases, making eventual archiving much easier discuss appropriate file formats find or create collaborative workspaces for CoEDL students, staff and their external collaborators develop data access protocols, refining conditions of access for archived collections offer file transfer options, ensuring secure data transmission inform researchers of other archiving alternatives if PARADISEC was not the best fit for their material. Over the years, I have tried to make the data management and archiving processes more efficient and helpful by creating archiving guides. Originally, these were static PDF files available on the CoEDL website; however, as they needed to be dynamic sources of information for rapidly changing technologies, they are now a series of webpages. Figure 1: Screenshot of the overview page of PARADISEC's Archiving Guides and Technical Workflows website. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:1:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Managing the PARADISEC ANU unit In my concurrent role as manager of the ANU unit of PARADISEC, I oversaw a wide range of digitisation and archiving activities, ensuring that important cultural materials were preserved and accessible for current and future use. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:2:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Digitising analogue materials We often receive magnetic tape audio (cassette and reel-to-reel tapes) and field notebooks to archive. Engaging with legacy formats is equal parts exciting and challenging. The challenging parts are: finding and maintaining the necessary playback devices dealing with damaged or mouldy materials creating functional, evolving workflows adhering to standards set by digital preservation’s peak bodies. …but the rewards are legion. There is nothing like hearing voices recorded decades ago, and then sharing those recordings with descendants of those speakers! It’s also a great feeling to make available rare language material by digitising the items, and then engage with young researchers to help enrich the descriptions of the content. Even more so if it helps them with their research topics. Some of the digitising tasks we conducted during this time for CoEDL researchers and general PARADISEC depositors were: digitising audio cassettes and reel-to-reel tapes, performing all necessary post-production editing to prepare for archiving Figure 2: Studer reel-to-reel tape player, with a tape threaded and ready to digitise. Image Source: Dr Julia Colleen Miller photographing tapes and tape boxes containing written metadata on their labels or inserts found within; these images then accompanied the audio files in the archive Figure 3: Collection of images of reel-to-reel tapes, tape boxes and loose sheets of paper, all containing important written metadata about the contents of the tapes. Image Source: Haoyi Li digitising manuscripts and field journals, ensuring high-quality reproduction. Figure 4: Digitising a field notebook with a DSLR camera mounted to an overhead shelf (out of shot). The camera is tethered to a laptop for remote capture. Image Source: Dr Julia Colleen Miller It is very satisfying knowing that by digitising these items, we helped make available legacy recordings brought to us by retiring researchers, small regional cultural institutions, and those found in our very own university archives. If you are interested in our workflows for audio digitisation or manuscript and field note image capture, you can find more information below: audio digitising workflow digitising fieldnotes \u0026 manuscripts workflow. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:2:1","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Handling born-digital files Of course, we are now living in a time where digital recording devices of all kinds are all around us, making content creation very easy. With this easy access to quality recorders for use in the field, people have the freedom to collect as much data as needed to address their research questions. We receive hundreds (sometimes thousands) of digital files each week to help manage, and eventually, archive. Some activities involved in managing born-digital files are: receiving files and creating detailed inventories of them — a critical step in ensuring all files are looked after and that quality checking can be logged resampling audio, transcoding image and video files to archival formats and managing the transfer to PARADISEC for archiving devising solutions for problematic video formats and liaising with commercial service providers and other institutions for help when needed. Below is an image of an inventory spreadsheet containing the structural metadata of thousands of files that went into a single collection: Lauren Reed’s Western Highlands Sign Languages. Figure 5: Screenshot of spreadsheet of structural metadata of audio and video files from the LRW1 collection in PARADISEC. This file and metadata inventory informed the transcoding workflows and quality-checking processes prior to archiving. Image Source: Dr Julia Colleen Miller Creating an inventory like this informed me as to how to proceed in transcoding tasks, based on the format and specifications of the files. Also, I could compare details of the output files to the originals, thus helping with quality checks. As I worked through the files, I could keep track of my progress by marking which ones had been transcoded and were ready to send to the archive. This task took months to complete, so it was important to be able to pick up each day right where I had left off on the previous day. If you are interested in learning how to batch extract this type of metadata from media files, visit the page Quality control of audio and video files from PARADISEC’s archiving guides and technical workflows. Over the life of CoEDL, we monitored the annual growth of files added to PARADISEC. The following graphic shows the period from 2008 to 2022. Note: CoEDL was active between November 2014 and December 2022. Figure 6: Line graph showing the growth of the number of files held in the PARADISEC archive between the years 2008 and 2022. Image Source: Dr Julia Colleen Miller Notice the increase in files starting in 2015. It was at this time I received research assistant (RA) support from CoEDL to help manage the influx of data. Over the years, I supervised several RAs and volunteers, managing their day-to-day tasks and overseeing their professional growth. By 2020, after refinements in our server connections and file transfer pipelines, we were sending a minimum of 500GB of audio-visual files to the archive every other day! I could never have achieved this without the support of the CoEDL RAs, Team Data: Jen Plaistowe, Tina Gregor, Melody Ross, Haoyi Li, Shubo Li, and our volunteer, Emma Cuppit. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:2:2","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Collaboration with cultural institutions in the region Collaborating with other cultural institutions was a significant aspect of my roles at CoEDL and PARADISEC, allowing us to increase access to archived materials, enhance our archiving capabilities and share expertise. Below are two examples of collaborative projects. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:3:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"ANU Pacific Research Archives Occasionally, requests would come from ANU’s Pacific Research Archives staff, or directly from students and researchers, to digitise items collected across the Pacific which were housed in ANU’s archives. We targeted audio tapes and field notes for digitisation, enhancing ANU’s finding guides with detailed metadata. Typically, we archived these digital files in PARADISEC and shared links to new collections, or provided copies back to the Pacific Research Archives. A notable project involved digitising select items from the Helen Groger-Wurm collection at ANU. Helen Groger-Wurm, an anthropologist active in the 1960s, focused on Northern Australia, particularly Eastern Arnhem Land bark paintings. PARADISEC and CoEDL secured funding to digitise 26 audio tapes from this collection. This effort supported CoEDL PhD student Haoyi Li, who, due to COVID-19 restrictions, couldn’t conduct fieldwork in Arnhem Land when she was hoping to. Haoyi identified relevant items for her research, and we expanded the digitisation to include field notes alongside the audio tapes. This initiative enabled her to start her primary research remotely and gain valuable skills in archival engagement with legacy recordings, high-resolution image capture of field notes, open reel audio tape digitisation, metadata collection and archival record enrichment. Figure 7: Haoyi Li threading a Helen Groger-Wurm reel-to-reel tape for digitisation on the Revox C 270 tape player. Image Source: Dr Julia Colleen Miller The images below are from a tape digitised by Haoyi Li from the Helen Groger-Wurm collection in the Pacific Research Archives at ANU. The image shows that there is metadata written on the tape box and labels, as well as on the slip of paper found in the tape box. These tape-counter reference points to topics contained on the recording and the tape labels provided the title and description for the archive. Figure 8: Images of a Helen Groger-Wurm tape digitised by Haoyi Li. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:3:1","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"AIATSIS, NLA and NFSA We sometimes find items held in other cultural institutions, which we would like to digitise and add to existing collections in PARADISEC. For example, we identified items produced by the important 20th century Australian linguist Arthur Capell, including 168 reel-to-reel audio tapes of non-Australian language recordings found in the Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) that we were able to digitise, and nearly 50 acetate discs and one magnetic wire recording found at the National Library of Australia (NLA). We enlisted the aid of the National Film and Sound Archive (NFSA) in digitising the discs and magnetic wire, as we did not have the facilities to handle those formats. Below is an image of a Capell disc being digitised by the NFSA. This contains recordings of the parable of the Prodigal Son and the Lord’s Prayer in the Bilua language of the Solomon Islands. Listen to the recording via the Capell collection in PARADISEC. Figure 9: One of Arthur Capell's acetate discs being digitised at the National Film and Sound Archive (NFSA). Image Source: Gerry O'Neill The next image is of the magnetic wire recording on its spool. This item contains narratives in languages of North and Central America, including Kiowa and Navajo. To hear these recordings, see the Capell collection. Figure 10: Magnetic wire audio recording from the Arthur Capell collection. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:3:2","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Managing large-scale, short-term projects In addition to my regular duties as data manager and digital archivist, I was sometimes tasked with overseeing the management of large-scale projects that required careful planning, coordination and implementation. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:4:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Phonemic modelling of Tok Pisin The first was a three-year project between ANU and the Commonwealth Defence Science and Technology Group (DSTG) called ‘Phonemic Modelling of Tok Pisin'. This project involved creating phonemic transcriptions of spoken Tok Pisin, the English-based creole which is a national language of Papua New Guinea. We retrieved open-access, untranscribed audio recordings of Tok Pisin held in PARADISEC and created rich time-aligned transcriptions, adding these back to their source collections, as well as providing these transcriptions to the DSTG. Below are images of transcriptions in the ELAN transcription software. The recordings were made by Andy Pawley in the Kaironk Valley region of Papua New Guinea and are held in PARADISEC as collection AP4. Figure 11: Screenshots of the ELAN software showing an orthographic time-aligned transcription of Tok Pisin (left) and a phonemic transcription of the same recording (right). Image Source: Dr Julia Colleen Miller Some tasks for this project included: targeting existing PARADISEC collections that held Tok Pisin content and preparing the audio for transcription liaising with DSTG on technical details to ensure mutual understanding and alignment creating detailed workflows and timelines to meet project deadlines hiring, training and managing a transcription team, including three Tok Pisin speakers delegating tasks to team members and monitoring their progress writing annual reports and conducting team debriefing meetings to gather feedback and adjust workflows as needed. In the end, PARADISEC gained rich transcriptions of previously untranscribed Tok Pisin audio held within the archive. Additionally, we were able to employ and train two ANU PhD students and one of PARADISEC’s archivists, Steven Gagau, all fluent Tok Pisin speakers, in using the ELAN software for creating the time-aligned transcriptions. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:4:1","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Digitising language material from the Katherine region Another significant project involved the secure storage of audio-visual materials and manuscripts from the Katherine region of Australia during renovations at their language centre. CoEDL Deputy Director Jane Simpson (ANU) and CoEDL CI Caroline Jones (WSU) tasked my team with storing, inventorying and overseeing the digitisation of materials contained in 42 shipping boxes from Mimi Arts \u0026 Crafts, previously known as Diwurruwurru-jaru (Katherine Regional Aboriginal Language Centre). Figure 12: Shipping boxes from Mimi Arts \u0026 Crafts, formerly Diwurruwurru-jaru (Katherine Regional Aboriginal Language Centre), stored at the ANU awaiting comprehensive inventory. Each of these 42 boxes contained two file boxes. Image Source: Dr Julia Colleen Miller Each box contained two smaller boxes with language-learning materials and cultural recordings created with Elders, primarily in the 1990s and 2000s. We expanded on an earlier inventory, creating a comprehensive list to facilitate digitisation and the eventual return of original items and digitised files to the Katherine Region. Figure 13: Six file boxes showing the diversity of contents, such as audio tapes, VHS and U-Matic tapes and written language-learning materials. Image Source: Dr Julia Colleen Miller Key aspects of the project included: developing a detailed inventory workflow and training a new RA collaborating with stakeholders, such as Caroline Jones, Jane Simpson, other invested researchers, Mimi Arts \u0026 Crafts representatives, AIATSIS, commercial digitisation providers, and loads of volunteer facilitators providing progress reports and maintaining communication with stakeholders facilitating the temporary removal and return of items for early digitisation efforts managing the repackaging and distribution of boxes as digitisation progressed. ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:4:2","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Service to the archiving community My contributions to the archiving community extended beyond CoEDL, through presentations, training modules and active participation in professional organisations. I have participated in conferences and workshops, representing CoEDL and PARADISEC while discussing best practices in archiving and data management. I also focused on developing training sessions aimed at enhancing digitisation techniques, metadata management and workflow optimisation for fellow archivists. As a member of the International Association of Sound and Audiovisual Archives (IASA), I continue to contribute to their Technical Committee, where I can participate in discussions that influence the development of digital preservation standards. I also contribute to the creation of guidelines for born-digital video, aiming to ensure they are relevant to current technological advancements and archival needs. I have collaborated with Charles Sturt University’s Master of Information Studies program, guiding a student through an approved curriculum I developed that aimed to bridge academic learning with practical skills. These teaching materials have been used again for students from the University of Manchester’s Master of Arts programs in Digital Media, Culture and Society, and in Library and Archive Studies under the tutelage of my PARADISEC colleague, Nick Ward. During ANU’s COVID-19 lockdown, I designed a 10-week online training module to accommodate my newly hired RAs, providing foundational knowledge in digitisation and archiving techniques. This online training was extended to ANU’s Summer Scholar interns, who were also missing out on activities due to lockdown. We focused on audio and manuscript digitisation practices, as well as introducing them to the history of digital language archives. When the lockdown was lifted, we followed up with hands-on training. One of the interns, a PhB student in Linguistics, Daniel Majchrzak (seen in the image below), has been trained in high-resolution image capture of manuscripts and fieldnotes and is now working in our studio to digitise the papers of Luise Hercus. Figure 14: PhB student, Daniel Majchrzak, using a shelf-mounted digital camera for high-resolution image capture of the manuscripts of Luise Hercus. Image Source: Dr Julia Colleen Miller ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:5:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"Final thoughts Data management is an integral element of institutional research infrastructure. It often falls within that fluid space between the roles of academic and professional staff — within what is called the Third Space in academia (to read more on this topic, take a look at \"Reconstructing Identities in Higher Education: The rise of ‘Third Space’ professionals\", by Celia Whitchurch, 2013). Many of the activities carried out as Data Manager for CoEDL, and now for LDaCA, were designed to assist in educational development and knowledge exchange, as well as to promote public engagement and outreach. These tasks not only assisted students and academic researchers to better conduct their research enquiries and safeguard their work, but they also facilitated activities that directly reinforced the academic missions of our universities. I love the work that I do. I feel lucky to be a part of so many diverse projects, even if it just involves: helping people name their files or store their research materials securely making an inventory of cassette tapes to be digitised connecting students to archival records of a language they are thinking about for their PhD research helping researchers repatriate rematriate recordings in a format that is accessible to the people who have been recorded, and to their descendants. Every day is a new and exciting challenge! ","date":"2024-07-09","objectID":"/news/posts/cultural-heritage/:6:0","tags":null,"title":"Safeguarding Cultural Heritage: The Essential Role of Archiving and Data Management","uri":"/news/posts/cultural-heritage/"},{"categories":null,"content":"by Maria Weaver We have learned from conversations with data stewards over the last two years that some are new to the practice of using persistent identifiers (PIDs) for data collections. This blog post will hopefully offer some guidance by briefly providing some background information and addressing the common questions we receive about PIDs and Digital Object Identifiers (DOIs). ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:0:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"A brief overview of persistent identifiers Most of us are familiar with the concept of identifiers (IDs); they mark and individualise a person through life, from their name at birth through to IDs assigned at school, at work and in other contexts. Identifiers are needed for countless things in complex societies, e.g. material goods, documents and transactions, but the purpose remains the same: to unambiguously identify someone or something among similar others within a context. As far as we can see in the literature, the term PID is closely tied to the rise of the internet and digital information management in the 1990s. This makes sense as PIDs are meant to operate in the context of the web environment. The qualifier “persistent” refers to the need to mitigate effects of the transient nature of online content, where websites and digital resources can disappear, change or become inaccessible over time (Chapekis et al. 2024). PIDs are a critical part of scholarly communication, especially online where the volume of data is increasing every minute. Research outcomes and resources have to be shared appropriately, therefore, data objects and entities need to be unambiguously identifiable and reliably findable, and the connections between them made evident. PIDs are fundamental to fulfilling these requirements. The Australian Research Data Commons’s Australian National PID Strategy 2024 provides an excellent summary of how PIDs contribute to strengthening the digital infrastructure ecosystem and to driving research innovation in Australia. A strategy used by some PID systems (e.g. the DOI system) to sustain persistence is separating how the object is identified from where the object is located. The identification component consists of a globally unique identifier; a typical convention is as a series of numbers, letters and special characters. This identifier is optimised on the web by being actionable, that is, it can be formatted as a locator, e.g. a URL, that points to information or metadata about the object (also referred to as the reference to the object). Based on this strategy, an identifier is considered persistent when it consistently resolves over a reasonably long period to the webpage containing information or metadata about the object, including information about the location of the object. The location component is where the object is served or made available, whether on the web or in the physical world. Should the object location change, e.g. a data collection is moved to a different repository or becomes available through other websites, the relevant metadata on the reference page should reflect that fact. Note that although the reference is a digital description, it can be a description of any object or entity, whether digital, physical (e.g. scientific specimens, cultural heritage objects) or abstract (e.g. software, organisations). Understandably, PID systems on a global scale require large investments in robust technological infrastructures, including geographically distributed servers, digital preservation, and redirection and resolver systems. However, such infrastructure is useless without people and social structures to ensure that PIDs work as expected for as long as possible. Organisational commitment from stakeholders, including data owners, repository managers and PID registration agencies, are vital. Data stewards, for example, are responsible for obtaining PIDs for their datasets, creating rich and accurate metadata, ensuring changes to the metadata and the dataset are reflected in the PID registry, and adhering to data governance best practices, community standards and protocols. The PID landscape continues to evolve. If you are interested in delving deeper into persistence strategies used by different PID systems or nuanced views of identifier persistence from experts, see the Further Reading items below. Below are a few examples of PIDs within digital scholarly ecosystems: DOI (Digital Object Identi","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:1:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"Digital Object Identifiers Contributors who allow LDaCA to make their data collections available through the portal can use the PIDs they have if they already follow a specific PID practice. However, if it is their first time obtaining a PID, we recommend DOIs as they are likely obtainable through a university library or institutional research repository. This has been the pattern with collections in the LDaCA portal to date. DOIs are also available through general-purpose repositories like Zenodo. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"What is a DOI? DOI stands for Digital Object Identifier, a PID which is in fact a digital identifier of an object, rather than an identifier of a digital object (DOI Key Facts). The DOI System was launched in 2000 by the International DOI Foundation (IDF), a not-for-profit organisation that functions as the governing body. DOI Registration Agencies (RAs) all over the world offer services for registering DOIs and community-specific metadata. University libraries and institutional research repositories can be members of an RA (e.g. DataCite in Australia) which allows them to create (or mint) and register DOIs for their organisation. Here’s an example of a DOI for a collection available via the LDaCA data portal: In the partial screenshot below of the record for the collection, the DOI number (10.26181/23089559) is part of the collection’s @id. The full @id value is arcp://name,doi10.26181%2F23089559 and the last part of this is the DOI (the slash which separates parts of the DOI is represented by the encoding %2F here). Figure 1: The collection record for the La Trobe Corpus of Spoken Australian English in the LDaCA data portal. Image Source: LDaCA The DOI resolves to an item in La Trobe University’s institutional repository. The content of the collection is available here, but there are also links to the record in the LDaCA portal which includes the DOI. Figure 2: Top part of page in La Trobe University’s repository showing files in collection. Image Source: LDaCA Figure 3: Bottom part of page in La Trobe University’s repository showing collection details. Image Source: LDaCA Clicking the Cite button on the La Trobe repository page shows the DOI link in its full form: Figure 4: Full DOI link displayed under citation on repository page. Image Source: LDaCA The DOI link is comprised of three parts: Resolver service Prefix (the assigning body) Suffix (the resource) https://doi.org/ 10.26181 23089559.v1 ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:1","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"Why should my data collection have a DOI? A review of the uptake of PID systems from the late 1990s to the late 2010s reported that DOIs are, for several good reasons, the most widely adopted PIDs in research data repository systems (Klump \u0026 Huber 2017). DOIs follow a standardised system: there is a syntax specification for the DOI name construction which is followed by all registrars, providing consistency across the board. The Handle System, a trusted and long-established digital infrastructure, is the DOI system’s resolution component providing reliability and persistence in the findability of research objects. Another notable characteristic of the DOI system is its well-developed social infrastructure. The IDF develops and enforces rules and policies for the maintenance of records and for managing the division of responsibilities between RAs, RA members and DOI users. As a consequence, DOIs promote good data governance practices. Apart from the act of obtaining a DOI, the associated responsibilities of creating and recording accurate and rich metadata, ensuring that the DOI record is updated through location and data changes, and providing clear data reuse and access terms, enlighten data owners and stewards about some of the practicalities of data governance. The DOI system’s underlying infrastructure supports interoperability across different platforms and systems. For example, DataCite’s “Add to ORCID Record” feature allows a researcher to manually claim a DOI of their work and link it to their ORCiD record (DataCite 2024). Data repositories can use DOIs to track usage metrics such as views, downloads and citations, which also informs data owners as to how researchers engage with the data. Below is a screenshot of a section of the DOI landing page for the La Trobe Corpus of Spoken Australian English with some of the key information highlighted and annotated. The DOI was issued by DataCite via the La Trobe University Library’s DOI service. Figure 5: Annotated screenshot of DOI landing page for the La Trobe Corpus of Spoken Australian English with key information highlighted. Image Source: LDaCA ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:2","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"How do I get a DOI for my data collection? We have an article about this topic on our website — see Obtaining a DOI. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:3","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"What format(s) should my data files be in before I share them with LDaCA? We understand that the choice of file formats during data planning and collection can be influenced by factors like research team decisions, discipline-specific norms, software-hardware compatibility and equipment funding. Although LDaCA's data migration experts can work with your data in any provided format, file formats that are easily shareable and reusable include: Text — TXT, PDF, (DOC/DOCX) Tabular data (which may include text) — CSV, (XLS/XLSX) Audio — WAV, MP3 Video — MP4. Bracketed formats are proprietary and therefore less sustainable. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:2:4","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"References Australian Research Data Commons. 2024. Australian national Persistent Identifier (PID) strategy 2024. Zenodo. https://doi.org/10.5281/ZENODO.10656275. Chapekis, Athena, Samuel Bestvater, Emma Remy and Gonzalo Rivero. 2024. When Online Content Disappears. Pew Research Center. https://www.pewresearch.org/data-labs/2024/05/17/when-online-content-disappears/. (4 July, 2024). DataCite. 2024. Add a DOI to Your ORCID Record. DataCite Support. https://support.datacite.org/docs/orcid-claiming. (4 July, 2024). Klump, Jens \u0026 Robert Huber. 2017. 20 Years of Persistent Identifiers – Which Systems are Here to Stay? Data Science Journal 16. 09. https://doi.org/10.5334/dsj-2017-009. ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:3:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":null,"content":"Further reading Car, Nicholas N. J., Pavel Golodoniuc \u0026 Jens Klump. 2017. The Challenge of Ensuring Persistency of Identifier Systems in the World of Ever-Changing Technology. Data Science Journal 16. 13. https://doi.org/10.5334/dsj-2017-013. Kunze, John \u0026 Richard Rodgers. 2008. The ARK Identifier Scheme. UC Office of the President: California Digital Library. https://escholarship.org/uc/item/9p9863nc. (4 July, 2024). ","date":"2024-07-03","objectID":"/news/posts/persistent-identifiers/:4:0","tags":null,"title":"Persistent Identifiers (PIDs) and Digital Object Identifiers (DOIs)","uri":"/news/posts/persistent-identifiers/"},{"categories":["LDaCA"],"content":"PDF version | Powerpoint Version Developer Track Session 2 Time: 05 June 2024, 09:00 - 10:30 · Location: Drottningporten 2 Crate-O - A drop-in linked data metadata editor for RO-Crate (and other) linked data in repositories and beyond Peter Sefton, Alvin Sebastian, Moises Sacal Bonequi, Rosanna Smith University of Queensland, Australia Research Object Crate is a metadata packaging standard which has been widely adopted over the last few years in research contexts and which debuted at Open Repositories with a workshop in 2019. Crate-O is an editor for the RO-Crate Metadata Specification. RO-Crate has been presented here at Open Repositories for the last few years, and is now starting to be incorporated into many research repository solutions (though they are not always called repositories). I am presenting in the next session on why RO-Crate is important for repositories. Five ways RO-Crate data packages are important for repositories Time: 05 June 2024, 11:00 - 12:30 · Location: Drottningporten 1 Peter Sefton*, Stian Soiland-Reyes** *University of Queensland, Australia; **The University of Manchester, UK Research Object Crate is a linked data metadata packaging standard which has been widely adopted in research contexts. In this presentation, we will briefly explain what RO-Crate is, how it is being adopted worldwide, then go on to list ways that RO-Crate is growing in importance in the repository world: Uploading of complex multi-file objects means RO-Crate is compatible with any general-purpose repository that can accept a zip file (with some coding, repository services can do more with RO-Crates) Download for well-described data objects complete with metadata from a repository rather than just a zip or file with no metadata Using RO-Crate metadata reduces the amount of customisation that is required in repository software, as ALL the metadata is described using the same simple, self-documenting linked-data structures, so generic display templates Sufficiently well-described RO-Crates can be used to make data FAIR compliant, aiding in Findability, Accessibility, Interoperability and Reusability thanks to standardised metadata and mature tooling And if you’re looking for a sustainable repository solution, there are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by Suleman in the closing keynote for OR2023. The Language Data Commons of Australia Data Partnerships (LDaCA-DP), Language Data Commons of Australia Research Data Commons (LDaCA-RDC), and Australian Text Analytics Platform (ATAP) projects received investment (https://doi.org/10.47486/DP768, https://doi.org/10.47486/HIR001, \u0026 https://doi.org/10.47486/PL074) from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). A version of the Crate-O component is available as a playground for Chrome browsers only. This allows you to describe your local folders in your computer and generate an RO-Crate. https://github.com/Language-Research-Technology/ro-crate-modes/blob/main/docs/soss-profiles.md A Schema.org style Schema (SOSS) specifies a metadata vocabulary of Classes and Properties, based on the RO-Crate specification’s use of Schema.org classes. An RO-Crate Profile has (at least) a document that explains how metadata entities from the Schema are used for a particular purpose. An RO-Crate Mode is a set of lightweight syntactic rules for combining SOSS Classes, Properties and DefinedTerms, expressed in a JSON file. Crate-O RO-Crate Editor Mode Files are editor configurations that implement RO-Crate Metadata Profiles. The configuration files are intended to form the basis of an approach for describing RO-Crate editor behaviour and can be used for RO-Crate validation. Initial versions of this work were based on the Describo Profiles (which vary between versions of Describo) used to configure the Describo family of RO-Crate editing","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-crate-o/:0:0","tags":["RO-Crate","Open Repositories","Crate-O"],"title":"Crate-O - a drop-in linked data metadata editor for RO-Crate (and other) linked data in repositories and beyond","uri":"/news/posts/open-repositories-2024-crate-o/"},{"categories":["LDaCA"],"content":"PDF version | Powerpoint Version Five ways RO-Crate data packages are important for repositories Presented at: The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg, Sweden Session: Presentations: Integrations for Research Data Management Time: 05/June/2024: 11:00 - 12:30 · Location: Drottningporten 1 Peter Sefton*, Stian Soiland-Reyes** *University of Queensland, Australia; **The University of Manchester, UK Research Object Crate is a linked data metadata packaging standard which has been widely adopted in research contexts. In this presentation, we will briefly explain what RO-Crate is, how it is being adopted worldwide, then go on to list ways that RO-Crate is growing in importance in the repository world: Uploading of complex multi-file objects means RO-Crate is compatible with any general-purpose repository that can accept a ZIP file (with some coding, repository services can do more with RO-Crates). Download for well-described data objects complete with metadata from a repository rather than just a ZIP or file with no metadata. Using RO-Crate metadata reduces the amount of customisation that is required in repository software, as ALL the metadata is described using the same simple, self-documenting linked-data structures, so generic display templates. Sufficiently well-described RO-Crates can be used to make data FAIR compliant, aiding in Findability, Accessibility, Interoperability and Reusability thanks to standardised metadata and mature tooling. And if you’re looking for a sustainable repository solution, there are tools which can run a repository from a set of static files on a storage service, in line with the ideas put forward by Suleman in the closing keynote for OR2023. ","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:0:0","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"Uploading of complex multi-file objects RO-Crate [1], [2] is a data packaging format and can be used to put multiple data files together with their metadata into a package such as a ZIP, tar or disk image file. This means that as long as your repository can handle a ZIP file it can take RO-Crates. RO-Crates enable data to travel with metadata. Beyond simply allowing the upload of opaque RO-Crates, there are opportunities for repository software to recognise metadata in an uploaded package and to pre-populate built-in metadata forms and/or datastores. This is not a pattern the authors have seen widely implemented in comprehensive institutionally focussed repositories, although at the time of writing it is being explored in Dataverse and InvenioRDM/Zenodo. We would encourage repository developers to explore this further, particularly those working with research data. RO-Crate support is increasing in research-domain repositories; e.g. RO-Crate upload with metadata extract is supported by WorkflowHub and ROHub). One of the design features of an RO-Crate is that as it can be “just a ZIP file” it can be used with any old repository that can handle ZIP files. The JSON file can also be uploaded separately. As RO-Crate adoption increases, the repository may be able to start to use the RO-Crate metadata it already has. You can/will have heard from Dieuwertje Bloemen here at OR2024 that DataVerse has support for RO-Crate metadata preview and building import/export mechanisms. The RO-Crate specification described a method of packaging data in a folder, which can be zipped, with any kind of file. This slide shows a folder of data, including a file with a typical obscure file name based on a timestamp (in this case created by a Dropbox upload from a digital camera). The RO-Crate Metadata file, in JSON Linked Data Format, can describe files. This one has a name (i.e. a title) for the file; “Cute puppy”. A human-readable description and preview (ro-crate-preview.html) can be in an HTML file that lives alongside the metadata. This slide shows an HTML view of the data that shows the image with its metadata, including the Schema.org name (equivalent to a Dublin Core title) for the file. Increasingly, research repository infrastructure is accepting RO-Crate input - this screenshot from WorkflowHub documents the upload API for submitting RO-Crate packaged descriptions of scientific workflows to the system. These can then be downloaded by others for reuse. Here, RO-Crate allows bypassing of the traditional “title, author, license, description” fields (rendered from the crate), as well as permitting user extensions on metadata to be kept in the repository. ","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:1:0","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"RO-Crate is a packaging format suitable for downloads One of the perennial problems with downloads is that once a user has the data, it often does not come with metadata as shown on the landing page, or if present it is in an ad hoc or specialised format. RO-Crate solves this by specifying an extensible way to put linked-data metadata with data assets and to provide an HTML page or small website with the data to explain it. Thus data travels with its metadata and can be made human-readable. RO-Crate download is already available in many data repositories. Examples include: WorkflowHub: A registry for describing, sharing and publishing scientific computational workflows. ROHub: A repository of Earth Science datasets and computational methods. TLCMap: The Time Layered Cultural map is a set of tools that work together for mapping Australian history and culture. The Language Data Commons of Australia data portal: entirely built on RO-Crates, the underlying data consists of crates-on-disk and the API is based on RO-Crate metadata. Senckenberg Wildlive portal: exposes metadata about automatic photo captures of endangered animals using RO-Crate. Dataverse: at the time of writing, RO-Crate downloads are in development. We will encourage developers from other repository platforms to follow the Dataverse project’s lead and add RO-Crate support. This is a screenshot of the Gazetteer of Historical Australian Places – not exactly a repository but an example of a place where people can download datasets. This kind of download could be added to any repository system where there is at least one file that has metadata; offer a download ZIP option that has machine-readable JSON metadata (linked data in JSON-LD) and a human-readable summary of the metadata – this one has descriptions of a few files in it. Every repository implementer should add FAIR Signposting – just a couple of HTTP headers – this means machines can go from an HTML landing page to the actual download without guesswork – or even better – to an RO-Crate! I’m sure you’ll hear this mentioned in one of the talks by Herbert van de Sompel and colleagues, such as the one on FAIRiCat. Again, DataVerse is ahead of the curve and has already implemented this. As we mentioned above, data should travel with the metadata – one example of this from the WorkflowHub is how other services like the LifeMonitor retrieves the RO-Crate and then looks for custom annotations in the metadata to pick up and connect to the testing infrastructure. This was only possible because RO-Crate is extensible - you are not trapped with whatever 25 properties we’ve selected. The vessel is still RO-Crate, the repository didn’t need to add anything to support the LifeMonitor. One of the key benefits of linked-data metadata over previous ‘legacy’ approaches, is that multiple vocabularies can be combined into a single metadata document in a way that is not possible with, say MARC, or MODS XML, and that all these vocabularies can use the same syntax and approach to describing data. This means that a simple generic RO-Crate viewer can be used to visualise any metadata whether it is basic “Who, What, Where” metadata (like Dublin Core) or domain-specific metadata like the RO-Crate metadata profile (https://w3id.org/ldac/profile) used by the Language Data Commons of Australia. This can be displayed alongside the core RO-Crate metadata without any expensive configuration or coding. If the recommendations are followed, the RO-Crate metadata terms are self-documenting, e.g. all the Language Data Commons terms which use a Schema.org Style approach, are defined here: https://w3id.org/ldac/terms. This slide is a repeat of one we used last year – this screenshot is a bit of (undated) DSpace documentation found following a tip from Kim Sheppard – we have included it here to illustrate that storing additional metadata (in this case METS) for an object was done by convention – it had to be stored in a special file called METS.xml. Us","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:2:0","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"References [1] Sefton, Peter et al., “RO-Crate Metadata Specification 1.1.3,” Apr. 2023, doi: 10.5281/ZENODO.3406497. ↩ [2] S. Soiland-Reyes et al., “Packaging research artefacts with RO-Crate,” Data Sci., vol. 5, no. 2, pp. 97–138, Jan. 2022, doi: 10.3233/DS-210053. ↩ [3] S. Leo et al., “Recording provenance of workflow runs with RO-Crate.” arXiv, Dec. 12, 2023. doi: 10.48550/arXiv.2312.07852. ↩ [4] P. Sefton, M. La Rosa, and Mi. Lynch, “Arkisto: a repository based platform for managing all kinds of research data,” in ptsefton.com, Jun. 2021. Accessed: Jan. 31, 2022. [Online]. Available: http://ptsefton.com/2021/06/11/or-2021-arkisto/index.html ↩ ","date":"2024-06-05","objectID":"/news/posts/open-repositories-2024-ro-crate/:2:1","tags":["RO-Crate","Open Repositories"],"title":"Five ways RO-Crate data packages are important for repositories","uri":"/news/posts/open-repositories-2024-ro-crate/"},{"categories":["LDaCA"],"content":"PDF version | Powerpoint Version From “R-Drive to RRKive” – a comprehensive, open and sustainable set of principles and tools for low (and high) resource archival-repositories Presented at: The 19th International Conference on Open Repositories, June 3-6th 2024, Göteborg, Sweden Session: Presentations: Open and Sustainable Infrastructure Time: 04/June/2024: 13:30 - 15:00 · Location: Drottningporten 1 Peter Sefton*, Robert McLellan*, Michael Lynch**, Moises Sacal Bonequi*, Nick Thieberger*** *University of Queensland; **University of Sydney; ***University of Melbourne We present a toolkit for sustainable Archival Repositories, with metadata and storage standards as well as APIs and data portals, that can be assembled by communities with various levels of resourcing into repository solutions at a variety of scales, with fallback to offline operation. Tools are designed to work in low-resource environments, allowing communities to have agency and control over their materials. We prioritise sustainability, simplicity, standardisation, linked-data description and clear licensing over user interface features, in line with Suleman’s keynote presentation at OR2023, while still being able to drive rich, full-featured services when resources allow and fall back to a sustainable core if needed. Much of the data we work with is subject to Indigenous Cultural Intellectual Property (ICIP) rights, controlled by First Nations peoples guided by Indigenous data sovereignty principles; we must ensure it is handled in a conscionable and culturally responsible manner. Making data Accessible does not always mean Open Access – under both research ethics and the CARE and FAIR principles, data by and about humans needs access control and licensing. Our framework deals with these issues and is built from the ground up to be “as open as possible, as closed as needed”. NOTE: Since the abstract was submitted we have adopted the name “Protocols” for the use in work we are doing on tools for FAIR and CARE adoption. This slide summarises the aims of the Language Data Commons of Australia project. The area of the strategy we’re focusing on for this presentation is highlighted – tools, standards and technical infrastructure. The activities we are undertaking are: Develop shared tools, standards and technical infrastructure to help data stewards care for data for the long term Build data portals with useful search functions and lightweight technical structures. Towards the following outcomes: Standards and tools are available and being applied by data stewards Good governance and standardised, distributed storage of data helps preserve data Discovering and locating language data is easy via linked portals. We don’t need to define the term “Repository” at the Open Repositories conference, but we do need to define a term that we use at LDaCA when we talk about Repositories – we use the inclusive term “Archival Repository” to refer to systems designed to keep data for the long term. We do this to avoid drawing boundaries around what is a “repository” vs an “archive” or a Digital Preservation system and to include as many practitioners as possible in the conversation. The first implementation of this stream of work was the UTS Research Data Portal – this has not been updated for a while due to staffing changes at the university and a lack of ongoing governance that made the data repository dependent on individual people (three of whom are authors of this paper and no longer work for UTS), but as of mid-2024 it is being renovated and revitalised. This screenshot shows what an LDaCA site looks like; a typical data-discovery portal. Underneath is not a typical (by Open Repositories standards) repository solution like Dspace, Innvenio or EPrints, all of which are more or less monolithic architectures. Following the implementation protocols we will discuss below, data is very much considered separately from the application used to provide views of the Archival Repositor","date":"2024-06-04","objectID":"/news/posts/open-repositories-2024-pilars/:0:0","tags":["PILARS"],"title":"A comprehensive, open and sustainable set of principles and tools for low (and high) resource Archival Repositories\n","uri":"/news/posts/open-repositories-2024-pilars/"},{"categories":null,"content":"by Rosanna Smith, Simon Musgrave, Robert McLellan and Ben Foley ","date":"2024-05-23","objectID":"/news/posts/idil/:0:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"International Decade of Indigenous Languages 2022–2032 The world’s billions of people use thousands of languages in myriads of ways, through communication by speech, sign, text and song (Evans, 2022). Diversity of language is critical for conveying the world’s knowledges and expressing the range of human experiences. Australia is in an area of great language diversity, with hundreds of languages on this continent. Languages require action to remain in use. The United Nations General Assembly has declared the period between 2022 and 2032 as the International Decade of Indigenous Languages (IDIL). The aim of IDIL is to draw global attention to the critical status of Indigenous languages worldwide and encourage action for their revitalisation, promotion and ongoing use. The International Decade aims to achieve four main outcomes: Learn, teach, and transmit: Empower Indigenous Peoples to learn, teach and transmit their languages to the present and future generations. Establish as global priority and foster commitment: Establish the usage, preservation, revitalisation and promotion of Indigenous languages as a global priority for societal development, peacebuilding and reconciliation by 2030. Ensure legal recognition: Ensure Indigenous languages are recognised by Member States within their legal systems and legislation, which, in turn, are supported with comprehensive language-related laws and policy frameworks, and are backed by allocated financial, institutional and human resources. Enhance the functional usage: Develop an enabling environment to enhance the functional usage of Indigenous languages in socio-cultural, economic, environmental, legal and political domains. ","date":"2024-05-23","objectID":"/news/posts/idil/:1:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Voices of Country: Australia’s National Action Plan for IDIL International Decade of Indigenous Languages Logo Image Source: International Decade of Indigenous Languages Directions Group The Australian Government has partnered with the IDIL Directions Group to produce the Voices of Country Action Plan. The Action Plan provides a framework to guide Australia’s participation in the International Decade, focusing in particular on Aboriginal and Torres Strait Islander community needs, and is a call to action for all stakeholders. The plan was launched at the PULiiMA Indigenous Languages and Technology Conference in Garamilla/Darwin in August 2023. ","date":"2024-05-23","objectID":"/news/posts/idil/:2:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Voices of Country Themes Voices of Country is guided by five inter-connected themes, represented as a native wattle tree, itself a source of food, medicine and tools, with language, culture and traditional knowledges as its foundation. Image Source: Gilimbaa with cultural elements created by David Williams (Wakka Wakka) Voices of Country is framed through five interconnected themes: Stop the Loss: Securing the future and continuance of Australia’s first languages. Aboriginal and Torres Strait Islander Communities are Centre: Over the Decade, our voices, and those of our Elders, will ensure that community leadership and priorities are at the centre. Our voices will come through – everything we do will be by community, for community. This will give us stronger and safer communities. Intergenerational Knowledge Transfer: Restore intergenerational traditional language and cultural transmission. Caring for Country: Everyone has the opportunity to practise their language and culture on Country. Truth-telling and Celebration: Aboriginal and Torres Strait Islander peoples speak their language with integrity and pride. Wider Australia shares this pride, respecting and celebrating Australia’s first languages and all that they encompass. ","date":"2024-05-23","objectID":"/news/posts/idil/:3:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"LDaCA’s contribution A sub-part of Theme 2 in the Action Plan has the title ‘Support communities to build and be the custodians of language resources and materials’. Making language resources and materials sustainable and accessible is central to the work of LDaCA. This includes working in collaboration with communities, and providing them with the tools and support they need to be effective custodians of their language materials. The sub-theme has six components (Voices of Country p34): Embed Aboriginal and Torres Strait Islander cultural practices and knowledge systems into linguistic methodologies. Develop resources that document and record languages, such as dictionaries, encyclopaedias, thesauri, grammar and orthography documents. Record and digitise written and oral language collections, resources and materials. Provide communities with the support, training and infrastructure they require to build language resources and collections. Support Aboriginal and Torres Strait Islander peoples to access and repurpose historical language materials. Repatriate language materials and cultural knowledge to community ownership and control. We will give a brief summary of how LDaCA is contributing to each of these areas of activity, first covering components two to six and then returning to show how all of those activities contribute to the first component. In this summary, we will refer occasionally to the FAIR and CARE ethical research principles which are core to our work. The FAIR principles have been developed as a foundation for good, scholarly ways of finding and using data, providing key guideposts (Findable, Accessible, Interoperable, Reusable) for transparent, reproducible, and reusable language research. Further key principles for advancing Indigenous innovation and self-determination, CARE (Collective benefit, Authority to control, Responsibility, Ethics), guide ethical work with Indigenous language collections. The concluding discussion will also detail the connection between the principles, especially the CARE principles, and LDaCA’s work supporting the Voices of Country Action Plan. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 2: Develop resources that document and record languages LDaCA contributes to work under component two by providing recommendations for good practice in developing sustainable resources. It is important to avoid having data locked into bespoke individual solutions which may be difficult to maintain. Instead, we advocate using data formats that can be supported into the future and which can be the basis for various tools. Although LDaCA can have only a limited role in developing resources, such as dictionaries or text collections, we can provide advice about key aspects of data handling, assist with designing backup strategies, and work with people to develop long-term strategies for appropriate access to language resources. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:1","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 3: Record and digitise written and oral language materials Similarly to component two, we have an indirect role in activity related to component three. We support recording and digitisation through our collaborative partnerships with the Nyingarn project and with PARADISEC. Nyingarn is a platform for making manuscript and typescript sources for Indigenous languages available as searchable digital objects, while PARADISEC has extensive experience in digitising language materials. Increasingly, LDaCA and these partners share a single technological ecosystem, leading to greater alignment in formats and description for digital versions of historical materials, which, in turn, increases the sustainability and usability of the materials. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:2","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 4: Provide communities with support, training and infrastructure to build language resources Contributing to component four, LDaCA is working with various Indigenous groups and organisations to help manage the language resources and collections they hold. An example is our partnership with the Batchelor Institute of Indigenous Tertiary Education, which holds an important archive of physical and digital works in, and about, Indigenous languages. The collection is managed by the Batchelor Institute Library and has strong connections with language projects through the Institute’s Centre for Australian Languages and Linguistics (CALL). The CALL Collection includes text, audio and video resources that have been contributed over at least 30 years by students and staff of the Batchelor Institute, as well as by teachers, linguists and language workers external to the institute. We are working with the Batchelor Institute to: backup and reformat item record information copy metadata from an ageing database into contemporary metadata standard format embed cultural protocols to ensure long-term and appropriate access to this significant collection. We are also working on training programs, designed with First Nations people, which will help make resources accessible to non-academic users of language data. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:3","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 5: Support Aboriginal and Torres Strait Islander peoples to access and repurpose historical language materials Component five aligns closely with the A and the R in the FAIR principles: Accessibility and Reusability. In order for data to meet these FAIR criteria, it must be well-described, but in many cases, existing collections of Indigenous data are not as well-described as they might be, and even when they are considered to be well-described, the description is often based on non-Indigenous knowledge. LDaCA is contributing to component five of the sub-theme by developing methods to make it easier to enrich existing descriptions of data, with the aim of moving towards Indigenous knowledges as the basis for the description of Indigenous data. We have carried out a pilot project with The University of Queensland libraries, in which records from the library system have been enriched with additional annotations based on Indigenous knowledges of the collection items. The results of this project can be seen in a dedicated data portal. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:4","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 6: Repatriate language materials and cultural knowledge to community ownership and control The A in the CARE principles stands for Authority to control. LDaCA’s approach to sustainable data includes a commitment to storing explicit licences with data, detailing in plain text who can access data and what the data can be used for. In line with component six, the communities and people who are the source of Indigenous data have the authority to make decisions about such matters: they decide who can access data and what users can do with the data. LDaCA has developed policies and technical solutions for implementing such access controls that make ongoing administration of access as straightforward as possible. We are sharing ideas and solutions with our colleagues in the Indigenous stream of the HASS and Indigenous Research Data Commons. The solutions that we are using for Indigenous data can also apply to other human-sourced data. As Levi-Craig Murray is fond of saying: “What’s good for mob, is good for everyone”. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:5","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Component 1: Embed Aboriginal and Torres Strait Islander cultural practices and knowledge systems into linguistic methodologies As stated above, we see all of the activities from components two to six as contributing to the aim of embedding Indigenous knowledges and cultural practices into our methodologies. When working with language material, LDaCA is guided by FAIR and CARE principles (Carroll et al. 2020; Carroll et al. 2021; Wilkinson et al. 2016). The FAIR principles provide guidance on ensuring that data will be useful in the future, though they do this from a Western scientific perspective. The CARE principles extend the FAIR data management principles and we hope that our commitment to these principles assists us in contributing to the goals of the Voices of Country plan: Making data more sustainable and more informed by Indigenous knowledges helps ensure that Indigenous communities benefit from the data. Providing accessible solutions for access management contributes to First Nations people’s authority to control Indigenous data. Supporting Indigenous communities in developing data capabilities helps to build positive relationships and show responsibility in our work with Indigenous Peoples. Basing all of our work on ethical principles respects the rights and well-being of Indigenous Peoples. We see all of our activities as both being guided by the CARE principles, and as contributing to the overall aim of embedding Indigenous knowledges and cultural practices into our methodologies. Indigenous control of access to data, Indigenous perspectives made part of the description of data, and using Indigenous perspectives to make data more accessible to communities are all part of this endeavour. ","date":"2024-05-23","objectID":"/news/posts/idil/:4:6","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"Further reading and references 2022 - 2032 International Decade of Indigenous Languages. 2022 - 2032 International Decade of Indigenous Languages. https://idil2022-2032.org/. (3 October, 2023a). International Decade of Indigenous Languages 2022 – 2032 | United Nations For Indigenous Peoples. https://www.un.org/development/desa/indigenouspeoples/indigenous-languages.html. (3 October, 2023b). International Decade of Indigenous Languages 2022–2032 | Office for the Arts. https://www.arts.gov.au/what-we-do/indigenous-arts-and-languages/international-decade-indigenous-languages/international-decade-indigenous-languages-2022-2032. (3 October, 2023c). Voices of Country: Australia’s National Action Plan for the International Decade of Indigenous Languages | Office for the Arts. https://www.arts.gov.au/what-we-do/indigenous-arts-and-languages/international-decade-indigenous-languages/voices-country-australias-national-action-plan-international-decade-indigenous-languages. (3 October, 2023d). Carroll, Stephanie Russo, Ibrahim Garba, Oscar L. Figueroa-Rodríguez, Jarita Holbrook, Raymond Lovett, Simeon Materechera, Mark Parsons, et al. 2020. The CARE Principles for Indigenous Data Governance. Data Science Journal 19. 43. https://doi.org/10.5334/dsj-2020-043. Carroll, Stephanie Russo, Edit Herczog, Maui Hudson, Keith Russell \u0026 Shelley Stall. 2021. Operationalizing the CARE and FAIR Principles for Indigenous data futures. Scientific Data. Nature Publishing Group 8(1). 108. https://doi.org/10.1038/s41597-021-00892-0. Evans, Nicholas. 2022. Words of Wonder: Endangered Languages and What They Tell Us (2nd ed.). Wiley. ISBN: 978-1-119-75877-8 Wilkinson, Mark D., Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3(1). 160018. https://doi.org/10.1038/sdata.2016.18. Thanks to Teresa Chan and Harriet Sheppard who provided valuable feedback on this piece. ","date":"2024-05-23","objectID":"/news/posts/idil/:5:0","tags":null,"title":"IDIL 2022–2032: Voices of Country Action Plan — An LDaCA Discussion Paper","uri":"/news/posts/idil/"},{"categories":null,"content":"LDaCA team member Teresa Chan discusses data considerations for her Master's research project.","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"by Teresa Chan During a Master of Applied Linguistics degree at UNSW Sydney, I was given the opportunity to enrol in a course in which students become interns in the School of Humanities and Languages’ Language Processing Lab. Although I would still have readings and assessments to complete, the teaching model would be different to a standard postgraduate course: I would attend regular online lab meetings with my supervisor and other interns, and design and conduct my own research project. The first few weeks of the course were dedicated to learning about real-world applications and key methods in computational linguistics and language technology. We then focused on the research project for the remainder of the course. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:0:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project proposal The idea for the project stemmed from an intriguing observation I had made about language use in my circle of friends. Certain friends, who had grown up in south-west Sydney and were mostly from a Vietnamese background, were prone to using the expression and whatnot, something that I never said myself. Researching this phenomenon revealed that it was a discourse-pragmatic feature called a general extender (GE). Figure 1: Two people using general extenders (GEs) in a conversation Image Source: Teresa Chan GEs are phrase-final expressions such as and stuff and or whatever that prototypically ’extend’ the referent of a phrase by evoking a more general set of items to which it belongs. They serve a range of discourse and interpersonal functions, such as hedging, establishing solidarity or marking in-group affiliation (Norrby \u0026 Winter, 2002). Regional and social variation in the use of GEs has been found across spoken and written English, including Australian English (Travis, Grama \u0026 Gonzalez, 2017), with different patterns of usage depending on variables like age, gender and socio-economic class. However, at the time of the project, ethnicity had rarely been considered (e.g. Secova, 2017), despite the fact that the high frequency of a specific GE form in a region can make it a statistical “marker of ethnic identity” (Aijmer, 2013, p. 146). There was also only one study on GEs in computer-mediated communication (CMC) (Fernandez \u0026 Yuldashev, 2011). Additionally, although GE usage in Australian English had been compared to its usage in varieties of English spoken in other countries, no research had looked at usage across different cities within Australia. My project would attempt to fill these gaps in the literature and explore socio-regional variation in GE usage in spoken and CMC settings. “Socio-regional” refers to the stratification of socio-economic class by region (Sheard, 2019, p. 486). The research questions I would aim to address were: Are there differences in GE usage in Australia between: a. different cities? b. different ethnicities? Does GE usage differ between spoken settings and CMC settings in Australia? Do the results corroborate or refute previous findings, particularly those for Australian English? ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:1:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Data selection and collection Several factors influenced the selection of the data sources and the collection of data from these sources, which were X (formerly Twitter) and two spoken corpora of Australian English: AusTalk (Burnham et al., 2011) and Sydney Speaks (Travis, 2014–2021). ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project scope and timeline The limited scope of the project and the short timeline I had for data collection meant I had to be strategic when selecting data sources. For the spoken language component, I could not construct my own corpus of recordings without requiring a potentially lengthy ethics clearance process, and so would need to rely on existing datasets like AusTalk and Sydney Speaks. For the CMC component, I needed to create my own dataset, which was reliant on both the suitability of the data source for my purposes and the skills required to access the data. Previous studies have found linguistic differences by geographical location within the same country on X (e.g. Rechkemmer, Wilson \u0026 Mihalcea, 2020), making it an appropriate choice for a CMC site. Furthermore, the website had an advanced search engine that allowed you to search for geolocated posts. X was also compatible with commercial web scraping software, allowing me to compensate for my lack of knowledge and experience as a novice researcher whose only relevant capability was a rudimentary knowledge of Python. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:1","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Dataset accessibility Accessibility needed to be considered when selecting and collecting data. The spoken corpora must already have transcriptions of the recordings available and the metadata needed to contain fields that matched the variables I was looking for. A small portion of the recordings in the AusTalk dataset had already been transcribed and made available online. This data could be downloaded from two separate (now defunct) web-based tools: the transcriptions via the Alveo website and the metadata via the Alveo Query Engine. The metadata contained detailed demographic information about the speakers in the recordings, allowing me to operationalise different fields into my chosen variables: age, gender, socio-economic class (education level), ethnicity (cultural heritage) and geographical location (current city). For the Sydney Speaks corpus, I submitted an application for access to the main researcher, LDaCA Chief Investigator Catherine Travis, and once granted, she sent me a link to download the data files. The metadata was encoded in the file names, with different codes used for ethnicity, gender and age. Luckily for future researchers, the majority of the Sydney Speaks corpus will be incorporated into the LDaCA portal, where the data and metadata will be made FAIR: findable, accessible, interoperable and reusable. Aside from the metadata requiring more manual work to extract, the utility of the Sydney Speaks dataset was lower than AusTalk for three main reasons: Only a limited number of ethnicities were included: Anglo (Australian), Chinese, Greek and Italian. The data was collected solely in Sydney, meaning it could only be analysed for a single geographical location. No metadata field was available to operationalise into a class variable. The accessibility of data and metadata from the CMC source was also key. Posts on X were highly accessible compared to other CMC sites, such as Facebook, where posts could be hidden from those outside of the user’s social network. However, the reverse was true when looking for information about a user on X, as usernames, user profiles and profile pictures were less likely to reflect the user’s true identity compared to Facebook. As a result, the metadata was sparser, and aside from geographical location, only one other variable was operationalised: gender. I assigned a gender to each user based on their username and handle, consulting profile pictures and descriptions in ambiguous cases (see Figure 2 for examples). While being arguably limited in making such a determination, this method was the most suitable option available at the time. If I wanted more reliable user information, ethics clearance would be needed, as it was only possible to determine all variables conclusively by contacting the users. Figure 2: Assigning genders to X users on two example posts Image Source: Teresa Chan ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:2","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Dataset size The data sources needed to contain enough instances of GEs to allow robust analysis. In previous studies, the frequency of GE usage across major varieties of English ranged from about 20 to 40 GEs per 10,000 words of speech, though rates of 50 to 60 GEs per 10,000 words had been observed among adolescents (Cheshire, 2007; Norrby \u0026 Winter, 2002). See Figure 1 for the rates of use of GEs (per 10,000 words) by age across several of these studies. Figure 3: Rates of use of GEs (per 10,000 words) by age, across studies Image Source: Catherine Travis These studies looked at spoken language, as the prevalence of GEs was found to be higher in spoken corpora compared to written corpora. Indeed, one study (Martínez, 2011) found the frequency of GEs was approximately 9 to 10 per 10,000 words in the spoken subcorpus compared to 1 to 1.5 in the written subcorpus of two different corpora of British English (International Corpus of English-GB component and the British National Corpus). Additionally, in the written subcorpora, GEs were mainly found in fictional dialogue or conversation, or in informal writing, such as emails. This evidences the fact that GEs appear to be predominantly features of spoken language. Based on the frequency figures outlined above, datasets needed to contain hundreds of thousands of words of speech in order to produce a sufficient number of GE tokens. Both AusTalk and Sydney Speaks featured spoken language and proved to be large enough datasets (274,676 words and 568,555 words, respectively). For the AusTalk corpus, participants completed a number of different tasks, but only data from two of the tasks were considered for the dataset: a story re-telling task, where a participant was asked to re-tell a story from a previous task in their own words, and an interview task, where the participant and researcher discussed a topic of interest to the participant. These tasks were designed to elicit a more spontaneous speech style, which would be fruitful ground to find GEs. Although interactions on CMC sites generally feature more written than spoken language, they seem more like oral interactions than written interactions. In a study on a synchronous CMC site (instant messaging), Fernandez and Yuldashev (2011) found GE usage patterns were closer to those used in spoken interaction rather than written interaction, though the frequency of occurrence was noticeably lower compared to the spoken corpora they investigated. I theorised an asynchronous CMC site like X should also demonstrate usage patterns that were similar to spoken interaction. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:3","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Dataset novelty To corroborate and extend previous findings, data sources that had never been analysed before were preferred, which was the case for AusTalk and X. Although patterns of GE usage by age, gender and socio-economic class had been studied previously in Sydney Speaks (Travis, Grama and Gonzalez, 2017), re-analysis of the dataset could illuminate further trends. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:2:4","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Data analysis Once I had cleaned and organised the data for all three datasets, I uploaded them into commercial corpus manager and text analysis software (Sketch Engine) and conducted concordance searches. I searched for GE variants from a list I had compiled before starting data collection, based on those found in previous studies. Figure 2 shows the first 20 results of a concordance search for the GE and stuff in the Sydney Speaks dataset. Figure 4: Concordance search for *and stuff* in the Sydney Speaks dataset Image Source: Teresa Chan Each search result was manually inspected to ensure it could be identified as a GE. After selecting only genuine instances, the results of each search were exported as CSV files, then imported into Microsoft Excel for annotation. Each GE token was coded with the relevant variables for the dataset, e.g. age, gender. Each token was also sorted into a GE subcategory and broader GE category, for instance, or something was in the (or) something (like that) subcategory and something category. Once all tokens were coded and sorted, statistical analysis was completed in Excel using the Data Analysis ToolPak add-on. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:3:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project results Addressing each research question in turn: Are there differences in GE usage in Australia between: a. Different cities? There were no statistically significant differences in frequency of GE usage between different Australian cities in AusTalk. However, the Sydney Speaks mean frequency per 10,000 words was significantly higher than the mean frequencies of the other cities in AusTalk, including Sydney1. Looking at GE categories, there were no statistically significant results. However, there were apparent trends, such as Sydney AusTalk speakers using less common variants more often compared to other cities2. Thus, there is some evidence of linguistic differences by geographical location. b. Different ethnicities? There were no statistically significant differences in frequency of usage between Anglo and non-Anglo Australians within both spoken corpora. However, when the corpora were compared, the mean frequency of both Anglo and non-Anglo Australians in Sydney Speaks was significantly higher than that of both Anglo and non-Anglo Australians in AusTalk3. There were no statistically significant differences in preferred variants, though there were some evident trends. For instance, Chinese Australians in the Sydney Speaks dataset used whatnot/what not variants more often than other ethnicities4, which might relate to my original observation about Vietnamese Australians. Does GE usage differ between spoken settings and CMC settings in Australia? There were more GE tokens on X than in the spoken corpora. However, the mean frequency per user was significantly lower on X than the mean frequency per speaker in the spoken corpora5, likely due to the limitation in post lengths on X. Figure 3 illustrates the mean GE frequency for Sydney speakers in the AusTalk and Sydney Speaks datasets compared to X users6. There were also GE variants such as (and) stuff (like that) that were used significantly more frequently by speakers in the spoken corpora than by X users7. Therefore, it seems usage patterns on X differ from those in spoken Australian English. Figure 5: Mean GE frequency for Sydney speakers in AusTalk and Sydney Speaks datasets compared to X (formerly Twitter) Image Source: Teresa Chan Do the results corroborate or refute previous findings, particularly those specific to Australian English? The AusTalk results corroborated the robust finding on age (e.g. Secova, 2017; Travis et al., 2017), that is, younger speakers used GEs more frequently than older speakers8. There was also support across both corpora for previous studies that found no gender and class differences in frequency of usage. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:4:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Project reflection Though the analysis revealed interesting trends, more data is necessary to make further claims about the difference between GE usage in spoken Australian English and on X. I considered expanding on this in a PhD degree, but decided against pursuing this, before I ended up working at LDaCA. If I do resume my research, I would need to consider other CMC sites, with X on the decline and no clear winner among the alternatives. However, other large platforms like Reddit and Facebook come with their own challenges, including restrictions in access to their underlying APIs (Application Programming Interfaces). Using the APIs would allow me to avoid web scraping and its associated risks (e.g. potential violation of website terms) and limitations (e.g. difficulty in handling large or complex websites). Additionally, I would drastically change how I approach data capture and analysis, with a focus on digital automation and open-source software. The LDaCA portal would also be an invaluable resource to locate corpora of Australian English, with a clear and straightforward path in accessing both the data and associated metadata. ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:5:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"References Aijmer, K. (2013). General Extenders. In H. Pichler (Ed.). Understanding Pragmatic Markers: A Variational Pragmatic Approach (pp. 127–147). Edinburgh, United Kingdom: Edinburgh University Press. Burnham, D., Estival, D., Fazio, S., Viethen, J. Cox, J., Dale, R., Cassidy, S., Epps, J., Togneri, R., Wagner, M., Kinoshita, Y., Göcke, R., Arciuli, J., Onslow. M., Lewis, T., Butcher, A., \u0026 Hajek, J. (2011). Building an audio-visual corpus of Australian English: large corpus collection with an economical portable and replicable Black Box. Proceedings of 12th Annual Conference of the International Speech Communication Association (Interspeech 2011) (pp. 841–844). Retrieved from https://austalk.edu.au/bibliography Cheshire, J. (2007). Discourse variation, grammaticalisation and stuff like that. Journal of Sociolinguistics, 11(2), 155–193. Fernandez, J. \u0026 Yuldashev, A. (2011). Variation in the use of general extenders and stuff in instant messaging interactions. Journal of Pragmatics, 43(10), 2610–2626. Martínez, I. M. P. (2011). “I might, I might go I mean it depends on money things and stuff”. A preliminary analysis of general extenders in British teenagers’ discourse. Journal of Pragmatics, 43(9), 2452–2470. Norrby, C. \u0026 Winter, J. (2002). Affiliation in Adolescents’ Use of Discourse Extenders. In C. Allen (Ed.), Proceedings of the 2001 Conference of the Australian Linguistic Society. Retrieved from http://www.als.asn.au/proceedings/als2001.html Rechkemmer, A., Wilson, S. R. \u0026 Mihalcea, R. (2020). Small Town or Metropolis? Analyzing the Relationship between Population Size and Language. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, … \u0026 S. Piperidis (Eds.), Proceedings of the 12th Language Resources and Evaluation Conference. Retrieved from http://www.lrec-conf.org/proceedings/lrec2020/index.html Secova, M. (2017). Discourse-pragmatic variation in Paris French and London English: Insights from general extenders. Journal of Pragmatics, 114, 1–15. Sheard, E. (2019). Variation, Language Ideologies and Stereotypes: Orientations towards like and youse in Western and Northern Sydney. Australian Journal of Linguistics, 39(4), 485–510. Travis, C. E. (2014–2021). Sydney Speaks. Australian Research Council Centre of Excellence for the Dynamics of Language, Australian National University: http://www.dynamicsoflanguage.edu.au/sydney-speaks/ Travis, C. E., Grama, J. \u0026 Gonzalez, S. (2017). General extenders over time in Sydney English: From or something to and stuff. Paper presented at the Australian Linguistic Society Annual Conference, Sydney, Australia, 4–7 December. Retrieved from http://www.dynamicsoflanguage.edu.au/sydney-speaks/dissemination/ ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:6:0","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":"Footnotes 1 The mean frequency per 10,000 words for AusTalk speakers was significantly lower than Sydney Speaks speakers, in Sydney (t(129) = 2.40, p = .018), Adelaide (t(154) = 3.01, p = .003), Brisbane (t(29) = 4.89, p \u003c .001) and Canberra (t(129) = 2.03, p = .044). ↩ 2 Less common variants made up 15.00% of AusTalk tokens in Sydney, compared to 9.03% in Adelaide, 6.90% in Brisbane and 7.20% in Canberra. ↩ 3 The mean frequency of Anglo Australians in the Sydney Speaks dataset was significantly higher than both Anglo Australians (t(108) = 4.66, p \u003c .001) and non-Anglo Australians (t(69) = 3.67, p \u003c .001) in AusTalk. The mean frequency of non-Anglo Australians in the Sydney Speaks dataset was significantly higher than both Anglo Australians (t(121) = 3.15, p = .002) and non-Anglo Australians (t(55) = 2.85, p = .006) in AusTalk. ↩ 4 Whatnot/what not variants made up 9.04% of Sydney Speaks tokens for Chinese Australians, compared to 0.99% for Anglo Australians, 4.49% for Greek Australians and 1.61% for Italian Australians. ↩ 5 The mean frequency per speaker/user for AusTalk speakers was significantly higher than X users (t(64) = 4.51, p \u003c .001). The mean frequency for Sydney Speaks speakers was significantly higher than X users (t(117) = 9.85, p \u003c .001). ↩ 6 The mean frequency for AusTalk speakers was significantly higher than X users in Sydney (t(877) = 4.00, p \u003c .001). The mean frequency for Sydney Speaks speakers was significantly higher than X users in Sydney (t(117) = 9.85, p \u003c .001). ↩ 7 The (and) stuff (like that) variant was used significantly more frequently by speakers in the AusTalk (t(15) = 3.17, p = .006) and Sydney Speaks (t(78) = 6.21, p \u003c .001) datasets compared to X users. ↩ 8 Speakers in the datasets were assigned one of three age brackets: Young (19-29 year-olds), Middle (30-59 year-olds) and Old (60-79 year-olds). The mean frequencies for the Young (t(36) = 4.07, p \u003c .001) and Middle (t(48) = 2.42, p = .02) age brackets in AusTalk were significantly higher than the Old age bracket. ↩ ","date":"2024-05-15","objectID":"/resources/general-resources/case-studies/masters-research-project/:6:1","tags":null,"title":"A Master's Research Project","uri":"/resources/general-resources/case-studies/masters-research-project/"},{"categories":null,"content":" This blog post features an interview with Kalin Stefanov (ARC DECRA Fellow at Monash University), who is working on an Australian Sign Language (Auslan) project. We would like to thank Chief Investigator Louisa Willoughby for first letting us know about Kalin’s work, and Kalin for graciously agreeing to the interview and providing his thoughts. Kalin Stefanov, Monash University Image Source: Kalin Stefanov How did you get interested in sign language? During my PhD at KTH Royal Institute of Technology, I worked on a project that aimed at creating a learning environment where children can learn signs in a game-like setting. An on-screen avatar presents the signs and gives the child certain tasks to accomplish, and in doing so, the child gets to practise the signs. The system was thus required to interpret the signs produced by the child, distinguish them from other signs, and indicate whether or not it was the right one. For this project, I developed a real-time hand gesture recognition technology applied to the recognition of Swedish Sign Language signs. To date, this is the only real-time Swedish Sign Language recogniser and one of the few such technologies in the world (both in academia and industry). The project involved close collaboration with researchers (native signers) at the Department of Linguistics, Stockholm University and technology evaluation with the target end-users. This was my (immersed) introduction to the fascinating world of sign language communication and since then I have participated in a range of research projects related to different sign language technologies. Can you tell us about the project you are currently undertaking? My current project addresses the computational modelling of Auslan. The project expects to generate knowledge by creating the largest Auslan dataset, enabling further advancements in this research area. The dataset will also play an essential role in other research fields, e.g. sign linguistics. Expected outcomes include the invention of the first Auslan recogniser and generator, representing a substantial advancement towards fully automated Auslan translation. The Auslan recogniser will be able to distinguish 1000+ isolated Auslan signs. This means that by providing a video of an isolated Auslan sign, the recogniser will ‘translate’ it to written or spoken English. We also aim to provide mechanisms for personalisation of the recogniser to account for individual variations in signing. The Auslan generator will operate in the reverse direction and be able to synthesise 1000+ isolated Auslan signs. This means that by providing a written or spoken English word, the generator will ‘translate’ it to a video of an isolated Auslan sign. We also aim to provide mechanisms for the personalisation of the generator to account for individual preferences in appearance and signing style. Who will be helped by the output of your project, and how? The Auslan recogniser and generator will be general enough to form the core of many applications targeted at specific needs. I see at least two contexts that could benefit directly from the project outcomes: Auslan education and research. In education, on the one hand, I can envision an application for individual Auslan sign acquisition and training (think ‘Duolingo for Auslan’). On the other hand, in collaboration with A/Prof Louisa Willoughby and her extensive network, we will attempt to co-design more focused pedagogical applications of the technologies. We are still in the early stages of the technology development, hence we have not yet finalised what this could look like. One possible direction is to embed the developed technologies in education offerings for future Auslan interpreters. This will potentially improve the quality of Auslan training and ensure that more students pass interpreting exams and enter the workforce as confident Auslan interpreters. In research, collecting and annotating sign language data is an expensive task that needs the","date":"2024-04-11","objectID":"/news/posts/kalin-stefanov/:0:0","tags":null,"title":"Interview with Kalin Stefanov","uri":"/news/posts/kalin-stefanov/"},{"categories":null,"content":"LDaCA team member Rosanna Smith discusses data management in language technology, focusing on projects at tech company Appen.","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"by Rosanna Smith Founded in Sydney in 1996, Appen is a technology company that collects and improves data for the purposes of training and developing machine learning and artificial intelligence systems. During my time at Appen, a significant part of my work involved a diverse range of projects for both automatic speech recognition (ASR) and text-to-speech (TTS) technologies, including phonetic transcription, grapheme-to-phoneme or letter-to-sound (L2S) rule development, and part-of-speech (POS) annotation. To break down these language technologies further, ASR enables computers to process human spoken language into readable text, allowing users to operate devices through speech or facilitate translation of that speech into other languages. Conversely, TTS generates an audio version of a written text. TTS can be used to improve accessibility, particularly for those with vision impairments or learning disabilities such as dyslexia. Language technology relies heavily on good-quality data, and the table below illustrates just some of the processes used in Appen projects, with use cases they can be applied to. Process Type Description Example Use Case Phonetic Transcription The process of transcribing words in a lexicon according to their sound (pronunciation lexicon), through the use of a phonetic alphabet such as the International Phonetic Alphabet (IPA) or X-SAMPA (a phonetic script that uses symbols all found on a standard keyboard, aiding in native-speaker transcription). Suprasegmental features such as stress, tone, vowel length, pitch accent and syllabification can also be encoded where relevant to the language. Use in ASR: Appen developed language packs in 26 languages for the Intelligence Advanced Research Projects Activity (IARPA) Babel program, facilitating improvement in speech recognition performance for languages other than English which have very little transcribed data. These projects generally have data collection, orthographic transcription and phonetic transcription components, the latter using X-SAMPA annotation. Further reading: IARPA - Babel Example dataset: IARPA Babel Mongolian Language Pack Use in TTS: In collaboration with health service providers GuildLink and MedAdvisor, Appen produces TTS audio of Australian Consumer Medicine Information (CMI) documents. This audio is available at medsinfo. Further reading: GuildLink Makes Consumer Medicines Information More Accessible Grapheme-to-Phoneme or Letter-to-Sound (L2S) Rule Development A set of rules for a given language that map between orthography and pronunciation, used to generate phonetic transcriptions from an input text. These rules can be both single and multi-letter mappings such as orthographic \u003cs\u003e → X-SAMPA /s/ and \u003csh\u003e → /S/ in English. The rules can refer to the surrounding environment which allows for alternative phonetic mappings based on the characters’ position in a word, surrounding morphemes, or other linguistic processes that impact pronunciation. For example, rules are implemented to specify when \u003cs\u003e should be pronounced as /z/ as in ‘roses’ or as /s/ as in ‘sit’. Suprasegmental features such as stress, tone, vowel length, pitch accent and syllabification may also be encoded, however, the accuracy of these in the output pronunciations depends on the given language and the extent to which these features are predictable from orthography alone. As was the case for the IARPA Babel program, one of the first steps in producing pronunciation lexicons involves running the wordlist through a language or dialect-specific L2S algorithm to generate hypothesised X-SAMPA pronunciations of each word. For many of the 26 languages in Babel, these rules didn’t yet exist and had to be developed through research, consultation and iterative testing with native speaker linguists. The automated output is then checked and edited as needed by native speaker linguists. L2S rules may be further developed according to customised requirements, for example, additiona","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:0:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"Data Management during the Life of a Project Due to the variety of services, data types and customisation capabilities provided by Appen, it is difficult to present a single standard approach to data management. Instead, I’ll focus on the processes that were developed for datasets that have both orthographic and phonetic transcription components, also known as transcription and lexicon multi-component projects. The basic procedure for these projects is outlined below: The client or project sponsor specifies the parameters for the dataset and its collection in consultation with Appen specialists. Appen carries out data collection, usually in the form of conversational or scripted telephony, VoIP or microphone recordings with qualified participants from Appen’s crowd of more than a million. Conversational data refers to spontaneous or unscripted natural speech on a variety of topics either over the telephone or with both participants in the same room, while for scripted data, participants read and respond to a set text of prompts, curated to facilitate topic, domain, keywords, key phrases and phonetic coverage. All participants are provided with a consent form explaining the purpose of the collection and how the data will be used, and only take part in the collection if they are happy and comfortable to do so, and if they sign the consent form. Personal data such as names are anonymised and sensitive personal data is not collected. The audio data is then quality-checked to ensure it meets the requirements for the collection, including demographic balance, language and content, background noise levels, audio levels and recording duration. The approved audio goes through some pre-processing steps to batch the data into smaller segments, then is orthographically transcribed by the transcription (TX) team. Timestamps are checked at this stage as well to ensure correct alignment of the audio and the transcribed text, and a variety of labels are added to capture speaker and non-speaker noise events (e.g. laugh, cough, background noise). Rigorous quality assurance processes are applied while the transcription is ongoing and on the resulting complete data set. Figure 1: Example of audio with transcription, batched and presented to transcribers in tool. Image Source: Appen Figure 2: Example of the resulting transcription text file. Image Source: Appen A sorted list of unique word forms (e.g. ‘house’, ‘houses’) occurring in the dataset is created from the orthographically transcribed data and sent to the lexicon (LX) team, who work with a group of native speaker linguists to prepare phonetic transcriptions of the words, either in X-SAMPA or another phonetic script. If a POS lexicon is also required for the dataset, this is similarly created from the sorted list of unique word forms and checked by native speaker linguists. Figure 3: Example of a Hungarian pronunciation lexicon loaded to the transcription tool for native speaker review. Image Source: Appen Figure 4: Example of a UK English pronunciation lexicon. Image Source: Appen Figure 5: Example of a UK English POS lexicon. Image Source: Appen Data across each stage is validated for quality assurance. The audio, transcription and lexicon(s) are packaged for delivery. As shown above, the data for a single project goes through multiple stages of annotation and requires specialist linguist expertise in the data management process before each part can be combined for final delivery. One major challenge in this process is dealing with errors and inconsistencies that arise both prior to, and during, the validation stage, particularly where data is shared between the TX and LX teams. As transcribers and linguists prepare the annotations, variations may arise in features like spelling, capitalisation and punctuation. Languages may have no or multiple accepted orthographic conventions, and a standard approach must be determined for the project and adhered to. Conversely, other languages may alre","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:1:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"Data Management for Pre-Labeled Datasets Figure 6: Visual representation of pre-labeled dataset types. Image Source: Appen At the time of writing, Appen has over 280 audio, image, video and text datasets in over 80 languages available as pre-labeled datasets. These datasets are publicly accessible and can be filtered according to several categories: product type, common use cases, language and number of hours of audio, word or image count, if applicable (called Unit in the table below). Combined with a standard search function, this filtering allows interested parties to further refine their query for data most applicable to their needs. Product Type Common Use Cases Language Unit AudioImageTextVideo ASRAction ClassificationAutomatic CaptioningBaby MonitorCall CentreChatbotContent ClassificationConversational AIDocument ProcessingDocument SearchFacial RecognitionFitness ApplicationsGesture RecognitionIn Car HMI \u0026 EntertainmentKeyword SpottingLanguage ModellingMTNERSearch EnginesSecurity \u0026 Other Consumer ApplicationsSpeech AnalyticsTTSVirtual Assistant A-Z A measure of the volume of the dataset, either in hours or word count. Each listing contains further details about the dataset, particularly in relation to aspects of the data collection, including the specifics of the language collected, the volume of the dataset, the number of contributors involved, recording conditions and the file formats available. Additional details appear in a pop-up window when you click on the “tile” for a given dataset. For example: Catalogue entries for a Danish POS dataset and a Mongolian pronunciation dictionary: Figure 7: Danish POS dictionary catalogue summary and metadata. Image Source: Appen Figure 8: Mongolian pronunciation dictionary catalogue summary and metadata. Image Source: Appen This is a helpful starting point for browsing and refining the dataset selection, but in order to find the most suitable match, clients consult directly with Appen to discuss their language technology needs in full. More detailed metadata for each dataset is tracked internally and this is used to further assist with client queries and requests. Some of the additional metadata categories recorded include the demographics of the contributors for each dataset, such as the age range and gender distribution of the participants, the dialect coverage within a specific language, the year of collection, and the domains or topics discussed in the collection. Using the metadata as filters, requests can be divided into three alternative outcomes: Dataset(s) matching all requirements are available and a sample of the data is shared with the client to confirm this is the case. Dataset(s) matching only some of the requirements are available and samples of these are shared in case they are still applicable to the project. No dataset matching the requirements is available and a quote is then prepared for producing a new dataset ​​or enhancing an existing dataset. In terms of dataset storage, collections are catalogued first according to language and second by dataset type. This method works particularly well for pre-labeled datasets, because, if nothing else, clients usually know which language(s) they want data for and other specifics can follow from there. ","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:2:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"Conclusions Both in cases where new datasets are being created and finalised collections made ready for cataloguing, data management is a crucial consideration at Appen and, more broadly, any language technology project. Two main issues that are important to consider for data collections are: Ensuring that collections with multiple component parts are organised in such a way that updates can be applied to the whole, rather than introducing inconsistencies between related data. Determining the parties that will be using the repository and assuring that the methods of accessing data are intuitive for all groups. These issues are handled in the Appen workflow through the use of iterative semi-automated spelling checks and other similar processes which enable modifications to be integrated into all levels of the dataset, and through clear recording and accessibility of the metadata describing a collection, available both to Appen staff and its clients. For LDaCA, these issues are of equal importance for wider data management. To the first point, while it is ultimately the responsibility and prerogative of the data contributor to decide how their data should be organised, LDaCA can also assist with this process. Guidance is available on best practices for data management and organisation of metadata, both in the form of the documentation available at LDaCA Resources, as well as automated validation of metadata category requirements for datasets added and edited through Crate-O. To the second point, LDaCA has a responsibility to ensure that tools to facilitate data management and discoverability are accessible and intuitive for all, including the portals that use the Oni application to access data packaged as RO-Crates. This should be an iterative process, with further improvements to operability implemented based on user feedback and needs. ","date":"2024-03-22","objectID":"/resources/general-resources/case-studies/data-mangement-appen/:3:0","tags":null,"title":"Data Management in Language Technology: A Case Study of Appen","uri":"/resources/general-resources/case-studies/data-mangement-appen/"},{"categories":null,"content":"LDaCA team members who also worked on the AusNC project explain the relationship between the two and why AusNC is not an entity in LDaCA.","date":"2024-03-22","objectID":"/news/posts/ausnc/","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"by Simon Musgrave and Michael Haugh ","date":"2024-03-22","objectID":"/news/posts/ausnc/:0:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Overview of the Australian National Corpus (AusNC) The Australian National Corpus (AusNC) project was an important precursor to the Language Data Commons of Australia (LDaCA) project. Important aims and motivations of the LDaCA project are to secure language data in Australia and to make it easily available to researchers (and other interested parties). But language researchers have been aware of these issues for some time and the AusNC was an earlier attempt to address them. The AusNC was discussed in various places before the work of assembling data began. A meeting of interested people was held at the conferences of the Australian Linguistic Society and the Applied Linguistics Association of Australia in 2008, and a workshop was held in 2008 (proceedings published in Haugh et al. 2009), supported by an ARC-funded network for research into speech and communication (HSC-Net). Following these meetings, funding support from the Australian National Data Service, and subsequent funding and technical support from Griffith University enabled us to start the AusNC project, which was led by Michael Haugh, then a staff member at Griffith. Several data sets, which already existed and could easily be accessed, were included in this initial stage of the project: AustLit (selected samples of out-of-copyright poetry, fiction and criticism ranging from 1795 to the 1930s) The Australian Corpus of English (ACE) The Australian component of the International Corpus of English (ICE-AUS) Australian Radio Talkback (ART) The Corpus of Oz Early English (COOEE) The Monash Corpus of English (MCE) The Griffith Corpus of Spoken Australian English (GCSAusE) Braided Channels (oral histories of women from the Channel Country) Mitchell and Delbridge (a sample from data collected around 1960 documenting the speech of Australian adolescents) The team behind the AusNC always hoped that additional data would be added to the corpus, although funding constraints meant that only one further collection, the La Trobe Corpus of Spoken English, was added to the corpus shortly after the initial launch of the AusNC web interface. The Griffith University eResearch Services built the web interface and hosted the data for the AusNC, which was launched in 2012. From 2014, the Alveo virtual laboratory also held a copy of the data and provided alternative ways of interacting with it. However, after the launch of the Alveo version, the Griffith version of the corpus changed in various ways, and the two versions remained out of sync for a time, before both eventually became inaccessible when changes in the underlying technologies meant that maintaining those digital infrastructures was no longer cost effective. The Alveo virtual laboratory still exists, in principle, but has not been regularly maintained since 2019, and its services are rarely operational. With no ongoing funding and no staff member affiliated with the AusNC project, Griffith University indicated that it could not maintain the AusNC data sets and interface, and closed those facilities when the data was transferred to LDaCA in 2023.1 All the original data sets are now available (or are in the process of being made available) through the LDaCA data portal, and in one case, there is a substantial increase in the amount of data available (see discussion of Mitchell and Delbridge below). Figure 1: Results of a frequency search in the Australian National Corpus interface. Image Source: LDaCA ","date":"2024-03-22","objectID":"/news/posts/ausnc/:1:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Corpus design There are three general types of corpora for language data. In the first type, the data was collected for specific research purposes by individuals. In the second type, the data was collected following a plan. In the third type, the data was collected opportunistically. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Data collected for specific research purposes by individuals Likely the most common type of corpus is a collection created by an individual researcher or research group to try to answer a specific question or set of questions. An example of such a corpus in the AusNC is the Mitchell and Delbridge dataset. Mitchell and Delbridge wanted to know what Australian speech sounded like, how it differed from other varieties of English (often Received Pronunciation), and what sort of variation occurred within the Australian community. To answer these questions, Mitchell and Delbridge recruited school principals across the country to organise recordings of adolescents reading word lists and other prepared material, with a small, less structured component. As mentioned above, the AusNC made only a small part of this dataset available. LDaCA is delighted that, with the assistance of the University of Sydney, we are now able to provide access to data from all of the 7,736 speakers from 330 schools who were recorded. Another example of such a corpus in the AusNC is the Braided Channels collection, which contains 70 hours of oral history interviews with women from Queensland’s Channel Country. This is an example of a collection that was created to answer historical and cultural questions, but has been repurposed to enable researchers to address linguistic questions as well. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:1","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Data collected following a plan In the second type of corpus, data is collected according to a defined plan. The kinds of material and the relative amounts of different kinds of material are decided in advance, and then data is collected accordingly. We can distinguish two sub-types within this category: the plan already exists or the plan needs to be developed independently. Existing plan In some cases, data is collected according to the plan of another corpus which already exists, so that meaningful comparisons can be made. An example of this approach in the AusNC is the Australian Corpus of English (ACE), which was designed to be comparable with two previous corpora: the Brown Corpus of American English and the Lancaster-Oslo-Bergen Corpus of UK English (which was itself designed to be comparable with the Brown Corpus). The Australian component of the International Corpus of English (ICE-AUS) is another example of this type of corpus in the AusNC. Independent plan Alternatively, a corpus can be designed independently, but with the intention that it should be representative of some realm of language use, showing different genres and registers, and possibly also sampling some aspects of demographic variation. A classic example of such a corpus is the British National Corpus (BNC: 1994 version, 2014 version), which includes samples of both written and spoken language, with different kinds of written text in specified proportions, and which samples speaker variation across dimensions such as geography, gender and age. The BNC is sometimes referred to as a ‘reference corpus’; the idea is that it provides a baseline against which phenomena from other datasets can be compared. COOEE was constructed on similar, if less ambitious, lines. The corpus is divided into four time periods, with an approximately equal amount of material in each. Texts are assigned to one of four different types, and the proportion of each of those types is approximately equal in each time period. Such a design makes it easy to examine chronological change or variation between registers. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:2","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Data collected opportunistically The third type of corpus is one that is almost without a plan; the collection of data is opportunistic. Again, we can distinguish some sub-types here. Restricted amount of data Sometimes, opportunistic data collection is the only viable strategy. An example of this is the approach often taken by linguists documenting languages whose future is uncertain. In that situation, carefully planning what kinds of data to collect and in what proportions would be ideal, but is typically a luxury which cannot be afforded. Collecting anything which is easily available is at least a good starting point — if one becomes aware that there are large gaps in the data collected, one can try to address the problem in the future, but that may not ever be possible. This type of corpus is the result of making the best one can of limited opportunities, which restrict the amount of data which is available. Collections of spoken interaction are often assembled in this way. The Griffith Corpus of Spoken Australian English (GCSAusE), which includes recordings made and transcribed by students, is an example of this type of corpus in the AusNC. Extensive data Opportunistic data collection is also relevant to big data: many researchers work with data from the web nowadays, and this almost inevitably involves opportunistic procedures. However, in this case, the large amount of data which is accessible can justify a claim that a significant part of the variation occurring in language use (at least for a certain type of use) will be represented in the data. On this basis, many linguists are happy to use a corpus of web data, at least for exploratory research. Figure 2: The Australian National Corpus logo. Image Source: LDaCA ","date":"2024-03-22","objectID":"/news/posts/ausnc/:2:3","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"AusNC as an opportunistic collection We have set out three broad strategies above for constructing a corpus. The AusNC calls itself a corpus, so can we assign it to one of these types? We hope that the answer to this question is clear: the AusNC was an opportunistic collection. When the possibility of a corpus was being discussed, there were proposals for it to be carefully planned. For example, in a paper published following one of the workshops leading up to the AusNC, Pho (2009) argued that an Australian corpus should mirror the design of the BNC. However, even putting together a collection of written material on the scale of the BNC is a huge task, never mind the spoken language component; the BNC itself was produced by a large team working over a number of years. The AusNC project never enjoyed a level of funding which would have made such a plan possible (indeed, the amount of funding we received while very welcome was modest). In these circumstances, we had no alternative but to create the AusNC from existing data sources which custodians were prepared to share. We did hope that this model would make it easy to expand the corpus, but this did not occur, as we were not able to develop the infrastructure necessary for ongoing expansion of the corpus with the limited amount of funding we had. In 2012, it seemed obvious that the appropriate term for our project was corpus. We were putting together a collection of language data and it came to exist as an entity in its own right. What was included might be the result of rather random factors, but there was a coherence in that all the data related to language use, in fact, use of the English language in Australia. No one was talking about ‘data commons’2 when the AusNC was created (or we were not listening). If the concept had been actively discussed, perhaps we would have described what we were doing as a data commons. The inception of LDaCA coincided, more or less, with the two factors mentioned above which led to the AusNC becoming non-viable: the understandable withdrawal of support from Griffith University and the instability of the Alveo infrastructure due to funding constraints. It was therefore inevitable that transferring the AusNC collection to the new project would be an early priority. Approaching this task, we faced the problem of whether it still made sense to refer to the collection of datasets as the AusNC. After discussion within our project, and with the custodians of the AusNC datasets, we decided that retaining the name would suggest a coherence which did not exist now that the AusNC was part of LDaCA. The AusNC is now part of a larger assemblage of collections of data, and we felt that any internal coherence which the AusNC might have did not differentiate it, or the datasets it was composed of, from that broader landscape. Therefore, we made the decision to include the component datasets of the AusNC as separate collections in LDaCA data portal. The top-level description of each of the collections contains the following sentence to acknowledge the history of the data: “This collection was previously accessible online via the Australian National Corpus (AusNC), an initiative managed by Griffith University between 2012 and 2023.” Additionally, this blog post is available to those who are interested in the history of the AusNC.3 The corpus is dead, long live the data commons! ","date":"2024-03-22","objectID":"/news/posts/ausnc/:3:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"References Grossman, Robert L., Allison Heath, Mark Murphy, Maria Patterson, and Walt Wells. 2016. “A Case for Data Commons: Toward Data Science as a Service.” Computing in Science \u0026 Engineering 18 (5): 10–20. DOI: 10.1109/MCSE.2016.92 Haugh, Michael, Kate Burridge, Jean Mulder and Pam Peters (eds.). 2009. Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages. Sommerville, MA: Cascadilla Proceedings Project. http://www.lingref.com/cpp/ausnc/2008/index.html. Musgrave, Simon \u0026 Michael Haugh. 2020. The Australian National Corpus (and beyond). In Louisa Willoughby \u0026 Howard Manns (eds.), Australian English Reimagined. Abingdon: Routledge. Pho, Phuong Dzung. 2009. Towards the Design of the Australian National Corpus. In Michael Haugh, Kate Burridge, Jean Mulder \u0026 Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, 25–29. Cascadilla Proceedings Project. http://www.lingref.com/cpp/ausnc/2008/index.html. ","date":"2024-03-22","objectID":"/news/posts/ausnc/:4:0","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"Footnotes Thanks to Teresa Chan who provided valuable feedback on a draft. 1 We think it is important, however, to acknowledge and recognise the longstanding commitment by Griffith University to maintaining the AusNC over a period of more than ten years. ↩ 2 “A global trusted system of systems that provides frictionless access to high quality interoperable resources, services and artefacts for research….. Data commons collocate data, storage, and computing infrastructure with core services and commonly used tools and applications for managing, analyzing, and sharing data to create an interoperable resource for the research community.” (Grossman et al. 2016) ↩ 3 More details of the history of the AusNC can be found in Musgrave \u0026 Haugh (2020). ↩ ","date":"2024-03-22","objectID":"/news/posts/ausnc/:4:1","tags":null,"title":"What happened to the Australian National Corpus (AusNC)?","uri":"/news/posts/ausnc/"},{"categories":null,"content":"A list of the Oni data portals currently available, including their current collections.","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":" This page lists, and links to, the Oni portals currently available, together with the collections contained in each. As additional collections and portals are added, these will be documented below. To find out more about each collection, simply click through to be redirected to the Collections page. ","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/:0:0","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":"LDaCA Portal Current Collections: A COrpus of Oz Early English (COOEE) AustLit Australian Corpus of English Australian Radio Talkback Braided Channels International Corpus of English (ICE-AUS) The La Trobe Corpus of Spoken Australian English The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960 ","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/:1:0","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":"Indigenous Language Data (ILD) Portal Current Collections: UQ Indigenous Language Collection: Elwyn Flint Collection, UQFL173 Fryer Library, The University of Queensland UQ Library Collection Caroline Kelly Papers ","date":"2024-01-29","objectID":"/resources/user-guides/portal/available-portals/:2:0","tags":null,"title":"Available Portals","uri":"/resources/user-guides/portal/available-portals/"},{"categories":null,"content":"In 2018, when he was working at The University of Queensland’s (UQ) Research Computing Centre (RCC), Marco Fahmi (former LDaCA Project Manager) initiated a Graduate Digital Research Fellowship (GDRF) program focused on researchers in the Humanities and Social Sciences (HASS). The program has run in some years since then, hosted by the RCC (2019) and also by the UQ Graduate School (2020). In 2023, the program ran again, this time under the auspices of the Australian Text Analytics Program (ATAP). Two UQ staff members, Sam Hames (Postdoctoral Research Fellow in Computational Humanities) and Simon Musgrave (Engagement Lead for ATAP and for the Language Data Commons of Australia) led the program, and four students participated. The program supports research students to explore the possibilities of digital scholarship, particularly in HASS. Our Fellows are students undertaking extended research projects who spent 12-15 weeks learning about digital and computational skills in order to enhance their current research/thesis topic or to work on an independent digital project. They explored digital research methods in areas including Computational analysis of text Creation of new software tools to support their specific research area Social media analytics Assessment and analysis of digital environments including mobile apps Online games as social and cultural objects The Fellows met regularly in activities to develop a sound understanding of digital research methods and tools, including seminars, reading groups, and training workshops. They also had access to mentors who advised and collaborated on their digital projects. The four Fellows who participated in the 2023 program and their projects were: ","date":"2024-01-10","objectID":"/news/posts/gdrf/:0:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Evelyn (Eve) Ansell (UQ, Linguistics): Eve works in the field of Conversation Analysis which uses very detailed transcriptions of spoken interactions as data. The detail means the basic word forms involved are not easy to identify. For example, dea:::l indicates the word deal with emphasis (the underlining) and a greatly lengthened vowel sound (the colons). Clearly, the search function in a word processor is not going to find this as an instance of the word deal! Eve’s project looked at the possibility of developing code in a Jupyter notebook which could identify base word forms in such transcriptions and allow the researcher to see all instances of a word in an interaction. Eve also had the opportunity to work on the problem as part of an industry placement with AARNet. In Eve’s words: My only regret is that I didn’t do the GDRF sooner. This sort of learning should be required early on in all Humanities research training programs. I cannot speak highly enough of Simon and Sam. They enthusiastically share their domains of knowledge, but also engage with us about our own work. To have access to experts willing to talk through current projects, future scoping, frustrations, and wild ideas is incredibly valuable. It’s easy to forget how much of our research work happens in the digital space. It’s not just the data, it’s how we interface with it. The GDRF taught me to think differently about how I treat my research workflow at almost every point. ","date":"2024-01-10","objectID":"/news/posts/gdrf/:1:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Mengyan (Kelly) Hou (UQ, Electrical Engineering and Computer Science): The core of Kelly’s PhD project is designing an app to support young people who have completed active cancer treatment to enhance their experience in digital healthcare interventions. In the Fellowship program, Kelly began surveying existing apps with similar functions and aims as the start of her own design process. An important component of this work was considering methodological questions: when we want to answer the question “Is this a good app?”, how do we go about that? Can we understand the intentions of the app designers and assess how effectively those intentions have been realised? Or should we try to judge what sort of experience the app provides for users (even if we are not members of the target audience)? Kelly examined these issues in relation to a number of relevant apps, as well as beginning to grapple with the (very topical) question of how an AI-based conversational agent might be included in her design. ","date":"2024-01-10","objectID":"/news/posts/gdrf/:2:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Christina Maxwell (UQ, Psychology): Christina’s research investigates where cisgender women create the boundaries for ‘womanhood’ in relation to transgender women within the context of feminist social movements. In her work as a Fellow, Christina looked at whether social media data, in this case from Twitter, could be another source of data to investigate the boundary creation issue. Similar to Kelly, Christina’s work involved considering methodological issues: what methods for analysing texts could usefully be applied to the Twitter data? It is exciting for all of us involved in the program that Christina (with co-authors) presented this research to the SASP-ACPID Conference 2023. To quote the abstract for that presentation: “As both anti-inclusion and pro-inclusion feminists are increasingly turning to social media to communicate their messages, we will collect data from Twitter to map out each movement online and how they position themselves and their goals in relation to the inclusion of transgender women. This exploratory mixed-methods project will use topic modelling, thematic analysis, and Linguistic Inquiry and Word Count…”. In Christina’s words: The Fellowship was instrumental in honing my ideas and offering invaluable hands-on support to bring my vision to life. It played a pivotal role in developing one of the most exciting and interesting projects of my PhD and I look forward to using these new skills and ways of thinking in future digitally-based projects. Update - August 2024 Christina’s work has now been published as a journal article with Sam as one of her co-authors: Maxwell, C., Selvanathan, H. P., Hames, S., Crimston, C. R., \u0026 Jetten, J. (2024). A mixed-methods approach to understand victimization discourses by opposing feminist sub-groups on social media. British Journal of Social Psychology, 00, 1–22. https://doi.org/10.1111/bjso.12785 ","date":"2024-01-10","objectID":"/news/posts/gdrf/:3:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Giulio Pitroso (Griffith, Sociology): Giulio’s research examines gaming communities in Australia and in Italy, focused on the way stereotypes tied to Italians and organised crime are articulated among players of video games. He is also interested in digital communities, ethnic identities on the Internet, and cultural and social manifestations related to Italian criminal organisations. In his work as a Fellow, he looked at the intersection of these areas of interest by examining the discussions, speculations, and expectations of fans about the (yet to be released) Mafia IV video game. This was tackled using comment threads from YouTube channels as a source of data. Once again, methodological questions were important for Giulio, including the questions about what analytic techniques could usefully be used. A GDRF program will be offered again in 2024, this time under the banner of the Language Data Commons of Australia. If you think the program could benefit you, there is an online form for expressions of interest (due by February 19 2024). For further information, you can contact Sam Hames or Simon Musgrave. ","date":"2024-01-10","objectID":"/news/posts/gdrf/:4:0","tags":null,"title":"Graduate Digital Research Fellowship","uri":"/news/posts/gdrf/"},{"categories":null,"content":"Find answers to the most commonly asked questions about the project.","date":"2023-12-12","objectID":"/about/faqs/","tags":null,"title":"Frequently Asked Questions","uri":"/about/faqs/"},{"categories":null,"content":"A variety of LDaCA open-source tools available on GitHub.","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":" A variety of LDaCA open-source tools are available at our GitHub organisation. Highlights include: ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:0:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Metadata Editor Crate-O: A tool that allows you to create and update Research Object Crates (RO-Crates) using a web interface, and with metadata spreadsheets. It provides researchers with a relatively simple way to describe their data using the best practices in formal metadata description. For more information about Crate-O and how to create RO-Crates with the interface, see the Crate-O User Guide. ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:1:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Data Storage OCFL-js, a library that implements the Oxford Common File Layout (OCFL): A specification for laying out digital collections on file or object storage. It is designed with long-term preservation principles in mind and does not rely on specialised software. Amongst the benefits of using OCFL with RO-Crate objects are: completeness: a repository can be re-indexed from the files it stores versioning: repositories can make changes to objects and still allow their history to persist ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:2:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Data Portal and Access API Oni: A web application that provides indexing, searching and access to secure data repositories following the Arkisto model. This is used to build the LDaCA Portal: The online interface of the Language Data Commons of Australia where users can discover and access language collections. ","date":"2023-11-22","objectID":"/resources/ldaca-resources/ldaca-software-tools/:3:0","tags":null,"title":"LDaCA Software Tools","uri":"/resources/ldaca-resources/ldaca-software-tools/"},{"categories":null,"content":"Specifies the license(s) that data contributors have applied to the content of their collection, including the content coverage of that license.","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":" Specifies the license(s) that data contributors have applied to the content of their collection, including the content coverage of that license. ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:0","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"A Corpus of Oz Early English (COOEE) All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:1","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"AustLit All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:2","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"Australian Radio Talkback All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:3","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"Braided Channels All content: Attribution-NoDerivatives 3.0 Australia (CC BY-ND 3.0 AU) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:4","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"International Corpus of English (Aus) All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:5","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"The Australian Corpus of English All content: Attribution 4.0 International (CC BY 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:6","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"The La Trobe Corpus of Spoken Australian English All content: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:7","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960 All content: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) ","date":"2023-11-15","objectID":"/resources/ldaca-resources/licenses/:0:8","tags":null,"title":"Licenses","uri":"/resources/ldaca-resources/licenses/"},{"categories":null,"content":"LDaCA team member Harriet Sheppard describes collecting data on a remote island in Papua New Guinea.","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/fieldwork-png/","tags":null,"title":"Fieldwork in Papua New Guinea","uri":"/resources/general-resources/case-studies/fieldwork-png/"},{"categories":null,"content":"by Harriet Sheppard Harriet on an outrigger boat off Vanatɨna Image Source: Harriet Sheppard My fieldwork experience, in many ways, conforms to stereotypes of language fieldwork – a community outsider working with a smaller language community located far from an urban centre. For my postgraduate research, I documented and described grammatical topics of Sudest (known as Vanga Vanatɨna by its speakers), an Oceanic language spoken in Papua New Guinea (PNG). The language is spoken on the islands of Vanatinai (also known as Sudest and Tagula) and Yeina in the Louisiade Archipelago, approximately 350 kilometres southeast of the PNG mainland. In the following, I discuss decisions I made and processes I followed related to the collection of language data and archiving the resulting corpus of texts, including ethical considerations, ownership of the data, data formats, metadata, and data security and reuse. This is by no means meant as a guide (of best practice) for projects where language data are being collected with speakers of un(der)described languages. Rather, it is an account of some of the types of considerations and procedures that need to be considered when collecting language data and how I handled them in this instance. Map of Louisiade Archipelago Image Source: Harriet Sheppard Before recording with a speaker for the first time, I would explain the project to the speaker and their rights as contributors to the project. I would explain how I would use the language recordings, how they would be stored and accessed and how speakers of the language or other researchers might access and use the recordings in the future. This included discussing how they, the person(s) making the recording, would retain ownership over the recording (i.e. intellectual and cultural property rights). They would also be the ones to choose access rights over the text and have the option to withdraw any or all recordings they made from the corpus at any point (more on these below). After these discussions, if the speaker was still interested in the project, I would then record an oral consent form with them. The consent form was in English, the language of wider communication in the province. In cases where the speaker didn’t speak or spoke limited English, we would have a translator present for this process. I chose to use an oral consent form as it is accessible to all participants no matter their literacy levels. It also has the advantage of being a more durable format compared with a paper consent form, particularly when working in a hot and humid climate and it can also be easily backed up. Transcription session with Abel Sam Image Source: Harriet Sheppard Each time I made a recording with a speaker or speakers, I played the recording back to them to check that they were happy with the recording and so that they could decide on what sort of access conditions they wanted the archived recording to have. If the speaker was not satisfied with the recording, sometimes they would choose to make another recording and delete the original or simply delete the recording and maybe try again later. If they were happy with the recording, they would decide on the access conditions they wanted for the archived recording. The possibilities ranged from the most restricted, being that only I, as the original researcher, could access the recordings and use them for my research, to the opposite end of the access continuum where anyone interested could download and listen to the recordings, read any transcriptions, and potentially use them for educational or research purposes. In between these two ends of the continuum, they could opt for other access conditions such as only allowing access for registered users of the archive, or only for researchers and speakers, or just for speakers. They could also choose to have a text embargoed for a certain period of time. Notes from an elicitation session Image Source: Harriet Sheppard When I was preparing my ethics applicati","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/fieldwork-png/:0:0","tags":null,"title":"Fieldwork in Papua New Guinea","uri":"/resources/general-resources/case-studies/fieldwork-png/"},{"categories":null,"content":"LDaCA Chief Investigator Catherine Travis and former team member Cale Johnstone explain the complexities of handling data from three periods in a 40 year span.","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"by Catherine Travis and Cale Johnstone ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:0:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Introducing the Sydney Speaks project Sydney Speaks is a large-scale sociolinguistic project, funded through the ARC Centre of Excellence for the Dynamics of Language (CoEDL 2014-2022) at the ANU, led by Catherine Travis, working with James Grama and Simon Gonzalez (CoEDL post-doctoral fellows (2017-2019); Katrina Hayes and Cale Johnstone (Project Managers); Benjamin Purser (lead RA); and many RAs who worked on data collection, transcription, and checking vowel alignment. The Sydney Speaks Corpora Contemporary linguistic recordings (collected 2015-present): Sydney Speaks 2010s Legacy linguistic recordings (Collected 1977-1981): Sydney Social Dialect Survey Legacy oral histories (collected 1987-1988): NSW Bicentennial Oral History Collection Data access concerns Combining legacy and contemporary data collections References Sydney Speaks Corpus and Publications Since 2015, the Sydney Speaks project has recorded the spontaneous speech of Sydneysiders for the purpose of documenting and exploring Australian English as spoken in Australia’s largest and most ethnically and linguistically diverse city. In order to study language change in real-time, the project also incorporates legacy data from two sources: a sociolinguistic project carried out in the late 1970s and an oral history collection created during the 1980s. Altogether, the birth years of the participants span over a century, from the 1890s to the 1990s, and five age groups are represented, as captured in Figure 1. Figure 1: Sub-corpora included in the Sydney Speaks project: Sydney Speaks 2010s; Sydney Social Dialect Survey; NSW Bicentennial Oral History Project Image Source: Catherine Travis Each of the three sub-corpora of Australian English presents a different set of challenges and issues for data management practices that maximise the value of the data, while protecting the ethical, moral and legal rights of the participants. Below, we present each sub-corpus individually, and provide a summary of each sub-corpus. We describe the ways in which recordings were made and transcription was managed; outline how we were able to get comparable demographic information from participants when this was not standardised across the corpora; and detail how we dealt with participant consent and privacy. We close by reviewing a set of data access concerns around de-identification, data security and access and re-use. In this overview of the process by which different datasets collected for distinct purposes and under varying conditions have been brought together to constitute a coherent corpus of language data, we hope to highlight some of the considerations key to working with sociolinguistic data, and the enormous potential for the incorporation of other data sources through the use of appropriate standards for data management. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:1:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"The Sydney Speaks corpora ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Contemporary linguistic recordings (collected 2015-present): Sydney Speaks 2010s Sub corpus overview The Sydney Speaks 2010s sub-corpus comprises recordings made from 2015 to the present. Data collection is still ongoing, but as of June 2023, Sydney Speaks 2010s comprises 142 sociolinguistic interviews with men and women of varying ages from different ethnic communities in Sydney representing some of the largest ethnic minorities in Australia, namely Anglo-Celtic, Chinese, Greek, and Italian Australians. The interviews were generally conducted individually (though some were done in pairs), and they were conducted by community members, typically with people they knew. The interviewers were provided with a list of suggested topics, covering such things as childhood memories, growing up in Sydney, school and work experiences, and for the Greek, Italian and Chinese participants, engagement with their ethnic community and language, but they were told to use these as a guide only, and to follow the participants’ lead in choosing topics. Thus, the data is highly interactional, in some cases conversational, and topics vary greatly across recordings. There is a focus on recording narratives of personal experience, as these have been demonstrated to be particularly ideal for recording the everyday vernacular (Labov 1984:32-42). After the sociolinguistic interview, participants read aloud a word list, which provides a comparison for both the sociolinguistic interview data and other studies of Australian English (as many have focused on read word lists). The suggested interview topics and the word list are given in Figures 2a and 2b. Figure 2a: Suggested interview topics Image Source: Catherine Travis Figure 2b: Suggested word list Image Source: Catherine Travis Recordings and transcripts Interviews are recorded using a Zoom H4N digital recording device in WAV format at 44.1 kHz/16 bits. They are orthographically transcribed, aligned at the utterance level and then uploaded into LaBB-CAT (Language, Brain and Behaviour Corpus Analysis Tool (Fromont \u0026 Hay 2012)), for forced alignment (which automates the process for alignment at the level of the phoneme). LaBB-CAT also serves as the corpus management tool, where the corpus is stored, data is further annotated, and concordance searches can be conducted. LaBB-CAT does not include corpus analysis tools as such, but data can be downloaded in multiple formats to allow for analyses, including CSV and WAV, as well as in formats that are compatible with tools that are used widely by linguists, such as Elan, illustrated in Figure 3 (Lausberg \u0026 Sloetjes 2009) and Praat in Figure 4 (Boersma \u0026 Weenink 2019). Figure 3: Sample SydS transcript in Elan, time aligned at the level of the Intonation Unit [SydS_CYM_061_Mark] Image Source: Catherine Travis Figure 4: Sample SydS transcript in Praat, downloaded from LaBB-CAT, where it has been automatically aligned at the level of the word and phoneme, and broken down by morpheme [SydS_CYM_061_Mark] Image Source: Catherine Travis Participant metadata Detailed speaker metadata is collected via a demographic form that includes information such as place and year of birth, current and past suburbs of residence, occupation, education, community background, languages, and social networks (see Figure 5). This questionnaire is conducted orally at the end of the interview (as a continuation of the interview), and the participants’ responses are filled in by the interviewer. Information is extracted from the written form, validated in the recorded audio, and added to a metadata spreadsheet in Excel. We use customised metadata, rather than drawing on a standard metadata vocabulary; however the demographic information from each sub-corpus is standardised in Excel and comparable across the three sub-corpora. Figure 5: Sample Sydney Speaks demographic information form Image Source: Catherine Travis Participant consent and privacy The Sydney Speaks contemporary data collection ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:1","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Legacy linguistic recordings (Collected 1977-1981): Sydney Social Dialect Survey Horvath, Barbara. 1985. Variation in Australian English: The sociolects of Sydney. Cambridge: Cambridge University Press Sub corpus overview The Sydney Social Dialect Survey (SSDS) is a collection of 177 sociolinguistic interviews with adult and teenage Australians from Anglo-Celtic, Greek, and Italian backgrounds, recorded in Sydney between 1977 and 1980, as part of an ARC-funded project led by Barbara Horvath, of the Department of Linguistics at the University of Sydney. For the Sydney Speaks project, recordings from Anglo-Celtic adults (born in the 1930s) and from Anglo-Celtic, Greek and Italian Teenagers (born in the 1960s) were included. The recordings with Greek and Italian adults were set aside, as they arrived in Australia as adults and speak English as a second language, and thus raise a different set of questions for the study of language variation and change. There were 7 participants who did not meet the Sydney Speaks participant criteria (e.g. one of their parents was not of the target ethnic groups), leaving a total of 20 adults and 72 teenagers for inclusion. Like the contemporary linguistic data, the Sydney Social Dialect Survey comprises sociolinguistic interviews, but these are more interview-like: the interviewers were not community members, and they did not typically know the participants; and the topics were more defined (some common topics being games, layout of the school, nicknames, and language). Recordings and transcripts Interviews were conducted in the 1970s with a cassette recorder and made available to the Sydney Speaks project directly by the lead researcher. The audio cassettes, type-written transcripts, and demographic information of the participants were stored in boxes in Horvath’s garage in Sydney and passed on to Catherine Travis in 2013. The cassettes (pictured in Figure 7) were digitised using PARADISEC equipment in the College of Asia Pacific Studies at the ANU to create WAV files (96.1 kHz, resampled using Audacity into smaller files of 44.1 kHz). Fortuitously, the cassettes had been preserved very well, and nearly all of them (124/130) were able to be digitised, and with further refinement, it was possible to conduct acoustic analyses (though in some cases, this presented considerable challenges that did not arise with the new recordings). The typewritten transcripts were scanned and digitised as PDFs, but it was not possible to convert them into machine-readable transcripts, partly because they had been marked up by hand for transcript corrections, coding and annotation, as can be seen in the sample in Figure 8. They were therefore re-transcribed in Elan by the Sydney Speaks team, following the protocols applied for the contemporary data (see sample Elan rendition of part of Figure 8 in Figure 9). Figure 7: SSDS cassette collection Image Source: Catherine Travis Figure 8: Sample SSDS transcript with annotations [SSDS_ITF_120_Sara] Image Source: Catherine Travis Figure 9: Sample SSDS transcript time-aligned in Elan [SSDS_ITF_120_Sara] Image Source: Catherine Travis Participant metadata Metadata for each speaker was extracted from original type-written profiles that included information such as suburb, date of birth, age, occupation, education, languages, and time lived in Sydney (see Figure 10). Details were standardised and added to the project metadata database in an Excel spreadsheet. Further demographic information was added from several other sources, including participant overview documents put together by the lead researcher, original cassette labels as well as from the audio recordings themselves. We were not able to extract the same demographic information for all participants, and we have more details for some than for others, something which needs to be taken into account in cross-corpus comparisons. Figure 10: Sample demographic information: Sydney Social Dialect Survey Image Source: Catherine ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:2","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Legacy oral histories (collected 1987-1988): NSW Bicentennial Oral History Collection NSW Bicentennial Oral History Project. 1987. NSW Bicentennial oral history collection. Council on the Ageing NSW Branch and NSW Oral History Association of Australia, housed at the National Library of Australia. The process of conducting oral history involves recording interviews to collect information about the past, from the perspective of those who lived through relevant events. In recording everyday voices, oral histories are particularly valuable for research across Humanities disciplines as they capture the ‘little-heard voices of society’. They are of particular value for linguistic analysis, because they aim to ensure that ‘the historical record includes different languages and vernacular speech, accent and dialect’ (Oral History Statement of value; What is oral history? Retrieved May 23, 2022, from Oral History Australia). Like the sociolinguistic interview, oral histories elicit narratives of personal experience, and thus they provide highly comparable data. Sub corpus overview The NSW Bicentennial Oral History Collection, produced by the NSW Bicentennial Oral History Project, comprises 200 interviews recorded in 1987 and 1988 with men and women born before 1910. The Sydney Speaks project has incorporated 31 interviews with people who were born in Sydney and whose parents were also born in Sydney. The recordings include discussions about life in Sydney in the early part of the twentieth century, including the war, women’s first experience in the workforce, the outbreak of the Spanish flu, and so on. The collection is managed by the State Library of NSW and the National Library of Australia. In 2017, the Sydney Speaks project gained access to the collection via direct request to the National Library of Australia. Recordings and transcripts The original audio of each interview was recorded on cassette tape and was made available to the Sydney Speaks team in MP3 and WAV format, allowing for acoustic analysis. Type-written transcripts (Figure 11) were also made available, from which the Sydney Speaks team was able to create machine-readable versions via OCR (Optical Character Recognition). The original transcript captured the content very accurately, and this was imported into Elan and edited to produce detailed transcriptions that are aligned with the audio and facilitate linguistic analysis (Figure 12). Figure 11: Sample transcript from Bicentennial Oral History Project [BCNT_AEF_032_Camila] Image Source: Catherine Travis Figure 12: Sample Bicentennial Oral History Project transcript time-aligned in Elan [BCNT_AEF_032_Camila] Image Source: Catherine Travis Participant metadata Demographic data was collected at the time of the interview for each participant; this included name and date/place of birth as a minimum, but a ‘summary’ provided further information, such as parents’ education and occupations, employment (past and current), interests and marital status (see sample in Figure 13). The recordings themselves also provided a wealth of demographic information, as is common practice in oral histories, meaning that full demographic profiles could be developed for most participants. Some investment is necessary to capture and systematise this information, but its availability is one of the clear advantages they have for (socio)linguistic analysis. Figure 13: Sample demographic information: Bicentennial Oral History Project Image Source: Catherine Travis Participant consent and privacy Incorporating speech data from oral history collections is new ground for linguistic research and there are no existing guidelines to follow regarding the consideration of participant consent. The NSW Bicentennial Oral History Collection manual indicates which ‘restrictions on use’ were sought by the participants. A small number of participants asked for their name not to be used and for permission to be sought before publication of the data. For most of ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:2:3","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Data access concerns ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"De-identification All speakers have been given pseudonyms, for which we aim to parallel the original name (thus, Sarah may be Sally, Alfredo may be Alberto etc.). In the transcripts themselves, names and all other identifying content (such as addresses, school names, nicknames, etc.) have been de-identified in all data formats. In the transcripts, this is done by using pseudonyms, for readability purposes (rather than noting [XXX], or [pseudonym], for example). To indicate to the analyst what words are pseudonyms, all pseudonyms are marked (preceded by a tilde, e.g. ~Jane, ~Millers ~Point). For the audio, we identify the segment that needs de-identification, and run a low-pass filter using a Praat script so that the name is not recognisable. During an interview, some participants have requested that a section not be included in the study, generally due to what they perceive to be the sensitive nature of a certain topic (for example, one participant talking about banking in Hong Kong). In these cases, this portion of the recording has been deleted, and it has not been transcribed or included in any analysis. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:1","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Data security Data security measures are guided by the Sydney Speaks ethics approval. Long-term storage is especially important for legacy data that hasn’t been digitised or archived where data loss is a realistic risk. Contemporary data is collected in Sydney and transferred to the research team based at the ANU in Canberra. A remote data transfer system using an online cloud service ensures the safe transfer of raw data. Original audio and transcripts in the possession of the project are stored in a locked cabinet, managed by the project lead. Data has been digitised to secure the collection long term, and it is stored in an online cloud service as well as backed up on external hard drives. Having multiple copies of the data increases the security of the collection and using pseudonyms in the file naming protocols protects the identity of participants. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:2","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Data access and reuse In accordance with the project ethics protocol, the data is made available to other researchers with the approval of the Sydney Speaks project lead, Catherine Travis. While data from the NSW Bicentennial Oral History Collection is openly accessible online, the agreement between the National Library of Australia and the Sydney Speaks project allows the Chief Investigator to determine future access to the data from the subset of speakers included in the Sydney Speaks collection, with the condition of correct attribution. The Sydney Social Dialect Survey legacy data was transferred to the Sydney Speaks in full, including the capacity to make decisions about data access and reuse. Regular outreach activities such as presentations, workshops, public lectures and publications, promote the corpus and increase awareness of the data. The corpora are also described on lists of significant language data collections, such as the Sydney Corpus Lab Blog. They are stored with the DoI managed through the library at the Australian National University: https://dx.doi.org/10.25911/m03c-yz22. Access to the Sydney Speaks collection (including all three sub-corpora) is managed on a case-by-case basis. There is an agreed-upon set of terms and conditions for use of the collection, to ensure that any use is in accordance with the ethics approval, and these conditions are specified in the data access licenses developed with support from the Language Data Commons of Australia (LDaCA). Users must fill in an online application form, specifying how the corpora will be used and guaranteeing appropriate attribution to gain access. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:3:3","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Combining legacy and contemporary data collections In the past, language data was often collected without much consideration of the use of that data beyond the specific purpose for which it was collected. Issues such as ethics, data storage, and long-term data management plans were not of primary concern in the way that they are today, when we are guided by the FAIR principles around Findability, Accessibility, Interoperability, and Re-usability. This does not mean, however, that older language collections (or collections made from other disciplines and for other purposes) cannot be made FAIR. The Sydney Speaks project has demonstrated that legacy corpora can be brought into line with standards appropriate for data in the current digital age. Integrating contemporary and legacy data collections in this way allows for an upscaling of our studies in both the size and the scope of the language data we work with, and in so doing can open a treasure trove of knowledge on Australian language, society, culture, and history. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:4:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"References Boersma, Frederic, J. and D. Weenink. 2019. Praat: Doing phonetics by computer [Computer Software] (6.1.03 ed.): Retrieved 1 September 2019 from http://www.praat.org/. Fromont, Robert and Jennifer Hay. 2012. LaBB-CAT: An annotation store. Proceedings of the Australasian Language Technology Workshop: 113-117. Labov, William. 1984. Field methods of the project on linguistic change and variation. In John Baugh, and Joel Sherzer (eds), Language in use: Readings in sociolinguistics, 28-53. Englewood Cliffs, NJ: Prentice Hall. Lausberg, Hedda and Han Sloetjes. 2009. Coding gestural behavior with the NEUROGES-ELAN system. Behavior Research Methods, Instruments, \u0026 Computers (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands. http://tla.mpi.nl/tools/tla-tools/elan/). 41(3): 841-849. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:5:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Sydney Speaks Corpus and Publications Travis, Catherine E., James Grama, Simon Gonzalez, Benjamin Purser and Cale Johnstone. 2023. Sydney Speaks Corpus. ARC Centre of Excellence for the Dynamics of Language, Australian National University. https://dx.doi.org/10.25911/m03c-yz22 Gonzalez, Simon, James Grama and Catherine E. Travis. 2020. Comparing the performance of forced aligners used in sociophonetic research. Linguistics Vanguard 6(1). https://doi.org/10.1515/lingvan-2019-0058 Grama, James, Catherine E. Travis and Simon Gonzalez. 2019. Initiation, progression and conditioning of the short-front vowel shift in Australian English. In Sasha Calhoun, Paola Escudero, Marija Tabain, and Paul Warren (eds), Proceedings of the 19th International Congress of Phonetic Sciences (ICPhS), Melbourne, Australia, 1769-1773. Canberra, Australia: Australasian Speech Science and Technology Association Inc. https://assta.org/proceedings/ICPhS2019/papers/ICPhS_1818.pdf Grama, James, Catherine E. Travis and Simon Gonzalez. 2020. Ethnolectal and community change ov(er) time: Word-final (er) in Australian English. Australian Journal of Linguistics 40(3): 346-368. https://doi.org/10.1080/07268602.2020.1823818 Grama, James, Catherine E. Travis and Simon Gonzalez. 2021. Ethnic variation in real time: Change in Australian English diphthongs. In Hans Van de Velde, Nanna Haug Hilton, and Remco Knooihuizen (eds), Studies in Language Variation, 292-314. Amsterdam: John Benjamins. https://www.jbe-platform.com/content/books/9789027259820-silv.25.13gra Lee, Esther. 2020. Quotatives over time: A study in ethnic variation. Honours thesis, School of Literature, Languages and Linguistics, Australian National University. http://hdl.handle.net/1885/298816 Purser, Benjamin, James Grama and Catherine E. Travis. 2020. Australian English over time: Using sociolinguistic analysis to inform dialect coaching. Voice and Speech Review 14(3): 269-291. https://doi.org/10.1080/23268263.2020.1750791 Qiao, Gan and Catherine E. Travis. 2022. Ethnicity and social class in pre-vocalic the in Australian English. In Rosey Billington (Ed.), Proceedings of the Eighteenth Australasian International Conference on Speech Science and Technology (pp. 56-60): Australasian Speech Science and Technology Association. https://sst2022.files.wordpress.com/2022/12/qiao-travis-2022-ethnicity-and-social-class-in-pre-vocalic-the-in-australian-english.pdf Sheard, Elena. 2022. Longevity of an ethnolectal marker in Australian English: Word-final (er) and the Greek-Australian community. In Rosey Billington (Ed.), Proceedings of the Eighteenth Australasian International Conference on Speech Science and Technology (pp. 51-55): Australasian Speech Science and Technology Association. https://sst2022.files.wordpress.com/2022/12/sheard-2022-longevity-of-an-ethnolectal-marker-in-australian-english-word-final-er-and-the-greek-australian-community.pdf Sheard, Elena. 2023. Explaining language change over the lifespan: A panel and trend analysis of Australian English. PhD thesis, School of Literature, Languages and Linguistics, Australian National University. http://hdl.handle.net/1885/292110 Travis, Catherine E., James Grama and Benjamin Purser. 2023. Stability and change in (ing): Ethnic and grammatical variation over time in Australian English. English World-Wide 44(3): 429-463. https://doi.org/10.1075/eww.22043.tra Travis, Catherine E. and Rena Torres Cacoullos. 2023. Form and function covariation: Obligation modals in Australian English. Language Variation and Change 35(3): 351-377. https://doi.org/10.1017/S0954394523000200. ","date":"2023-11-13","objectID":"/resources/general-resources/case-studies/sydney-speaks/:6:0","tags":null,"title":"Sydney Speaks","uri":"/resources/general-resources/case-studies/sydney-speaks/"},{"categories":null,"content":"Information about the approach to metadata being taken by LDaCA.","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":" Metadata is often defined as ‘data about data’. High-quality metadata is important in making data FAIR: Findable: Metadata is the starting point for searching data collections. For example, if we want to find data in a particular language, this will only be possible for data that has a language recorded in its metadata. (Tracking languages is in itself problematic, see below.) Accessible: Access conditions that apply to data should be part of the associated metadata. Interoperable: Information about the format of data and whether it requires specific software to be usable should be part of the associated metadata. Reusable: All of the aspects of metadata mentioned above contribute to making data reusable. The more we know about some data, the easier it is to know whether it will be useful to us or not. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:0:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"RO-Crate Profiles RO-Crates in general have basic metadata requirements, but it is possible to specify a profile for crates for specific purposes. LDaCA is developing such a profile for our data; we are basing this largely on previous work in the area. An important aspect of the RO-Crate approach is that it uses the principles of Linked Open Data. This means that terms used in our metadata will (whenever possible) link to an openly available definition. In developing the profile, we are drawing on two existing attempts to provide vocabularies for describing language data. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:1:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Open Language Archives Community (OLAC) OLAC is an international partnership of institutions and individuals; one of their activities is developing consensus on best current practice for the digital archiving of language resources and this includes making recommendations for metadata. The OLAC metadata scheme is based on Dublin Core (DC), a widely used general metadata schema. OLAC have suggested refinements and extensions of the DC base which make it more useful for describing language resources. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:2:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Component Metadata Infrastructure (CMDI) CMDI was developed within the CLARIN project. It draws on the earlier ISLE Metadata Initiative (IMDI), but where IMDI attempted to specify a comprehensive scheme for (multimodal) language data, CMDI adopts a more flexible approach where components are assembled into reusable profiles. This is very similar to the RO-Crate approach described above but with an important difference: the components of a CMDI profile are all drawn from a central registry, whereas components of an RO-Crate profile come from any linkable location. The metadata being developed as part of the LDaCA RO-Crate profile can be viewed as a Schema (a list of terms used to describe language reserouces). We are also developing a more fully documented version as a Gitbook, Metadata for Language Data. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:3:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Identifying Codes for Languages One very important piece of metadata for language data is a description of the language or languages that the data represent. This is not a simple problem because the relationship between languages and names for them is not one-to-one. Some languages have more than one name: for example, Farsi and Persian can both be used to refer to the same language. Some names refer to more than one language: for example, there are languages called Buru used in Nigeria and in Indonesia. To avoid the confusion that can arise from such situations, various systems have been developed to assign unique identifiers to languages. None of these systems gives a comprehensive list of languages and all such systems struggle with another problem, the distinction between separate languages and dialects of one language, as can be seen in the case study below. LDaCA includes identifiers from each of the three systems below where they are available and relevant. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:0","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"ISO-639 This system is recognised as a standard by the International Standards Organisation. An earlier version of this system used two-letter codes to identify languages; more recent versions use three-letter codes (referred to as ISO 639-3). These codes are used by Ethnologue, which is a catalogue of the languages of the world, and in many other contexts. The ISO639-3 code for French is fra, and Warlpiri is wbp. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:1","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Glottolog Glottolog is an alternative catalogue of the world’s languages, language families and dialects - Glottolog uses the term languoid to cover all of these. Each languoid is assigned a unique identifier consisting of four alphanumeric characters and four digits. For example, (standard) French has the code stan1290, and Warlpiri is warl1254. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:2","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Austlang AustLang provides a controlled vocabulary of persistent identifiers, a thesaurus of languages and peoples, and information about Aboriginal and Torres Strait Islander languages which has been assembled from referenced sources. Alphanumeric codes are used as persistent identifiers, while associated text strings are changeable and can reflect community preferences (including alternative names and spellings). In AustLang, Warlpiri has two codes: C15 for the language in general, and C15.1 for the variety named as Wakirti Warlpiri. (French is not covered by AustLang.) ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:3","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"Case study - Kala Lagaw Ya Kala Lagaw Ya is a language spoken in the Torres Strait Islands. The language has several dialects or varieties and the table below shows how the different code schemes deal with this. ","date":"2023-11-13","objectID":"/resources/ldaca-resources/metadata/:4:4","tags":null,"title":"Metadata","uri":"/resources/ldaca-resources/metadata/"},{"categories":null,"content":"A collection of published material related to language research.","date":"2023-11-06","objectID":"/resources/general-resources/publications/","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"The Open Handbook of Linguistic Data Management Edited by Andrea L. Berez-Kroeker, Bradley McDonnell, Eve Koller, Lauren B. Collister. MIT Press 2022. “A guide to principles and methods for the management, archiving, sharing, and citing of linguistic research data, especially digital data. “Doing language science” depends on collecting, transcribing, annotating, analyzing, storing, and sharing linguistic research data. This volume offers a guide to linguistic data management, engaging with current trends toward the transformation of linguistics into a more data-driven and reproducible scientific endeavor. It offers both principles and methods, presenting the conceptual foundations of linguistic data management and a series of case studies, each of which demonstrates a concrete application of abstract principles in a current practice.” This material is Open Access. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:1:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Language Documentation and Conservation “LD\u0026C publishes papers on all topics related to language documentation and conservation, including, but not limited to, the goals of language documentation, data management, fieldwork methods, ethical issues, orthography design, reference grammar design, lexicography, methods of assessing ethnolinguistic vitality, archiving matters, language planning, areal survey reports, short field reports on endangered or underdocumented languages, reports on language maintenance, preservation, and revitalization efforts, plus software, hardware, and book reviews.” LD\u0026C is an Open Access journal. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:2:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Living Languages/Lenguas vivas/Línguas vivas journal “Living Languages is an international, multilingual journal dedicated to topics in language revitalization and sustainability. The goal of the journal is to promote scholarly work and experience-sharing in the field. The primary focus is on bringing together language revitalization practitioners from a diversity of backgrounds, whether academic or not, within a peer-reviewed publication venue that is not limited to academic contributions and is inclusive of a diversity of perspectives and forms of expression.” Living Languages is an Open Access journal. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:3:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Pacific Linguistics A digital archive of many Pacific Linguistics publications up to 2012. ","date":"2023-11-06","objectID":"/resources/general-resources/publications/:4:0","tags":null,"title":"Publications","uri":"/resources/general-resources/publications/"},{"categories":null,"content":"Australian organisations that are active in the language research space, as well as international organisations working in the global language space.","date":"2023-11-06","objectID":"/resources/general-resources/organisations/","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australian Organisations ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:0","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australian Linguistic Society (ALS) The national organisation for linguists and linguistics in Australia. Its primary goal is to further interest in and support for linguistics research and teaching in Australia. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:1","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Applied Linguistics Association of Australia (ALAA) The national professional organisation for applied linguistics in Australia. It welcomes academics, teachers, researchers, students and members of the wider community to join and become part of an active community interested in questions, issues and problems that can be understood and addressed through a focus on language in our world. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:2","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australasian Language Technology Association (ALTA) Has the purpose of promoting language technology research and development in Australia and New Zealand. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:3","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australasian Speech Science and Technology Association (ASSTA) A scientific association that aims to advance the understanding of speech science and its application to speech technology in a way that is appropriate for Australia and New Zealand. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:4","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Australian Institute of Aboriginal and Torres Strait Islander Studies (AIATSIS) An Indigenous-led, national institute that celebrates, educates and inspires people from all walks of life to connect with the knowledge, heritage and cultures of Australia’s First Peoples. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:1:5","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"International Organisations ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:0","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Common Language Resources and Technology Infrastructure (CLARIN) A research infrastructure that was initiated from the vision that all digital language resources and tools from all over Europe and beyond are accessible through an online environment for the support of researchers in the humanities and social sciences. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:1","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Endangered Languages Documentation Program (ELDP) Supports the documentation and preservation of endangered languages through granting, training and outreach activities. The collections compiled through their funding are freely accessible at the Endangered Languages Archive. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:2","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Linguistic Data Consortium (LDC) An open consortium of universities, libraries, corporations and government research laboratories. LDC was formed in 1992 to address the critical data shortage then facing language technology research and development. Initially, LDC’s primary role was as a repository and distribution point for language resources, but with the help of its members, LDC has grown into an organisation that creates and distributes a wide array of language resources. ","date":"2023-11-06","objectID":"/resources/general-resources/organisations/:2:3","tags":null,"title":"Organisations","uri":"/resources/general-resources/organisations/"},{"categories":null,"content":"Existing archives of language in numerous forms, including recorded speech, text, images, books and music.","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) Holds 14,500 hours of audio recordings and 2,000 hours of video recordings that might otherwise have been lost. These recordings are of performance, narrative, singing, and other oral tradition. This amounts to 150 terabytes and represents 1,315 languages, mainly from the Pacific region. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:1:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Living Archive of Aboriginal Languages (LAAL) A digital archive of endangered literature in Australian Indigenous languages of the Northern Territory. It contains nearly 4,000 books in 50 languages from 40 communities available to read online or download freely. This is a living archive, with connections to the people and communities where the books were created. This will allow for collaborative research work with the Indigenous authorities and communities. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:2:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Endangered Languages Archive (ELAR) A digital repository for preserving multimedia collections of endangered languages from all over the world, making them available for future generations. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:3:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"The Language Archive (TLA) An integral part of the Max Planck Institute for Psycholinguistics in Nijmegen. It contains various types of materials, including: audio and video language corpus data from languages around the world; photographs, notes, experimental data, and other relevant information required to document and describe languages and how people use them; records of speech in everyday interactions in families and communities; naturalistic data from adult conversations from endangered and under-studied languages, and linguistic phenomena. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:4:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Kaipuleohone Language Archive The digital language archive of the University of Hawai’i. Founded in 2008, the archive houses texts, images, audio, and video collected from around the world by linguists, anthropologists, ethnomusicologists, and more. Our collection includes a wealth of photographs, notes, dictionaries, transcriptions, and other materials related to small and endangered languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:5:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Digital Endangered Languages and Musics Archives Network (DELAMAN) An international network of archives of data on linguistic and cultural diversity, in particular on small languages and cultures under pressure. ","date":"2023-11-06","objectID":"/resources/general-resources/language-archives/:6:0","tags":null,"title":"Language Archives","uri":"/resources/general-resources/language-archives/"},{"categories":null,"content":"Information about ethical practices, especially in the research of language and First Nations cultures.","date":"2023-11-06","objectID":"/resources/general-resources/ethics/","tags":null,"title":"Ethics","uri":"/resources/general-resources/ethics/"},{"categories":null,"content":" Australian Institute of Aboriginal and Torres Strait Islander Studies Code of Ethics Australian Linguistic Society policies Includes Policy on Linguistic rights of Aboriginal and Islander communities and Statement of Ethics. Australian Code for the Responsible Conduct of Research, 2018 Linguistic Society of America: Ethics Statement 2019 Further resources Tromsø recommendations for citation of research data in linguistics Language and linguistics datasets are often not cited, or cited imprecisely, because of confusion surrounding the proper methods for citing them. For the use of researchers and scholars in the field working with datasets, the Tromsø recommendations propose components of data citation for referencing language data, both in the bibliography and in the text of linguistics publications. ","date":"2023-11-06","objectID":"/resources/general-resources/ethics/:0:0","tags":null,"title":"Ethics","uri":"/resources/general-resources/ethics/"},{"categories":null,"content":"Learn more about Indigenous language work in Australia through key projects in the space, as well as projects involved in the documentation of the world's languages.","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Indigenous Languages of Australia ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:0","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"50 Words Project Showcases words from 64 languages across Australia, with further languages being added regularly as more communities around Australia become involved. For each language, you can hear the words spoken via a map that shows the general location of the language. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:1","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"AustLang Provides a controlled vocabulary of persistent identifiers, a thesaurus of languages and peoples, and information about Aboriginal and Torres Strait Islander languages which has been assembled from referenced sources. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:2","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Australian Indigenous Language Collections A guide to materials held in the National Library of Australia, with links to similar resources at State Libraries. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:3","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Contemporary and Historical Reconstruction in the Indigenous Languages of Australia (CHIRILA) A lexical database (a database with words from different languages). Currently, there are about 780,000 words, from all over Australia, of which about 20% is publicly available. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:4","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Digital Daisy Bates A collection of over 23,000 pages of wordlists of Australian languages, originally recorded by Daisy Bates in the early 1900s, made up of the original questionnaires and around 4,000 pages of typescripts. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:5","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"First Languages Australia First Languages Australia is a partner institution of LDaCA and encourages communication between communities, the government and key partners whose work can impact Aboriginal and Torres Strait Islander languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:6","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Gambay A map of Australia’s first languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:7","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Indigenous Voices of Queensland An online resource created to improve public access to audio recordings, held in the Fryer Library (UQ), of Aboriginal and Torres Strait Islander people speaking Indigenous languages of Queensland. The project aims to make the recordings as openly available as possible so they can be used to assist communities with language recovery and education, and deepen knowledge of First Nations languages. A map summary and table of languages is also available for browsing, based on recordings from three academics: Elwyn Flint, Bruce Rigsby and Bruce Sommer. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:8","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Interactive Queenslander Languages Map An interactive map providing resources for Queensland Aboriginal and Torres Strait Islander languages, hosted by the State Library of Queensland and searchable by suburb or language. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:9","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Language Centres The Indigenous Languages and Arts (ILA) program currently supports a number of organisations, including a network of 20 language centres. A list of these centres is available from the above link in PDF or DOCX format. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:10","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Living Archive of Aboriginal Languages (LAAL) See Language Archives. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:11","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Living Languages Provides grassroots training to people, communities and Language Centres in Australia doing language work on the ground in remote, regional and urban areas. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:12","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Nyingarn A platform for primary sources in Australian Indigenous languages. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:1:13","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Languages of the World ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:0","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Ethnologue Provides information about the languages of the world, but operates on a subscription model. The information that is available without a subscription is very limited: the first three lines of individual language entries which include the ISO 639-3 code, the classification of the language into a language family, and a link to the language’s OLAC page. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:1","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Glottolog Comprehensive reference information for the world’s languages, especially the lesser-known languages. LDaCA uses Glottolog language codes in our metadata. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:2","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Open Language Archives Community (OLAC) An international partnership of institutions and individuals who are creating a worldwide virtual library of language resources by: (i) developing consensus on best current practice for the digital archiving of language resources, and (ii) developing a network of interoperating repositories and services for housing and accessing such resources. OLAC harvests metadata and their website has a search facility to find resources for languages. OLAC metadata recommendations are the basis for some of LDaCA’s metadata. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:3","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"World Atlas of Linguistic Structures Online (WALS) A large database of structural (phonological, grammatical, lexical) properties of languages gathered from descriptive materials (such as reference grammars) by a team of 55 authors. ","date":"2023-11-06","objectID":"/resources/general-resources/language-projects/:2:4","tags":null,"title":"Language Projects","uri":"/resources/general-resources/language-projects/"},{"categories":null,"content":"Further information relating to Indigenous research.","date":"2023-11-06","objectID":"/resources/general-resources/indigenous-knowledge/","tags":null,"title":"Indigenous Knowledge","uri":"/resources/general-resources/indigenous-knowledge/"},{"categories":null,"content":" Protocols for using First Nations Cultural and Intellectual Property in the Arts (Australia Council). Citing Indigenous elders and knowledge keepers: Templates For Citing Indigenous Elders and Knowledge Keepers (Lorisia MacLeod) University of Alberta Library’s guide University of Alberta Library’s blog TK Labels: Traditional Knowledge labels are an initiative for Indigenous communities and local organisations. Developed through sustained partnership and testing within Indigenous communities across multiple countries, the Labels allow communities to express local and specific conditions for sharing and engaging in future research and relationships in ways that are consistent with already existing community rules, governance and protocols for using, sharing and circulating knowledge and data. Decolonising linguistics: Spinning a better yarn Rawlings, V, Flexner, J L, Riley, L (eds.) (2021) Community-Led Research: Walking New Pathways Together. Sydney University Press The Linguistics and Applied Linguistics Indigenous Curriculum Resource Project (The University of Melbourne) This report, prepared by Sasha Wilmoth, includes a bibliography of linguistic work by Australian Indigenous authors. ","date":"2023-11-06","objectID":"/resources/general-resources/indigenous-knowledge/:0:0","tags":null,"title":"Indigenous Knowledge","uri":"/resources/general-resources/indigenous-knowledge/"},{"categories":null,"content":"A collection of digital tools useful for language work.","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Audacity A free, easy-to-use, multi-track audio editor and recorder for Windows, macOS, GNU/Linux and other operating systems. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:1:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"ELAN A tool for making time-aligned annotations on audio and video recordings. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:2:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Praat Free software for doing phonetics by computer. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:3:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"FieldWorks Software tools for managing linguistic and cultural data. FieldWorks supports tasks ranging from the initial entry of collected data through to the preparation of data for publication, including dictionary development, interlinearization of texts, morphological analysis, and other publications. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:4:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Toolbox A data management and analysis tool for field linguists. It is especially useful for maintaining lexical data, and for parsing and interlinearizing text, but it can be used to manage virtually any kind of data. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:5:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"AntConc A freeware corpus analysis toolkit for concordancing and text analysis. ","date":"2023-11-01","objectID":"/resources/general-resources/useful-tools/:6:0","tags":null,"title":"Useful Tools","uri":"/resources/general-resources/useful-tools/"},{"categories":null,"content":"Browse the project's upcoming events.","date":"2023-10-18","objectID":"/news/events/","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":" Recurring Events Previous Events Webinars ","date":"2023-10-18","objectID":"/news/events/:0:0","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Indigenous Data Governance: A discussion Part of The University of Queensland Research and Innovation Week 2024 The University of Queensland is committed to growing the opportunities for Indigenous people to lead and conduct their own research. One important strand in developing this innovative research culture is ensuring that appropriate governance is in place for Indigenous data. This event will present a panel of Indigenous researchers, data custodians and data stewards discussing current developments in this field including: the importance of Indigenous Cultural and Intellectual Property concerns governance of Indigenous Data Framework for the HASS \u0026 Indigenous Research Data Commons tools and methods for data governance. Panellists: Rose Barrowcliffe, Robert McLellan, Lesley Acres. The discussion will be facilitated by Grant Sarra. When: 4–6pm, 30 September 2024 Where: Anthropology Museum, Michie Building, The University of Queensland (St Lucia Campus) Details ","date":"2023-10-18","objectID":"/news/events/:0:1","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Data Migration Skills Workshop The Language Data Commons of Australia infrastructure is based on widely-accepted standards such as Research Object Crates and the Oxford Common File Layout. The project has been running for more than three years and the processes and tools for aligning data with these standards are now well-developed. This workshop aims to show the application of those tools to data in a variety of formats to efficiently migrate material to the LDaCA standards. The hands-on workshop is intended for data librarians and other professionals who are interested in these issues, but participation is open to anyone who is prepared to work on language data. Participants will have some experience of working with code and/or metadata and will commit to bringing datasets which they work with (or are responsible for) to use in the practical exercises which will make up a large part of the workshop. We will also have example datasets available and we also are open to the possibility of participants working in teams based on complementary skills. Leader: Peter Sefton When: 3 - 5 September 2024 Where: Engma Room (Room 3.165), Coombs Building, Australian National University Details: (registration form to come shortly) ","date":"2023-10-18","objectID":"/news/events/:0:2","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Recurring Events ","date":"2023-10-18","objectID":"/news/events/:1:0","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"RO-Crate Clinic Drop-in The RO-Crate community run a weekly drop-in call in Australia. For further information contact Peter Sefton. When: Weekly, Thursday 14:00 AEST Where: Online via Zoom ","date":"2023-10-18","objectID":"/news/events/:1:1","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Previous Events ","date":"2023-10-18","objectID":"/news/events/:2:0","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Previous Workshops If your university or organisation would like to host a workshop, please contact us. ","date":"2023-10-18","objectID":"/news/events/:2:1","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Webinars Our webinar series was a joint initiative with the Language Technology and Data Analysis Laboratory (LADAL) at the School of Languages and Cultures, The University of Queensland. ","date":"2023-10-18","objectID":"/news/events/:2:2","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Australian Digital Observatory Online Office Hours LDaCA has previously run regular online office hours, jointly hosted with the Australian Digital Observatory (ADO). This activity will not continue in 2024, but LDaCA and ADO are developing an alternative model to help Australian researchers working with linguistics, text analytics, digital and computational methods, social media and web archives. In the meantime, you are welcome to contact us by email at ldaca@uq.edu.au with your technical questions, research problems and rough ideas to get advice and feedback from the combined expertise of our ARDC research infrastructure projects. No question is too small, and even if we don’t know the answer we are likely to be able to point you to someone who does. ","date":"2023-10-18","objectID":"/news/events/:2:3","tags":null,"title":"Events","uri":"/news/events/"},{"categories":null,"content":"Defines policies, roles, responsibilities and procedures for ongoing use and storage of data, as well as for access to data.","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":" Purpose 1. Access Conditions 1.1 Legal, moral and ethical constraints 1.2 Licensing 2. Persistent Identifiers 3. Metadata 4. Appendix: Determining Copyright ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:0:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"Purpose Data governance defines policies, roles, responsibilities and procedures for ongoing use and storage of data, as well as for access to data. Effective data governance maximises sustainability, while ensuring data integrity and protecting research participants. Long-term sustainability requires a data management plan. This document provides guidance on some components of data governance that are key to a data management plan, including access conditions, licensing, persistent identifiers and metadata. The principles below are those employed by LDaCA, but they are widely applicable and represent best practice for data governance, in accordance with the FAIR and CARE principles. Questions for reflection: In each section, you will find a thought bubble marking some questions for reflection that will help you start to explore these data governance topics. This content is designed as guidance for Data Stewards considering how to manage their data into the future. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:1:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"1. Access Conditions Access conditions refer to who can access data and what use is permitted. Defining specific conditions for access supports data reusability and the advancement of the scientific endeavour; it also protects the data from misuse. To determine access conditions, the Data Steward must: understand the legal, moral and ethical constraints to sharing data, and prepare a license outlining access conditions. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:2:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"1.1 Legal, moral and ethical constraints Legal constraints In Australia, as in many other countries, research data is recognised as Intellectual Property that can be protected under legal mechanisms such as copyright. When considering how data can be shared and accessed by others, it is important to consider these legal constraints. Copyright protects expressions of ideas in works such as books, music, paintings, films, and performative acts such as speech, sign and gesture, etc., and therefore also data collections. The creator of the work is known as the copyright owner. Copyright provides two types of rights: Economic rights: The owner has the exclusive right to reproduce, publish, perform, communicate, and adapt or modify their work, for both commercial and non-commercial purposes. This right can be transferred or shared with others via assignment or licensing. Moral rights: The work must be correctly attributed and not treated in a derogatory manner. This protects the integrity of the work. Moral rights cannot be transferred or shared. Questions for reflection: How can I identify the copyright owner of a language collection? Unlike trademarks and patents, copyright doesn’t require registration and there are no official records that can be searched to identify a copyright owner in Australia. Additionally, the creator of material is not necessarily the copyright owner and copyright may also be jointly owned. Copyright ownership is determined according to a set of complex rules set out in the Copyright Act and its amendments. It is important to review in detail the law to ensure the correct owner has been identified. Legal advice should be sought where the copyright owner cannot be clearly identified. If the copyright owner has died the copyright is usually passed on to that person’s spouse or children. (See more information at the end of this section.) The copyright owner of a language collection can be identified by considering the following questions: Question Further Information Does the collection comprise materials all collected under the same conditions? (e.g. as part of the same research project)YesNo If the collection includes material from a third party, the copyright owner and copyright status should be identified for each subset of the collection. Has the copyright owner been determined by a contract, formal agreement, or other relevant document?YesNo In some cases, existing contracts or agreements will assign copyright ownership in advance. This will take precedence over the rules set out in the Copyright Act. What type of material is included in the collection?Textual workMusical workDramatic workComputer programArtistic workFilm or videoSound recordings Generally, the author of a textual work, musical work, dramatic work, computer program or artistic work (i.e. the person who created the work) is the first owner of copyright. However, the general rules for films, videos and sound recordings are different.For sound recordings, copyright is owned by the ‘maker’ of the sound recording, typically the person or company who owns the recording equipment.For film and video, copyright is owned by the person or company which made the arrangements for the creation of the work.In the academic context, this is typically the university where the research was conducted. Was the material created by an employee in the course of their employment?YesNo When the work was created by an employee as part of their usual work duties, the employer is the copyright owner (unless there is a specific employment agreement that specifies otherwise). Is the work a performance?YesNo Performers’ rights apply to live performances including dramatic works, musical works, dances, circus acts, expressions of folklore, readings, and recitations of existing or improved literary works recorded or filmed with or without an audience. Permission must be sought to record the live performance, and to broadcast and distribute recordings.As of 1 January 200","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:2:1","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"1.2 Licensing Documenting access conditions is key to ensuring appropriate use of the data over time. Transparency and clarity surrounding access conditions also supports the sustainability of the data and reduces the need for the Data Steward to be available to communicate or enforce the access conditions. While access conditions can be documented internally in a data management plan, a common and useful mechanism for documenting and managing access is via licensing. Licensing allows the copyright owner to share the right to access and use the data without forfeiting or transferring the ownership of the copyright of the work. The license sets out the conditions for who can access the data, how it can be used, and what other conditions are required. While licensing is a legal mechanism, it can also be used to uphold other conditions as determined by the Data Steward. Questions for reflection: Is there an existing data license? Question Further Information Is there an existing license outlining the access conditions?YesNo If the collection has already been made available, a license may have already been prepared. Avoid duplicating previous work by checking this first. Although it is best practice to have a single license attached to a collection, multiple licenses can exist as long as they are non-exclusive. Questions for reflection: What information needs to be included in the license? Key parts of a data license Image Source: LDaCA Question Further Information What are the parties relevant to this license? Who is the author of the material? Who is the copyright owner? Is access managed by a Data Steward? Which individuals or groups are permitted access to the material? Who is the Licensor (copyright owner) and Licensee? What materials are covered by the license? Does this license cover the entire collection or a subset of the material? Is this an exclusive or non-exclusive license? Does this license grant an individual or group exclusive rights to use and share the material? If yes, this is an exclusive license. Describe the rights that are being transferred to the licensee. What rights are being transferred? Can the material be modified? For what purpose? Can the material be shared? Under what conditions? Does this collection include sensitive information? Does the material include personal information? What is the protocol for ensuring participant privacy? What are the responsibilities of the licensee? What are the requirements surrounding citation? How should the material be correctly attributed? What is the suggested citation for the collection? Under which region is the license legally binding? Is the license limited to a specific geographical region? Other considerations How long will the license be valid? Does it expire after a specific period of time? Is the licensee required to pay a fee? How is the fee processed? Under what conditions can the license be terminated? How will disputes be resolved? While in some cases it may be necessary to write a custom data license, for other collections it may be possible to apply an already-existing license, such as a Creative Commons license. Creative Commons provides a useful option for promoting data reusability while protecting key access conditions. The Creative Commons licenses combine four main elements in different ways: BY - Attribution: This is a requirement for all Creative Commons licenses. The original creator must always be attributed. SA - Share-Alike: Data can only be shared under identical license terms. NC - Non-Commercial: Only non-commercial use of the data is permitted. ND - No Derivatives: Material can be copied and distributed in the original form only and cannot be adapted. A copyright owner may also choose to waive their rights and make data available in the public domain before the duration of the copyright has expired. Creative Commons provides a relevant tool for documenting this decision: CC0: A copyright owner opts out of copyright and waives their ex","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:2:2","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"2. Persistent Identifiers Persistent identifiers (PIDs) are digital identifiers that are permanently assigned to physical and digital objects. In contrast to other identifiers that are used online, such as URLs, PIDs are persistent, meaning they point to reliable information in the long term. They have become crucial for research data as they make a collection citable; ensure that it is findable even if moved to a different location; and establish its relationship to other objects and entities in the academic research environment (e.g. researchers, funders, organisations, academic publications, software and other datasets). While various PID systems have been in use for research data over the last 25 years, including ISBN and ORCiD, the DOI System has been the most widely used globally to date and is becoming the default identifier for research datasets. A DOI is a unique number made up of a prefix and a suffix separated by a forward slash. It is resolvable by displaying it as a link: https://doi.org/10.1000/182. An example of a DOI for a data collection (obtained via the university library) can be viewed here: Sydney Speaks DOI landing page. Questions for reflection: Who can get a DOI? How? Question Further Information Does the collection have an existing PID?YesNo Investigate whether the collection already has an existing PID as this is sometimes automatically generated when the collection is listed or archived with an archive or library. Who can generate a DOI for the collection? The most common DOI minting services are universities, research organisations, research libraries and research repositories. You can make enquiries with your university library or the research repository of the organisation that supported the data collection. Find out more about PIDs: Klump, J. and Huber, R., 2017. 20 Years of Persistent Identifiers – Which Systems are Here to Stay?. Data Science Journal, 16, p.9. DOI: http://doi.org/10.5334/dsj-2017-009 ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:3:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"3. Metadata Metadata is data about data — information that describes the data collection as a whole; provides the context and conditions under which the data was collected, can be stored, shared and used; and the characteristics of the format, duration or size of data making up the collection; and includes socio-demographic details of participants. Standardised metadata allows data to be more easily found and understood, to be compared, and grouped with other similar objects. Metadata is often managed in two different ways: A standard metadata vocabulary (such as Dublin Core (DC), Darwin Core and Metadata Object Description Schema (MODS). These vocabularies set out a limited list of metadata terms that can be used across disciplines with the aim of promoting a shared metadata framework. A customised metadata strategy, unique to a particular project to meet specific needs. Where a customised metadata strategy is used, metadata terms should be clearly defined so as to facilitate comprehension of the metadata and to avoid misunderstandings or multiple interpretations. Customised metadata terms can be mapped onto existing vocabularies in order to organise metadata in a manner aligned to a standard. Questions for reflection: What does it mean to apply standards to metadata? Question Further Information Does the collection use metadata terms from an existing metadata vocabulary?YesNo If the collection uses customised metadata terms (i.e. not an existing metadata vocabulary), consider mapping the relationships between the custom system and an existing metadata vocabulary. This will facilitate findability as the metadata aligns with standards used in the research community. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:4:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":null,"content":"4. Appendix: Determining Copyright Information is provided as guidance only; legal advice should be sought. Select the correct table for works and subject matter other than works. Works: Step 1: Is the author known? Author of the work is known* Step 2: When was the material first made public? Work has not been made public Work has been made public Step 3: Was the material first made public before or after the author died? N/A Work was made public before the author died Work was not made public before the author died Step 4: Was the material first made public before or on/after 1 January 2019? N/A N/A Work was made public with the author’s permission before 1 January 2019 Work was not made public with the author’s permission before 1 January 2019 Copyright duration Date author died + 70 years. Date author died + 70 years. Date the material was first made public + 70 years. Date author died + 70 years. *Different copyright duration applies to works for which the author is unknown. Subject matter other than works: sound recordings, films and videos Step 1: When was the material created? Created before 1 January 2019 Created on or after 1 January 2019 Step 2: When was the material first made public? Not made public First made public before 1 Jan 2019 First made public on or after 1 Jan 2019 and within 50 years of the date of creation First made public on or after 1 Jan 2019 more than 50 years after the date of creation Made public within 50 years of the date of creation Made public more than 50 years after the date of creation Copyright duration Date the material was created + 70 years. Date the material was first made public + 70 years. Date the material was first made public + 70 years. Date the material was created + 70 years. Date the material was first made public + 50 years. Date the material was created + 50 years. ","date":"2023-10-18","objectID":"/resources/ldaca-resources/guidance-for-data-governance-decisions/:5:0","tags":null,"title":"Guidance for Data Governance Decisions","uri":"/resources/ldaca-resources/guidance-for-data-governance-decisions/"},{"categories":["LDaCA"],"content":"The Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA) are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. In this blog post series, we feature interviews with the Chief Investigators of the two projects. In each post, we present their answers to three questions: What is your role in these projects? (What do you/your team do as part of your participation?) What excites you most about the projects? (What motivates you to participate?) What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? This blog post features Catherine Travis (CT), Nicholas Evans (NE), and Monika Bednarek (MB). Nick was a Chief Investigator in one of the first ARDC-funded projects for LDaCA under the Australian Data Partnerships program. Although he is not a CI on our current project, Nick remains a friend and supporter of LDaCA. The interview was undertaken via email, and we are grateful to Kelvin Lee from the Sydney Corpus Lab for his assistance in undertaking the interviews and creating these blog posts. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:0:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"What is your role in these projects? CT: Our main goal at the Australian National University is to grow the set of collections that will be incorporated into LDaCA with a particular focus on Australian English and migrant languages. This involves identifying relevant collections, which is a bit of a ‘language dig’, as these are diverse, dispersed, and often quite hidden away. Many have restrictions around data sharing, so we work closely with data stewards to set up the appropriate access and licensing conditions. As well as this, we are developing tools that can enhance the usability of both legacy corpora and newer collections, including tools for aligning transcripts that exist in different formats with their corresponding audio, standardising orthography, streamlining the anonymisation process, and so on. Right now, most of what we know about Australian English comes from middle-class Australians of Anglo-Celtic background living in major urban centres. A better representation of Australian society in our language collections will broaden the scope of the research that can be done, allow new questions to be asked, and provide a better picture of Australia overall. NE: My role is to keep our foot on the gas for the huge job of getting a continued flow of new corpus data, as well as digitised legacy data, mostly for the languages of New Guinea and the Pacific. MB: I’m the Academic Lead for these projects at the University of Sydney. In this role, I manage the projects overall and work closely with staff from the Sydney Corpus Lab and the Sydney Informatics Hub. We develop text analytics tools and training for analysis of text/language, including but not limited to tools for linguistic and discourse analysis. A team at PARADISEC is also involved and contributes to work packages around language collections for Indigenous languages of Australia and the Pacific. My role as director of the Sydney Corpus Lab is crucial for these projects. The lab’s mission is to build research capacity in corpus linguistics at the University of Sydney, to connect Australian corpus linguists, and to promote the method in Australia, both in linguistics and in other disciplines. We organise relevant events on corpus linguistics and text analytics, including guest lectures and workshops, and create resources such as corpora, blog posts, video playlists, and curated introductions in different languages. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:1:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"What excites you most about the projects? CT: There is a wide range of collections out there that represent an absolute treasure trove of knowledge not only for language in Australia, but also for Australian society, culture, and history. For example, oral histories or ethnographic interviews can provide invaluable linguistic data, and likewise, sociolinguistic interviews may contain invaluable information for historians, sociologists, or anthropologists. But because there is no way to know what collections exist, what is in them, and what is accessible, this treasure trove is underutilised. With many decades of data collection behind us and with recent technological developments, now is the perfect time to open this up to create a language data commons. It is exciting to think of the opportunities that this presents for us to cross disciplinary boundaries, and expand our understanding of Australia’s social, cultural, and linguistic history. NE: We all know that most of the world’s languages are under-resourced, lacking even a grammar or dictionary. A dream I have – which this project will move us very slightly toward – is that every one of the world’s 7,000 languages might one day have a vast library of corpus data, comparable to the 60 million words we have for Classical Greek, for example. Each language and culture deserves its own vast library wing. To give an idea of the scale of the challenge, so far in our part of the world there are only three languages (Bislama, Gurindji Kriol, and Ku Waru) where we have even one million words. MB: To work across disciplines, to collaborate with data scientists, and to learn many new things myself! And to discover the value that we bring to such collaborations as domain experts in the Humanities and Social Sciences. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:2:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? CT: Given the number of existing language collections, one piece of advice I would give to someone who wants to get started with data collection would be to learn what is already out there prior to beginning new data collection, and to think about how any new data you collect may contribute to existing collections or research projects, or alternatively, how existing collections might shape the kind of new data you might collect or the kind of research you might do. It is standard practice to contextualise the research that we do within the field; we are at a point where we should be doing the same with data collection. It is through our cumulative efforts, taking advantage of the work that has preceded us and building on that, that we will most advance in our scientific endeavours. NE: Find good ways of getting sensitive and accurate commentaries on the meaning of materials. We need the equivalent of Biblical or Talmudic commentaries for the digital age – metatexts that comment on what other texts mean. MB: Just give it a go and don’t be daunted! Go to workshops or summer schools, and avail yourself of other free or low-cost opportunities. Be (and remain) critical in your use of tools, and remember the value of your own knowledge and expertise. Discover the joy of mastering a new tool/technique or of knowing enough to find out that it is not for you or for your research project. If you find it useful, practice and also keep good notes so that you can return to the tool later and still know what to do. Always be transparent in how you use the tool/technique and make sure you know why you are using it. ","date":"2023-10-18","objectID":"/news/posts/interview-2/:3:0","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":["LDaCA"],"content":"Acknowledgments The Australian Text Analytics Platform program (https://doi.org/10.47486/PL074) and the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001) received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ","date":"2023-10-18","objectID":"/news/posts/interview-2/:3:1","tags":null,"title":"Interview with our Chief Investigators - Part 2","uri":"/news/posts/interview-2/"},{"categories":null,"content":"Outlines the standards and processes that support the onboarding of data collections to LDaCA.","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":" Purpose Onboarding Process 1. Initiating contact between LDaCA and Data Stewards 2. Work Plan 3. Determining appropriate governance and access conditions 4. Preparing data for cataloguing in LDaCA 5. Ongoing engagement ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:0:0","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"Purpose This document outlines the standards and processes that support the onboarding of language collections to LDaCA. A successful onboarding will result in the collection being catalogued with LDaCA and made available in accordance with the specified access conditions. To determine if you are ready to onboard a collection, check this data management and governance capacity flowchart. Data Management and Governance Capacity Image Source: LDaCA ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:1:0","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"Onboarding Process The LDaCA onboarding process is carried out collaboratively between LDaCA and the Data Steward. There are five broad steps to be followed; while the order of these steps is flexible, each step should be appropriately covered. Initiating contact between LDaCA and Data Stewards. Establishing a Work Plan. Determining the data governance and access conditions appropriate for this collection, and applying that in LDaCA. Preparing data for onboarding to LDaCA. Facilitating ongoing engagement. These five steps are covered in detail below, along with the responsibilities of LDaCA and the Data Steward for each step, and associated resources. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:0","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"1. Initiating contact between LDaCA and Data Stewards Data Stewards wishing to onboard data to LDaCA may approach LDaCA about this, and LDaCA will also seek to identify appropriate collections for onboarding. LDaCA will gather relevant publicly available information, and record that information in a collection onboarding database. This includes information about the current location, significance and description of the collection; size, format and data structure; existing metadata; persistent identifiers (if any) and contact person. Contact between LDaCA and the Data Steward is initiated with the aim of identifying the appropriate Data Steward, establishing a collaborative relationship, providing information about LDaCA, and discussing shared goals for the onboarding of the collection to LDaCA. LDaCA will: Respond to requests to onboard data to LDaCA and/or research data collections and approach contact people about onboarding data to LDaCA. Provide information about LDaCA. Confirm the Data Steward. Help to determine if the data collection is ready for onboarding. Discuss needs and goals for onboarding to LDaCA. Additional resources relevant to this step: Initial Contact Checklist ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:1","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"2. Work Plan Once communication between LDaCA and the Data Steward is in place, a Work Plan is prepared in order to establish the terms according to which the data will be onboarded to LDaCA, including the goals and responsibilities of each party, and the steps and timeline for carrying out the onboarding process. At this stage, it is important to share key information and resources about the LDaCA project and manage expectations. LDaCA will: Provide the Work Plan template for onboarding. Collaborate with the Data Steward on the development of the Work Plan. Provide further information about LDaCA, its services and limitations. Data Steward will: Confirm their role as the Data Steward. Collaborate with LDaCA on the development of the Work Plan. Additional resources relevant to this step: Work Plan template Terms of Use ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:2","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"3. Determining appropriate governance and access conditions LDaCA will collaborate with the Data Steward to (i) understand legal, ethical and moral considerations around data accessibility, and (ii) map the access conditions and how these should be implemented. At this stage, the Data Steward must prepare a data access license that clearly outlines the conditions for data access, sharing and reuse. LDaCA makes available varying levels of access, from open access to restricted access. A persistent identifier for the collection must also be provided to support sustainable data management practices. LDaCA will: Provide guidance and support with regards to data governance, persistent identifiers, determining access conditions and licensing. Data Steward will: Review the legal, ethical and moral conditions for providing access to the collection, including the identification of existing data access licenses if applicable. Ensure that any agreed-on restrictions with funding bodies and participants are strictly adhered to. Identify the copyright holder and confirm the copyright status. Provide or develop data access licenses that outline the specific conditions for access to the collection, including citation information. Provide a persistent identifier for the collection (obtaining a DOI if necessary). Additional resources relevant to this step: Access Policy Guidance for Data Governance Decisions Determining Access Conditions Obtaining a DOI ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:3","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"4. Preparing data for cataloguing in LDaCA LDaCA coordinates with the Data Steward to apply standards to the collection to support its long-term sustainability. The standards applied allow the data to be packaged in RO-Crates, the metadata to be mapped onto LDaCA’s metadata schema, and additional metadata to be added where possible/required. LDaCA will: Prepare the data for packaging as RO-Crates. Support the metadata mapping process. Data Steward will: Provide a copy of the collection data. Support the RO-Crate packaging by applying standards to the data and reviewing the packaged data. Provide metadata and collaborate with metadata mapping. Additional resources relevant to this step: Information on Metadata Information on RO-Crate ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:4","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"5. Ongoing engagement Finally, we implement ongoing engagement by outlining shared responsibilities and processes in order to ensure that access requests, system updates and user feedback are responded to appropriately by LDaCA and Data Stewards. This is key for collections that require access approval on a case-by-case basis and for collections that may introduce data updates, such as additional data, transcription, or annotation, edits to existing data, transcription, or annotation etc. LDaCA will: Incorporate updates to the data collection provided by the Data Steward. Make the Data Steward aware of any takedown requests. Update the Work Plan as relevant, with responsibilities for ongoing maintenance of the collection in LDaCA. Data Steward will: Make LDaCA aware of any updates to the data collection they wish to incorporate, and work with LDaCA to incorporate them. Review any takedown requests, make a decision about how to address them, and inform LDaCA of that decision. Update the Work Plan as relevant, with responsibilities for ongoing maintenance of the collection in LDaCA. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/data-onboarding-process/:2:5","tags":null,"title":"Data Onboarding Process","uri":"/resources/ldaca-resources/data-onboarding-process/"},{"categories":null,"content":"Outlines the LDaCA Access Policy, making data appropriately accessible in accordance with legal, moral and ethical considerations of data sharing.","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":" Purpose 1. Access Types and Licensing Classification of Access Types 2. Onboarding Data to LDaCA LDaCA Responsibilities Data Steward Responsibilities 3. Ongoing Data Management Strategy 4. Supporting Information LDaCA Terms of Use Privacy Policy Takedown Policy ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:0:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Purpose This document outlines the LDaCA Access Policy, which is developed to accommodate the goal of making data appropriately accessible, in accordance with legal, moral and ethical considerations of data sharing, and tailored to meet the needs and requirements of different data collections. Access conditions for data in LDaCA are determined by the Data Steward. To assist with this, LDaCA provides: information to help Data Stewards make informed decisions about appropriate access conditions for the data; tools and resources to facilitate this process as collections are onboarded and made available (including for applying standards as required); and mechanisms for managing these once data is onboarded. The Access Policy comprises three key components, as outlined in this document: Access types and licensing Onboarding data to LDaCA Ongoing data management ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:1:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"1. Access Types and Licensing The foundation of the LDaCA Access Policy is licensing, i.e. the action of setting out conditions for accessing and using data in a license that is attached to all data in our systems. Access conditions are defined by the Data Steward for each collection, and they may restrict access as little or as much as desired, from open access, to an automated click-through license, to case-by-case approval prior to gaining access to the data. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:2:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Classification of Access Types Type of Access Description Basic Process Open access Data is openly available under a Creative Commons license or similar (including for data in the public domain). The user can read the license and directly access the data. Authorization Required Different levels of implementation:click-through licenseby invitationby application (case-by-case approval by Data Steward) Data is available with some restrictive conditions of use, as outlined in the license. Users must authenticate their identity to access these materials. In some cases, access to the license is limited to specific users, who can be defined by invitation and/or application (note that these are not mutually exclusive – either or both options may be implemented). Such authorisation requires ongoing engagement from the Data Steward in order to manage access lists and approve/decline access requests. A number of levels of authorization can be implemented depending on the access conditions:License acceptance: the user must read and accept the license (click-through) before accessing the data.By invitation: the Data Steward invites users to access and accept the license. Only invited users can gain access.By application: specific conditions apply, and the user must provide extra information to gain access, which they do through a linked application form which must be reviewed and approved by the Data Steward before access is granted. Approval may be dependent on, for example, the community or academic affiliation of the applicant; the specific use of the data requested; payment of a fee; or other. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:2:1","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"2. Onboarding Data to LDaCA In accordance with the Access Policy, LDaCA has an established process for onboarding to LDaCA. Throughout this process, the LDaCA team will work with the Data Steward to develop an effective strategy for managing access, in accordance with relevant legal, moral and ethical considerations applying to that data. Read more about our data onboarding process. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:3:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"LDaCA Responsibilities Supporting the Data Steward to standardise aspects of the language collection as required for successful onboarding, including with regards to metadata, data governance and data preparation. Adapting the onboarding process as relevant to the specific needs and requirements of the collection and Data Steward, and working to facilitate a successful and efficient onboarding process. Providing clear information to the Data Steward to ensure comprehension of the purpose of each step in the onboarding process, and responding to questions and concerns. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:3:1","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Data Steward Responsibilities Providing a persistent identifier for the data. If the collection does not have an existing persistent identifier, LDaCA recommends getting a DOI (Digital Object Identifier), which is becoming the default identifier for research datasets. A DOI makes a collection citable; it ensures that it is findable even if moved to a different location; and it establishes its relationship to other objects and entities in the academic research environment (e.g. researchers, funders, organisations, academic publications, software and other datasets). See Obtaining a DOI for more information. Providing metadata that is organised with a consistent structure, and includes descriptors, definitions and contextual information where relevant. This includes metadata at the level of the collection (e.g. collection name, a narrative description of the corpus, the subject language(s), author(s) or Collector(s), publication year, access conditions, etc.), file (e.g. length in minutes, words) and participants (e.g. age, gender). LDaCA technologies enable programmatic detection, extraction and summarisation of existing metadata in a dataset; in order for this to work effectively, the metadata must be standardised. Metadata are mapped to open schemas including the Open Language Archives Community (OLAC) vocabularies for describing language data and Schema.org for generic descriptions (e.g. Name, Identifier, Description). This approach allows many metadata terms to be linked to openly available and widely used definitions. Some metadata may be specific to a corpus or difficult to map to existing vocabulary terms. LDaCA makes use of the Language Data Ontology, which has been developed in consultation with OLAC and metadata specialists, to ensure consistency across terms. See Metadata for more information. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:3:2","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"3. Ongoing Data Management Strategy A strategy for ongoing access and data management of the collection in LDaCA must be developed in collaboration with the Data Steward. Responsibilities and processes must be clearly outlined specifying that Data Stewards will respond appropriately to access requests, system updates and user feedback. This is key for collections that require access approval on a case-by-case basis and for those collections that may introduce data updates, such as additional data, edits to existing data, transcription, or annotation etc. Both LDaCA and the Data Steward are jointly responsible for ongoing maintenance of the collection in LDaCA, and updating the Work Plan as needed. Read more about our ongoing engagement. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:4:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"4. Supporting Information The Access Policy is supported by the LDaCA Terms of Use, Privacy Policy, and Takedown Policy. These are summarised below; please go to the links to view the actual policies online. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:0","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"LDaCA Terms of Use The LDaCA Terms of Use provide an overarching agreement for all users who interact with the platform in adherence with general legal, ethical and moral standards. The key points included in the Terms of Use are: Users must adhere to conditions set by the Data Steward. Users must exercise ethical standards, paying particular attention to potential issues around sensitive data and participant privacy. Users must uphold the moral rights of data owners. Publicly available metadata in the LDaCA catalogue are licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license. Data is provided ‘as is’ and LDaCA provides no guarantee to the accuracy or completeness of the material. LDaCA may change the site and suspend or cancel user access without notice. LDaCA is not responsible for any breaches of legal, moral or ethical standards by users. The Terms of Use can be viewed here. LDaCA Terms of Use: Key Points Image Source: LDaCA ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:1","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Privacy Policy The LDaCA Privacy Policy describes what information is collected about users and how personal information is managed by LDaCA. Information collected by LDaCA may include home organisation, social login provider and personal data such as name, email address, affiliation, and information about how a user accesses and uses LDaCA. This information is necessary to facilitate the functioning of the site and is required to provide access to restricted language collections. The LDaCA Privacy Policy outlines the limitations of the use of personal information and the measures taken to protect user privacy. The policy document is currently under development. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:2","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Takedown Policy The LDaCA Takedown Policy outlines the steps to be taken by users in making a request for data to be removed, or access to be adjusted in some way. LDaCA recognises that licensing and decisions surrounding access to data are made by the Data Steward. Therefore, the Data Steward is also responsible for assessing and determining the outcome of takedown requests. The Takedown Request mechanism supports FAIR and CARE Principles by facilitating a process by which data access may be questioned and discussed by relevant stakeholders. The policy document is currently under development. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/access-policy/:5:3","tags":null,"title":"Access Policy","uri":"/resources/ldaca-resources/access-policy/"},{"categories":null,"content":"Defines the workflow for determining the access conditions for a data collection, to be outlined in the license.","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":" Step 1: Consider access condition(s) for the entire collection Step 2: Organise relevant materials Step 3: Define the license terms Step 4: Determine authorisation level Options for access by application LDaCA recommends the following workflow to determine the access conditions to be outlined in the license. Examples of existing data access licenses can be found here. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:0:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 1: Consider access condition(s) for the entire collection Do the same access conditions apply to the entire collection? If yes → two licenses can be defined (one for metadata, one for the collection data). Metadata is licensed with an open license in order to allow the user to find the collection through the LDaCA catalogue. If no → all of the ways the data can be accessed must be defined, with a license for each subset. Figure 1: Access Conditions Flowchart Image Source: LDaCA Name the license(s). Complete the following steps 2-4 for each license. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:1:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 2: Organise relevant materials What material can be accessed with this license? The material can be organised in different ways: Metadata vs. data Formats, e.g. transcriptions vs. audio List of files/speakers (based on a shared quality such as sensitive material, transcribed or non-anonymised material) ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:2:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 3: Define the license terms What conditions are included in the license? Can data be shared? Can data be modified? What is the license duration? Provide a brief description of the license conditions (1-2 lines). List the correct citation for this material. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:3:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Step 4: Determine authorisation level What authorisation is required for users to access this license? If none → all users can accept the click-through license and access the material. If by invitation → only users invited by the Data Steward can accept the license and access the data. If by application → arrange a form or a document with requirements for users to fill in. Figure 2: Access Conditions Flowchart - Authorisation Level Image Source: LDaCA This information can be mapped out visually or listed in an Excel sheet with the following basic structure: License Step 1: Determine the number of separate licenses needed, and name the license Step 2: Define the material Step 3: List the license terms Step 4: Describe the authorisation process, if applicable 1 2 ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:4:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Options for access by application Access by application can be managed in a number of formats: Integrated forms, e.g. set three key questions for users requesting access to the collection. PDF application form, i.e. provide a dynamic PDF form that the user can download, complete and re-upload. Document upload request, i.e. request a copy of the user’s research ethics approval or proof of affiliation. Link to external service, i.e. Google Forms, payment service. ","date":"2023-10-03","objectID":"/resources/ldaca-resources/determining-access-conditions/:5:0","tags":null,"title":"Determining Access Conditions","uri":"/resources/ldaca-resources/determining-access-conditions/"},{"categories":null,"content":"Explore our glossary of key terms and concepts related to LDaCA and the language data field.","date":"2023-10-03","objectID":"/resources/glossary/","tags":null,"title":"Glossary","uri":"/resources/glossary/"},{"categories":null,"content":" Access Conditions Conditions which specify who can access data and what they can do with that data. A well-governed archival repository has mechanisms in place to administer and implement such conditions which will be specified on a data license. ADA Australian Data Archive. A national service for the collection and preservation of digital research data. More information ADM+S ARC Centre of Excellence for Automated Decision-Making and Society. It brings together universities, industry, government and the community to support the development of responsible, ethical and inclusive automated decision-making. More information ADO Australian Digital Observatory. An ARDC platform working to establish a national infrastructure to support a diverse array of researchers, especially in the humanities, in accessing and working with dynamic digital data. More information AIATSIS Australian Institute of Aboriginal and Torres Strait Islander Studies. Australia’s only national institution focused exclusively on the diverse history, culture and heritage of Aboriginal and Torres Strait Island Australia, with a growing collection of over one million items, dedicated to Australian Aboriginal and Torres Strait Islander cultures, histories and contemporary stories. More information API Application Programming Interface. A way for computer programs to communicate with each other. It is a way for one computer or system to ask another computer or system to do something, like provide a dataset. ARC Australian Research Council. Its purpose is to grow knowledge and innovation for the benefit of the Australian community through funding the highest quality research, assessing the quality, engagement and impact of research, and providing advice on research matters. More information Archival Repository A location for the storage of data that has an appropriate governance regime in place. ARCP Archive and Packaging ID. A globally unique, searchable ID with zero management overhead, but which can be used like URLs in linked data systems, but does not resolve to content in a browser. ARDC Australian Research Data Commons. The ARDC is Australia’s leading research data infrastructure facility accelerating Australian research and innovation by driving excellence in the creation, analysis and retention of high-quality data assets. More information ARDS Aboriginal Resource and Development Services (ARDS Aboriginal Corporation). Its work champions the importance of language and culture in developing self-determination for Aboriginal people, and supports Aboriginal communities to increase control and understanding of mainstream services and systems. More information Arkisto A scalable, standards-based platform for sustainable data. Data on an Arkisto deployment is always available on disc (or object storage) with a complete description independently of any services such as websites or APIs. Once the data is safe and well-described, Arkisto has a flexible model for how data can be accessed using a variety of services. Built on top of RO-Crate and OCFL. More information See also: Oxford Common File Layout See also: RO-Crate ASR Automatic Speech Recognition. ASR enables computers to process human spoken language into readable text, allowing users to operate devices through speech or facilitate translation of that speech into other languages. ATAP Australian Text Analytics Platform. An open source environment that provides researchers with tools and training for analysing, processing and exploring text. More information AustLang Provides a controlled vocabulary of persistent identifiers, a thesaurus of languages and peoples, and information about Aboriginal and Torres Strait Islander languages which has been assembled from referenced sources. Alphanumeric codes are used as persistent identifiers, while associated text strings are changeable and can reflect community preferences (including alternative names and spellings). In AustLang, Warlpiri has two codes: C15 for the ","date":"2023-10-03","objectID":"/resources/glossary/:0:0","tags":null,"title":"Glossary","uri":"/resources/glossary/"},{"categories":null,"content":"Outlines steps for acquiring a Digital Object Identifier (DOI) for a data collection.","date":"2023-09-15","objectID":"/resources/ldaca-resources/obtaining-a-doi/","tags":null,"title":"Obtaining a DOI","uri":"/resources/ldaca-resources/obtaining-a-doi/"},{"categories":null,"content":"How do I get a DOI for my dataset? Through the collection's custodial organisation: Identify the collection’s custodial organisation – a university or research organisation that funded, sponsored, published, approved, or has a current or historical stakeholder interest in the research project that collected the data. In many cases, contacting the university library about getting a DOI for a dataset is a good starting point. Alternatively, an enquiry may be made with the research repository of the custodial organisation regarding options for getting a DOI. An example of a collection DOI obtained via the university library is Sydney Speaks at the Australian National University (ANU), which can be viewed here: Sydney Speaks DOI landing page. The ANU Library administers and manages the landing pages for ANU DOIs. If the above does not suit the context of the dataset, and the dataset has not been published before, consider depositing it in a data repository like Zenodo, a multi-disciplinary open repository that automatically assigns a DOI to uploaded datasets and publications and hosts landing pages. Zenodo is a proponent of FAIR Principles and open science; however, it allows upload of data with different access levels: Open, Closed, Restricted, and Embargoed - read the Zenodo General Policies, Terms of Use, and FAQs. The Zenodo DOI landing page can contain a link to point researchers to the data content in LDaCA. If unsure whether any of the above applies to your context, please contact LDaCA to discuss PIDs for your data. ","date":"2023-09-15","objectID":"/resources/ldaca-resources/obtaining-a-doi/:1:0","tags":null,"title":"Obtaining a DOI","uri":"/resources/ldaca-resources/obtaining-a-doi/"},{"categories":["LDaCA"],"content":" Download as PDF This presentation was delivered by Peter Sefton at the Open Repositories 2023 conference in South Africa on 2023-06-14 in the Presentations: Discipline specific systems with FAIR principles session. This contains the slides and complete speaker notes, which have been edited after the conference. We will present a standards-based generalised architecture for large-scale data* repositories for research and preservation illustrated with real-world examples drawn from a number of languages and cultural archive projects. This work is taking place in the context of the Australian Humanities and Social Sciences Research Data Commons, particularly the Language Data component thereof and the long-established PARADISEC cultural archive. The standards used include the Oxford Common File Layout for storage, Research Object CRATE (RO-Crate) for consistent linked-data description of FAIR digital objects, and a language data metadata profile to ensure long-term interoperability between systems and re-usability over time. We also discuss data licensing and authorization for access to non-open resources. We suggest that the approach shown here may be used in other disciplines or for other kinds of digital library, repository or archival systems. *The submitted abstract did not have the word data here - added for clarity By: Peter Sefton (University of Queensland), Simon Musgrave (University of Queensland \u0026 Monash University) \u0026 Nick Thieberger (University of Melbourne) This work is supported by the Australian Research Data Commons. The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the HASS (Humanities and Social Sciences) and Indigenous Research Data Commons (HASS+I RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Cultures at the University of Queensland with several partner institutions. We would like to acknowledge the traditional custodians of the lands on which we live and work and the importance of Indigenous knowledge, culture and language to these projects. Peter Sefton lives and works on Wiradjuri land, and for Nick Thieberger and Simon Musgrave, it’s the land of the Kulin nation. PARADISEC is an online archive of cultural data which has been maintained for twenty years, in this presentation, we will look at some of the lessons learned from PARADISEC. In summary – the PARADISEC approach to simple data and metadata storage is something we want to continue in LDaCA, while the high cost for PARADISEC of commissioning and maintaining its own software stack is something we want to address by taking a more standards-based approach to managing language and other data over the coming decades. The Arkisto platform started in 2019 as a way to capture the lessons of PARADISEC and other projects such as Alveo (another language data project similar in scope to LDaCA) which was presented at OR 2014: Sefton PM, Estival D, Cassidy S, Burnham D, Berghold J. The Human Communication Science Virtual Lab (HCS vLab): A repository microclimate in a rapidly evol","date":"2023-06-29","objectID":"/news/posts/arkisto-stack-or-2023/:0:0","tags":["Access"],"title":"Towards a Generic Research Data Commons: A highly scalable standards-based repository framework for Language and other Humanities data\n","uri":"/news/posts/arkisto-stack-or-2023/"},{"categories":["LDaCA"],"content":" The Australian Text Analytics Platform (ATAP) and the Language Data Commons of Australia (LDaCA) are collaborative projects led by the University of Queensland and supported by the Australian Research Data Commons to develop infrastructure for researchers who work with language data. In this blog post series, we feature interviews with the Chief Investigators of the two projects. In each post, we present their answers to three questions: What is your role in these projects? (What do you/your team do as part of your participation?) What excites you most about the projects? (What motivates you to participate?) What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? This blog post features Louisa Willoughby (LW), Martin Schweinberger (MS) and Nick Thieberger (NT). The interview was undertaken via email, and we are grateful to Kelvin Lee from the Sydney Corpus Lab for his assistance in undertaking the interviews and creating these blog posts. ","date":"2023-04-26","objectID":"/news/posts/interview-1/:0:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"What is your role in these projects? LW: I’m the academic lead on the multimodal corpus. We’re building infrastructure for people to create video dictionaries that link to corpus examples (and vice versa), using Signbank and the Auslan corpus as our test case. MS: I am a Chief Investigator and steering committee member of the Australian Text Analytics (ATAP) Project and I am also a CI of the Language Data Commons of Australia (LDaCA). I am particularly focusing on the Language Technology and Data Analysis Laboratory (LADAL) which is part of ATAP and represents an infrastructure for computational text analytics in the humanities. LADAL has an outreach component as it organises webinars and workshops and it provides computational resources in the form of self-paced online tutorials as well as interactive notebooks that allow researchers to try out methods and apply them to their own data. I see my task as organising and managing activities around LADAL by creating and optimizing resources as well as getting involved in outreach via events and building (inter-)national partnerships and collaborations with other computational humanities infrastructures. NT: I am in charge of work packages 1.1 and 1.2, Indigenous languages of Australia and the Pacific. I worked at AIATSIS in the past and set up an Aboriginal language centre in Port Hedland (Wangka Maya). I have worked in Vanuatu, and did my PhD research on Nafsan, a language from Efate, sparking an interest in languages of the Pacific. The work of recording speakers of Nafsan also resulted in a corpus of time-aligned text and media, as well as historical manuscripts that I have been finding in archives around the world. This led me to be concerned that existing records of these languages be made accessible to current speakers and so I helped establish the Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) which has many connections to cultural agencies in the Pacific. I currently lead a project (Nyingarn) to convert manuscripts in Australian Indigenous languages to text. ","date":"2023-04-26","objectID":"/news/posts/interview-1/:1:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"What excites you most about the projects? LW: That it is both building an exciting tool for researchers to better understand language variation AND making a resource that is useful for Auslan students and members of the community. MS: One of the most fulfilling aspects of my work is the opportunity to collaborate with researchers from diverse communities and fields or to see how resources I created help people from very different backgrounds in their work. I do take pride in inspiring and enabling researchers to explore the potential of computational methods, whether they are working on projects related to the language sciences, health sciences, social sciences, arts, or beyond. As a quantitative linguist who came to computation as a computerphobe philosopher, I really enjoy creating visualizations that communicate data and results and that provide an intuitive understanding of complex data sets. I also take pleasure in showing others how to perform statistical analyses that help them make informed decisions based on their findings. By optimizing workflows and automating procedures, I help researchers to work more efficiently and effectively, enabling them to achieve their goals more quickly and easily. Through my work, I aim to create an environment where researchers from diverse backgrounds and specializations can benefit from the potential of computational tools, regardless of their field of study or level of experience. I value the opportunity to work collaboratively with researchers to find solutions to complex problems, and to share my expertise with others to promote innovation and growth within the field. Overall, I believe that the application of computational methods is an exciting area of research that has the potential to transform the way we approach scientific inquiry in the humanities and social sciences. By helping researchers to explore the possibilities of these tools, I hope to contribute to a more vibrant and innovative research community. NT: Making textual material in these languages available so that speakers can find records in their own languages. Often there is very little available information for these languages, so every record becomes all the more important, especially in the context of colonial dispossession where the languages are no longer spoken everyday and there are efforts to relearn the language. It is exciting that this work can become part of a national commons, and be supported into the future, so that more and more material can be included. The simple task of locating a language record, digitising it, and making the files available with appropriate licenses means that it can be used by speakers, and by researchers for various new purposes. ","date":"2023-04-26","objectID":"/news/posts/interview-1/:2:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"What advice would you give someone who wants to get started with text analytics, corpus linguistics, or language data collection? LW: Jump in and start playing! There are lots of different tools and corpora online and I find it easiest to learn how to use them by just having a go and seeing what you can get out. It can be fun to look for little bits of variation or to see how common a certain word or phrase is, and doing something small will give you a sense of the kind of data you get out of the tools and how you might then incorporate them into a wider project. MS: If you’re just starting out with a new project, it’s essential to take it one step at a time. Start with a simple visualization or basic text processing task, such as concordancing, and build on it gradually. Don’t be intimidated by others who may have more experience or skills than you do. Instead, take the opportunity to learn from them and seek their guidance when necessary. It’s important to keep in mind that progress takes time, and comparing yourself to others can be discouraging. Instead, focus on comparing your current skills and accomplishments to your former self. Take pride in what you’ve created and the progress you’ve made. Celebrate your small wins and use them as motivation to keep moving forward. Another key to improving your skills is to seek feedback from others. Share your work with peers, mentors, or instructors who can provide constructive criticism and suggestions for improvement. Don’t be afraid to ask questions or seek clarification when you’re unsure of something. Remember that learning is a continuous process, and even the most experienced professionals have room for growth and improvement. Embrace the journey, enjoy the learning process, and stay motivated by setting achievable goals for yourself. By doing so, you’ll be well on your way to developing your skills and achieving your goals. NT: The first steps in language data collection involve recording speakers, with all appropriate permissions in place, and with a consent form signed by them. Making those recordings as well as possible, following recommendations, doing training and so on, and managing the resulting files so they are not lost. Learn the basics of text querying, searching, and regular expressions. Be aware of the ethical considerations in using material from someone else’s language. But just start now! (Further information to supplement these suggestions is available from our Resources page and from the ATAP pages introducing Text Analysis) ","date":"2023-04-26","objectID":"/news/posts/interview-1/:3:0","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":["LDaCA"],"content":"Acknowledgments The Australian Text Analytics Platform program (https://doi.org/10.47486/PL074) and the HASS Research Data Commons and Indigenous Research Capability Program (https://doi.org/10.47486/HIR001) received investment from the Australian Research Data Commons (ARDC). The ARDC is funded by the National Collaborative Research Infrastructure Strategy (NCRIS). ","date":"2023-04-26","objectID":"/news/posts/interview-1/:3:1","tags":null,"title":"Interview with our Chief Investigators - Part 1","uri":"/news/posts/interview-1/"},{"categories":null,"content":"A guide to navigating the various sections of the portal interface.","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Data and Page Structure Home Page (Top Menu, Left Panel, Main Panel) Collection Page Object Page File Page ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:0:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Data and Page Structure Both the data and the webpages in the portal are structured in a heirarchy: Collections contain Objects and Objects contain Files. Level Description Collection A group of related Objects. Examples of collections include corpora, and sub-corpora, as well as aggregations of cultural objects such as PARADISEC collections, which bring together items collected in a region or a session with consultants. ↓ Object A single resource or a group of tightly related resources that record a communicative event; for example, a dialogue or session in a speech study, a work (document) in a written corpus. ↓ File A container for data (in the form of bits and bytes). Files can store data in different formats; for example, a single Object could have an audio file as well as a text file containing a transcription of the audio. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:1:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Home Page There are three main sections of the Home Page in the portal: the top menu bar, the left panel and the main panel. Each of these is discussed below. Home Page: Top Menu, Left Panel and Main Panel sections Image Source: LDaCA ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Top Menu Home Page: Top Menu Image Source: LDaCA The top menu of the home page allows you to access some of the main features of the portal: LDaCA logo: Returns you to the home page from anywhere in the portal. Collections: Filters the results section of the main panel so that only collections are shown (excluding objects, files and notebooks). A search performed while in this view will only search the text in the Name and Description fields at the collection level. Browse: Resets your results to the default settings. Login: Takes you to the CILogon page where you can login to apply for access to collections or view your current access. See Login for more details. Help: Provides more general information about the infrastructure the portal was built with, as well as the Oni API functionalities. It also links back to this user guide for easy reference. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:1","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Left Panel Home Page: Left Panel Image Source: LDaCA The left panel of the portal home page allows you to refine your data query through the use of a search field and filters. Searching allows you to view records which include specific information, while filtering allows you to view records which share values for metadata categories. Hovering over the information icon next to each filter will display tooltips related to that filter. See Search and Filters for more details on how to use the search and filter functions. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:2","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Main Panel Home Page: Main Panel Image Source: LDaCA The main panel on the right of the home page allows you to view the top-level descriptions of collections, objects, files and notebooks in the portal. The top row shows the total number of Index entries or items; this total will reduce as you apply filters and search queries. Below this is the option to Reset Search, clearing all (current) filters and searches. The Sort by and Order by dropdown boxes can also be configured; see Sort and Order to learn more about their usage. Below this, there is a row of buttons which allow you to navigate through pages of results, with 10 results appearing on each page. The results in the main panel contain a set of top-level descriptions: Name: The name of the collection, object, file or notebook. Type: The type of object a record describes, i.e. a collection, object or file. Language: The language(s) of the materials (including PrimaryMaterials, DerivedMaterials and Annotations) in this item. Description: An abstract of the collection, object, file or notebook. Members: The number of items (objects, files, etc.) within the given collection. Member of: The collection that an object, file or notebook is a part of. Note that top-level descriptions will not appear in cases where they aren’t provided for that item, e.g. usually Description is not present for objects and files, and Member of does not appear for top-level collections. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:3","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Item Icons The icons next to each item show you some important information about that item (Access, Communication Mode and File Format). You can hover over a icon to see more detail. Icon Category Icon Tooltip Access You can access this data immediately. By doing so, you accept the license terms specified on the record. Access You can access this data after logging in. You may also have to agree to license terms in an automatic process. Access There are restrictions on access to this data. Log in to get further information. Communication Mode SpokenLanguage Communication Mode WrittenLanguage Communication Mode SignedLanguage File Format text/plain File Format application/msword File Format application/tei+xml application/vnd.openxmlformats-officedocument.wordprocessingml.document File Format audio/mpeg audio/x-wav File Format application/mp4 File Format text/csv File Format application/pdf ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:2:4","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Collection Page Clicking on one of the collections from the main page results will take you to the Collection page for that item. Collection Page: Braided Channels Example Image Source: LDaCA The main panel of the Collection page lists the main details and metadata associated with the collection. Clicking on the question mark icon or information icon next to each heading will display tooltips related to that item. Below the main description and metadata is the number of objects present in the collection. The objects are then listed below, with buttons allowing navigation by page if there are more than 10 objects in the collection. The right panel has the following sections: Access: Defines the license and access conditions for the current collection. Content: Lists some of the main features of the current collection including Language, Linguistic Genre, Communication Mode, File Formats and Data licenses for access. Retrieve Metadata: View or download the metadata associated with the current collection, as well as the license and access conditions for this metadata. Notebooks: Lists any notebooks associated with the current collection. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:3:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"Object Page Clicking on one of the objects or files from either the main page results or from a Collection page will take you to the Object page for that item. Object Page: A Corpus of Oz Early English Example Image Source: LDaCA The left panel of the Object page lists the main details and metadata associated with the object. Clicking on the question mark icon or information icon next to each heading will display tooltips related to that item. The right panel has the following sections: Access: Defines the license and access conditions for the current object, together with a click-through link to the full license. Member Of: Lists the collection that the current object is a part of. Other Objects in this Collection: Lists the other objects that are in the same collection as the current object. The end of the Object page provides details about the files that form part of the current object. In the example above, the object Text 1-028 1791 Convict has two associated files: a plain text version and a text version including metadata codes, the details of which can be viewed using the dropdown arrow to the right of the file name. A preview of the file is available (provided that the access conditions permit this), and there are two options: View File to display the full file in the web browser, and Download File to save a copy of the file to your local system. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:4:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":null,"content":"File Page Clicking View File on one of the files in an Object page will take you the File page for that item. File Page: Braided Channels Example Image Source: LDaCA From this page, you can view the whole file or download it with the option at the end of the page. ","date":"2023-01-29","objectID":"/resources/user-guides/portal/basic-navigation/:5:0","tags":null,"title":"Basic Navigation","uri":"/resources/user-guides/portal/basic-navigation/"},{"categories":["LDaCA"],"content":" The data which is being made accessible through the Language Data Commons of Australia will contribute to the task of documenting language use and language behaviour in Australia. But what does this include? We are aware of many data sources and, in the short term, the important questions are about priorities: What will be immediately useful? What data is stored precariously? What relationships can be leveraged? But in thinking for the long term, there is value in asking the question from a more conceptual point of view: What were the possibilities for making records of language use over our history and what has resulted (and not resulted) from those possibilities? I will try to at least start answering that question by providing a conceptual map of a metaphorical landscape and this will be structured around two themes: demography and technology. Demography is important because what languages were being used at any particular time depends on who was living in Australia at that time. On this dimension, the major point of articulation is the arrival of non-Indigenous people. The places of origin of those non-Indigenous people have changed over time, and that has had linguistic consequences, but not on the same scale as the initial change. Technology is important because the kinds of records which might exist of language use depend on the means which were available to make such records at a given time. And on this dimension, I think that there are three points of articulation that are important: the possibility of making written records, the possibility of recording sound and vision, and the possibility of digital records. Before European contact, Australia had a very diverse linguistic ecology. Credible estimates of the number of distinct languages range from around 250 up to as many as 490, with many more dialects. Contact between language groups was common and therefore multilingualism and multidialectalism were also common. We know that there was contact between Indigenous Australians and fishermen from the Indonesian archipelago (especially Makassarese people) from the evidence of loan words. But the only record which existed of this period in the linguistic history of the continent was what was passed from one generation to another by oral transmission. The presence of Europeans on the Australian continent brought a huge change to both dimensions of the map I am developing. Europeans landed on parts of Australia from some time in the 17th century, but I will take two dates in the 18th century to be crucial. Although earlier visitors may have recorded a few words which they heard from Indigenous Australians, written records of the languages only start properly (and even then to a limited extent) with Cook’s expedition in 1770, reflecting perhaps the scientific orientation of Joseph Banks. This is the first point of technological articulation in the language data landscape which introduced the possibility of making language records independent of human memory. And then from 1788, there has been a continuous non-Indigenous presence in Australia representing a huge demographic articulation point. During the 19th century (and into the 20th century), written records of Australian languages were produced by a variety of people such as explorers, missionaries, administrators (in fact, pretty much anyone who could be bothered). These records are scattered and new material continues to be found (for example, Des Crump’s ongoing work in the State Library of Queensland and the Queensland State Archives). If you would like to get a flavour of some records of this kind, the Nyingarn project is making many of them available online. (Nyingarn builds on earlier work which presented the Indigenous language materials collected by Daisy Bates as an online resource.) However, the efforts of these early recorders were sporadic, uncoordinated and poorly focused. In 1945, Sydney Baker wrote: “Records of their languages are extremely deficient for","date":"2022-12-07","objectID":"/news/posts/data-map/:0:0","tags":["Australia","language data"],"title":"Language data in Australia - Mapping a conceptual landscape","uri":"/news/posts/data-map/"},{"categories":["LDaCA"],"content":" PDF version By Peter Sefton, Nick Thieberger, Marco La Rosa, Simon Musgrave, River Tae Smith, Moises Sacal Bonequi – delivered by Peter Sefton at eResearch 2022 in Brisbane This work is licensed under CC BY 4.0. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ This presentation will look at how a Metadata Standard - RO-Crate - with Metadata Profile (the Language Data Commons) is being developed and implemented. Two major collections are collaborating on the standard, PARADISEC and the Language Data Commons of Australia (LDaCA). This ongoing standardisation effort for language data is designed to improve interoperability, reduce costs for data migration and allow storage on disk, object storage or in archival repositories. RO-Crate is a linked-data metadata system which allows discovery metadata (Who, what where) based on the widely adopted Schema.org vocabulary to be seamlessly integrated with more discipline-specific metadata. RO-Crate uses metadata profiles to provide guidance for packaging resources for particular disciplines and purposes. In this presentation, we will introduce a RO-Crate metadata profile for language data which extends the core RO-Crate standard with new vocabulary terms adapted from pre-linked-data discipline-specific metadata efforts, particularly the Open Language Archives Community (OLAC) standards. The profile has English-language guidance on how to structure collections of resources in a repository with links between them, such that they can be indexed and displayed via APIs and search/browse portals. The profile is also implemented as a series of machine-readable profiles for the Describo Online metadata description system. We will demonstrate current ways of describing items in a variety of languages and modes (spoken, written and signed), from a large set of heterogeneous language resources held by PARADISEC and LDaCA. We will also show how to access them via API calls and a search portal, and how resources may be stored in simple storage systems using the Arkisto platform (a set of standards and principles). This work is supported by the Australian Research Data Commons. The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions. This page shows a YouTube demo of the PARADISEC website. PARADISEC (the Pacific And Regional Archive for Digital Sources in Endangered Cultures) is a digital archive of records of some of the many small cultures and languages of the world and it has developed models to ensure that the archive can provide access to interested communities while also conforming with emerging international standards for digital archiving. Australian researchers have been making unique and irreplaceable audiovisual recordings in the region since portable field recorders became available in the mid-twent","date":"2022-11-25","objectID":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/:0:0","tags":["Metadata"],"title":"Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate)\n","uri":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/"},{"categories":["LDaCA"],"content":"Conclusion In this presentation, we have shown the major components of an ecosystem for storing, discovering and analysing language data using common standards for describing objects in a repository. The RO-Crate standard is used as the key metadata container, with a common vocabulary of language-specific terms for describing data. This approach should reduce development costs and increase data reuse. The approach can also be adapted to other disciplines and domains with the development only of new profiles. ","date":"2022-11-25","objectID":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/:1:0","tags":["Metadata"],"title":"Designing a metadata ecosystem for language research based on Research Object Crate (RO-Crate)\n","uri":"/news/posts/ldaca-metadata-ecosystem-eresearch-2022/"},{"categories":["LDaCA"],"content":" This work is licensed under a Creative Commons Attribution 4.0 International License. Download as PDF This is a write-up of a talk given at eResearch Australasia 2022, delivered by Peter Sefton, with some additional detail. By: Peter Sefton, Jenny Fewster, Moises Sacal Bonequi, Cale Johnstone, Catherine Travis, River Tae Smith, Patrick Carnuccio Edited by: Simon Musgrave The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are building towards a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. The platform will provide workbench services to support computational research, starting with code-notebooks with no-code research tools provided in later phases. Research artefacts such as code and derived data will be made available as fully documented research objects that are re-runnable and rigorously described. Metrics to demonstrate the impact of the platform are projected to include usage statistics, data and article citations. These projects are led by Professor Michael Haugh of the School of Languages and Culture at the University of Queensland with several partner institutions. This work is supported by the Australian Research Data Commons. Last year at eResearch Australasia, the Language Data Commons of Australia (LDaCA) team presented a design for a distributed access control system which could look after the A-is-for-accessible in FAIR data; in this presentation, we describe and demonstrate a pilot system based on that design, showing how data licenses that allow access by identified groups of people to language data collections can be used with an AAF pilot system (CILogon) to give the right people access to data resources. The ARDC have invested in a pilot of this work as part of the HASS Research Data Commons and Indigenous Research Capability Program integration activities. The system has to be able to implement data access policies with real-world complexity and one of our challenges has been developing a data access policy that works across a range of different collections of language data. Here we present a pilot data access policy that we have developed, describing how this policy captures the decisions that must be made by a range of data providers to ensure data accessibility that complies with diverse legal, moral and ethical considerations. We will discuss how the CARE and FAIR principles underpin this work, and compare this work to other projects such as CADRE, which promise to deliver more complex solutions in the future. Initial work is with collections curated in a research context but we will also address community access to these resources. The idea is to separate safe storage of data from its delivery. Each item in a repository is stored with licensing information in natural language (English at the moment, but could be other languages) and the repository defers access decisions to an Authorization system, where data custodians can design whatever process they like for granting license access. This can range from simple click-through licenses where anyone can agree to license terms, to detailed multi-step workflows where applicants are vetted based on whatever criteria the rights holder wishes; qualifications, membership of a cultural group, have they paid a subscription fee, etc Regarding rights, our project is informed by the CARE principles for Indigenous data which also describe the level of respect which should be given to any data collected from individuals or communities. The current moveme","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:0:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: Why not “just” implement an access control list (ACL) in the repository? There are a few reasons for the distributed approach we have taken in LDaCA: ACLs need maintenance over time - people’s identities change, they retire and die, so storing a list of identifiers such as email addresses alongside content is not a viable long-term preservation strategy. Rather, we will encourage data custodians to describe in words what are permitted uses for the data, and by whom, in a license, then allow whomever is the current data custodian to manage that access in a separate administrative system. We expect these administrative systems to be ephemeral, and change over time but also to generate less friction over time as standards are developed. Expected future benefits of concentrating these processes will include that people do not have to prove the same claims they make about themselves multiple times and that it is easier for data custodians to authorise access. LDaCA data will be stored in a variety of places with separate portal applications serving data for specific purposes; if these systems all have in-built authorization schemes, even if they are the same, then we have the problem of synchronizing access control lists around a network of services. Accessing data that requires some sort of authorization process is not language or humanities-specific, so working with an existing application that can handle pre-authorization workflows and access-control authorization decisions is an attractive choice and should allow LDaCA to take advantage of centrally managed services with functionality that improves over time rather than having to develop and maintain our own systems. If complex access controls are implemented inside a system then there is a risk that data becomes stranded inside that system and cannot be reused without completely re-implementing the access control. For example, imagine an archive of cultural material with complex access controls encoded into the business logic such as “this item is accessible only to male initiates”. Applications like this need to store user accounts with attributes on both data and user records that can be used to authorise access. There is a high risk of data being stranded in a system such as this if it is no longer supported. This will be mitigated somewhat if the rules are also expressed as licenses, perhaps a composition of Traditional Knowledge (TK) Labels - but the access system is baked-in to the application and not portable. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:1:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: Yes but why does data need to have a license if we already have access controls? The point of Research Data Commons projects like LDaCA is to create an ecosystem where data can be re-used. For language data, this means that users, including researchers and community members, will be able to download data for certain authorised purposes and activities. The license is the way that data custodians communicate to data users (and future administrators) what those purposes and activities are. A license, which is always packaged with data will allow: A user to inspect a five-year-old dataset in their downloads folder and work out what they are allowed to do with it. An IT professional to clean up a laptop that has been handed in by (or seized from – it happens) a departing faculty member. A developer to re-create an access control replacing a decommissioned system. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:2:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: So many licenses! Sounds like a lot of work! We expect that the overhead of writing licenses will diminish greatly over time and standard clauses and complete licenses will be established. A data depositor will be able to choose from a set of standard license terms (such as a standard “restricted to CIs and participants license” for a given repository, using that as a template to mint their own license for a given data set with its own name and ID. The user can choose a standard way of adding pre-authorised licensees (such as email invitations). This ID can then be used by an authorization system. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:3:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: So you have centralised authorization into a system that grants licenses doesn’t that mean you are locked-in to that system? No, and Yes No, there is no lock-in regarding the list of Licenses and pre-authorised users; licenses and access control lists can be exported via an API so it is possible to import them into another system or save them for audit purposes. Yes, there is lock-in, in that at this stage the workflow used to give access to users is specific to the system (such as REMS) But, because our process requires a governance step first in writing a license, then there is a statement of intent for re-building those processes later if needed - a step which is very likely to be missing in a system with built-in access control. Also, over time, we expect the administrative burden of constructing workflows will become less as standards are developed for a couple of things: Licenses can be made less complex (particularly in the context of academic studies) if they specify re-use by particular known cohorts in advance - this comes down to improving the design of studies to encourage data reuse. This may also help to simplify academic ethics processes in the medium to long term. The CADRE project is looking to improve pre-authorization workflows that automatically source relevant information about potential users - fetching their publication record, and potentially remembering what certifications they have, so these attributes can be used and reused for decision-making. It is conceivable that this approach might be useful in cultural contexts as well to allow data custodians to manage data sharing - this is a discussion we have yet to have in the broader HASS RDC. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:4:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: What if I have a really simple requirement like giving access to just a couple of people - doesn’t this license approach just add complexity? If a data item needs to be locked down to a small group of people, say the chief investigator and the participants in a recorded dialogue then an obvious implementation is to maintain a small access control list (ACL) for the item. But all of the issues identified above with application-specific ACLs are the same, no matter the size of the cohort: the data set can’t be access controlled outside of its home system. If the system is no longer running then the data may be completely inaccessible, and if there is no license document stored with the data setting out terms of re-use in general terms then there is no indication to future administrators about who, if anyone, should have access of the data. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:5:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":["LDaCA"],"content":"Q: We don’t need a license, we have a “terms of use” Same thing. Terms of use for data are what a license does. We are designing our systems so that all the relevant terms and conditions go in one place to minimise confusion. The final three slides have been contributed by co-author Patrick. These slides briefly outline the AAF process for the next phase that will provide the foundations for the development of the service and the creation of those policies to support the community and the service. This process will support the project to deliver a viable service that meets researchers’ needs and is trusted by the community and the participants to safely distribute data to authorised persons. The AAF’s business analyst is conducting interviews with the key stakeholders. This discovery process will collect information on the current and the “to-be” state of the service. Together these will establish goals and expectations and provide the basis for further prototyping a service that meets stakeholder needs. The process will facilitate the building of a service that empowers the data custodians, the communities and participants to manage access. The basis for prototyping is iterative: Identify Prioritise Pilot Review Update requirements This leads to a production service that meets participant, community and researcher requirements and unifies the services, policies and trust framework for the community. ","date":"2022-11-16","objectID":"/news/posts/fair-care-eresearch-2022/:6:0","tags":["Access"],"title":"A CARE and FAIR-ready distributed access control system for human-created data using the Australian Access Federation and beyond\n","uri":"/news/posts/fair-care-eresearch-2022/"},{"categories":null,"content":" Bio My name is Otis Carmichael and I am a proud Waanyi man living in Meanjin. My designs reflect my appreciation of the past and my hope for the future. The design of the LDaCA logo came from my own experience researching the Waanyi language, the language of my mob. The design flows from the beginning to the end, where it flows back into itself at the beginning, representing the cyclical nature of life and the Land. The strands of language interweave with themselves, influenced by everything that touches them until they become something entirely their own. The complexity of these languages is a thing of beauty and their preservation, study and proliferation lays claim to this unceded land. The inspiration of the design is as follows: L (Language) - Language is shown as a flow of words that can be merged, split and rearranged to create different meanings, just as these lines can be. D (Data) - Data is depicted by these specks of raw information, undecipherable before processing but heavy with power. a - This data is not just numbers and words but aspects of cultures and knowledge with connections to the land. C (Commons) - Whilst research often appears niche in these commons, there is no limit to the range of it’s influence, the ripples of which disrupt everything it touches. A (Australia) - The language data of this common has been collected from many individual nations that make up ‘Australia’, often with a foundation on extraction based research techniques. This project aims to make up for the mistakes of the past by celebrating and giving back to these nations, which are so incredibly distinct whilst sharing beautiful connections with each other. The colours utilised in this design draw from those that are seen in my peoples land of Boodjamulla, in the Gulf of Carpentaria (see below). They reflect the connections between the Sky, the People and the Land in the hope that these connections continue to strengthen indefinitely. ","date":"2022-11-07","objectID":"/designer/:0:0","tags":null,"title":"Otis Carmichael","uri":"/designer/"},{"categories":null,"content":" You can contact the Language Data Commons of Australia by email at ldaca@uq.edu.au and make sure you subscribe to our newsletter. Our logo was designed by Otis Carmichael. Read more about Otis and the ideas behind his design. We have a page on LinkedIn and we share a Twitter account with the Australian Text Analytics Platform: ","date":"2022-10-28","objectID":"/contact/:0:0","tags":null,"title":"Contact Us","uri":"/contact/"},{"categories":null,"content":"If you believe that something on our website or data portal shouldn’t be available publicly, please review our takedown policy and submit the form below. ","date":"2022-10-28","objectID":"/takedown/:0:0","tags":null,"title":"Takedown Form","uri":"/takedown/"},{"categories":["LDaCA"],"content":" This is a presentation Peter Sefton gave to the Humanities, Arts and Social Sciences Research Data Commons and Indigenous Research Capability Program Technical Advisory Group on Friday 11th February 2022. Thanks to Simon Musgrave for reviewing this and adding a little detail here and there. (This is also available on Dr Sefton’s site) PDF version The Language Data Commons of Australia Data Partnerships (LDaCA) and the Australian Text Analytics Platform (ATAP) are establishing a scalable and flexible language data and analytics commons. These projects will be part of the Humanities and Social Sciences Research Data Commons (HASS RDC). The Data Commons will focus on preservation and discovery of distributed multi-modal language data collections under a variety of governance frameworks. This will include access control that reflects ethical constraints and intellectual property rights, including those of Aboriginal and Torres Strait Islander, migrant and Pacific communities. For this Research Data Commons work, we are using the Arkisto Platform (introduced at eResearch 2020). Arkisto aims to secure the long-term preservation of data independently of code and services - recognizing the ephemeral nature of software and platforms. We know that sustaining software platforms can be hard and aim to make sure that important data assets are not locked up in a database or hard-coded logic of some hard-to-maintain application. We are using three key standards on this project … The first standard is the Oxford Common File Layout - this is a way of keeping version-controlled digital objects on a plain old filesystem or object store. Here’s the introduction to the spec: Introduction This section is non-normative. This Oxford Common File Layout (OCFL) specification describes an application-independent approach to the storage of digital objects in a structured, transparent, and predictable manner. It is designed to promote long-term access and management of digital objects within digital repositories. Need The OCFL initiative began as a discussion amongst digital repository practitioners to identify well-defined, common, and application-independent file management for a digital repository’s persisted objects and represents a specification of the community’s collective recommendations addressing five primary requirements: completeness, parsability, versioning, robustness, and storage diversity. Completeness The OCFL recommends storing metadata and the content it describes together so the OCFL object can be fully understood in the absence of original software. The OCFL does not make recommendations about what constitutes an object, nor does it assume what type of metadata is needed to fully understand the object, recognizing those decisions may differ from one repository to another. However, it is recommended that when making this decision, implementers consider what is necessary to rebuild the objects from the files stored. Parsability One goal of the OCFL is to ensure objects remain fixed over time. This can be difficult as software and infrastructure change, and content is migrated. To combat this challenge, the OCFL ensures that both humans and machines can understand the layout and corresponding inventory regardless of the software or infrastructure used. This allows for humans to read the layout and corresponding inventory, and understand it without the use of machines. Additionally, if existing software were to become obsolete, the OCFL could easily be understood by a lightweight application, even without the full feature repository that might have been used in the past. Versioning Another need expressed by the community was the need to update and change objects, either the content itself or the metadata associated with the object. The OCFL relies heavily on the prior art in the [Moab] Design for Digital Object Versioning which utilises forward deltas to track the history of the object. Utilizing this schema allows implementers of the OCF","date":"2022-06-08","objectID":"/news/posts/rdc-tech-meeting/:0:0","tags":["Repositories"],"title":"HASS RDC Technical Advisory Group Meeting LDaCA \u0026 ATAP Intro","uri":"/news/posts/rdc-tech-meeting/"},{"categories":null,"content":"Information about the first datasets which have been added to the LDaCA and Oni .","date":"2022-02-15","objectID":"/about/sample-collections/","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"LDaCA Portal LDaCA has begun adding datasets, including: most of the datasets which were part of the Australian National Corpus collection, now available at our data portal (or click the Data Portal button top right). ","date":"2022-02-15","objectID":"/about/sample-collections/:1:0","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"A Corpus of Oz Early English (COOEE) A collection of texts written in Australia between 1788 and 1900. The corpus is divided into four time periods (1788–1825, 1826–1850, 1851–1875 and 1876–1900) each holding about 500,000 words. Four registers were defined for COOEE: the Speech-based Register (SB), the Private Written Register (PrW), the Public Written Register (PcW) and the register of Government English (GE). For each time period, there is a similar number of words in the different registers. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:1","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"AustLit The contribution from AustLit provides full-text access to select samples of out-of-copyright poetry, fiction and criticism ranging from 1795 to the 1930s. The collection includes literature intended for popular audiences as well as literature intended for audiences concerned with literary quality or the establishment of a national canon. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:2","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Australian Corpus of English The Australian Corpus of English (ACE) was compiled to match Australian data from 1986 with the American (Brown) and British (LOB) corpora of written English from the 1960s. It includes 500 samples of published texts taken from 15 different categories of nonfiction and fiction, including newspapers, reportage, editorials, reviews; magazines and journals: popular, academic; government and corporate documents; fiction monographs and short stories (both popular and literary). ","date":"2022-02-15","objectID":"/about/sample-collections/:1:3","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Australian Radio Talkback Australian Radio Talkback (ART) is a set of transcribed recordings of samples of national, regional and commercial Australian talkback radio from 2004 to 2006. It includes 27 audio recordings and transcripts of talkback from ABC National Radio, ABC Radio broadcasts to eastern Australia, ABC Radio broadcasts to southern and western Australia, as well as commercial stations broadcasting to eastern Australia and southern and western Australia. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:4","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Braided Channels The Braided Channels research collection is constructed from some 70 hours of oral history interviews with women from Australia’s Channel Country, together with archival film, transcripts, photos and music. It includes both audiovisual recordings and transcripts of interviews. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:5","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"International Corpus of English (ICE-AUS) The Australian component of the International Corpus of English (ICE-AUS) is an approximately one million word corpus of transcribed spoken and written Australian English from 1992-1995. It consists of 500 samples of Australian English (60% speech, 40% writing) that match the structure of other corpora associated with the International Corpus of English. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:6","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"The La Trobe Corpus of Spoken Australian English The La Trobe Corpus of Spoken Australian English comprises a collection of six recordings and transcriptions of spoken interaction amongst Australian speakers of English (some in conversation with native French speakers speaking English) made in Melbourne from 2001 to 2002. ","date":"2022-02-15","objectID":"/about/sample-collections/:1:7","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"The speech of Australian adolescents: research data and recordings collected by A.G. Mitchell and Arthur Delbridge in 1959 and 1960 This dataset comprises 22,187 recordings of Australian English as spoken by 7,736 students at 330 schools across Australia and specific information about the speakers. The recordings were made on reel-to-reel tapes and were used to create the 1965 monograph The speech of Australian adolescents: a survey and the revised 1965 publication The pronunciation of English in Australia (originally published in 1946). The Australian National Corpus provided access to a sample of this material; the full dataset is now available. Other datasets not yet available in the portal: Sydney Speaks: This project seeks to document and explore Australian English, as spoken in Australia’s largest and most ethnically and linguistically diverse city – Sydney. From Farms to Freeways: This research project sought to analyse the experiences of women who had lived in the Blacktown and Penrith areas since the early 1950s, including their responses to social changes brought about by rapid suburbanisation in the Western Sydney region in the post-war period. Two-hour taped discussions were held with 34 women, aged 60 and over, who were in their early twenties during the Western Sydney region’s population growth. A collection of government documents in various languages. This is a very small dataset assembled to check that our technology can handle different languages and different scripts; more information about this work is available in this presentation. Work is underway to make data from other earlier projects accessible through LDaCA: Datasets from The Australian National Corpus not listed above (Monash Corpus of English, Griffith Corpus of Spoken Australian English) AusTalk Corpus of Australian English as a Second Language (AusESL) ","date":"2022-02-15","objectID":"/about/sample-collections/:1:8","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Indigenous Language Data (ILD) Portal ","date":"2022-02-15","objectID":"/about/sample-collections/:2:0","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Caroline Kelly Papers The Caroline Kelly Papers collection comprises personal and professional papers of Caroline Kelly, including correspondence; financial and legal papers; unpublished poetry and stories; theatre records and publications; anthropology field notes, reports and articles; photographs and newspaper cuttings. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:1","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Elwyn Flint Collection, UQFL173 The Elwyn Flint collection comprises written documents collected by Flint mostly as part of a long term research project in the 1960s, known as the Queensland Speech Survey, during which Flint recorded traditional Aboriginal languages. This collection also documents the Yuulngu (Gupapuyngu) language. Other parts of the collection include journal articles of languages, correspondence with field-linguists and staff from academic institutions, Indonesian material, and documentation of the English spoken by Australian Aboriginal and Torres Strait Islander peoples, Australian and regional pidgins, “migrants” and “mother-tongue” people from different ages, regions or socio-economic groups. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:2","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Fryer Library, The University of Queensland The Fryer Library manuscript collections consists of unpublished materials and includes personal papers, photographs, audio recordings, architectural plans, artworks and more. Manuscripts are generally arranged in collections according to who created or collected the material. Note that Fryer Library’s published material can also be searched through its Library Search. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:3","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"UQ Library Collection A selection of Indigenous language resources held in the University of Queensland Library. ","date":"2022-02-15","objectID":"/about/sample-collections/:2:4","tags":null,"title":"Collections","uri":"/about/sample-collections/"},{"categories":null,"content":"Principles Information about the principles on which the work of LDaCA is based. ","date":"2022-02-15","objectID":"/data/:0:1","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Technologies Information about the technologies being used in LDaCA. ","date":"2022-02-15","objectID":"/data/:0:2","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Metadata Information about the approach to metadata being taken by LDaCA. ","date":"2022-02-15","objectID":"/data/:0:3","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Sample Collections Information about the first datasets which have been added to LDaCA. ","date":"2022-02-15","objectID":"/data/:0:4","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Case Studies Accounts by collectors of various kinds of language data illustrating the different solutions they have adopted for the problems encountered in the process. ","date":"2022-02-15","objectID":"/data/:0:5","tags":null,"title":"Language Data","uri":"/data/"},{"categories":null,"content":"Learn more about our project's structure and partnerships.","date":"2022-02-15","objectID":"/about/organisation/","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":" LDaCA is one strand of the partnership between the Australian Research Data Commons (ARDC) and the School of Languages and Cultures at The University of Queensland. This partnership includes a number of projects that explore language-related technologies, data collection infrastructure and Indigenous capability programs. These projects are being led out of the Language Technology and Data Analytics Lab (LADAL), which is overseen by Professor Michael Haugh and Dr Martin Schweinberger. The Language Data Commons of Australia (LDaCA) project receives investment from the Australian Research Data Commons (ARDC) through its HASS and Indigenous Research Data Commons program. ","date":"2022-02-15","objectID":"/about/organisation/:0:0","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":"Partner Institutions University of Queensland: Professor Michael Haugh Australian National University: Professor Catherine Travis Monash University: Associate Professor Louisa Willoughby University of Melbourne: Associate Professor Nick Thieberger University of Sydney: Professor Monika Bednarek (Sydney Corpus Lab) AARNet: Adam Bell First Languages Australia: Beau Williams ","date":"2022-02-15","objectID":"/about/organisation/:1:0","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":"Advisory/Consultative Partners Pacific and Regional Archive for Digital Sources in Endangered Cultures (PARADISEC) Australian Digital Observatory CLARIN ","date":"2022-02-15","objectID":"/about/organisation/:2:0","tags":null,"title":"Organisation","uri":"/about/organisation/"},{"categories":null,"content":"Information about the technologies being used in LDaCA.","date":"2022-02-15","objectID":"/about/technologies/","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":null,"content":" LDaCA is basing its data storage on Research Object Crates and the Oxford Common File Layout, which are described below. The overall approach is informed by the Arkisto platform, taking the view that research data has interest and value that extends beyond funding cycles and its long-term preservation and accessibility must continue to be managed. This presentation gives further details of the technical architecture. ","date":"2022-02-15","objectID":"/about/technologies/:0:0","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":null,"content":"Research Object Crates (RO-Crate) A Research Object (RO) is a structured archive of all the items that contributed to the research outcome, including their identifiers, provenance, relations and annotations. RO-Crate is a lightweight approach to packaging research data with their metadata. It is based on schema.org annotations in JSON-LD, and aims to make best-practice in formal metadata description accessible and practical for use in a wide variety of situations. While RO-Crates can be considered general-purpose containers of arbitrary data and open-ended metadata, in practical use within a particular domain, application or framework, it is beneficial to further constrain RO-Crate to a specific profile: a set of conventions, types and properties that one minimally can require and expect to be present in that subset of RO-Crates. LDaCA is developing such a profile to be used for language data. Watch this video for a quick introduction to the RO-Crate concept. The RO-Crate community run a weekly drop-in call in Australia (details). ","date":"2022-02-15","objectID":"/about/technologies/:1:0","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":null,"content":"Oxford Common File Layout (OCFL) OCFL is a specification for laying out digital collections on file or object storage. It is designed with long-term preservation principles in mind and does not rely on specialised software. Amongst the benefits of using OCFL with RO-Crate objects are: completeness: a repository can be re-indexed from the files it stores versioning: repositories can make changes to objects and still allow their history to persist ","date":"2022-02-15","objectID":"/about/technologies/:2:0","tags":null,"title":"Technologies","uri":"/about/technologies/"},{"categories":["LDaCA"],"content":"FAIR and CARE Data is becoming increasingly important in today’s world, so corpus linguists might feel that the rest of the world is finally catching up. But the rest of the world are bringing with them new approaches to how data is handled. This means that fields such as corpus linguistics may need to reassess their practices. Such reassessment includes addressing concerns about how data is stored and who can access it (data stewardship) – concerns that are a part of the Open Science movement, ultimately grounded on principles of equity and accountability. The most influential approach to data stewardship today is the FAIR principles. According to these principles, data should be: Findable Metadata and data should be easy to find for both humans and computers. Accessible Once the user finds the required data, she/he/they need to know how can they be accessed, possibly including authentication and authorisation. Interoperable The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing. Reusable The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings. In general, corpus linguists do well on the interoperability criterion. Corpus data is usually stored in non-proprietary formats; even when some structure is imposed on the data, this is almost always in a form which is saved as a simple text file (e.g. CSV files or XML annotations). Data stored in such formats is easy to move between applications. But what about the other three criteria? Some corpus data is easy to discover; it is findable. For example, CLARIN, the portal to the European Union language resource infrastructure, provides access to many large data collections, as does the Linguistic Data Consortium in the USA. However, some data is never made part of a large collection and often remains under the control of individual researchers or research teams. Such data may be almost impossible to find. Even if we can find such data, it is unlikely to be accompanied by good descriptions of the data and metadata, making reusability problematic. Of course, big corpora such as the British National Corpus will be both findable and accompanied by comprehensive corpus manuals. However, it is worth considering how to make other corpora more findable, including the provision of corpus manuals or corpus descriptions. Corpus resource databases such as CoRD do aim to work towards this principle. Accessibility may also be an issue for some data. Copyright law may allow use of material for individual research but prohibit any further distribution of the material. The FAIR approach to such cases is that metadata should be available so that interested parties can know that a data holding exists (F), and the metadata will include information about the conditions under which the data may or may not be shared or reused (A and R). For linguists, there is another very important set of principles concerning data, the CARE principles developed by the Global Indigenous Data Alliance: Collective Benefit Data ecosystems shall be designed and function in ways that enable Indigenous Peoples to derive benefit from the data. Authority to control Indigenous Peoples’ rights and interests in Indigenous data must be recognised and their authority to control such data be empowered. Responsibility Those working with Indigenous data have a responsibility to share how those data are used to support Indigenous Peoples’ self-determination and collective benefit. Ethics Indigenous Peoples’ rights and wellbeing should be the primary concern at all stages of the data life cycle and across the data ecosystem. These principles are presented as applying particularly to Indigenous data, but we believe that researchers should adopt this approach in all cases where the people who particip","date":"2022-02-08","objectID":"/news/posts/fair-and-care/:1:0","tags":["FAIR","CARE"],"title":"What are the FAIR and CARE principles and why should corpus linguists know about them?","uri":"/news/posts/fair-and-care/"},{"categories":null,"content":"Refine your results according to a set of major metadata categories.","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Filters allow you to refine what items are available to view and search according to a set of major metadata categories and can be accessed from the left panel of the portal home page. Select a filter in order to see the dropdown list of options; the number icons beside each filter indicate how many options are available. A total of five options are visible per page; to access more options, use the pagination icons, or the search bar if you are looking for a particular option. Part of the Filter List Linguistic Genre Filter Expanded Filter List Example Image Source: LDaCA Linguistic Genre Filter Example Image Source: LDaCA Within each filter, the number icons beside each option indicate how many items are available for that particular option. For example, in the images above, there are 11 options under the Linguistic Genre filter, and there are 8098 items available for the Linguistic Genre Narrative. Once you have selected all the filters you wish to apply, select the Apply Filters button, which appears on a floating pop-up box and also above the Reset Search button. The top of the results section will show the filters and selected options that were applied next to the Filtering by: heading, and these can either be removed individually by clicking the × next to that filter or in bulk with the Clear Filters button. If needed, you can further select or deselect options by repeating this process with the left panel filters, at which point the Apply Filters button will reappear. ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:0:0","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Filter Categories ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:0","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Collection A group of related objects such as a corpus, a sub-corpus or items collected in a single data collection session. Examples: A COrpus of Oz Early English (COOEE) Braided Channels International Corpus of English (ICE-AUS) ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:1","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Sub-Collection The sub-collections, if any, that comprise a collection. Examples: ICE: S1A: Conversations ICE: S2A: Dialogues ICE: S2B: Monologues ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:2","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Access The access conditions associated with an item. Note that these differ from the three general categories of the Access icons and are instead the specific licenses for the item’s usage. Examples: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) Attribution 4.0 International (CC BY 4.0) Attribution-NoDerivs 3.0 Australia (CC BY-ND 3.0 AU) ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:3","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Record Type The type or categorisation of an item, i.e. a collection, object or file. For individual files, this field also gives information about the nature of the material, i.e. primary material, transcription, annotation etc. Examples: Annotation Video Transcription ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:4","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Language The language(s) which the content of the item is in. Examples: English Waanyi Vietnamese ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:5","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Communication Mode The mode(s) (spoken, written, signed, etc.) in which the content of the resource was created. Examples: WrittenLanguage SpokenLanguage SignedLanguage ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:6","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Linguistic Genre The style or form of communication to which the item can be assigned. Examples: Narrative Report Interview ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:7","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"File Format The media type and encoding format of the file. Examples: text/plain (.txt) application/mp4 (.mp4) audio/x-wav (.wav) ","date":"2022-01-23","objectID":"/resources/user-guides/portal/filters/:1:8","tags":null,"title":"Filters","uri":"/resources/user-guides/portal/filters/"},{"categories":null,"content":"Refine your queries with word, phrase and pattern searches.","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Basic Search Advanced Search (Search Fields, Boolean Operators, Query String Syntax, Regular Expressions, Reserved Characters, Show Query) The portal has both basic and advanced search capabilities. Search allows you to refine what items are displayed through basic queries with one or more words, or advanced queries using regular expressions to find specific patterns. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:0:0","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Basic Search The basic search function can be found in the top left section of the portal home page. Basic Search Image Source: LDaCA Basic Search allows you to search for: single words multiple words (i.e. two or more) where at least one of these words occur in the result (e.g. “northern territory” will return results where “northern” and “territory” occur in isolation, as well as any instances where “northern” and “territory” occur in the same item) case-insensitivity. Basic Search does not allow you to search for: exact phrases parts of words case-sensitivity. Selecting the magnifying glass icon or hitting Enter/Return on a keyboard executes a Basic Search (but note that the Enter/Return option is not available in Advanced Search). By default, results are sorted by relevance and ordered in descending order. Each result will display a Relevance Score based on the relevance of the result to the search query. If at least one of the query words occurs in the text field of an object, this will be highlighted in yellow; however, highlighting will not occur for matches within metadata fields such as name and description. Note that queries in Basic Search will be applied to all fields in the collections (e.g. name, description, text). If you need to refine the search field, or search for multiple words in specific fields, use the Advanced Search function. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:1:0","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Advanced Search The advanced search function is accessed by selecting Advanced Search below the Basic Search bar. Advanced Search Image Source: LDaCA ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:0","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Search Fields By default, queries in Advanced Search are applied to all fields, however, this can be refined using the All Fields dropdown menu on the left. The current search fields available are Name, Description, Language and Text. To search in more than one field, select Add New Line and specify the additional field you wish to search. To reset your search query, select Clear. The information entered in the Advanced Search text field is treated as part of a ‘mini-language’: Your query specifications across multiple lines will be compiled as a single query string consisting of a series of search terms and operators which can be viewed by clicking the Show Query button. In general, the search text is not case-sensitive. Exceptions to this are Boolean operators (see below). ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:1","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Boolean Operators The standard Boolean operators AND, OR and NOT are supported in Advanced Search. These can either be added in the dropdown menu between fields when Add New Line is selected, or included within the search text field, using parentheses around the full query, e.g. (koala AND kangaroo). For instance, to search for items that contain both ‘rainbow’ and ’lorikeet’ or ‘pink’ and ‘cockatoo’ but not ‘galah’, the query should be: (((rainbow AND lorikeet) OR (pink AND cockatoo)) NOT galah) If you are using multiple Boolean operators within a single text field, you may want to group parts of the query in parentheses like in the example above, but ensure in all cases you remember the parentheses around the full query or the results will be incorrect. To search for the literal words AND, OR and NOT, add a backward slash (\\) before that word to ’escape’ it, e.g. \\OR. Note that this is a situation where the search text is case-sensitive; ‘and’ does not need to be escaped, but ‘AND’ does. Escaping will not return case-sensitive matches; it will just prevent its use as a Boolean operator. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:2","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Query String Syntax Symbol Function \" \" Use double quotation marks before and after a phrase to search for that exact phrase, e.g. \"public house\". Entries where a hyphen occurs in the text instead of a space will also be returned in a phrasal search. ~ Creates a fuzzy query to return results similar to the search term by changing, removing, inserting or transposing one or more characters. Add a number following this operator to increase the number of variations, e.g. brwn~2 will find instances of ‘brown’, ‘been’, ‘own’, etc. Fuzzy queries can also be applied to phrasal searches, allowing the specified words to be further apart or in a different order, e.g. \"house home\"~3 will find instances of ‘house and home’, ‘house is my home’, ‘home, the house’, etc. ? Wildcard to replace a single character. Wildcards cannot be included in a phrasal search. e.g. qu?ck will find instances of ‘quick’ and ‘quack’. * Wildcard to replace zero or more characters. Wildcards cannot be included in a phrasal search. e.g. gre* will find instances of ‘green’, ‘grew’, ‘greater’, etc. This wildcard can also be used to find related word forms e.g. ask* will find instances of ‘ask’, ‘asks’, ‘asked’ and ‘asking’. ( ) Defines a sub-expression. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:3","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Regular Expressions Some regular expression patterns can be used within the query string by surrounding the pattern in forward slashes, e.g. /gr[ae]y/ or /honou*r/. Currently, regular expressions can only be used for complete word searches and not for partial words or phrases. This search function does not support full Perl-compatible regex syntax. For more information see: RegExp Syntax. ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:4","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Reserved Characters The following are reserved characters (i.e. part of the ‘mini-language’) that can have specific search functions and may need ’escaping’ with \\ if you want to search for the literal characters. + − = \u0026\u0026 ; || \u003e \u003c ! ( ) { } [ ] ^ \" ~ * ? : \\ / ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:5","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Show Query If you need to check your search query against what it actually sent to the back-end search engine, select Show Query. For example, setting the search field to Language and searching for Danish has the following query string: ( language.name.@value: Danish ) ","date":"2021-01-23","objectID":"/resources/user-guides/portal/search/:2:6","tags":null,"title":"Search","uri":"/resources/user-guides/portal/search/"},{"categories":null,"content":"Sort and order the results in the portal.","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). The portal search engine allows you to sort and order all items of a query according to the below options. By default, items are sorted by Collections first, and are in Descending order. ","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/:0:0","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":"Sort By Collections Relevance Name ","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/:1:0","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":"Order By Ascending Descending ","date":"2020-01-23","objectID":"/resources/user-guides/portal/sort-and-order/:2:0","tags":null,"title":"Sort and Order","uri":"/resources/user-guides/portal/sort-and-order/"},{"categories":null,"content":"A guide to logging in to the portal.","date":"2019-01-29","objectID":"/resources/user-guides/portal/login/","tags":null,"title":"Login","uri":"/resources/user-guides/portal/login/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Some collections require that you log in either to access, or apply for access, to the data. This can be done by selecting Login from the top menu bar. Portal Top Menu Options Image Source: LDaCA You will be taken to the LDaCAData login page. Select Sign in with CILogon. LDaCA Logon Image Source: LDaCA You will be taken to the CILogon page which: provides details of the Consent to Attribute Release allows you to select an identity provider. CILogon facilitates secure access to cyber infrastructure (CI). In order to use the CILogon Service, you must first select an identity provider. An identity provider is an organisation where you have an account and can log in to gain access to online services. Note that the identity provider defaults to ORCID but this can be changed to your preferred authentication. If you are a faculty, staff or student member of a university or college, please select that for your identity provider. If your school is not listed, please contact help@cilogon.org, and they will try to add your school in the future. You can also select your ORCID, Google, GitHub or Microsoft account to authenticate for the CILogon Service. For further details about CILogon and the identity providers it supports, see CILogon FAQ. CILogon Image Source: LDaCA First select your preferred identity provider in the dropdown before clicking Log On (there is also a search bar available to more easily find your selection). If you want the system to remember your selection on subsequent visits, select Remember this selection. CILogon: Select an Identity Provider Image Source: LDaCA Once your login is successful, you will be redirected back to the portal page and the Login section of the top menu bar will instead display your account name. Portal: Logged In View Image Source: LDaCA Hovering your cursor on the arrow next to your account name will trigger a dropdown menu with the following two options: User Information: View your user details and memberships to collections that have more specific access conditions. Logout: Select to log out of your account. ","date":"2019-01-29","objectID":"/resources/user-guides/portal/login/:0:0","tags":null,"title":"Login","uri":"/resources/user-guides/portal/login/"},{"categories":null,"content":"A guide to viewing and applying for access to collections in the portal.","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Apply for Access to Collections with Access Restrictions Apply for a Data License View your Current Access List ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:0:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"Apply for Access to Collections with Access Restrictions When viewing an item that has restricted access, you will see one of the following permission warnings, depending on whether or not you are logged in to the portal: You do not have permission to see these files. Sign up or Login You do not have permission to see these files. You are logged in and you can apply for permission to view these files apply for access or refresh permissions Restricted Access Example Image Source: LDaCA Once you have logged in to the LDaCA portal (see Login for further details), select apply for access in the warning text and you will be directed to the LDaCA-REMS login page in a new tab. REMS (Resource Entitlement Management System) is an open-source tool for managing access rights to resources such as research datasets. REMS is used by many research and government organisations. In this document, LDaCA-REMS refers to the specific way REMS is implemented in the LDaCA environment. LDaCA-REMS Interface Welcome Page Image Source: LDaCA Select Login and follow the CILogon prompts (see Login again). Once logged in, you will be presented with the license application form containing the following five sections: State: Shows the status of your application across three states: Apply: Displays a tick mark when you click on the Accept Terms button in the Licenses section. Approval: Displays a tick mark when you click Submit application in the Actions section. Approved: Displays a tick mark when your application is approved. Applicants: Shows the name of the applicant and an Invite member button that allows the applicant to send a link to the license application page to another person; the invitee will be prompted to log into LDaCA-REMS. Resources: Provides a link to the license document. Licenses: Provides a link for the applicant to view the license and an Accept license button to accept the terms. Actions: Contains features actionable by the applicant, including Submit application, an important function to select during the application process. LDaCA-REMS Interface Submit Application Image Source: LDaCA You will also notice the three top menu items, Catalogue, Applications and About: Catalogue: Lists resources to which you can apply for access. Applications: Lists all your applications, each with details such as the application ID, resource name, state and activity dates. About: A brief description of the LDaCA-REMS tool. ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:1:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"Apply for a Data License Data licensing is the mechanism used in LDaCA-REMS for data stewards to grant or deny legal permission to access and use data. The image below shows how to apply for a data license in three steps: Go to Licenses to read the license terms. Clicking on the link will open these in a new tab. Then, still under Licenses, select Accept license. Go to Actions and select Submit application to activate the approval process. LDaCA-REMS Application Submission Steps Image Source: LDaCA Note that the application form above illustrates a basic license application procedure. Access restrictions and procedures for application approval will vary across collections (e.g. requiring applicants to complete a questionnaire, or to upload material explaining intended use of the data). Data stewards can set up the license application approval as an automated process (using bots) which takes 1-2 minutes to complete. Other data stewards might prefer to receive, review and approve or reject the application themselves, or designate a colleague for the task. Understandably, human-mediated approval is likely to take longer than an automated process. Every applicant will be sent an email notification confirming the submission of a license application, and another email confirming approval. Once approved, you should see all three tick marks under State: LDaCA-REMS State: Application Approved Image Source: LDaCA Navigate back to the data object you require. If the warning text is still displayed, click on refresh permissions. The page will update and you should be able to access the data. Note that these permissions apply to other data objects in the collection that are covered by the same license terms; you do not need to apply for permission for each data object. ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:2:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"View your Current Access List To view your current list of licenses you have access to, hover your cursor over the arrow next to your account name to trigger a dropdown menu and select User Information. If you are not logged in, see the steps under Login. The User Memberships section shows your current access list, and you can click Check Memberships to refresh this list. You can also select Verify your access in REMS to log into LDaCA-REMS. LDaCA-REMS User Memberships Image Source: LDaCA ","date":"2018-01-29","objectID":"/resources/user-guides/portal/collection-access/:3:0","tags":null,"title":"Collection Access","uri":"/resources/user-guides/portal/collection-access/"},{"categories":null,"content":"A guide to downloading individual files from the portal.","date":"2017-01-29","objectID":"/resources/user-guides/portal/download-data/","tags":null,"title":"Download Data","uri":"/resources/user-guides/portal/download-data/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). ","date":"2017-01-29","objectID":"/resources/user-guides/portal/download-data/:0:0","tags":null,"title":"Download Data","uri":"/resources/user-guides/portal/download-data/"},{"categories":null,"content":"Download Individual Files Files from a collection can be downloaded from two locations: the File section of an Object page the File page for the selected item. To download a file from an Object page, navigate to the Files section at the end of the page (ensuring you click on the arrow to expand the relevant file) and select Download File on the right-hand side of the file entry. Object Page: Download File Image Source: LDaCA To download a file from a File page, navigate to the end of the page and select Download File. File Page: Download File Image Source: LDaCA ","date":"2017-01-29","objectID":"/resources/user-guides/portal/download-data/:1:0","tags":null,"title":"Download Data","uri":"/resources/user-guides/portal/download-data/"},{"categories":null,"content":"A guide to citing collections and data accessed through the portal.","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":" This user guide uses ‘portal’ to refer to the interface across all of the available Oni portals (see Available Portals for more details). Citing a Collection Citing Parts of a Collection Citing Published Work Citing LDaCA as Access Source Follow this guide when you need to cite a data source that you accessed through the LDaCA portal in a publication. ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:0:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing a Collection Cite the collection as a whole whenever you use, or need to refer to, any part of it in your work, unless otherwise stated in the citation information provided in the LDaCA portal. You can find each collection’s citation information on the collection level page under “How to cite this collection”. The collection level page is the webpage presented upon selection of the collection name from the portal homepage or from a list of search results. Note that this citation information is separate from the Citation metadata field that appears in some collections and refers to published papers about the collection. The citation information is not in any specific format or style (e.g. APA, MLA); it is meant to provide the essential citation elements for a minimal bibliographic reference (Tromsø Recommendations 2019) which are: Author, Date, Title, Publisher, Locator However, the citation information given may be what the data steward had approved or had specified in the data license. For the definitions of the above and additional citation elements, and further advice on citing linguistic data, refer to The Tromsø Recommendations for Citation of Research Data in Linguistics. ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:1:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing Parts of a Collection You can mention particular parts of the collection (e.g. objects, files, accompanying materials) you used in the text of your work, accompanied by the in-text citation e.g. Author, Date (or equivalent footnote if using a numbered style). Note that the manner of citing a collection would most likely be subject to the context of your research project, and guidance from your school or publisher may take precedence over the information provided here. ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:2:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing Published Work Descriptive metadata at the collection level may include a Citation field that contains a link to, or citation information for, a published work. The published work is typically also authored by the data collection creator and explains the research work and processes behind the collection of the data. LDaCA provides this information as specified by the data collection creator. In cases like this, cite the publication in addition to citing the data collection. As an example, the COrpus of Oz Early English (COOEE) shows a Citation field with a link to the thesis that explains the research concepts, planning, protocols, processes etc. in the creation of the COOEE collection. The user should cite both the collection and the thesis. COOEE Citation Metadata Field Example Image Source: LDaCA ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:3:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"},{"categories":null,"content":"Citing LDaCA as Access Source If you discovered or accessed the data collection that you used in your work through the LDaCA portal, please cite or acknowledge LDaCA as follows: Data accessed via the Language Data Commons of Australia data portal https://data.ldaca.edu.au on [yyyy-mm-dd]. (The Language Data Commons of Australia is a nationally funded partnership project by the Australian Research Data Commons, The University of Queensland, Australian National University, The University of Melbourne, The University of Sydney, Monash University, First Languages Australia and AARNet). ","date":"2016-04-16","objectID":"/resources/user-guides/portal/cite-data/:4:0","tags":null,"title":"Cite Data","uri":"/resources/user-guides/portal/cite-data/"}] \ No newline at end of file diff --git a/news/events/index.html b/news/events/index.html index 2b50253d..77d9eb89 100644 --- a/news/events/index.html +++ b/news/events/index.html @@ -1,5 +1,5 @@ Events - LDaCA -
+

Events


Recurring Events   Previous Events   -Webinars   


Indigenous Data Governance: A discussion

Part of The University of Queensland Research and Innovation Week 2024

The University of Queensland is committed to growing the opportunities for Indigenous people to lead and conduct their own research. One important strand in developing this innovative research culture is ensuring that appropriate governance is in place for Indigenous data.

This event will present a panel of Indigenous researchers, data custodians and data stewards discussing current developments in this field including:

  • the importance of Indigenous Cultural and Intellectual Property concerns
  • governance of Indigenous Data Framework for the HASS & Indigenous Research Data Commons
  • tools and methods for data governance.

Panellists: Rose Barrowcliffe, Robert McLellan, Lesley Acres. The discussion will be facilitated by Grant Sarra.

When: 4–6pm, 30 September 2024

Where: Anthropology Museum, Michie Building, The University of Queensland (St Lucia Campus)

Details


Data Migration Skills Workshop

The Language Data Commons of Australia infrastructure is based on widely-accepted standards such as Research Object Crates and the Oxford Common File Layout. The project has been running for more than three years and the processes and tools for aligning data with these standards are now well-developed. This workshop aims to show the application of those tools to data in a variety of formats to efficiently migrate material to the LDaCA standards.

The hands-on workshop is intended for data librarians and other professionals who are interested in these issues, but participation is open to anyone who is prepared to work on language data. Participants will have some experience of working with code and/or metadata and will commit to bringing datasets which they work with (or are responsible for) to use in the practical exercises which will make up a large part of the workshop. We will also have example datasets available and we also are open to the possibility of participants working in teams based on complementary skills.

Leader: Peter Sefton

When: 3 - 5 September 2024

Where: Engma Room (Room 3.165), Coombs Building, Australian National University

Further information: If you are interested in participating in this workshop, please contact Peter Sefton


Recurring Events

RO-Crate Clinic Drop-in

The RO-Crate community run a weekly drop-in call in Australia. For further information contact Peter Sefton.

When: Weekly, Thursday 14:00 AEST

Where: Online via Zoom


Previous Events

Previous Workshops


If your university or organisation would like to host a workshop, please contact us.


Webinars

Our webinar series was a joint initiative with the Language Technology and Data Analysis Laboratory (LADAL) at the School of Languages and Cultures, The University of Queensland.


Australian Digital Observatory Online Office Hours

LDaCA has previously run regular online office hours, jointly hosted with the Australian Digital Observatory (ADO). This activity will not continue in 2024, but LDaCA and ADO are developing an alternative model to help Australian researchers working with linguistics, text analytics, digital and computational methods, social media and web archives.

In the meantime, you are welcome to contact us by email at ldaca@uq.edu.au with your technical questions, research problems and rough ideas to get advice and feedback from the combined expertise of our ARDC research infrastructure projects. No question is too small, and even if we don’t know the answer we are likely to be able to point you to someone who does.





- -

Resources for Data Contributors

A collection of resources for people interested in collecting, organising and caring for data, as well as contributing data to LDaCA. -




-
\ No newline at end of file diff --git a/resources/data-contributors/index.xml b/resources/data-contributors/index.xml deleted file mode 100644 index ad7dc12b..00000000 --- a/resources/data-contributors/index.xml +++ /dev/null @@ -1 +0,0 @@ -Resources for Data Contributors on LDaCAhttps://www.ldaca.edu.au/resources/data-contributors/Recent content in Resources for Data Contributors on LDaCAHugoen-usinfo@ldaca.edu.au (Language Data Commons of Australia)info@ldaca.edu.au (Language Data Commons of Australia) \ No newline at end of file diff --git a/resources/data-users/index.xml b/resources/data-users/index.xml deleted file mode 100644 index c849084b..00000000 --- a/resources/data-users/index.xml +++ /dev/null @@ -1 +0,0 @@ -Resources for Data Users on LDaCAhttps://www.ldaca.edu.au/resources/data-users/Recent content in Resources for Data Users on LDaCAHugoen-usinfo@ldaca.edu.au (Language Data Commons of Australia)info@ldaca.edu.au (Language Data Commons of Australia) \ No newline at end of file diff --git a/resources/index.html b/resources/index.html index 0dff1964..f8c704a5 100644 --- a/resources/index.html +++ b/resources/index.html @@ -10,6 +10,6 @@ CancelHomeAbout ▾Resources ▾News ▾Contact -




\ No newline at end of file diff --git a/sitemap.xml b/sitemap.xml index 65c9e03c..77d8d3b2 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -1 +1 @@ -https://www.ldaca.edu.au/resources/user-guides/crate-o/general/2024-05-22T15:43:09+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q3-2024/2024-01-04T10:37:28+10:00https://www.ldaca.edu.au/resources/ldaca-resources/2023-11-01T15:01:37+11:00https://www.ldaca.edu.au/about/principles/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/resources/general-resources/2023-11-01T16:54:49+11:00https://www.ldaca.edu.au/resources/data-contributors/2023-11-01T16:54:49+11:00https://www.ldaca.edu.au/resources/data-users/2023-11-01T16:54:49+11:00https://www.ldaca.edu.au/resources/user-guides/crate-o/basic-navigation/2023-04-22T15:42:05+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q2-2024/2023-02-04T10:37:28+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q1-2024/2023-12-03T10:37:28+10:00https://www.ldaca.edu.au/resources/user-guides/2023-05-14T11:23:57+10:00https://www.ldaca.edu.au/resources/user-guides/crate-o/ro-crate-creation/2022-03-27T16:01:02+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q4-2023/2023-08-14T10:37:28+10:00https://www.ldaca.edu.au/resources/user-guides/crate-o/spreadsheet-upload/2021-02-28T13:53:43+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q3-2023/2023-07-13T10:37:28+10:00https://www.ldaca.edu.au/2024-08-13T00:00:00+00:00https://www.ldaca.edu.au/news/posts/gdrf-2024/2024-08-13T00:00:00+00:00https://www.ldaca.edu.au/news/posts/cultural-heritage/2024-07-09T16:24:18+10:00https://www.ldaca.edu.au/news/posts/persistent-identifiers/2024-07-03T09:45:50+10:00https://www.ldaca.edu.au/categories/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/crate-o/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/news/posts/open-repositories-2024-crate-o/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/news/posts/open-repositories-2024-ro-crate/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/categories/ldaca/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/open-repositories/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/ro-crate/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/news/posts/open-repositories-2024-pilars/2024-06-04T00:00:00+00:00https://www.ldaca.edu.au/tags/pilars/2024-06-04T00:00:00+00:00https://www.ldaca.edu.au/news/posts/idil/2024-05-23T09:45:50+10:00https://www.ldaca.edu.au/resources/general-resources/case-studies/masters-research-project/2024-05-15T16:17:33+10:00https://www.ldaca.edu.au/resources/user-guides/portal/2024-05-14T12:23:27+10:00https://www.ldaca.edu.au/resources/user-guides/crate-o/2024-05-14T12:23:17+10:00https://www.ldaca.edu.au/news/publications/2024-04-19T14:17:24+11:00https://www.ldaca.edu.au/news/posts/kalin-stefanov/2024-04-11T14:29:40+10:00https://www.ldaca.edu.au/resources/general-resources/case-studies/data-mangement-appen/2024-03-22T15:12:12+11:00https://www.ldaca.edu.au/news/posts/ausnc/2024-03-22T15:12:12+11:00https://www.ldaca.edu.au/resources/user-guides/portal/available-portals/2024-01-29T11:45:16+11:00https://www.ldaca.edu.au/news/posts/gdrf/2024-01-10T09:00:00+10:00https://www.ldaca.edu.au/about/faqs/2023-12-12T17:13:28+10:00https://www.ldaca.edu.au/resources/ldaca-resources/ldaca-software-tools/2023-11-22T14:51:49+11:00https://www.ldaca.edu.au/resources/ldaca-resources/licenses/2023-11-15T12:05:32+11:00https://www.ldaca.edu.au/resources/general-resources/case-studies/2023-11-13T16:02:05+11:00https://www.ldaca.edu.au/resources/general-resources/case-studies/fieldwork-png/2023-11-13T15:55:01+11:00https://www.ldaca.edu.au/resources/general-resources/case-studies/sydney-speaks/2023-11-13T15:53:30+11:00https://www.ldaca.edu.au/resources/ldaca-resources/metadata/2023-11-13T15:49:31+11:00https://www.ldaca.edu.au/resources/general-resources/publications/2023-11-06T14:19:16+11:00https://www.ldaca.edu.au/resources/general-resources/organisations/2023-11-06T14:18:49+11:00https://www.ldaca.edu.au/resources/general-resources/language-archives/2023-11-06T14:18:36+11:00https://www.ldaca.edu.au/resources/general-resources/ethics/2023-11-06T14:18:11+11:00https://www.ldaca.edu.au/resources/general-resources/language-projects/2023-11-06T14:18:01+11:00https://www.ldaca.edu.au/resources/general-resources/indigenous-knowledge/2023-11-06T14:17:24+11:00https://www.ldaca.edu.au/resources/general-resources/useful-tools/2023-11-01T16:27:35+11:00https://www.ldaca.edu.au/about/2023-10-18T11:20:35+10:00https://www.ldaca.edu.au/news/2023-10-18T11:20:35+10:00https://www.ldaca.edu.au/news/events/2023-10-18T11:12:10+10:00https://www.ldaca.edu.au/news/newsletter/2023-10-18T11:11:52+10:00https://www.ldaca.edu.au/news/posts/2023-10-18T11:10:21+10:00https://www.ldaca.edu.au/resources/ldaca-resources/guidance-for-data-governance-decisions/2023-10-18T12:09:10+11:00https://www.ldaca.edu.au/news/posts/interview-2/2023-10-18T07:51:43+11:00https://www.ldaca.edu.au/resources/ldaca-resources/data-onboarding-process/2023-10-03T13:59:34+11:00https://www.ldaca.edu.au/resources/ldaca-resources/access-policy/2023-10-03T12:49:58+11:00https://www.ldaca.edu.au/resources/ldaca-resources/determining-access-conditions/2023-10-03T12:07:19+11:00https://www.ldaca.edu.au/resources/glossary/2023-10-03T12:07:09+11:00https://www.ldaca.edu.au/resources/ldaca-resources/obtaining-a-doi/2023-09-15T15:42:57+10:00https://www.ldaca.edu.au/tags/access/2023-06-29T00:00:00+00:00https://www.ldaca.edu.au/news/posts/arkisto-stack-or-2023/2023-06-29T00:00:00+00:00https://www.ldaca.edu.au/news/posts/interview-1/2023-04-26T15:28:35+10:00https://www.ldaca.edu.au/resources/user-guides/portal/basic-navigation/2023-01-29T11:45:30+11:00https://www.ldaca.edu.au/tags/australia/2022-12-07T15:28:35+10:00https://www.ldaca.edu.au/tags/language-data/2022-12-07T15:28:35+10:00https://www.ldaca.edu.au/news/posts/data-map/2022-12-07T15:28:35+10:00https://www.ldaca.edu.au/news/posts/ldaca-metadata-ecosystem-eresearch-2022/2022-11-25T00:00:00+00:00https://www.ldaca.edu.au/tags/metadata/2022-11-25T00:00:00+00:00https://www.ldaca.edu.au/news/posts/fair-care-eresearch-2022/2022-11-16T00:00:00+00:00https://www.ldaca.edu.au/designer/2022-11-07T17:13:28+10:00https://www.ldaca.edu.au/contact/2022-10-28T17:13:28+10:00https://www.ldaca.edu.au/news/posts/rdc-tech-meeting/2022-06-08T15:28:35+10:00https://www.ldaca.edu.au/tags/repositories/2022-06-08T15:28:35+10:00https://www.ldaca.edu.au/about/sample-collections/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/data/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/about/organisation/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/resources/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/about/technologies/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/tags/care/2022-02-08T15:28:35+10:00https://www.ldaca.edu.au/tags/fair/2022-02-08T15:28:35+10:00https://www.ldaca.edu.au/news/posts/fair-and-care/2022-02-08T15:28:35+10:00https://www.ldaca.edu.au/resources/user-guides/portal/filters/2022-01-23T14:23:10+11:00https://www.ldaca.edu.au/resources/user-guides/portal/search/2021-01-23T14:23:27+11:00https://www.ldaca.edu.au/resources/user-guides/portal/sort-and-order/2020-01-23T15:02:22+11:00https://www.ldaca.edu.au/resources/user-guides/portal/login/2019-01-29T11:38:37+11:00https://www.ldaca.edu.au/resources/user-guides/portal/collection-access/2018-01-29T11:38:46+11:00https://www.ldaca.edu.au/resources/user-guides/portal/download-data/2017-01-29T11:39:04+11:00https://www.ldaca.edu.au/resources/user-guides/portal/cite-data/2016-04-16T13:09:22+10:00 \ No newline at end of file +https://www.ldaca.edu.au/resources/user-guides/crate-o/general/2024-05-22T15:43:09+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q3-2024/2024-01-04T10:37:28+10:00https://www.ldaca.edu.au/resources/ldaca-resources/2023-11-01T15:01:37+11:00https://www.ldaca.edu.au/about/principles/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/resources/general-resources/2023-11-01T16:54:49+11:00https://www.ldaca.edu.au/resources/user-guides/crate-o/basic-navigation/2023-04-22T15:42:05+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q2-2024/2023-02-04T10:37:28+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q1-2024/2023-12-03T10:37:28+10:00https://www.ldaca.edu.au/resources/user-guides/2023-05-14T11:23:57+10:00https://www.ldaca.edu.au/resources/user-guides/crate-o/ro-crate-creation/2022-03-27T16:01:02+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q4-2023/2023-08-14T10:37:28+10:00https://www.ldaca.edu.au/resources/user-guides/crate-o/spreadsheet-upload/2021-02-28T13:53:43+10:00https://www.ldaca.edu.au/news/newsletter/newsletter-q3-2023/2023-07-13T10:37:28+10:00https://www.ldaca.edu.au/2024-08-13T00:00:00+00:00https://www.ldaca.edu.au/news/posts/gdrf-2024/2024-08-13T00:00:00+00:00https://www.ldaca.edu.au/news/posts/cultural-heritage/2024-07-09T16:24:18+10:00https://www.ldaca.edu.au/news/posts/persistent-identifiers/2024-07-03T09:45:50+10:00https://www.ldaca.edu.au/categories/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/crate-o/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/news/posts/open-repositories-2024-crate-o/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/news/posts/open-repositories-2024-ro-crate/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/categories/ldaca/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/open-repositories/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/ro-crate/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/tags/2024-06-05T00:00:00+00:00https://www.ldaca.edu.au/news/posts/open-repositories-2024-pilars/2024-06-04T00:00:00+00:00https://www.ldaca.edu.au/tags/pilars/2024-06-04T00:00:00+00:00https://www.ldaca.edu.au/news/posts/idil/2024-05-23T09:45:50+10:00https://www.ldaca.edu.au/resources/general-resources/case-studies/masters-research-project/2024-05-15T16:17:33+10:00https://www.ldaca.edu.au/resources/user-guides/portal/2024-05-14T12:23:27+10:00https://www.ldaca.edu.au/resources/user-guides/crate-o/2024-05-14T12:23:17+10:00https://www.ldaca.edu.au/news/publications/2024-04-19T14:17:24+11:00https://www.ldaca.edu.au/news/posts/kalin-stefanov/2024-04-11T14:29:40+10:00https://www.ldaca.edu.au/resources/general-resources/case-studies/data-mangement-appen/2024-03-22T15:12:12+11:00https://www.ldaca.edu.au/news/posts/ausnc/2024-03-22T15:12:12+11:00https://www.ldaca.edu.au/resources/user-guides/portal/available-portals/2024-01-29T11:45:16+11:00https://www.ldaca.edu.au/news/posts/gdrf/2024-01-10T09:00:00+10:00https://www.ldaca.edu.au/about/faqs/2023-12-12T17:13:28+10:00https://www.ldaca.edu.au/resources/ldaca-resources/ldaca-software-tools/2023-11-22T14:51:49+11:00https://www.ldaca.edu.au/resources/ldaca-resources/licenses/2023-11-15T12:05:32+11:00https://www.ldaca.edu.au/resources/general-resources/case-studies/2023-11-13T16:02:05+11:00https://www.ldaca.edu.au/resources/general-resources/case-studies/fieldwork-png/2023-11-13T15:55:01+11:00https://www.ldaca.edu.au/resources/general-resources/case-studies/sydney-speaks/2023-11-13T15:53:30+11:00https://www.ldaca.edu.au/resources/ldaca-resources/metadata/2023-11-13T15:49:31+11:00https://www.ldaca.edu.au/resources/general-resources/publications/2023-11-06T14:19:16+11:00https://www.ldaca.edu.au/resources/general-resources/organisations/2023-11-06T14:18:49+11:00https://www.ldaca.edu.au/resources/general-resources/language-archives/2023-11-06T14:18:36+11:00https://www.ldaca.edu.au/resources/general-resources/ethics/2023-11-06T14:18:11+11:00https://www.ldaca.edu.au/resources/general-resources/language-projects/2023-11-06T14:18:01+11:00https://www.ldaca.edu.au/resources/general-resources/indigenous-knowledge/2023-11-06T14:17:24+11:00https://www.ldaca.edu.au/resources/general-resources/useful-tools/2023-11-01T16:27:35+11:00https://www.ldaca.edu.au/about/2023-10-18T11:20:35+10:00https://www.ldaca.edu.au/news/2023-10-18T11:20:35+10:00https://www.ldaca.edu.au/news/events/2023-10-18T11:12:10+10:00https://www.ldaca.edu.au/news/newsletter/2023-10-18T11:11:52+10:00https://www.ldaca.edu.au/news/posts/2023-10-18T11:10:21+10:00https://www.ldaca.edu.au/resources/ldaca-resources/guidance-for-data-governance-decisions/2023-10-18T12:09:10+11:00https://www.ldaca.edu.au/news/posts/interview-2/2023-10-18T07:51:43+11:00https://www.ldaca.edu.au/resources/ldaca-resources/data-onboarding-process/2023-10-03T13:59:34+11:00https://www.ldaca.edu.au/resources/ldaca-resources/access-policy/2023-10-03T12:49:58+11:00https://www.ldaca.edu.au/resources/ldaca-resources/determining-access-conditions/2023-10-03T12:07:19+11:00https://www.ldaca.edu.au/resources/glossary/2023-10-03T12:07:09+11:00https://www.ldaca.edu.au/resources/ldaca-resources/obtaining-a-doi/2023-09-15T15:42:57+10:00https://www.ldaca.edu.au/tags/access/2023-06-29T00:00:00+00:00https://www.ldaca.edu.au/news/posts/arkisto-stack-or-2023/2023-06-29T00:00:00+00:00https://www.ldaca.edu.au/news/posts/interview-1/2023-04-26T15:28:35+10:00https://www.ldaca.edu.au/resources/user-guides/portal/basic-navigation/2023-01-29T11:45:30+11:00https://www.ldaca.edu.au/tags/australia/2022-12-07T15:28:35+10:00https://www.ldaca.edu.au/tags/language-data/2022-12-07T15:28:35+10:00https://www.ldaca.edu.au/news/posts/data-map/2022-12-07T15:28:35+10:00https://www.ldaca.edu.au/news/posts/ldaca-metadata-ecosystem-eresearch-2022/2022-11-25T00:00:00+00:00https://www.ldaca.edu.au/tags/metadata/2022-11-25T00:00:00+00:00https://www.ldaca.edu.au/news/posts/fair-care-eresearch-2022/2022-11-16T00:00:00+00:00https://www.ldaca.edu.au/designer/2022-11-07T17:13:28+10:00https://www.ldaca.edu.au/contact/2022-10-28T17:13:28+10:00https://www.ldaca.edu.au/takedown/2022-10-28T17:13:28+10:00https://www.ldaca.edu.au/news/posts/rdc-tech-meeting/2022-06-08T15:28:35+10:00https://www.ldaca.edu.au/tags/repositories/2022-06-08T15:28:35+10:00https://www.ldaca.edu.au/about/sample-collections/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/data/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/about/organisation/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/resources/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/about/technologies/2022-02-15T17:13:28+10:00https://www.ldaca.edu.au/tags/care/2022-02-08T15:28:35+10:00https://www.ldaca.edu.au/tags/fair/2022-02-08T15:28:35+10:00https://www.ldaca.edu.au/news/posts/fair-and-care/2022-02-08T15:28:35+10:00https://www.ldaca.edu.au/resources/user-guides/portal/filters/2022-01-23T14:23:10+11:00https://www.ldaca.edu.au/resources/user-guides/portal/search/2021-01-23T14:23:27+11:00https://www.ldaca.edu.au/resources/user-guides/portal/sort-and-order/2020-01-23T15:02:22+11:00https://www.ldaca.edu.au/resources/user-guides/portal/login/2019-01-29T11:38:37+11:00https://www.ldaca.edu.au/resources/user-guides/portal/collection-access/2018-01-29T11:38:46+11:00https://www.ldaca.edu.au/resources/user-guides/portal/download-data/2017-01-29T11:39:04+11:00https://www.ldaca.edu.au/resources/user-guides/portal/cite-data/2016-04-16T13:09:22+10:00 \ No newline at end of file diff --git a/resources/data-users/index.html b/takedown/index.html similarity index 52% rename from resources/data-users/index.html rename to takedown/index.html index ac8d579a..02e6dd68 100644 --- a/resources/data-users/index.html +++ b/takedown/index.html @@ -1,5 +1,5 @@ -Resources for Data Users - LDaCA -
+Takedown Form - LDaCA +

Resources for Data Users

A collection of resources for people interested in finding, accessing and using data. -




Takedown Form

If you believe that something on our website or data portal shouldn’t be available publicly, please review our takedown policy and submit the form below.



-
\ No newline at end of file