-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path11_chapter11_video.Rmd
1082 lines (805 loc) · 47.2 KB
/
11_chapter11_video.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
output: html_document
---
# Videos {#chapter11}
<center>
{width=25%}
</center>
The schema we propose to document video files is a combination of elements extracted from the [Dublin Core Metadata Initiative](https://dublincore.org/) (DCMI) and from the [VideoObject (from schema.org)](https://schema.org/VideoObject) schemas. This schema is very similar to the schema we proposed for audio files (see chapter 10).
The Dublin Core is a generic and versatile standard, which we also use (in an augmented form) for the documentation of *Documents* (Chapter 4), *Images* (Chapter 9), and *Audio* files (chapter 10). It contains 15 core elements, to which we added a selection of elements from VideoObject. We also included the elements `keywords`, `topics`, `tags`, `provenance` and `additional` that are found in other schemas documented in the Guide.
The resulting metadata schema is simple, but it contains the elements needed to document the resources and their content in a way that will foster their discoverability in data catalogs. Compliance with the VideoObject elements contributes to search engine optimization, as search engines like Google, Bing and others "reward" metadata published in formats compatible with the schema.org recommendations.
<br>
```json
{
"repositoryid": "string",
"published": 0,
"overwrite": "no",
"metadata_information": {},
"video_description": {},
"provenance": [],
"tags": [],
"lda_topics": [],
"embeddings": [],
"additional": { }
}
```
<br>
When published in a NADA catalog, the metadata related to video files will appear in a specific tab.
<br>
{width=100%}
<br>
## Augmenting video metadata
Videos typically come with limited metadata. To make them more discoverable, a transcription of the video content can be generated, stored, and indexed in the catalog. The metadata schema we propose includes an element `transcription` that can store transcriptions (and possibly their automatically-generated translations) in the video metadata. Word embedding models and topic models can be applied to the transcriptions to further augment the metadata. This will significantly increase the discoverability of the resource, and offer the possibility to apply semantic searchability on video metadata.
Machine learning speech-to-text solutions are available (although not for all languages) to automatically generate transcriptions at a low cost. This includes commercial applications like [Whisper by openAI](https://openai.com/research/whisper), [Microsoft Azure](https://azure.microsoft.com/en-us/services/cognitive-services/speech-to-text/), or [Amazon Transcribe](https://aws.amazon.com/transcribe/pricing/). Open source solutions in Python also exist.
Transcriptions of videos published on Youtube are available on-line (the example below was extracted from https://www.youtube.com/watch?v=Axs8NPVYmms).
<br>
{width=100%}
<br>
Note that some care must be taken when adding automatic speech transcriptions into your metadata, as the transcriptions are not always perfect and may return unexpected results. This will be the case when the sound quality is low, or when the video includes sections in an unknown language (see the example below, of a video in English that includes a brief segmnent in Somali; the speech-to-text algorithm may in such case attempt to transcribe text it does not recognize, returning invalid information).
<br>
{width=100%}
<br>
## Schema description
The first three elements of the schema (`repositoryid`, `published`, and `overwrite`) are not part of the video metadata. They are parameters used to indicate how the video metadata will be published in a NADA catalog.
- **`repositoryid`** identifies the collection in which the metadata will be published. By default, the metadata will be published in the central catalog. To publish them in a collection, the collection must have been previously created in NADA.
- **`published`**: Indicates whether the metadata must be made visible to visitors of the catalog. By default, the value is 0 (unpublished). This value must be set to 1 (published) to make the metadata visible.
- **`overwrite`**: Indicates whether metadata that may have been previously uploaded for the same video can be overwritten. By default, the value is "no". It must be set to "yes" to overwrite existing information. Note that a video will be considered as being the same as a previously uploaded one if the identifier provided in the metadata element `video_description > idno` is the same.
### Metadata information
**`metadata_information`** *[Optional ; Not Repeatable]*
The metadata information set is used to document the video <u>metadata</u> (not the video itself). This provides information useful for archiving purposes. This set is optional. It is recommended however to enter at least the identification and affiliation of the metadata producer, and the date of creation of the metadata. One reason for this is that metadata can be shared and harvested across catalogs/organizations, so metadata produced by one organization can be found in other data centers.
<br>
```json
"metadata_information": {
"title": "string",
"idno": "string",
"producers": [
{
"name": "string",
"abbr": "string",
"affiliation": "string",
"role": "string"
}
],
"production_date": "string",
"version": "string"
}
```
<br>
- **`title`** *[Optional ; Not Repeatable ; String]* <br>
The title of the video.<br>
- **`idno`** *[Optional ; Not Repeatable ; String]* <br>
A unique identifier for the metadata document (unique in the catalog; ideally also unique globally). This is different from the video unique ID (see `idno` element in section *video_description* below), although it is good practice to generate identifiers that would maintain an easy connection between the metadata `idno` element and the video `idno` found under `video_description` (see below).<br>
- **`producers`** *[Optional ; Repeatable]* <br>
This refers to the producer(s) of the metadata, NOT to the producer(s) of the video. This could for example be the data curator in a data center.<br>
- **`name`** *[Optional ; Not repeatable ; String]* <br>
Name of the metadata producer/curator. An alternative to entering the name of the curator (e.g. for privacy protection purpose) is to enter the curator ID (see the element *abbr* below)<br>
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
Can be used to provide an ID of the metadata producer/curator.<br>
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
Affiliation of the metadata producer/curator.<br>
- **`role`** *[Optional ; Not repeatable ; String]* <br>
Specific role of the metadata producer/curator.<br>
- **`production_date`** *[Optional ; Not repeatable ; String]* <br>
Date the metadata (not the table) was produced.<br>
- **`version`** *[Optional ; Not repeatable ; String]* <br>
Version of the metadata (not version of the table).<br>
### Video description
**`video_description`** *[Required ; Not Repeatable]* <br>
The `video_description` section contains all elements that will be used to describe the video and its content. These are the elements that will be indexed and made searchable when published in a data catalog.
- **`idno`** *[Mandatory, Not Repeatable ; String]* <br>
`idno` is an identification number that is used to uniquely identify a video in a catalog. It will also help users of the data cite the video properly. The best option is to obtain a [Digital Object Identifier (DOI)](https://www.doi.org/) for the video, as it will ensure that the ID is unique globally. Alternatively, it can be an identifier constructed by an organization using a consistent scheme. Note that the schema allows you to provide more than one identifier for a video (see `identifiers` below). This element maps to the “identifier” element in the Dublin Core.
- **`identifiers`** *[Optional ; Repeatable]* <br>
<br>
```json
"identifiers": [
{
"type": "string",
"identifier": "string"
}
]
```
<br>
This element is used to enter video identifiers other than the `idno` element described above). It can for example be a Digital Object Identifier (DOI). Note that the identifier entered in `idno` can be repeated here, allowing to attach a "type" attribute to it.
- **`type`** *[Optional ; Not repeatable ; String]* <br>
The type of unique identifier, e.g., "DOI".
- **`value`** *[Required ; Not repeatable ; String]* <br>
The identifier. <br><br>
- **`title`** *[Required ; Not repeatable ; String]* <br>
The title of the video. This element maps to the element *caption* in VideoObject.
- **`alt_title`** *[Optional ; Not repeatable ; String]* <br>
An alias for the video title. This element maps to the element *alternateName* in VideoObject.
- **`description`** *[Optional ; Not repeatable ; String]* <br>
A brief description of the video, typically about a paragraph long (around 150 to 250 words). This element maps to the element *abstract* in VideoObject.
- **`genre`** *[Optional ; Repeatable ; String]* <br>
The genre of the video, broadcast channel or group. This is a VideoObject element. A controlled vocabulary can be used.
- **`keywords`** *[Optional ; Repeatable]* <br>
<br>
```json
"keywords": [
{
"name": "string",
"vocabulary": "string",
"uri": "string"
}
]
```
<br>
A list of keywords that provide information on the core content of the video. Keywords provide a convenient solution to improve the discoverability of the video, as it allows terms and phrases not found elsewhere in the video metadata to be indexed and to make the video discoverable by text-based search engines. A controlled vocabulary will preferably be used (although not required), such as the [UNESCO Thesaurus](http://vocabularies.unesco.org/browser/thesaurus/en/). The list can combine keywords from multiple controlled vocabularies, and user-defined keywords.
- **`name`** *[Required ; Not repeatable ; String]* <br>
The keyword itself.
- **`vocabulary`** *[Optional ; Not repeatable ; String]* <br>
The controlled vocabulary (including version number or date) from which the keyword is extracted, if any.
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
The URL of the controlled vocabulary from which the keyword is extracted, if any.<br><br>
```{r, indent=" ", eval=F, echo=T}
my_video <- list(
# ... ,
video_description = list(
# ... ,
keywords = list(
list(name = "Migration",
vocabulary = "Unesco Thesaurus (June 2021)",
uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
list(name = "Migrants",
vocabulary = "Unesco Thesaurus (June 2021)",
uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
list(name = "Refugee",
vocabulary = "Unesco Thesaurus (June 2021)",
uri = "http://vocabularies.unesco.org/browser/thesaurus/en/page/concept427"),
list(name = "Forced displacement"),
list(name = "Internally displaced population (IDP)")
),
# ...
),
# ...
)
```
<br>
- **`topics`** *[Optional ; Repeatable]* <br>
<br>
```json
"topics": [
{
"id": "string",
"name": "string",
"parent_id": "string",
"vocabulary": "string",
"uri": "string"
}
]
```
<br>
Information on the topics covered in the video. A controlled vocabulary will preferably be used, for example the [CESSDA Topics classification](https://vocabularies.cessda.eu/vocabulary/TopicClassification), a typology of topics available in 11 languages; or the [Journal of Economic Literature (JEL) Classification System](https://en.wikipedia.org/wiki/JEL_classification_codes), or the [World Bank topics classification](https://documents.worldbank.org/en/publication/documents-reports/docadvancesearch). Note that you may use more than one controlled vocabulary. This element is a block of five fields:<br>
- **`id`** *[Optional ; Not repeatable ; String]* <br>
The identifier of the topic, taken from a controlled vocabulary.<br>
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name (label) of the topic, preferably taken from a controlled vocabulary.<br>
- **`parent_id`** *[Optional ; Not repeatable ; String]* <br>
The parent identifier of the topic (identifier of the item one level up in the hierarchy), if a hierarchical controlled vocabulary is used.<br>
- **`vocabulary`** *[Optional ; Not repeatable ; String]* <br>
The name (including version number) of the controlled vocabulary used, if any.<br>
- **`uri`** *[Optional ; Not repeatable ; String]* <br>
The URL to the controlled vocabulary used, if any. <br><br>
```{r, indent=" ", eval=F, echo=T}
my_video <- list(
# ... ,
video_description = list(
# ... ,
topics = list(
list(name = "Demography.Migration",
vocabulary = "CESSDA Topic Classification",
uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
list(name = "Demography.Censuses",
vocabulary = "CESSDA Topic Classification",
uri = "https://vocabularies.cessda.eu/vocabulary/TopicClassification"),
list(id = "F22",
name = "International Migration",
parent_id = "F2 - International Factor Movements and International Business",
vocabulary = "JEL Classification System",
uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
list(id = "O15",
name = "Human Resources - Human Development - Income Distribution - Migration",
parent_id = "O1 - Economic Development",
vocabulary = "JEL Classification System",
uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
list(id = "O12",
name = "Microeconomic Analyses of Economic Development",
parent_id = "O1 - Economic Development",
vocabulary = "JEL Classification System",
uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J"),
list(id = "J61",
name = "Geographic Labor Mobility - Immigrant Workers",
parent_id = "J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers",
vocabulary = "JEL Classification System",
uri = "https://www.aeaweb.org/econlit/jelCodes.php?view=jel#J")
),
# ...
),
)
```
<br>
- **`persons`** *[Optional ; Repeatable]* <br>
<br>
```json
"persons": [
{
"name": "string",
"role": "string"
}
]
```
<br>
A list of persons who appear in the video.<br>
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the person.<br>
- **`role`** *[Optional ; Not repeatable, String]* <br>
The role of the person mentioned in `name`.<br><br>
```{r, indent=" ", eval=F, echo=T}
my_video <- list(
metadata_information = list(
# ...
),
video_description = list(
# ... ,
persons = list(
list(name = "John Smith",
role = "Keynote speaker"),
list(name = "Jane Doe",
role = "Debate moderator")
),
# ...
)
```
<br>
- **`main_entity`** *[Optional ; Not repeatable ; String]* <br>
Indicates the primary entity described in the video. This element maps to the element `mainEntity` in VideoObject.
- **`date_created`** *[Optional, Not Repeatable ; String]* <br>
The date the video was created. It is recommended to enter the date in the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY). The date the video is created refers to the date that the video was produced and considered ready for dissemination.
- **`date_published`** *[Optional, Not Repeatable ; String]* <br>
The date the video was published. It is recommended to use the ISO 8601 format (YYYY-MM-DD or YYYY-MM or YYYY).
- **`version`** *[Optional, Not Repeatable ; String]* <br>
The version of the video refers to the published version of the video.
- **`status`** *[Optional ; Not repeatable, String]* <br>
The status of the video in terms of its stage in a lifecycle. A controlled vocabulary should be used. Example terms include {`Incomplete, Draft, Published, Obsolete`}. Some organizations define a set of terms for the stages of their publication lifecycle. This element maps to the element *creativeWorkStatus* in VideoObject.
- **`country`** *[Optional ; Repeatable]* <br>
<br>
```json
"country": [
{
"name": "string",
"code": "string"
}
]
```
<br>
The list of countries (or regions) covered by the video, if applicable. This refers to the <u>content</u> of the video, not to the country where the video was released. This is a repeatable block of two elements:
- **`name`** *[Required ; Not repeatable ; String]* <br>
The country/region name. Note that many organizations have their own policies on the naming of countries/regions/economies/territories, which data curators will have to comply with.
- **`code`** *[Optional ; Not repeatable ; String]* <br>
The country/region code (entered as a string, even for numeric codes). It is recommended to use a standard list of countries and regions, such as the ISO country list ([ISO 3166](https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes)).
<br>
- **`spatial_coverage`** *[Optional ; Not repeatable ; String]* <br>
Indicates the place(s) which are depicted or described in the video. This element maps to the element `contentLocation` in VideoObject. This element complements the `ref_country` element. It can be used to qualify the geographic coverage of the video, in the form of a free text.
- **`content_reference_time`** *[Optional ; Not repeatable ; String]* <br>
The specific time described by the video, for works that emphasize a particular moment within an event. This element maps to the element `contentReferenceTime` in VideoObject.
- **`temporal_coverage`** *[Optional ; Not repeatable ; String]* <br>
Indicates the period that the video applies to, i.e. that it describes, either as a DateTime or as a textual string indicating a time period in ISO 8601 time interval format. This element maps to the element `temporalCoverage` in VideoObject.
- **`recorded_at`** *[Optional ; Not repeatable ; String]* <br>
This element maps to the element `recordedAt` in VideoObject schema. It identifies the event where the video was recorded (e.g., a conference, or a demonstration).
- **`audience`** *[Optional ; Not repeatable ; String]* <br>
A brief description of the intended audience of the video, i.e. the group for whom it was created.
- **`bbox`** *[Optional ; Repeatable]* <br>
<br>
```json
"bbox": [
{
"west": "string",
"east": "string",
"south": "string",
"north": "string"
}
]
```
<br>
This element is used to define one or multiple bounding box(es), which are the (rectangular) fundamental geometric description of the geographic coverage of the video. A bounding box is defined by west and east longitudes and north and south latitudes, and includes the largest geographic extent of the video's geographic coverage. The bounding box provides the geographic coordinates of the top left (north/west) and bottom-right (south/east) corners of a rectangular area. This element can be used in catalogs as the first pass of a coordinate-based search.
- **`west`** *[Required ; Not repeatable ; String]* <br>
West longitude of the box
- **`east`** *[Required ; Not repeatable ; String]* <br>
East longitude of the box
- **`south`** *[Required ; Not repeatable ; String]* <br>
South latitude of the box
- **`north`** *[Required ; Not repeatable ; String]* <br>
North latitude of the box
<br>
- **`language`** *[Optional, Repeatable]* <br>
<br>
```json
"language": [
{
"name": "string",
"code": "string"
}
]
```
<br>
Most videos will only be provided in one language. This is however a repeatable field, to allow for more than one language to be listed. For the language code, ISO codes will preferably be used. The language refers to the language in which the video is published. This is a block of two elements (at least one must be provided for each language):
- **`name`** *[Optional ; Not repeatable ; String]* <br>
The name of the language.
- **`code`** *[Optional ; Not repeatable ; String]* <br>
The code of the language. The use of [ISO 639-2](https://www.loc.gov/standards/iso639-2/php/code_list.php) (the alpha-3 code in Codes for the representation of names of languages) is recommended. Numeric codes must be entered as strings. <br><br>
- **`creator`** *[Optional, Not repeatable ; String]* <br>
Organization or person who created/authored the video.
- **`production_company`** *[Optional, Not repeatable ; String]* <br>
The production company or studio responsible for the item. This element maps to the element *productionCompany* in VideoObject.
- **`publisher`** *[Optional, Not repeatable ; String]* <br>
```{r, indent=" ", eval=F, echo=T}
my_video = list(
# ... ,
video_description = list(
# ... ,
publisher = "@@@@@",
# ...
)
)
```
<br>
- **`repository`** *[Optional ; Not repeatable ; String]* <br>
The name of the repository (organization).
- **`contacts`** *[Optional, Repeatable]* <br>
Users of the video may need further clarification and information. This section may include the name-affiliation-email-URI of one or multiple contact persons. This block of elements will identify contact persons who can be used as resource persons regarding problems or questions raised by the user community. The URI attribute should be used to indicate a URN or URL for the homepage of the contact individual. The email attribute is used to indicate an email address for the contact individual. It is recommended to avoid putting the actual name of individuals. The information provided here should be valid for the long term. It is therefore preferable to identify contact persons by a title. The same applies for the email field. Ideally, a "generic" email address should be provided. It is easy to configure a mail server in such a way that all messages sent to the generic email address would be automatically forwarded to some staff members.
<br>
```json
"contacts": [
{
"name": "string",
"role": "string",
"affiliation": "string",
"email": "string",
"telephone": "string",
"uri": "string"
}
]
```
<br>
- **`name`** *[Required, Not repeatable, String]* <br>
Name of a person or unit (such as a data help desk). It will usually be better to provide a title/function than the actual name of the person. Keep in mind that people do not stay forever in their position.
- **`role`** *[Optional, Not repeatable, String]* <br>
The specific role of `name`, in regards to supporting users. This element is used when multiple names are provided, to help users identify the most appropriate person or unit to contact.
- **`affiliation`** *[Optional, Not repeatable, String]* <br>
Affiliation of the person/unit.
- **`email`** *[Optional, Not repeatable, String]* <br>
E-mail address of the person.
- **`telephone`** *[Optional, Not repeatable, String]* <br>
A phone number that can be called to obtain information or provide feedback on the table. This should never be a personal phone number; a corporate number (typically of a data help desk) should be provided.
- **`uri`** *[Optional, Not repeatable, String]* <br>
A link to a website where contact information for `name` can be found.<br><br>
- **`contributors`** *[Optional, Repeatable]* <br>
<br>
```json
"contributors": [
{
"name": "string",
"affiliation": "string",
"abbr": "string",
"role": "string",
"uri": "string"
}
]
```
<br>
Identifies the person(s) and/or organization(s) who contributed to the production of the video. The `role` attribute allows defining what the specific contribution of the identified person or organization was. <br>
- **`name`** *[Optional, Not Repeatable ; String]* <br>
The name of the contributor (person or organization).
- **`affiliation`** *[Optional, Not Repeatable ; String]* <br>
The affiliation of the contributor.
- **`abbr`** *[Optional, Not Repeatable ; String]* <br>
The abbreviation for the institution which has been listed as the affiliation of the contributor.
- **`role`** *[Optional, Not Repeatable ; String]* <br>
The specific role of the contributor. This could for example be "Cameraman", "Sound engineer", etc.
- **`uri`** *[Optional, Not Repeatable ; String]* <br>
A URI (link to a website, or email address) for the contributor. <br><br>
```{r, indent=" ", eval=F, echo=T}
my_video = list(
# ... ,
video_description = list(
# ... ,
contributors = list(
list(
name = "",
affiliation = "",
abbr = "",
role = "",
uri = "")
),
# ...
)
)
```
- **`sponsors`** *[Optional ; Repeatable]* <br>
<br>
```json
"sponsors": [
{
"name": "string",
"abbr": "string",
"grant": "string",
"role": "string"
}
]
```
<br>
This element is used to list the funders/sponsors of the video. If different funding agencies financed different stages of the production process, use the "role" attribute to distinguish them.
- **`name`** *[Required ; Not repeatable ; String]* <br>
The name of the sponsor (person or organization)
- **`abbr`** *[Optional ; Not repeatable ; String]* <br>
The abbreviation (acronym) of the sponsor.
- **`grant`** *[Optional ; Not repeatable ; String]* <br>
The grant (or contract) number.
- **`role`** *[Optional ; Not repeatable ; String]* <br>
The specific role of the sponsor. <br><br>
- **`translators`** *[Optional ; Repeatable]* <br>
<br>
```json
"translators": [
{
"first_name": "string",
"initial": "string",
"last_name": "string",
"affiliation": "string"
}
]
```
<br>
Organization or person who adapted the video to different languages. This element maps to the element *translator* in VideoObject.
- **`first_name`** *[Optional ; Not repeatable ; String]* <br>
The first name of the translator.
- **`initial`** *[Optional ; Not repeatable ; String]* <br>
The initials of the translator.
- **`last_name`** *[Optional ; Not repeatable ; String]* <br>
The last name of the translator.
- **`affiliation`** *[Optional ; Not repeatable ; String]* <br>
The affiliation of the translator. <br><br>
- **`is_based_on`** *[Optional ; Not repeatable, String]* <br>
A resource from which this video is derived or from which it is a modification or adaption. This element maps to the element *isBasedOn* in VideoObject.
- **`is_part_of`** *[Optional ; Not repeatable, String]* <br>
Indicates another video that this video is part of. This element maps to the element *isPartOf* in VideoObject.
- **`relations`** *[Optional ; Repeatable, String]* <br>
<br>
```json
"relations": [
"string"
]
```
<br>
Defines, as a free text field, the relation between the video being documented and other resources. This is a Dublin Core element.
- **`video_provider`** *[Optional ; Not repeatable, String]* <br>
<br>
{width=100%}
<br>
The person or organization who provides the video. This element maps to the element *provider* in VideoObject.
- **`video_url`** *[Optional ; Not repeatable, String]* <br>
URL of the video. This element maps to the element *url* in VideoObject.
- **`embed_url`** *[Optional ; Not repeatable, String]* <br>
A URL pointing to a player for a specific video. This element maps to the element *embedUrl* in VideoObject. For example, "https://www.youtube.com/embed/7Aif1xjstws"
To be embedded, a video must be hosted on a video sharing platform like Youtube (www.youtube.com). To obtain the "embed link" from youtube, click on the "Share" button, then "Embed". In the result box, select the content of the element `src = `.
<br>
{width=100%}
<br>
- **`encoding_format`** *[Optional ; Not repeatable, String]* <br>
The video file format, typically expressed using a MIME format. This element corresponds to the "encodingFormat" element of VideoObject and maps to the element *format* of the Dublin Core.
- **`duration`** *[Optional ; Not repeatable, String]* <br>
The duration of the item (movie, audio recording, event, etc.) in ISO 8601 format. This element is a VideoObject element.
ISO 8601 durations are expressed using the following format, where (n) is replaced by the value for each of the date and time elements that <u>follows</u> the (n). For example: (3)H means 3 hours.
:::note
**`P(n)Y(n)M(n)DT(n)H(n)M(n)S`**
Where:<br>
- P is the **Period designator** and is always placed at the beginning of the duration<br>
- (n)Y represents the number of years<br>
- (n)M represents the number of months<br>
- (n)W represents the number of weeks<br>
- (n)D represents the number of days<br>
- T is the **Time designator** and always precedes the time components<br>
- (n)H represents the number of hours<br>
- (n)M represents the number of minutes<br>
- (n)S represents the number of seconds<br>
For example, **P1Y2M20DT3H30M8S** represents a duration of one year, two months, twenty days, three hours, thirty minutes, and eight seconds.
Date and time elements including their designator may be omitted if their value is zero, and lower-order elements may also be omitted for reduced precision. For example, "P23DT23H" and "P4Y" are both acceptable duration representations.
As *M* can represent both Month and Minutes, the time designator *T* is used. For example, "P1M" is a one-month duration and "PT1M" is a one-minute duration.
This information on the ISO 8601 was adapted from [wikipedia](https://en.wikipedia.org/wiki/ISO_8601) where more detailed information can be found.
:::
- **`rights`** *[Optional ; Not repeatable, String]* <br>
A textual description of the rights associated to the video. If a copyright is available, the three following elements will be used instead of this element.
- **`copyright_holder`** *[Optional ; Not repeatable, String]* <br>
The party holding the legal copyright to the video. This element corresponds to the "copyrightHolder" element of VideoObject.
- **`copyright_notice`** *[Optional ; Not repeatable, String]* <br>
Text of a notice appropriate for describing the copyright aspects of the video, ideally indicating the owner of the copyright. This element corresponds to the "copyrightNotice" element of VideoObject.
- **`copyright_year`** *[Optional ; Not repeatable, String]* <br>
The year during which the claimed copyright for the video was first asserted. This element corresponds to the "copyrightYear" element of VideoObject.
- **`credit_text`** *[Optional ; Not repeatable, String]* <br>
This element can be used to credit the person(s) and/or organization(s) associated with a published video. This element corresponds to the "creditText" element of VideoObject.
- **`citation`** *[Optional ; Not repeatable, String]* <br>
This element provides a required or recommended citation of the audio file.
- **`transcript`** *[Optional ; Repeatable, String]* <br>
<br>
```json
"transcript": [
{
"language_name": "string",
"language_code": "string",
"text": "string"
}
]
```
<br>
The transcript of the video content, provided as a text. Note that if the text is very long, an alternative is to save it in a separate text file and to make it available in a data catalog as an external resource.
- **`language_name`** *[Optional ; Not repeatable ; String]* <br>
The name of the language of the transcript.
- **`language_code`** *[Optional ; Not repeatable ; String]* <br>
The code of the language of the transcript, preferably the ISO code.
- **`text`** *[Optional ; Not repeatable ; String]* <br>
The transcript itself. Adding the transcript in the metadata will make the video much more discoverable, as the content of the transcription can be indexed in catalogs.
- **`media`** *[Optional ; Repeatable ; String]* <br>
<br>
```json
"media": [
"string"
]
```
<br>
A description of the media on which the recording is stored (other than the online file format); e,g., "CD-ROM".
- **`album`** *[Optional ; Repeatable]* <br>
<br>
```json
"album": [
{
"name": "string",
"description": "string",
"owner": "string",
"uri": "string"
}
]
```
<br>
When a video is published in a catalog containing many other videos, it may be desirable to organize them by album. Albums are collections of videos organized by theme, period, location, or other criteria. One video can belong to more than one album. Albums are "virtual collections".
- **`name`** *[Optional ; Not Repeatable ; String]* <br>
The name (label) of the album.
- **`description`** *[Optional ; Not Repeatable ; String]* <br>
A brief description of the album.
- **`owner`** *[Optional ; Not Repeatable ; String]* <br>
The owner of the album.
- **`uri`** *[Optional ; Not Repeatable ; String]* <br>
A link (URL) to the album.
- **`provenance`** *[Optional ; Repeatable]* <br>
Metadata can be programmatically harvested from external catalogs. The `provenance` group of elements is used to store information on the provenance of harvested metadata, and on alterations that may have been done to the harvested metadata. These elements are NOT part of the IPTC or DCMI metadata standard.<br>
<br>
```json
"provenance": [
{
"origin_description": {
"harvest_date": "string",
"altered": true,
"base_url": "string",
"identifier": "string",
"date_stamp": "string",
"metadata_namespace": "string"
}
}
]
```
<br>
- **`origin_description`** *[Required ; Not repeatable]* <br>
The `origin_description` elements are used to describe when and from where metadata have been extracted or harvested. <br>
- **`harvest_date`** *[Required ; Not repeatable ; String]* <br>
The date and time the metadata were harvested, in ISO 8601 format.<br>
- **`altered`** *[Optional ; Not repeatable ; Boolean]* <br>
A boolean variable ("true" or "false"; "true by default) indicating whether the harvested metadata have been modified before being re-published. In many cases, the unique identifier of the study (element `idno` in the Study Description / Title Statement section) will be modified when published in a new catalog.<br>
- **`base_url`** *[Required ; Not repeatable ; String]* <br>
The URL from where the metadata were harvested.<br>
- **`identifier`** *[Optional ; Not repeatable ; String]* <br>
The unique dataset identifier (`idno` element) in the source catalog. When harvested metadata are re-published in a new catalog, the identifier will likely be changed. The `identifier` element in `provenance` is used to maintain traceability.<br>
- **`date_stamp`** *[Optional ; Not repeatable ; String]* <br>
The datestamp (in UTC date format) of the metadata record in the originating repository (this should correspond to the date the metadata were last updated in the source catalog).<br>
- **`metadata_namespace`** *[Optional ; Not repeatable ; String]* <br>
@@@@@@@ <br><br>
- **`tags`** *[Optional ; Repeatable]* <br>
As shown in section 1.7 of the Guide, tags, when associated with `tag_groups`, provide a powerful and flexible solution to enable custom facets (filters) in data catalogs. See section 1.7 for an example in R.
<br>
```json
"tags": [
{
"tag": "string",
"tag_group": "string"
}
]
```
<br>
- **`tag`** *[Required ; Not repeatable ; String]* <br>
A user-defined tag.
- **`tag_group`** *[Optional ; Not repeatable ; String]* <br><br>
A user-defined group (optional) to which the tag belongs. Grouping tags allows implementation of controlled facets in data catalogs.
- **`lda_topics`** *[Optional ; Not repeatable]* <br>
We mentioned in Chapter 1 the importance of producing rich metadata, and the opportunities that machine learning offers to enrich (or "augment") metadata in a largely automated manner. One application of machine learning, more specifically of natural language processing, to enrich metadata related to publications is the topic extraction using Latent Dirichlet Allocation (LDA) models. LDA models must be trained on large corpora of documents. They do not require any pre-defined taxonomy of topics. The approach consists of "clustering" words that are likely to appear in similar contexts (the number of "clusters" or "topics" is a parameter provided when training a model). Clusters of related words form "topics". A topic is thus defined by a list of keywords, each one of them provided with a score indicating its importance in the topic. Typically, the top 10 words that represent a topic will be used to describe it. The description of the topics covered by a document can be indexed to improve searchability (possibly in a selective manner, by setting thresholds on the topic shares and word weights).
Once an LDA topic model has been trained, it can be used to infer the topic composition of any text. In the case of indicators and time series, this text will be a concatenation of some metadata elements including the series' name, definitions, keywords, concepts, and possibly others. This inference will then provide the share that each topic represents in the metadata. The sum of all represented topics is 1 (100%).
<br>
```json
"lda_topics": [
{
"model_info": [
{
"source": "string",
"author": "string",
"version": "string",
"model_id": "string",
"nb_topics": 0,
"description": "string",
"corpus": "string",
"uri": "string"
}
],
"topic_description": [
{
"topic_id": null,
"topic_score": null,
"topic_label": "string",
"topic_words": [
{
"word": "string",
"word_weight": 0
}
]
}
]
}
]
```
<br>
The `lda_topics` element includes the following metadata fields.
- **`model_info`** *[Optional ; Not repeatable]* <br>
Information on the LDA model.<br>
- `source` *[Optional ; Not repeatable ; String]* <br>
The source of the model (typically, an organization).<br>
- `author` *[Optional ; Not repeatable ; String]* <br>
The author(s) of the model.<br>
- `version` *[Optional ; Not repeatable ; String]* <br>
The version of the model, which could be defined by a date or a number.<br>
- `model_id` *[Optional ; Not repeatable ; String]* <br>
The unique ID given to the model.<br>
- `nb_topics` *[Optional ; Not repeatable ; Numeric]* <br>
The number of topics in the model (the number of topics to be extracted from a corpus is the key parameter of any LDA model).<br>
- `description` *[Optional ; Not repeatable ; String]* <br>
A brief description of the model.<br>
- `corpus` *[Optional ; Not repeatable ; String]* <br>
A brief description of the corpus on which the LDA model was trained.<br>
- `uri` *[Optional ; Not repeatable ; String]* <br>
A link to a web page where additional information on the model is available.<br><br>
- **`topic_description`** *[Optional ; Repeatable]* <br>
The topic composition extracted from selected elements of the series metadata (typically, the name, definitions, and concepts).<br>
- `topic_id` *[Optional ; Not repeatable ; String]* <br>
The identifier of the topic; this will often be a sequential number (Topic 1, Topic 2, etc.).<br>
- `topic_score` *[Optional ; Not repeatable ; Numeric]* <br>
The share of the topic in the metadata (%).<br>
- `topic_label` *[Optional ; Not repeatable ; String]* <br>
The label of the topic, if any (not automatically generated by the LDA model).<br>
- `topic_words` *[Optional ; Not repeatable]* <br>
The list of N keywords describing the topic (e.g., the top 5 words).<br>
- `word` *[Optional ; Not repeatable ; String]* <br>
The word.<br>
- `word_weight` *[Optional ; Not repeatable ; Numeric]* <br>
The weight of the word in the definition of the topic.<br><br>
- **`embeddings`** *[Optional ; Repeatable]* <br>
In Chapter 1 (section 1.n), we briefly introduced the concept of word embeddings and their use in implementation of semantic search tools. Word embedding models convert text (words, phrases, documents) into large-dimension numeric vectors (e.g., a vector of 100 or 200 numbers) that are representative of the semantic content of the text. The vectors are generated by submitting a text to a pre-trained word embedding model (possibly via an API).
The word vectors do not have to be stored in the series/indicator metadata to be exploited by search engines. When a semantic search tool is implemented in a catalog, the vectors will be stored in a database end processed by a tool like Milvus. A metadata element is however provided to store the vectors for preservation and sharing purposes. This block of metadata elements is repeatable, allowing multiple vectors to be stored. When using vectors in a search engine, it is critical to only use vectors generated by one same model.
<br>
```json
"embeddings": [
{
"id": "string",
"description": "string",
"date": "string",
"vector": null
}
]
```
<br>
The `embeddings` element contains four metadata fields:
- **`id`** *[Optional ; Not repeatable ; String]* <br>
A unique identifier of the word embedding model used to generate the vector.
- **`description`** *[Optional ; Not repeatable ; String]* <br>
A brief description of the model. This may include the identification of the producer, a description of the corpus on which the model was trained, the identification of the software and algorithm used to train the model, the size of the vector, etc.
- **`date`** *[Optional ; Not repeatable ; String]* <br>
The date the model was trained (or a version date for the model).
- **`vector`** *[Required ; Not repeatable ; @@@@]*
The numeric vector representing the video metadata. <br><br>
- **`additional`** *[Optional ; Not repeatable]*<br>
The `additional` element allows data curators to add their own metadata elements to the schema. All custom elements must be added within the `additional` block; embedding them elsewhere in the schema would cause schema validation to fail.
<br>
{width=100%}
<br>
## Complete example
### In R
```{r, eval=F, echo=T}
library(nadar)
# ----------------------------------------------------------------------------------
# Enter credentials (API confidential key) and catalog URL
my_keys <- read.csv("C:/confidential/my_API_keys.csv", header=F, stringsAsFactors=F)
set_api_key("my_keys[1,1")
set_api_url("https://.../index.php/api/")
set_api_verbose(FALSE)
# ----------------------------------------------------------------------------------
setwd("C:/my_videos")
id = "MDA_VDO_001"
thumb = "vdo_001.jpg"
# Generate the metadata
my_video = list(
metadata_information = list(
title = "Mogadishu, Somalia: A Call for Help",
idno = id,
producers = list(
list(name = "John Doe", affiliation = "National Library")
),
production_date = "2021-09-03"
),
video_description = list(
idno = id,
title = "Mogadishu, Somalia: A Call for Help",
alt_title = "Somalia: Guterres in Mogadishu",
date_published = "2011-09-01",
description = "During a landmark visit, the United Nations High Commissioner for Refugees calls on the international community to rapidly increase aid to Somalia.",