From e8e75be6008a41d8de39b747e4d82697f63c3927 Mon Sep 17 00:00:00 2001 From: Moritz Twente <127845092+mtwente@users.noreply.github.com> Date: Wed, 2 Oct 2024 18:03:20 +0200 Subject: [PATCH] Abstract Formatting (#72) --- submissions/405/index.qmd | 36 ++++++++++++++++++++++++------------ 1 file changed, 24 insertions(+), 12 deletions(-) diff --git a/submissions/405/index.qmd b/submissions/405/index.qmd index 29dc6b9..01c44b8 100644 --- a/submissions/405/index.qmd +++ b/submissions/405/index.qmd @@ -25,22 +25,34 @@ bibliography: references.bib ## Introduction -Data-driven approaches bring extensive opportunities for research to analyze large volumes of data, and gain new knowledge and insights. This is considered especially beneficial for implementation in the humanities and social sciences [@weichselbraun2021]. Application of data-driven research methodologies in the field of history requires a sufficient source base, which should be accurate, transparently shaped and large enough for robust analysis [@braake2016]. -Web archives preserve valuable resources that can be drawn upon to analyze the development of the websites and even the whole domains through the decades, and provide access to them [@brugger2018]. At first glance, the volumes of data captured are impressive and suggest the opportunity for big data research practices. For example, the Web Crawls collection of the Internet Archive alone includes 80.2 PB of data [@webcrawls2024]. At the same time, the web-archived collections expose a set of other characteristics relevant to big data and this can be challenging for their efficient use. Such features include, for instance, a high level of velocity, exhaustive in scope and diverse in variety [@kitchin2014], which require addressing and resolving specific issues. +Data-driven approaches bring extensive opportunities for research to analyze large volumes of data, and gain new knowledge and insights. This is considered especially beneficial for implementation in the humanities and social sciences [@weichselbraun2021]. Application of data-driven research methodologies in the field of history requires a sufficient source base, which should be accurate, transparently shaped and large enough for robust analysis [@braake2016]. + +Web archives preserve valuable resources that can be drawn upon to analyze the development of the websites and even the whole domains through the decades, and provide access to them [@brugger2018]. At first glance, the volumes of data captured are impressive and suggest the opportunity for big data research practices. For example, the Web Crawls collection of the Internet Archive alone includes 80.2 PB of data [@webcrawls2024]. At the same time, the web-archived collections expose a set of other characteristics relevant to big data and this can be challenging for their efficient use. Such features include, for instance, a high level of velocity, exhaustive in scope and diverse in variety [@kitchin2014], which require addressing and resolving specific issues. + This research focuses on museums’ presence on the web, describes opportunities for implementation of data-driven research, and identifies challenges faced by researchers. In particular, in the paper the opportunity to extract data, to investigate the complexity of structure of the archived websites, and to analyze the content are addressed. At the same time, the findings are relevant to other studies devoted to the use of the archived web in computational research in the humanities. ## Data-Driven Research of the Museums’ Web Presence -Proliferation of digital technologies and the World Wide Web have profoundly impacted museums, transforming their functions and engagement practices. To comprehend these changes, a thorough examination of museum websites and their historical evolution is essential. The focus of this research is on the application of a data-driven approach to the history of the Metropolitan Museum of Art (MET) and the National Museum of Australia (NMA). The cases were selected from two publicly available web archives – the Internet Archive [@internetarchive2024] and the Australian Web Archive, Trove [@trove2024] – that are the oldest web archives which means they preserve the historical web from the same starting point, making it possible to compare their infrastructures and their use in research. -When working with web archives, the main strategies to apply data-driven methods to research the websites’ history refer to obtaining data from the general pool or using special collections. In the first scenario, scholars obtain data captured during the general crawling process which is not specifically curated. In this case, there is a large chance that the website is not crawled systematically and in terms of the full depth of hierarchy of the pages. The second scenario assumes more rigorous preservation practices that result in a more comprehensive dataset. Analyzing the MET and the NMA, the researcher may use different approaches to obtaining data. Studying the MET’s history on the web, the researcher can search for data from the Internet Archive and also from the special collection devoted to the MET in the Archive-It project [@metropolitan2024]. However, the special collection was only initiated in 2019. To study the previous years the researcher needs to necessarily apply to the general pool of the Internet Archive. To investigate the web presence of the NMA scholars may mainly observe the data from the Internet Archive and from Trove as there is no special collection of the preserved web devoted to the NMA. -Both web archives offer open APIs to obtain large datasets suitable for data-driven research [@api2024; @troveapi2024]. However, even obtaining data is challenging. The statistics of the Internet Archive show that the preserved version of the MET museum website [@met2024] was saved 20,519 times between November 11, 1996, and July 28, 2023, including 10,867,395 captures of text/HTML files, corresponding to 6,559,761 URLs [@summary_met2023]. The statistics for the National Museum of Australia on the Internet Archive show 353,134 captures of text/HTML files, which relates to 175,999 URLs [@summary_nma2023]. At the same time, the attempt to download all the timestamps using the Wayback Machine Downloader returns only 1,342,067 files for the MET website. The same is relevant for the NMA website obtained from the Internet Archive. The stage of obtaining the dataset requires more attention to the APIs and gaining reliable data. -Difficulties related to obtaining data and building datasets have been addressed partially by creating research infrastructures to work with web archived materials. The Internet Archive introduced several initiatives to collect, store and provide access to preserved materials and process data such as Archive-It [@archiveit2024] and ARCH (Archives Research Compute Hub [@arch2024]). The GLAM Workbench [@glam2024] has been created to analyze materials from the Australian Web Archive, the Internet Archive and several other web archives. Initially focused on Australia and New Zealand digital platforms the GLAM Workbench suggests a range of solutions based on the use of Jupyter notebooks for exploration and usage of data from GLAM institutions including web archival data. These infrastructures support researchers in finding solutions of various problems in obtaining and processing data, opening them up to wide opportunities to explore the archived web. Regarding the topic of museums on the web, the GLAM Workbench is particularly valuable because some examples in the notebooks have already focused on the Australian web domain and the code from the notebooks can be easily transformed for addressing the topic related to the NMA’s web presence. Using these research infrastructures is beneficial also for solving some technical issues related to the limited capacity of personal computers to address large amount of data [@robertson2022]. -Data-driven approaches require not only obtaining information in as complete as possible form but also assessment of the available data, which can be considered as a step in source criticism. Analysis of the distribution of data can be based on a URL analysis of data preserved on the web archives. The URL analysis serves as a necessary step in the source assessment because its study can reveal the temporal distribution of the data, identify the gaps, trace the regularity of crawling and updating the website, specify the distribution of the file formats, and identify other characteristics related to the complexity of the websites and their changes over time. URL-string after crawling becomes a part of the identification of information in the web archive, being enriched with a timestamp. URLs collected from the web archive provide a series of characteristics such as protocol, domain and subdomain names, path, timestamps, and parameters which are all valuable resources of information. Some of these parameters are more important for tracing the technical side of the website’s history. For example, tracing the protocols http and https give us an idea about accepting of security measures and technological upgrades. Some other characteristics provide valuable information to study the content of the website. -Identification of the subdomains contributes to our understanding complexity and segmentation of the museum’s website and the presentation of information to the website’s visitors. Often the subdomains have been used for specific projects within the museum’s activities and can be studied separately from the ‘main’ website due to their own structure and content. Both of the considered case studies through their history included the subdomain structures and experimented over the years to find more suitable approaches to the complexity of the website. Subdomains could have a different design and non-overlapping content so that they can be located as a large web structure within the museums’ activities. The analysis of the subdomains facilitates reconstruction of their life cycle, sustainability and use in comparison with the main domain. Building the network of the domain and the subdomains serves as a way to identify the important webpages through the most connected nodes (web pages) and the edges (hyperlinks). In this regard, the main website performs as a metastructure that encompasses various substructures (such as subdomains). Therefore, data-driven approaches are helpful in the analysis of functional segmentation, autonomy and integration processes within such a museum’s web universe. -The challenge in the identification of the subdomains refers not only to the inherent incompleteness of preserved data but also to the deficient methods of obtaining data from the web archives. Our experience shows that the API of the Internet Archive deriving the millions of URLs from the same domain returns errors and collapses the processing. Also, we are able to get some subdomains of the metmuseum.org website from the Archive-It platform (The Metropolitan Museum of Art Web Archive, 2024) but the MET preserves the data systematically only from 2019 and for this reason identification of the subdomains from the past perspective can be challenging and researchers need to seek better ways to achieve access this data. The GLAM Workbench has a notebook in relation to obtaining subdomains. The code can also be adjusted to any web domain for searching on the Internet Archive (and some other web archives). Regarding the research of the NMA and subdomains, the GLAM workbench suggests a highly powerful tool to consider the NMA’s website as a part of the gov.au domain [@exploring_govau2024]. The sub-subdomains of nma.gov.au website can be analyzed around the main museum’s website and at the same time the museum’s website can be considered in connection with other websites from the gov.au domain. Such an approach has a strong potential for discoveries related to positioning of the museums’ website along with the other 1825 third-level domains of the governmental segment, identifying the unique webpages and other characteristics. -The archived web is a complex resource that encompasses a large amount of heterogeneous data [@brugger2013]. The single webpage may include various formats of information and analysing the whole website requires finding the appropriate methods. Deducing complexity and investigation data separated according to formats is a widely accepted method of analysis of the website’s content. Textual analysis is a subversion of such research on the websites when only the texts are taken for the study. Building a corpus of texts from the museum websites preserved on the Danish Web Archive gave insights into the development of the Danish museums on the web and the identification of the attributes specific to the museum clusters [@skov2024]. -Separating the content according to the formats and selecting the particular type of data for the analysis, may appear to be a simple task. However, building a corpus is a very complex task which requires defining appropriate approaches to how to obtain data and what type of data to include in the corpus. There are many pitfalls to consider: the transfer of dynamic web content to the static version on the web archive inescapably changes the nature of the data and requires decisions on how to shape the dataset for analysis [@brugger2010]. Moreover, not all the textual data represent the same level of data. Another issue is the multiplication of data when the same page has been crawled and preserved on the web archive several times. The Internet Archive and the GLAM Workbench suggest different solutions in this regard. The Internet Archive provides users with unique identifiers (‘digest’) of every captured URL. If the content on the same URL has been changed, the hash sum and subsequently the identifier will vary as well. It helps to treat the web pages differently if their content is diverse. At the same time, the significance of the changes cannot be assessed from a distance. The Glam Workbench suggests a code published on the Jupyter Notebook to harvest textual data from the required archived webpages [@harvesting2024]. At the same time, the obtaining of data is possible by the lists of the URLs, which aids in treating the large amount of webpages automatically. -Textual data analysis is a well-established sphere of computational humanities. However, the complexity of the website is significantly greater than the text only. Analysis of images can be beneficial for many reasons. One such task is to reveal the selection processes in publishing images on the websites in general and in the digital collections in particular. Art museums had to identify the priorities in publishing pictures and develop specific strategies for that. We do not know much about selection processes, especially in the early years of the web, and how these solutions evolved due to the influences of political and cultural events, movements and actions. Data-driven research is able to identify and highlight these trends. To analyze the currently published collections some museums suggest open APIs for obtaining metadata about the museum objects. Both the MET and the NMA provide open access to their collections through the open API [@collection_api2024; @museum_api2024]. Access to the metadata of the publicly available objects is provided on the digital platforms. At the same time, the metadata is limited to information about the objects and does not include metadata regarding their web presence, including the date of the first publication online. In this regard, the timestamps from the web archives can be considered as a valuable resource to analyze the publishing processes from a historical perspective. At the same time, in the web archive the preserved pictures are disconnected from the metadata about the image and this gap requires finding specific solutions to connect the image and metadata to make the discoveries easier. +Proliferation of digital technologies and the World Wide Web have profoundly impacted museums, transforming their functions and engagement practices. To comprehend these changes, a thorough examination of museum websites and their historical evolution is essential. The focus of this research is on the application of a data-driven approach to the history of the Metropolitan Museum of Art (MET) and the National Museum of Australia (NMA). The cases were selected from two publicly available web archives – the Internet Archive [@internetarchive2024] and the Australian Web Archive, Trove [@trove2024] – that are the oldest web archives which means they preserve the historical web from the same starting point, making it possible to compare their infrastructures and their use in research. + +When working with web archives, the main strategies to apply data-driven methods to research the websites’ history refer to obtaining data from the general pool or using special collections. In the first scenario, scholars obtain data captured during the general crawling process which is not specifically curated. In this case, there is a large chance that the website is not crawled systematically and in terms of the full depth of hierarchy of the pages. The second scenario assumes more rigorous preservation practices that result in a more comprehensive dataset. Analyzing the MET and the NMA, the researcher may use different approaches to obtaining data. Studying the MET’s history on the web, the researcher can search for data from the Internet Archive and also from the special collection devoted to the MET in the Archive-It project [@metropolitan2024]. However, the special collection was only initiated in 2019. To study the previous years the researcher needs to necessarily apply to the general pool of the Internet Archive. To investigate the web presence of the NMA scholars may mainly observe the data from the Internet Archive and from Trove as there is no special collection of the preserved web devoted to the NMA. + +Both web archives offer open APIs to obtain large datasets suitable for data-driven research [@api2024; @troveapi2024]. However, even obtaining data is challenging. The statistics of the Internet Archive show that the preserved version of the MET museum website [@met2024] was saved 20,519 times between November 11, 1996, and July 28, 2023, including 10,867,395 captures of text/HTML files, corresponding to 6,559,761 URLs [@summary_met2023]. The statistics for the National Museum of Australia on the Internet Archive show 353,134 captures of text/HTML files, which relates to 175,999 URLs [@summary_nma2023]. At the same time, the attempt to download all the timestamps using the Wayback Machine Downloader returns only 1,342,067 files for the MET website. The same is relevant for the NMA website obtained from the Internet Archive. The stage of obtaining the dataset requires more attention to the APIs and gaining reliable data. + +Difficulties related to obtaining data and building datasets have been addressed partially by creating research infrastructures to work with web archived materials. The Internet Archive introduced several initiatives to collect, store and provide access to preserved materials and process data such as Archive-It [@archiveit2024] and ARCH (Archives Research Compute Hub [@arch2024]). The GLAM Workbench [@glam2024] has been created to analyze materials from the Australian Web Archive, the Internet Archive and several other web archives. Initially focused on Australia and New Zealand digital platforms the GLAM Workbench suggests a range of solutions based on the use of Jupyter notebooks for exploration and usage of data from GLAM institutions including web archival data. These infrastructures support researchers in finding solutions of various problems in obtaining and processing data, opening them up to wide opportunities to explore the archived web. Regarding the topic of museums on the web, the GLAM Workbench is particularly valuable because some examples in the notebooks have already focused on the Australian web domain and the code from the notebooks can be easily transformed for addressing the topic related to the NMA’s web presence. Using these research infrastructures is beneficial also for solving some technical issues related to the limited capacity of personal computers to address large amount of data [@robertson2022]. + +Data-driven approaches require not only obtaining information in as complete as possible form but also assessment of the available data, which can be considered as a step in source criticism. Analysis of the distribution of data can be based on a URL analysis of data preserved on the web archives. The URL analysis serves as a necessary step in the source assessment because its study can reveal the temporal distribution of the data, identify the gaps, trace the regularity of crawling and updating the website, specify the distribution of the file formats, and identify other characteristics related to the complexity of the websites and their changes over time. URL-string after crawling becomes a part of the identification of information in the web archive, being enriched with a timestamp. URLs collected from the web archive provide a series of characteristics such as protocol, domain and subdomain names, path, timestamps, and parameters which are all valuable resources of information. Some of these parameters are more important for tracing the technical side of the website’s history. For example, tracing the protocols http and https give us an idea about accepting of security measures and technological upgrades. Some other characteristics provide valuable information to study the content of the website. + +Identification of the subdomains contributes to our understanding complexity and segmentation of the museum’s website and the presentation of information to the website’s visitors. Often the subdomains have been used for specific projects within the museum’s activities and can be studied separately from the ‘main’ website due to their own structure and content. Both of the considered case studies through their history included the subdomain structures and experimented over the years to find more suitable approaches to the complexity of the website. Subdomains could have a different design and non-overlapping content so that they can be located as a large web structure within the museums’ activities. The analysis of the subdomains facilitates reconstruction of their life cycle, sustainability and use in comparison with the main domain. Building the network of the domain and the subdomains serves as a way to identify the important webpages through the most connected nodes (web pages) and the edges (hyperlinks). In this regard, the main website performs as a metastructure that encompasses various substructures (such as subdomains). Therefore, data-driven approaches are helpful in the analysis of functional segmentation, autonomy and integration processes within such a museum’s web universe. + +The challenge in the identification of the subdomains refers not only to the inherent incompleteness of preserved data but also to the deficient methods of obtaining data from the web archives. Our experience shows that the API of the Internet Archive deriving the millions of URLs from the same domain returns errors and collapses the processing. Also, we are able to get some subdomains of the metmuseum.org website from the Archive-It platform (The Metropolitan Museum of Art Web Archive, 2024) but the MET preserves the data systematically only from 2019 and for this reason identification of the subdomains from the past perspective can be challenging and researchers need to seek better ways to achieve access this data. The GLAM Workbench has a notebook in relation to obtaining subdomains. The code can also be adjusted to any web domain for searching on the Internet Archive (and some other web archives). Regarding the research of the NMA and subdomains, the GLAM workbench suggests a highly powerful tool to consider the NMA’s website as a part of the gov.au domain [@exploring_govau2024]. The sub-subdomains of nma.gov.au website can be analyzed around the main museum’s website and at the same time the museum’s website can be considered in connection with other websites from the gov.au domain. Such an approach has a strong potential for discoveries related to positioning of the museums’ website along with the other 1825 third-level domains of the governmental segment, identifying the unique webpages and other characteristics. + +The archived web is a complex resource that encompasses a large amount of heterogeneous data [@brugger2013]. The single webpage may include various formats of information and analysing the whole website requires finding the appropriate methods. Deducing complexity and investigation data separated according to formats is a widely accepted method of analysis of the website’s content. Textual analysis is a subversion of such research on the websites when only the texts are taken for the study. Building a corpus of texts from the museum websites preserved on the Danish Web Archive gave insights into the development of the Danish museums on the web and the identification of the attributes specific to the museum clusters [@skov2024]. + +Separating the content according to the formats and selecting the particular type of data for the analysis, may appear to be a simple task. However, building a corpus is a very complex task which requires defining appropriate approaches to how to obtain data and what type of data to include in the corpus. There are many pitfalls to consider: the transfer of dynamic web content to the static version on the web archive inescapably changes the nature of the data and requires decisions on how to shape the dataset for analysis [@brugger2010]. Moreover, not all the textual data represent the same level of data. Another issue is the multiplication of data when the same page has been crawled and preserved on the web archive several times. The Internet Archive and the GLAM Workbench suggest different solutions in this regard. The Internet Archive provides users with unique identifiers (‘digest’) of every captured URL. If the content on the same URL has been changed, the hash sum and subsequently the identifier will vary as well. It helps to treat the web pages differently if their content is diverse. At the same time, the significance of the changes cannot be assessed from a distance. The Glam Workbench suggests a code published on the Jupyter Notebook to harvest textual data from the required archived webpages [@harvesting2024]. At the same time, the obtaining of data is possible by the lists of the URLs, which aids in treating the large amount of webpages automatically. + +Textual data analysis is a well-established sphere of computational humanities. However, the complexity of the website is significantly greater than the text only. Analysis of images can be beneficial for many reasons. One such task is to reveal the selection processes in publishing images on the websites in general and in the digital collections in particular. Art museums had to identify the priorities in publishing pictures and develop specific strategies for that. We do not know much about selection processes, especially in the early years of the web, and how these solutions evolved due to the influences of political and cultural events, movements and actions. Data-driven research is able to identify and highlight these trends. To analyze the currently published collections some museums suggest open APIs for obtaining metadata about the museum objects. Both the MET and the NMA provide open access to their collections through the open API [@collection_api2024; @museum_api2024]. Access to the metadata of the publicly available objects is provided on the digital platforms. At the same time, the metadata is limited to information about the objects and does not include metadata regarding their web presence, including the date of the first publication online. In this regard, the timestamps from the web archives can be considered as a valuable resource to analyze the publishing processes from a historical perspective. At the same time, in the web archive the preserved pictures are disconnected from the metadata about the image and this gap requires finding specific solutions to connect the image and metadata to make the discoveries easier. + Apart from the texts and images, the websites incorporate other formats of data and their use in the research is more problematic for analysis. The museums represented on the web multimedia content including videos, animation, conducted podcasts, etc. All of this and other content is valuable for our understanding of their evolution on the web. At the same time, these types of content are very challenging for web archiving [@muller2021], so the specific methodologies should be developed for their systematic preservation and then for the subsequent analysis, including data-driven practices. ## Conclusion