diff --git a/categories/index.html b/categories/index.html index 9f0781e..98e0b5a 100644 --- a/categories/index.html +++ b/categories/index.html @@ -432,9 +432,9 @@
Signed networks are a way to represent relationships between entities. This type of netwworks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
- read more +Signed networks are a way to represent relationships between entities. These types of networks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
+ read more diff --git a/index.json b/index.json index 1881147..eb98682 100644 --- a/index.json +++ b/index.json @@ -252,4 +252,4 @@ -[{"categories":null,"contents":"In this course, you will learn the basics of cluster computing with R for social science workloads. The day starts with an introduction to supercomputer architecture, including a hands-on session focused on running jobs on a supercomputer.\nThe second half of the programme focuses on translating your R workflow from a GUI (Rstudio) workflow on your desktop to a scripting/batch environment on the supercomputer. Topics covered here include: efficient programming, parallel computing, and using the SLURM job manager to send your job/analysis to the supercomputer.\nIn this course you will:\nDo practical exercises to learn how to effectively use the Lisa national computing cluster and the national supercomputer, and how to complete your tasks with minimal effort in the shortest possible time. Experience how to achieve high performance with R by using SURF\u0026rsquo;s supercomputing facilities. Additional information When 27-09-2024 Where SURF Amsterdam Registration https://www.surf.nl/en/agenda/cluster-computing-for-social-scientists-with-r Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University, together with a SURF instructor Erik-Jan\u0026rsquo;s website Materials Course materials, including slides and code, are open and can be accessed here. ","date":"September 27, 2024","image":"http://odissei-soda.nl/images/workshops/Supercomputing_hubeda2984481e4d3f13fb836f8a1151c6_83157_650x0_resize_box_3.png","permalink":"/workshops/cluster-computing/","title":"Supercomputing for social scientists with R"},{"categories":null,"contents":"Signed networks are a way to represent relationships between entities. This type of netwworks are called \u0026lsquo;signed\u0026rsquo; because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting). Community detection in signed networks aims to identify groups of nodes that share similar connection patterns. In this tutorial, we will guide you through applying two popular community detection algorithms to signed networks, using Python.\nAlgorithms We will be using two algorithms:\nSpinglass: This algorithm leverages a spin model metaphor to partition nodes into communities. It considers both the weights and signs of connections. SPONGE: This spectral clustering technique identifies communities by analyzing the eigenvectors of the signed adjacency matrix. They are implemented in Python. First, you need install the necessary libraries (pandas is not strictly necessary to implement the algorithms, but we\u0026rsquo;ll use it throught the tutorial).\npip install igraph pip install git+https://github.com/alan-turing-institute/SigNet.git pip install pandas Once installed, you can import the libraries and start working with the algorithms.\nimport igraph as ig from signet.cluster import Cluster import pandas as pd Construct and visualize a signed network from data To begin, we\u0026rsquo;ll construct an example network. This network will be constructed from a series of signed interactions among agents: the edgelist. We can use pandas to read the edgelist from a file or from a list of tuples. In this network, 1 means a positive interaction, and -1 means a negative interaction.\nedgelist = [(\u0026#39;A\u0026#39;, \u0026#39;B\u0026#39;, 1), (\u0026#39;A\u0026#39;, \u0026#39;C\u0026#39;, -1), (\u0026#39;B\u0026#39;, \u0026#39;C\u0026#39;, -1) ] # create DataFrame using edgelist edgelist_df = pd.DataFrame(edgelist, columns =[\u0026#39;source\u0026#39;, \u0026#39;target\u0026#39;, \u0026#39;weight\u0026#39;]) To construct the signed network, we use igraph. In the network, the nodes (agents) will be A, B, C, and the edges (interactions) are the ones listed in the edgelist -with their corresponding weights.\ng = ig.Graph.TupleList(edgelist_df.itertuples(index=False), directed=False, weights=False, edge_attrs=\u0026#34;weight\u0026#34;) ig.summary(g) IGRAPH UNW- 3 3 -- + attr: name (v), weight (e) Then, we can plot our network, still using igraph.\n# color the edges based on their weight g.es[\u0026#34;color\u0026#34;] = [\u0026#34;#00BFB3\u0026#34; if edge[\u0026#39;weight\u0026#39;] \u0026gt;0 else \u0026#34;#F05D5E\u0026#34; for edge in g.es] ig.plot(g, vertex_color=\u0026#34;grey\u0026#34;, edge_color=g.es[\u0026#34;color\u0026#34;], edge_width=5, vertex_label=g.vs[\u0026#34;name\u0026#34;]) Figure 1. A signed network\rDetect communities Spinglass The community-spinglass method is a community detection approach grounded in statistical mechanics. Initially proposed by Reichardt and Bornholdt for unsigned networks, it was later extended by Traag and Bruggeman later extended to signed networks. This algorithm is implemented in the igraph package.\nspinglass = g.community_spinglass(weights=\u0026#34;weight\u0026#34;, spins=50, gamma= 0.5, lambda_= 0.5, implementation=\u0026#39;neg\u0026#39;) for i, node in enumerate(g.vs): print(f\u0026#39;node {node[\u0026#34;name\u0026#34;]}: community {spinglass.membership[i]}\u0026#39;) node A: community 0 node B: community 0 node C: community 1 Changing the parameters gamma and lambda_ gives more or less importance to positive or negative ties within a community, depending on whether we want agents with negative interactions to be found in the same group of agents.\nSPONGE The SPONGE method (Signed Positive Over Negative Generalized Eigenproblem), introduced by Cucuringu et al. (2019) is based on minimizing the number of violations. These \u0026ldquo;violations\u0026rdquo; consist of positive edges between communities and negative edges within communities. An open-source Python implementation of the SPONGE algorithm is available on GitHub.\n# get the adjacency matrix of the network (only signs, not weights) A = g.get_adjacency_sparse(attribute=\u0026#39;weight\u0026#39;).sign() c = Cluster((A.multiply(A\u0026gt;0), -A.multiply(A\u0026lt;0))) sponge = c.SPONGE(k=2) node A: community 0 node B: community 0 node C: community 1 Changing the parameter k, you can set as many communities as you want.\nConclusion This tutorial has equipped you with the knowledge and code to apply two common community detection algorithms – Spinglass and SPONGE – to signed networks using Python libraries like igraph and signet. Applying these algorithms, you can gain insights into the underlying community structure of signed networks.\nAs you saw, there are some parameter choices to be done. We found that changing parameters can drastically influence the results of the algorithms (i.e., find communities where there are none, or just not look for what you wanted to find.). We suggest you to check our last pre-print Community detection in bipartite signed networks is highly dependent on parameter choice and the related code to discover more about the parameter tuning in the case of two-mode (bipartite) signed networks.\nIf you are doing similar work and you have a different method, let us know! In addition, if you have further questions about community detection, or you think we can help you, do not hesitate to contact us!\nReferences Reichardt, Jörg, and Stefan Bornholdt. “Statistical Mechanics of Community Detection.” Physical Review E, vol. 74, no. 1, July 2006. Crossref, https://doi.org/10.1103/physreve.74.016110. Traag, V. A., and Jeroen Bruggeman. “Community Detection in Networks with Positive and Negative Links.” Physical Review E, vol. 80, no. 3, Sept. 2009. Crossref, https://doi.org/10.1103/physreve.80.036115. Cucuringu, Mihai, et al. \u0026ldquo;SPONGE: A Generalized Eigenproblem for Clustering Signed Networks.\u0026rdquo; arXiv, 2019, https://arxiv.org/abs/1904.08575 Candellone, Elena, et al. \u0026ldquo;Community Detection in Bipartite Signed Networks is Highly Dependent on Parameter Choice.\u0026rdquo; arXiv, 2024, https://arxiv.org/abs/2405.08203 ","date":"May 15, 2024","image":"http://odissei-soda.nl/images/tutorial-9/algorithm_hu5459c0360c2b0cb7a147d2df0eb350ca_1164949_650x0_resize_q100_box.jpg","permalink":"/tutorials/comunity-detection-signed-networks/","title":"Detecting communities in signed networks with Python"},{"categories":null,"contents":"Doing open, reproducible science means doing your best to openly share research data and analysis code. With these open materials, others can check and understand your research, use it to prepare their own analysis, find examples for teaching, and more. However, sometimes datasets contain sensititive or confidential information, which makes it difficult — if not impossible — to share. In this case, producing and sharing a synthetic version of the data might be a solution. In this post, we show how to do this in an auditable, transparent way with the software package metasyn.\nMetasyn is a Python package that helps you to generate synthetic data, with two ideas in mind. First, it is easy to use and understand. Second, and most importantly, it is privacy-friendly. Unlike most other synthetic data generation tools, metasyn strictly limits the statistical information in its data generation model to adhere to the highest privacy standards and only generates data that is similar on an individual column level. This makes it a great tool for initial exploration, code development, and sharing of datasets while maintining very high privacy levels - but it is not suitable for in-depth statistical analysis.\nWith metasyn, you fit a model to a dataset and synthesize data similar to the original based on that model. You can then export the synthetic data and the model used to generate it, in easy-to-read format. As a result, metasyn allows data owners to safely share synthetic datasets based on their source data, as well as the model used to generate it, without worrying about leaking any private information from the original dataset.\nLet\u0026rsquo;s say you want to use metasyn to collaborate on a sensitive dataset with others. In this tutorial, we will show you everything you need to know to get started.\nStep 1: Setup The first step is installing metasyn. The easiest way to do so is by installing it through pip. This can be done by typing the following command in your terminal:\npip install metasyn Then, in a Python environment, you can import metasyn (and Polars, which will be used to load the dataset):\nimport polars as pl from metasyn import MetaFrame, demo_file Step 2: Creating a DataFrame Before we can pass a dataset into metasyn, we need to convert it to a Polars DataFrame. In doing so, we can indicate which columns contain categorical values. We can also tell polars to find columns that may contain dates or timestamps. Metasyn can later use this information to generate categorical or date-like values where appropriate. For more information on how to use Polars, check out the Polars documentation.\nFor this tutorial, we will use the Titanic dataset, which comes preloaded with metasyn (its file path can be accessed using the demo_file function). We will specify the data types of the Sex and Embarked columns as categorical, and we will also try to parse dates in the DataFrame.\n# Get the CSV file path for the Titanic dataset csv_path = demo_file(\u0026#34;titanic\u0026#34;) # Replace this with your file path if needed # Create a Polars DataFrame df = pl.read_csv( source=csv_path, dtypes={\u0026#34;Sex\u0026#34;: pl.Categorical, \u0026#34;Embarked\u0026#34;: pl.Categorical}, try_parse_dates=True, ) Step 3: Generating a MetaFrame Now that we have created a DataFrame, we can easily generate a MetaFrame for it. Metasyn can later use this MetaFrame to generate synthetic data that aligns with the original dataset.\nA MetaFrame is a simple model that captures the essentials of each variable in the original dataset (e.g., variable names, types, data types, the percentage of missing values, and distribution), without containing any actual data entries.\nA MetaFrame can be created by simply calling MetaFrame.fit_dataframe(), passing in the DataFrame as a parameter.\n# Generate and fit a MetaFrame to the DataFrame mf = MetaFrame.fit_dataframe(df) Step 4: Generating synthetic data With our MetaFrame in place, we can use it to generate synthetic data. To do so, we can call synthesize on our MetaFrame, and pass in the amount of rows of data that we want to generate. This will return a DataFrame with synthetic data, that is similar to our original dataset.\n# generate synthetic data syn_df = mf.synthesize(5) That\u0026rsquo;s it! You can now read, analyze, modify, use and share this DataFrame as you would with any other \u0026ndash; knowing that it is rather unlikely to leak private information (though if you need actual formal privacy guarantees, look at our disclosure control plugin).\nStep 5: Exporting the MetaFrame Let\u0026rsquo;s say we want to go one step further, and also share the an auditable representation of the MetaFrame alongside our synthetic data. We can easily do so by exporting it to a JSON file.\nThese exported files follow the Generative Metadata Format (GMF). This is a format that was designed to be easy-to-read and understand.\nOther users can then import this file to generate synthetic data similar to the original dataset, without ever having access to the original data. In addition, due to these files being easy to read, others can easily understand and evaluate how the synthetic data is generated.\nTo export the MetaFrame, we can call the export method on an existing MetaFrame (in this case, mf), passing in the file path of where we want to save the JSON file.\n# Serialize and export the MetaFrame mf.export(\u0026#34;exported_metaframe.json\u0026#34;) To load the MetaFrame from the exported JSON file, we can use the MetaFrame.from_json() class method, passing in the file path as a parameter:\n# Create a MetaFrame based on a GMF (.json) file mf = MetaFrame.from_json(file_path) Conclusion You now know how to use metasyn to generate synthetic data from a dataset. Both the synthetic data and the model (MetaFrame) used to generate it can be shared safely, while maintaining a high level of privacy.\nEnjoy using metasyn!\nFor more information on how to use metasyn, check out the documentation or the GitHub repository.\n","date":"April 26, 2024","image":"http://odissei-soda.nl/images/metasyn-tutorial/metasyn_hu815f2b6e5313f67d740aa04f6f4cb5c8_145073_650x0_resize_box_3.png","permalink":"/tutorials/generating-synthetic-data-with-metasyn/","title":"Generating synthetic data in a safe way with metasyn"},{"categories":null,"contents":"About the workshop The workshop will teach how to work efficiently with data from CBS (Statistics Netherlands), using its remote access environment. It is designed for people getting acces to the remote environment for the first time, or people who want to improve the efficiency and reproducibility of their workflows. Topics covered include project organization, principles for writing legible programs (in R, but people using other languages are encouraged to join as well; the principles apply to all languages), and to retrieve, store and configure data pipelines in a reproducible and understandable way.\nThere will be a 2 hour plenary session and a 1 hour consultation session, where specific questions will be adressed in small groups. Questions can be sent in advanced in the registration process.\nAdditional information When Friday, 19th April 2024, from 13.30h to 16.30h Where Utrecht University, Administration building, room “Van Lier \u0026amp; Eggink”. Registration You can register here, and consult additional info, here. It is free of charge for ODISSEI organization members. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University. Erik-Jan\u0026rsquo;s website Materials A link with all workshop materials can be found here. ","date":"April 19, 2024","image":"http://odissei-soda.nl/images/metasyn-tutorial/metasyn_hu815f2b6e5313f67d740aa04f6f4cb5c8_145073_650x0_resize_box_3.png","permalink":"/workshops/efficient-reproducible-research-cbs-microdata/","title":"Efficient and reproducible research with CBS microdata"},{"categories":null,"contents":"Author: Flavio Hafner. Post photo from Glenn Carstens-Peters on Unsplash\nThis post is the first of a series of blogposts arising from a collaboration between the eScience Center and the ODISSEI Social Data Science Team. You can also find this text at the eScience Center page.\nResearch often relies on accessing novel data, for instance by collecting them from the internet through web scraping. If you have ever tried this, you may have run into your IP address being blocked by the website you scrape. Websites do this with a good reason — to protect themselves against malicious acts, such as denial-of-service attacks or large-scale scraping by parties with ill intent. This makes sense from the websites’ perspective, but prevents you from answering your research question.\nBut this problem can be solved. In this tutorial, we show how you as a researcher can use IP rotation to circumvent certain scraping protections with the sirup package, which works on Linux operating systems.\nBefore we jump into it, it is important to highlight that web scraping and IP rotation need to respect the law and should only be a last resort. For instance, before you scrape data from a website, you should ask the data owner whether they are willing to make them available to you through a data sharing agreement. If you nevertheless decide to scrape the data, you should get approval from the ethical review board at your university. Moreover, do only scrape data that are publicly available on the web and do not send excessive number of requests to the website in a given time.\nFor rotating the IP address, we will use a VPN service. Here you can read more about what a VPN service is — in short, the service has a bunch of remote servers, and connecting your computer to one of these servers changes your IP address.\nWhat you need 1. OpenVPN OpenVPN is a system that allows you to create secure VPN connections. You can install it by following these instructions.\n2. Root access to your computer Because internet connections are an important security concern, OpenVPN requires root access — this is the equivalent to administrator rights on a Windows computer. If you have root access, you can for instance run the following command on your terminal:\nsudo ls -lh # will ask you for your root password Installing and setting up sirup You can install sirup as follows:\npython -m pip install sirup To use the package and change your IP address, you need an account with a VPN service provider that offers OpenVPN configuration files for your account. At the time of writing, for instance ProtonVPN and Surfshark offer this option — note that these services are not for free. We will use ProtonVPN in this tutorial.\nAfter creating an account, you need to download two sets of files.\nFirst, you download credentials that identify your Proton account when using OpenVPN. On the ProtonVPN website, click on “Account” and then you see something like this:\nFigure 1. Credentials that identify your Proton account when using OpenVPN\rCopy and paste the username and the password into a .txt file that looks like this:\nusername password Then, save the file as “proton_credentials.txt”. Remember where it is stored — we will need it later.\nA first warning on security. Storing account credentials like this makes it easy for you to use the sirup package. But it also increases the risk that unauthorized persons get a hold on these credentials. Thus, be careful to store the credentials in a safe place on your laptop and do not share them with anyone.\nSecond, to use OpenVPN we need configuration files, whose names end with .ovpn. The files allow OpenVPN to connect to a server from the VPN service provider. In ProtonVPN, go to the \u0026ldquo;Download\u0026rdquo; section of your account. Select the options as follows:\nFigure 2. Options to download .ovpn configuration files\rAnd download the configuration file(s) you want to use. Store the downloaded files on your computer, and remember the location.\nNow you are ready!\nUsing sirup We start by defining the path to the proton_credentials.txt file. When you execute the code below, you will be asked to enter the root password, which is necessary to make the connection.\nimport getpass auth_file = \u0026#34;proton_credentials.txt\u0026#34; pwd = getpass.getpass(\u0026#34;Please enter your root password:\u0026#34;) A second warning on security. The code above stores your root password during the python session without encrypting it. This is OK to do on your laptop — if someone gets access to your python session, your security has already been compromised — but not recommended on a shared computer such as a cluster or a cloud service.\nChanging the IP address with sirup Now you can use the VPNConnector to change your IP address. We will use the \u0026quot;my_config_file.ovpn\u0026quot; configuration file.\nfrom sirup.VPNConnector import VPNConnector config_file = \u0026#34;my_config_file.ovpn\u0026#34; The code below first connects to the server associated with \u0026quot;my_config_file.ovpn\u0026quot; and then disconnects.\nconnector = VPNConnector(auth_file, config_file) # Let\u0026#39;s see the current IP address when no VPN tunnel is active print(connector.base_ip) connector.connect(pwd=pwd) # Now the IP address should differ print(connector.current_ip) connector.disconnect(pwd=pwd) # Now current_ip should be the same as base_ip above print(connector.current_ip) Rotating the IP address with sirup Instead of connecting to a single server, you can also rotate across many different servers — which means you rotate your IP address across a set of potential addresses. Doing so is useful for larger scraping jobs because it will spread your requests across more servers.\nTo do this, you need to download multiple configuration files as described above. Store all of the .ovpn configuration files together in a separate directory. Let\u0026rsquo;s say you store them in the \u0026quot;/path/to/config/files/\u0026quot; directory. You need to define this path in your python script:\nconfig_path = \u0026#34;/path/to/config/files/\u0026#34; The following code connects to two different servers before disconnecting again:\nfrom sirup.IPRotator import IPRotator rotator = IPRotator(auth_file=my_auth_file, config_location=config_path, seed=seed) # this will ask for the root password print(rotator.connector.base_ip) rotator.connect() print(rotator.connector.current_ip) rotator.rotate() print(rotator.connector.current_ip) rotator.disconnect() print(rotator.connector.current_ip) Conclusion This tutorial has walked you through the steps to manage your IP address in python, using the sirup package. We hope it makes your scraping workflows easier!\nsirup is an open-source package developed by the Netherlands eScience Center. If you use the tool, you can cite this zenodo repository with the DOI: https://doi.org/10.5281/zenodo.10261949.\nThe source code of the package is here, where you can contribute to it, build on it and submit issues.\nThanks to Patrick Bos, Peter Kalverla, Kody Moodley and Carlos Gonzalez Poses for comments.\n","date":"February 27, 2024","image":"http://odissei-soda.nl/images/tutorial-8/sirup_hu5459c0360c2b0cb7a147d2df0eb350ca_814836_650x0_resize_q100_box.jpg","permalink":"/tutorials/ip-rotation/","title":"How to manage your IP address in python"},{"categories":null,"contents":"By now, it\u0026rsquo;s no surprise to anybody the astonishing results large language models produce. Models such as GPT-4, Bard, Bert or RoBERTa have sparked intense research and media attention, as well as changed many people\u0026rsquo;s workflows. However, these models have issues. A common critique is that they function as black boxes: users do not know much about their training data or modelling choices. Besides, training them usually requires gigantic datasets and processing power. Therefore, there is value in alternative models that can be trained by researchers, having full control over the input data and internal process. In this tutorial, we explain how to train a natural language processing model using fastText: a lightweight, easy-to-implement and efficient word embedding model that has shown good performance in various natural language tasks over the years.\nFirst, a bit about word-embedding models. Word-embeddings models are one type of natural language processing models. By producing real number vector representations of words, they offer a powerful way to capture the semantic meaning of words in large datasets, which is why they are widely used in diverse research applications. Indeed, their uses are numerous: semantic similarity, text generation, document representation, author recognition, knowledge graph construction, sentiment analysis, or bias detection (Caliskan et al., 2017).\nInstallation To install Fasttext, we recommend checking the fasttext-wheel PyPI module. To verify the installation succeeded, you have to importat the package in a Python script.\n\u0026gt;\u0026gt;\u0026gt; import fasttext If there are no error messages, you have succeeded and we can move to the training part.\nTraining the model Training data To train fastText, you need a corpus: a large collection of text. The required size of these corpora varies depending on the research purpose: from several thousand to billions of words. Some research benefits from smaller, well-curated corpora; other research benefits from large unstructured corpora. However, while the exact size needed is hard to determine, do keep in mind that the text in the training data has to relate to your research question! If you want to use word embeddings for studying doctors\u0026rsquo; notes, you need doctors\u0026rsquo; notes - and not legal medical texts. If you want to study niche cultural sub-groups, you need data from these groups - and not necessarily a corpus of random Internet interactions. The corpus is an integral part of your research! Generally, the larger the research-related corpus you can get, the better.\nIn this tutorial we use a freely available corpus of Science-Fiction texts downloaded from Kaggle. Preferably, the text you feed to fastText should have each sentence on a new line.\nHyperparameters We will train an unsupervised fastText model, which means that lot of implementation decisions need to be made. If you don\u0026rsquo;t have specific methodological reasons and/or you lack the time or computing power for a proper grid search, we suggest you go with the default parameter options - which are optimized for many research contexts -, but switching the \u0026lsquo;dim\u0026rsquo; parameter to 300. Empirical research has shown that a dimensionality of 300 leads to optimal performance in most settings, even if that will increase computational resources and training time. If you can afford spending the time thinking about hyperparameters, you could tune the training model (CBOW or SkipGram), learning rate, dimensionality of the vector, context widow size, and more. You can see here the full list of tuning parameters available.\nFitting model We fit the model with the following command:\n\u0026gt;\u0026gt;\u0026gt; model = fasttext.train_unsupervised(\u0026#39;internet_archive_scifi_v3.txt\u0026#39;, dim = 300) Then, you can save the trained model, so that you do not have to train it again. For this, you need to feed the save_model() method a path to which to save the file. Make sure to add \u0026lsquo;.bin\u0026rsquo; to save the model as a .bin file.\n\u0026gt;\u0026gt;\u0026gt; model.save_model(\u0026#39;scifi_fasttext_model.bin\u0026#39;) Re-opening a saved model is done with the load_model() method:\n\u0026gt;\u0026gt;\u0026gt; model = fasttext.load_model(\u0026#39;scifi_fasttext_model.bin\u0026#39;) Using word embeddings Now we have trained the model, we have the word embeddings ready to be used. And, luckily, fastText comes with some nice functions to work with word embeddings! Here we highlight two of possible uses of word embeddings: obtaining most similar words, and analogies - but remember there are more possible uses. We start by simply retrieving the word embeddings. This can be done with any of the two following commands.\n\u0026gt;\u0026gt;\u0026gt; model.get_word_vector(\u0026#39;villain\u0026#39;) \u0026gt;\u0026gt;\u0026gt; model[\u0026#39;villain\u0026#39;] array([ 0.01417591, -0.06866349, 0.09390495, -0.04146367, 0.10481305, -0.2541916 , 0.26757774, -0.04365376, -0.02336818, 0.07684527, -0.05139925, 0.14692445, 0.07103274, 0.23373744, -0.28555775, .............................................................. -0.14082788, 0.27454248, 0.02602287, 0.03754443, 0.18067479, 0.20172128, 0.02454677, 0.04874028, -0.17860755, -0.01387627, 0.02247835, 0.05518318, 0.04844297, -0.2925061 , -0.05710272], dtype=float32) Since fastText does not only train an embedding for the full word, but also so for the ngrams in each word as well, subwords and their embeddings can be accessed as follows:\n\u0026gt;\u0026gt;\u0026gt; ngrams, hashes = model.get_subwords(\u0026#39;villain\u0026#39;) \u0026gt;\u0026gt;\u0026gt; \u0026gt;\u0026gt;\u0026gt; for ngram, hash in zip(ngrams, hashes): \u0026gt;\u0026gt;\u0026gt; print(ngram, model.get_input_vector(hash)) Note: using the get_subwords() method returns two lists, one with the ngrams of type string, the other with hashes. These hashes are not the same as embeddings, but rather are the identifier that fastText uses to store and retrieve embeddings. Therefore, to get the (sub-)word embedding using a hash, the get_input_vector() method has to be used.\nFurthermore, vectors can be created for full sentences as well:\n\u0026gt;\u0026gt;\u0026gt; model.get_sentence_vector(\u0026#39;the villain defeated the hero, tyrrany reigned throughout the galaxy for a thousand eons.\u0026#39;) array([-2.73631997e-02, 7.83981197e-03, -1.97590180e-02, -1.42770987e-02, 6.88663125e-03, -1.63909234e-02, 5.72902411e-02, 1.44126266e-02, -1.64726824e-02, 8.55281111e-03, -5.33024594e-02, 4.74718548e-02, ................................................................. 3.30820642e-02, 7.64035881e-02, 7.33195152e-03, 4.60342802e-02, 4.94049815e-03, 2.52075139e-02, -2.30138078e-02, -3.56832631e-02, -2.22732662e-03, -1.84207838e-02, 2.37668958e-03, -1.00214258e-02], dtype=float32) Most similar words A nice usecase of fastText is to retrieve similar words. For instance, you can retrieve the set of 10 words with the most similar meaning (i.e., most similar word vector) to a target word using a nearest neighbours algorithm based on the cosine distance.\n\u0026gt;\u0026gt;\u0026gt; model.get_nearest_neighbors(\u0026#39;villain\u0026#39;) [(0.9379335641860962, \u0026#39;villainy\u0026#39;), (0.9019550681114197, \u0026#39;villain,\u0026#39;), (0.890184223651886, \u0026#39;villain.\u0026#39;), (0.8709720969200134, \u0026#39;villains\u0026#39;), (0.8297745585441589, \u0026#39;villains.\u0026#39;), (0.8225630521774292, \u0026#39;villainous\u0026#39;), (0.8214142918586731, \u0026#39;villains,\u0026#39;), (0.6485553979873657, \u0026#39;Villains\u0026#39;), (0.6020095944404602, \u0026#39;heroine\u0026#39;), (0.5941146612167358, \u0026#39;villa,\u0026#39;)] Interestingly, this also works for words not in the model corpus, including misspelled words!\n\u0026gt;\u0026gt;\u0026gt; model.get_nearest_neighbors(\u0026#39;vilain\u0026#39;) [(0.6722341179847717, \u0026#39;villain\u0026#39;), (0.619519829750061, \u0026#39;villain.\u0026#39;), (0.6137816309928894, \u0026#39;lain\u0026#39;), (0.6128077507019043, \u0026#39;villainous\u0026#39;), (0.609745979309082, \u0026#39;villainy\u0026#39;), (0.6089878678321838, \u0026#39;Glain\u0026#39;), (0.5980470180511475, \u0026#39;slain\u0026#39;), (0.5925296545028687, \u0026#39;villain,\u0026#39;), (0.5779100060462952, \u0026#39;villains\u0026#39;), (0.5764451622962952, \u0026#39;chaplain\u0026#39;)] Analogies Another nice use for fastText is creating analogies. Since the word embedding vectors are created in relation to every other word in the corpus, these relations should be preserved in the vector space so that analogies can be created. For analogies, a triplet of words is required according to the formula \u0026lsquo;A is to B as C is to [output]\u0026rsquo;. For example, if we take the formula \u0026lsquo;Men is to Father as [output] is to Mother\u0026rsquo;, we get the expected answer of Women.\n\u0026gt;\u0026gt;\u0026gt; model.get_analogies(\u0026#39;men\u0026#39;, \u0026#39;father\u0026#39;, \u0026#39;mother\u0026#39;) [(0.6985629200935364, \u0026#39;women\u0026#39;), (0.6015384793281555, \u0026#39;all\u0026#39;), (0.5977899432182312, \u0026#39;man\u0026#39;), (0.5835891366004944, \u0026#39;out\u0026#39;), (0.5830296874046326, \u0026#39;now\u0026#39;), (0.5767865180969238, \u0026#39;one\u0026#39;), (0.5711579322814941, \u0026#39;in\u0026#39;), (0.5671708583831787, \u0026#39;wingmen\u0026#39;), (0.567089855670929, \u0026#39;women\u0026#34;\u0026#39;), (0.5663136839866638, \u0026#39;were\u0026#39;)] However, since the model that we have created was done using uncleaned data from a relatively small corpus, our output is not perfect. For example, with the following analogy triplets, the correct answer of bad comes fourth, after villainy, villain. and villain, showing that for a better model, we should do some additional cleaning of our data (e.g., removing punctuation).\n\u0026gt;\u0026gt;\u0026gt; model.get_analogies(\u0026#39;good\u0026#39;, \u0026#39;hero\u0026#39;, \u0026#39;villain\u0026#39;) [(0.5228292942047119, \u0026#39;villainy\u0026#39;), (0.5205934047698975, \u0026#39;villain.\u0026#39;), (0.5122538208961487, \u0026#39;villain,\u0026#39;), (0.5047158598899841, \u0026#39;bad\u0026#39;), (0.483129620552063, \u0026#39;villains.\u0026#39;), (0.4676515460014343, \u0026#39;good\u0026#34;\u0026#39;), (0.4662466049194336, \u0026#39;vill\u0026#39;), (0.46115875244140625, \u0026#39;villains\u0026#39;), (0.4569159746170044, \u0026#34;good\u0026#39;\u0026#34;), (0.4529685974121094, \u0026#39;excellent.\u0026#34;\u0026#39;)] Conclusion In this blogpost we have shown how to train a lightweight, efficient natural language processing model using fastText. After installing it, we have shown how to use some of the fastText functions to train the model, retrieve word embeddings, and usem them for different questions. While this was a toy example, we hope you found it inspiring for your own research! And remember, if you have a research idea that entails using natural language models and are stuck or do not know how to start - you can contact us!\nBonus: Tips to improve your model performance Depending on your research, various tweaks can be made to your data to improve fastText performance. For example, if multi-word phrases (e.g., Social Data Science) play some key aspect in your analyses, you might want to change these word phrases in the data with a dash or underscore (e.g., Social_Data_Science) so that these phrases are trained as a single token, not as a sum of the tokens Social, Data, and Science.\nAs shown with the good-hero-villain analogy, removing punctuation and other types of non-alphabetic characters can help remove learning representations for unwanted tokens. For this, stemming (removing word stems) and lemmatization (converting all words to their base form) can also be useful. Similarly, two other ways to deal with unwanted tokens are to remove stop-words from your data, and to play around with the minCount parameter (i.e, the minimum number of times a word needs to occur in the data to have a token be trained) when training your model.\nMost importantly, try to gather as much knowledge about the domain of your research. This tip can seem obvious, but having a proper understanding of the topic you are researching is the most important skill to have when it comes to language models. Let\u0026rsquo;s take the Sci-Fi corpus we used as an example: the 10th nearest neigbor of the word villain was villa. If you don\u0026rsquo;t really understand what either of those words mean, you would not know that these results seem fishy (since the model we created has low internal quality, it relies very much on the trained n-grams. Since both words contain the n-gram \u0026lsquo;villa\u0026rsquo;, they are rated as being close in the vector space). Therefore, make sure to understand the domain to the best of your abilities and scrutinize your findings to get reliable results.\n","date":"January 22, 2024","image":"http://odissei-soda.nl/images/tutorial-6/logo-color_hu1abe885bd00d0c3db0943f497a717f5c_29206_650x0_resize_box_3.png","permalink":"/tutorials/fasttext/","title":"Training a fastText model from scratch using Python"},{"categories":null,"contents":"In this tutorial we present Geoflow, a newly created tool designed to visualize international flows in an interactive way. The tool is free and open-source, and can be accessed here. It is designed to visualize any international flows like, for instance, cash or migration flows. Since it\u0026rsquo;s easier to understand its capabilities by using them, we\u0026rsquo;ll start showing a couple of examples of what Geoflow can do. After that, we briefly explain how to upload your own dataset and visualize it using the tool.\nWhat Geoflow can do For these examples, we use the included demo dataset showing investments in fossil companies across countries. Figure 1 shows the top 10 investments in fossil fuel companies from China. The visualization is straightforward: the arrows indicate in which countries these investments are located. By placing the cursor on top of Bermudas, its arrow gets highlighted, showing the flow strength (weight), as well as the inflow and outflow country. Figure 2 is a barplot that shows to which countries most investments go to: in this case, it is mainly Singapore, followed by the Netherlands.\nFigure 1. Map of top 10 investments in fossil fuel companies from China\rFigure 2. Barplot of 10 investments in fossil fuel companies from China\rIn general, in the tool we can select the source and target countries for which we want to see the flows. There is also an option to select source or target for a country, which is useful when we want to focus on one country: for example, selecting all flows into the Netherlands or from the Netherlands. Besides, we can select the number of flows to visualize in the upper part. Lastly, we can select whether we want to visualize the inflow or the outflow. This last option changes the colouring of the countries (colouring either the inflow or outflow countries), and it also changes the barplot visualization.\nFigure 3. Some configuration options for the visualization\rHow to use Geoflow for your visualizations It is very straightforward to use Geoflow for your own visualizations. You only need to upload a .csv dataset to the app with the following columns:\nSource: the source of the flow, written in ISO2 format. Target: the target of the flow, also in ISO2 format Weight: strength of the flow (e.g., the number of migrants, or the financial revenue) Additionally, you can add:\nYear: If year is present, a new visualization on the bottom right panel shows a time series for the flows. You can see it in Figure 4 below. Other columns: They will be interpreted as categorical variables. This allows you to split the flows into categories, as shown in Figure 5 It is a requirement that the data format is .csv and the name of the variables are source, target, weight and (if included) year, so note you might have to reformat your data to use the tool.\nFigure 4. How the time series plot look like, with Germany highlighted\rFigure 5. Map of China top 10 fossil fuel investments, showing only investments in Manufacturing\rConclusion In this tutorial we have shown you how to use Geoflow for your own visualizations. We have shown what the Geoflow capabilities are, and how to upload your data to use it. We hope you\u0026rsquo;ve find it inspiring!\nGeoflow is open-source and has been developed by Peter Kok, Javier García Bernardo and mbabic332. If you use the tool, you can cite this Zenodo repo, with a DOI. If you wish to expand on Geoflow, you can check the source code here, and contribute or build on it. The tool is written using JavaScript.\n","date":"December 11, 2023","image":"http://odissei-soda.nl/images/tutorial-7/geoflow-miniature_hu6dbb4503ac6d6ce6a020e99dbb85449c_282734_650x0_resize_box_3.png","permalink":"/tutorials/geoflow-visualizer/","title":"Visualizing international flows with Geoflow visualizer"},{"categories":null,"contents":"In October 2023, we hosted a workshop about data visualization.\nAbout the workshop A key aspect of data science is turning data into a story that anyone can understand at a glance. In this Data Visualization Bootcamp, you will learn how to represent data in visual formats like pictures, charts and graphs.\nIn this workshop you will:\nlearn the most important principles of data visualization,\nlearn how to use data visualization libraries in Python (with Bokeh and Pyodide),\ndevelop your own Panel/hvPlot/Bokeh dashboard to interactively visualize data.\nPlenary instruction is combined with hands-on practice with your own datasets in small groups. The workshop will cover how interactive visualizations and dashboarding in Python/Bokeh differs from RShiny. You will learn how to run Bokeh Apps in the browser via Pyodide (no server required).\nBy the end of the workshop, you will be able to create powerful storytelling visuals with your own research data.\nAdditional information When Friday, 20 October 2023. It was also taught in 2021 and 2022. Where Data Science Center, University of Amsterdam. Registration Registration is no longer possible. Instructors Javier Garcia-Bernardo, Assistant Professor at Utrecht University. Javier\u0026rsquo;s website. Materials Course materials, including slides and code, are open and can be accessed here. ","date":"October 20, 2023","image":"http://odissei-soda.nl/images/workshops/data-visualization_hu3d03a01dcc18bc5be0e67db3d8d209a6_1348270_650x0_resize_q100_box.jpg","permalink":"/workshops/data-visualization/","title":"Data Visualization Bootcamp"},{"categories":null,"contents":"One common issue we encounter in helping researchers work with the housing register data of Statistics Netherlands is its transactional nature: each row in the housing register table contains data on when someone registered and deregistered at an address (more info in Dutch here).\nIn this post, we show how to use this transactional data to perform one of the most common transformations we see: what part of a certain time interval (e.g, the entire year 2021 or January 1999) did the people I’m interested in live in the Netherlands? To solve this issue, we will use time interval objects, as implemented in the package {lubridate} which is part of the {tidyverse} since version 2.0.0.\nlibrary(tidyverse) The data Obviously, we cannot share actual Statistics Netherlands microdata here, so we first generate some tables that capture the gist of the data structure. First, let’s generate some basic person identifiers and some info about each person:\nCode (person_df \u0026lt;- tibble( person_id = factor(c(\u0026#34;A10232\u0026#34;, \u0026#34;A39211\u0026#34;, \u0026#34;A28183\u0026#34;, \u0026#34;A10124\u0026#34;)), firstname = c(\u0026#34;Aron\u0026#34;, \u0026#34;Beth\u0026#34;, \u0026#34;Carol\u0026#34;, \u0026#34;Dave\u0026#34;), income_avg = c(14001, 45304, 110123, 43078) )) # A tibble: 4 × 3 person_id firstname income_avg \u0026lt;fct\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; 1 A10232 Aron 14001 2 A39211 Beth 45304 3 A28183 Carol 110123 4 A10124 Dave 43078 Then, we create a small example of housing transaction register data. In this data, for any period where a person is not registered to a house, they are assumed to live abroad (because everyone in the Netherlands is required to be registered at an address).\nCode (house_df \u0026lt;- tibble( person_id = factor(c(\u0026#34;A10232\u0026#34;, \u0026#34;A10232\u0026#34;, \u0026#34;A10232\u0026#34;, \u0026#34;A39211\u0026#34;, \u0026#34;A39211\u0026#34;, \u0026#34;A28183\u0026#34;, \u0026#34;A28183\u0026#34;, \u0026#34;A10124\u0026#34;)), house_id = factor(c(\u0026#34;H1200E\u0026#34;, \u0026#34;H1243D\u0026#34;, \u0026#34;H3432B\u0026#34;, \u0026#34;HA7382\u0026#34;, \u0026#34;H53621\u0026#34;, \u0026#34;HC39EF\u0026#34;, \u0026#34;HA3A01\u0026#34;, \u0026#34;H222BA\u0026#34;)), start_date = ymd(c(\u0026#34;20200101\u0026#34;, \u0026#34;20200112\u0026#34;, \u0026#34;20211120\u0026#34;, \u0026#34;19800101\u0026#34;, \u0026#34;19900101\u0026#34;, \u0026#34;20170303\u0026#34;, \u0026#34;20190202\u0026#34;, \u0026#34;19931023\u0026#34;)), end_date = ymd(c(\u0026#34;20200112\u0026#34;, \u0026#34;20211120\u0026#34;, \u0026#34;20230720\u0026#34;, \u0026#34;19891231\u0026#34;, \u0026#34;20170102\u0026#34;, \u0026#34;20180720\u0026#34;, \u0026#34;20230720\u0026#34;, \u0026#34;20230720\u0026#34;)) )) # A tibble: 8 × 4 person_id house_id start_date end_date \u0026lt;fct\u0026gt; \u0026lt;fct\u0026gt; \u0026lt;date\u0026gt; \u0026lt;date\u0026gt; 1 A10232 H1200E 2020-01-01 2020-01-12 2 A10232 H1243D 2020-01-12 2021-11-20 3 A10232 H3432B 2021-11-20 2023-07-20 4 A39211 HA7382 1980-01-01 1989-12-31 5 A39211 H53621 1990-01-01 2017-01-02 6 A28183 HC39EF 2017-03-03 2018-07-20 7 A28183 HA3A01 2019-02-02 2023-07-20 8 A10124 H222BA 1993-10-23 2023-07-20 Interval objects! Notice how each transaction in the housing data has a start and end date, indicating when someone registered and deregistered at an address. A natural representation of this information is as a single object: a time interval. The package {lubridate} has support for specific interval objects, and several operations on intervals:\ncomputing the length of an interval with int_length() computing whether two intervals overlap with int_overlap() and much more\u0026hellip; as you can see here So let’s transform these start and end columns into a single interval column!\nhouse_df \u0026lt;- house_df |\u0026gt; mutate( # create the interval int = interval(start_date, end_date), # drop the start/end columns .keep = \u0026#34;unused\u0026#34; ) house_df # A tibble: 8 × 3 person_id house_id int \u0026lt;fct\u0026gt; \u0026lt;fct\u0026gt; \u0026lt;Interval\u0026gt; 1 A10232 H1200E 2020-01-01 UTC--2020-01-12 UTC 2 A10232 H1243D 2020-01-12 UTC--2021-11-20 UTC 3 A10232 H3432B 2021-11-20 UTC--2023-07-20 UTC 4 A39211 HA7382 1980-01-01 UTC--1989-12-31 UTC 5 A39211 H53621 1990-01-01 UTC--2017-01-02 UTC 6 A28183 HC39EF 2017-03-03 UTC--2018-07-20 UTC 7 A28183 HA3A01 2019-02-02 UTC--2023-07-20 UTC 8 A10124 H222BA 1993-10-23 UTC--2023-07-20 UTC We will want to compare this interval with a reference interval to compute the proportion of time that a person lived in the Netherlands within the reference interval. Therefore, we quickly define a new interval operation which truncates an interval to a reference interval. Don’t worry too much about it for now, we will use it later. Do notice that we’re always using the int_*() functions defined by {lubridate} to interact with the interval objects.\n# utility function to truncate an interval object to limits (also vectorized so it works in mutate()) int_truncate \u0026lt;- function(int, int_limits) { int_start(int) \u0026lt;- pmax(int_start(int), int_start(int_limits)) int_end(int) \u0026lt;- pmin(int_end(int), int_end(int_limits)) return(int) } Computing the proportion in the Netherlands The next step is to define a function that computes for each person a proportion overlap for a reference interval. By creating a function, it will be easy later to do the same operation for different intervals (e.g., different reference years) to work with the rich nature of the Statistics Netherlands microdata. To compute this table, we make extensive use of the {tidyverse}, with verbs like filter(), mutate(), and summarize(). If you want to know more about these, take a look at the {dplyr} documentation (but of course you can also use your own flavour of data processing, such as {data.table} or base R).\n# function to compute overlap proportion per person proportion_tab \u0026lt;- function(housing_data, reference_interval) { # start with the housing data housing_data |\u0026gt; # only retain overlapping rows, this makes the following # operations more efficient by only computing what we need filter(int_overlaps(int, reference_interval)) |\u0026gt; # then, actually compute the overlap of the intervals mutate( # use our earlier truncate function int_tr = int_truncate(int, reference_interval), # then, it\u0026#39;s simple to compute the overlap proportion prop = int_length(int_tr) / int_length(reference_interval) ) |\u0026gt; # combine different intervals per person summarize(prop_in_nl = sum(prop), .by = person_id) } Now we’ve defined this function, let’s try it out for a specific year such as 2017!\nint_2017 \u0026lt;- interval(ymd(\u0026#34;20170101\u0026#34;), ymd(\u0026#34;20171231\u0026#34;)) prop_2017 \u0026lt;- proportion_tab(house_df, int_2017) prop_2017 # A tibble: 3 × 2 person_id prop_in_nl \u0026lt;fct\u0026gt; \u0026lt;dbl\u0026gt; 1 A39211 0.00275 2 A28183 0.832 3 A10124 1 Now we’ve computed this proportion, notice that we only have three people. This means that the other person was living abroad in that time, with a proportion in the Netherlands of 0. To nicely display this information, we can join the proportion table with the original person dataset and replace the NA values in the proportion column with 0.\nleft_join(person_df, prop_2017, by = \u0026#34;person_id\u0026#34;) |\u0026gt; mutate(prop_in_nl = replace_na(prop_in_nl, 0)) # A tibble: 4 × 4 person_id firstname income_avg prop_in_nl \u0026lt;fct\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; 1 A10232 Aron 14001 0 2 A39211 Beth 45304 0.00275 3 A28183 Carol 110123 0.832 4 A10124 Dave 43078 1 Success! We now have a dataset for each person with the proportion of time they lived in the Netherlands in 2017. If you look at the original housing dataset, you may see the following patterns reflected in the proportion:\nAron indeed did not live in the Netherlands at this time. Beth moved away on January 2nd, 2017. Carol moved into the Netherlands on March 3rd, 2017 and remained there until 2018 Dave lived in the Netherlands this entire time. Conclusion In this post, we used interval objects and operations from the {lubridate} package to wrangle transactional housing data into a proportion of time spent living in the Netherlands. The advantage of using this package and its functions is that any particularities with timezones, date comparison, and leap years are automatically dealt with so that we could focus on the end result rather than the details.\nIf you are doing similar work and you have a different method, let us know! In addition, if you have further questions about working with Statistics Netherlands microdata or other complex or large social science datasets, do not hesitate to contact us on our website: https://odissei-soda.nl.\nBonus appendix: multiple time intervals Because we created a function that takes in the transaction data and a reference interval, we can do the same thing for multiple time intervals (e.g., years) and combine the data together in one wide or long dataset. This is one way to do this:\nlibrary(glue) # for easy string manipulation # initialize an empty dataframe with all our columns nl_prop \u0026lt;- tibble(person_id = factor(), prop_in_nl = double(), yr = integer()) # then loop over the years of interest for (yr in 2017L:2022L) { # construct reference interval for this year ref_int \u0026lt;- interval(ymd(glue(\u0026#34;{yr}0101\u0026#34;)), ymd(glue(\u0026#34;{yr}1231\u0026#34;))) # compute the proportion table for this year nl_prop_yr \u0026lt;- proportion_tab(house_df, ref_int) |\u0026gt; mutate(yr = yr) # append this year to the dataframe nl_prop \u0026lt;- bind_rows(nl_prop, nl_prop_yr) } # we can pivot it to a wide format nl_prop_wide \u0026lt;- nl_prop |\u0026gt; pivot_wider( names_from = yr, names_prefix = \u0026#34;nl_prop_\u0026#34;, values_from = prop_in_nl ) # and join it with the original person data, replacing NAs with 0 again person_df |\u0026gt; left_join(nl_prop_wide, by = \u0026#34;person_id\u0026#34;) |\u0026gt; mutate(across(starts_with(\u0026#34;nl_prop_\u0026#34;), \\(p) replace_na(p, 0))) # A tibble: 4 × 9 person_id firstname income_avg nl_prop_2017 nl_prop_2018 nl_prop_2019 \u0026lt;fct\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; 1 A10232 Aron 14001 0 0 0 2 A39211 Beth 45304 0.00275 0 0 3 A28183 Carol 110123 0.832 0.549 0.912 4 A10124 Dave 43078 1 1 1 # ℹ 3 more variables: nl_prop_2020 \u0026lt;dbl\u0026gt;, nl_prop_2021 \u0026lt;dbl\u0026gt;, # nl_prop_2022 \u0026lt;dbl\u0026gt; ","date":"September 29, 2023","image":"http://odissei-soda.nl/images/tutorial-5/lubridate_1_hu9ab8adf199cf1b31e00f5b6813db3886_430902_650x0_resize_box_3.png","permalink":"/tutorials/lubridate/","title":"Wrangling interval data using lubridate"},{"categories":null,"contents":"These days, our online presence leaves traces of our behavior everywhere. There is data of what we do and say in platforms such as WhatsApp, Instagram, online stores, and many others. Of course, this so-called ‘digital trace data’ is of interest for social scientists: new, rich, enormous datasets that can be used to describe and understand our social world. However, this data is commonly owned by private companies. How can social scientists access and make sense of this data?\nIn this tutorial, we use data donation and the Port software to get access to WhatsApp group-chat data in a way that completely preserves privacy of research participants. Our goal is to show a small peek of what can be achieved with these methods. If you have an idea for your own research that entails collecting digital trace data, don’t hesitate to contact us! We can help you think about data acquisition, analysis and more.\nWith data donation, it is possible to collect data about any online platform: under the General Data Protection Regulation (EU law), companies are required to provide their data to any citizen that requests it. This data is available in so-called Data Download Packages (DDP’s), which are rather cumbersome to work with and contain personal information. Therefore, the Port software processes these DDP’s so that the data is in a format ready for analysis, while completely guaranteeing privacy of the respondents. The only thing research participants have to do is request their DDP’s, see which information they are sharing and consent to sharing it.\nSince we do not dive in with a lot of detail, we refer to Port\u0026rsquo;s github for more details on how to get started with your own project. There you can find a full guide to install and use Port, examples of past studies done with it, a tutorial for creating your own data donation workflow, and more. You can also read more about data donation in general here and here.\nAn application with WhatsApp data In this example, we extract some basic information from WhatsApp group chats, such as how many messages links, locations, and pictures were shared, as well as which person in the group the participant responded most to.\nNote that this is the only information we want to collect from the participants of the study, not the whole group chat file!\nThe first step in creating a DDP processing script is to obtain an example DDP and examine it. This example DDP can be, for example, your own DDP requested from WhatsApp. Usually, platforms provide a (compressed) folder with many different files; i.e., data in a format that is not ready to use. Once uncompressed, a WhatsApp group chat file could look like this:\n[16/03/2022, 15:10:17] Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more. [16/03/2022, 15:20:25] person1: Hi shiva! [16/03/2022, 15:25:38] person2: Hi 👋 [16/03/2022, 15:26:48] person3: Hoi! [16/03/2022, 18:39:29] person2: https://youtu.be/KBmUTY6mK_E [16/03/2022, 18:35:51] person1: Location: https://maps.google.com/?q=52.089451,5.108469 [20/03/2022, 20:08:51] person4: I’m about to generate some very random messages so that I can make some screenshots for the explanation to participants [24/03/2022, 20:19:38] person1: @user3 if you remove your Profile picture for a moment I will redo the screenshots 😁 [26/03/2022, 18:52:15] person2: Well done Utrecht 😁 [14/07/2020, 22:05:54] person4: 👍Bedankt As part of a collaboration with data donation researchers using Port, we wrote a Python script1 to convert this into the information we need. The main script is available here; in short, it does the following:\nseparate the header and the message itself parse the date and time information in each message remove unneeded information such as alert notifications anonymize usernames convert the extracted information to a nice data frame format to show the participant for consent One big problem we had to overcome is that messages and alert notifications cannot be identified in the same way (i.e., using the same regular expression) on every device. Through trial-and-error, we tailored the steps to work with every operating system, language, and device required for this study. Indeed, if you design a study like this, it is very important to try out your script on many different DDPs from different people and devices. That way you will make sure you have covered possible variation in DPPs before actually starting data collection. This is a process that can take quite a while, so keep this in mind when you want to run a data donation study!\nThe end result In Figure 1 you can see a (fictitious) snippet of the dataset obtained. This is how a dataset in which you combine donations from different users would look like. As can be seen, we have moved from a rather untidy text file to a tidy, directly analysable dataset, where each row corresponds to an user in a given data donation package, and the rest of the columns give information about that user. Particularly, the dataset displays the following information: data donation package id (ddp_id), an anonymized user name (user_name), number of words sent on the groupchat by the user (nwords), date of first and last messages sent on the groupchat by the user (date_first_mess, date_last_mess), number of urls, files and locations sent on the groupchat by the user (respectively, nurls, nfiles, nlocations), and the (other) user that has replied more to that user (replies_from), as well as the user that that user has replied to the most (replies_to).\nTable 1. Snippet of fictitious dataset\nddp_id user_name nwords date_first_mess date_last_mess nurls nfiles nlocations replies_from replies_to 1 User1_1 121 10/08/2023 27/08/2023 0 15 0 User1_2 User1_4 1 User1_2 17 11/08/2023 28/08/2023 3 1 2 User1_1 User1_1 1 User1_3 44 10/08/2023 28/08/2023 9 6 3 User1_2 User1_1 1 User1_4 50 12/08/2023 29/08/2023 0 3 1 User1_3 User1_1 2 User2_1 123 01/05/2022 01/11/2022 2 0 1 User2_2 User2_2 2 User2_2 250 01/05/2022 02/11/2022 0 32 3 User2_3 User2_1 2 User2_3 176 08/07/2022 04/12/2022 6 0 5 User2_2 User2_3 3 User3_1 12 05/06/2023 26/07/2023 12 2 0 User3_1 User3_2 3 User3_2 16 06/06/2023 26/07/2023 17 2 0 User3_2 User3_1 In Figure 2 you can see a screenshot of how the Port software would display the data to be shared (number of words or messages, date stamps…) and ask for consent to the research subjects. As you see, the Port software guarantees that research subjects are aware of what information they are sharing and consent to it. The rest of the DDPs, including sensitive data, is analyzed locally and does not leave the respondents\u0026rsquo; devices.\nFigure 2. How the Port software displays the data to be shared and asks for consent\rConclusion The aim of this post was to illustrate how to use data donation with the software Port to extract online platform data. We illustrated all of this with the extraction of group-chat information from WhatsApp data. The main challenge of this project was to write a robust script that transforms this data into a nice, readily usable format while maintaining privacy. If you want to implement something similar but do not know how or where to start, let us know and we can help!\nThis script uses a deprecated version of Port, but large part of the script can be reused;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"September 8, 2023","image":"http://odissei-soda.nl/images/tutorial-4/whatsapp_header_hu3d03a01dcc18bc5be0e67db3d8d209a6_1708177_650x0_resize_q100_box.jpg","permalink":"/tutorials/collect-online-data-whatsapp/","title":"Collecting online platforms data for science: an example using WhatsApp"},{"categories":null,"contents":"In June 2023, we hosted a workshop about Network Science.\nAbout the workshop How can networks at Statistics Netherlands (CBS) help us understand and predict social systems? In this workshop, we provide participants with the conceptual and practical skills necessary to use network science tools to answer social, economic and biological questions.\nThis workshop introduces concepts and tools in network science. The objective of the course is that participants acquire hands-on knowledge on how to analyze CBS network data. Participants will be able to understand when a network approach is useful, work with networks efficiently, and create network variables.\nThe course has a hands-on focus, with lectures accompanied by programming practicals (in Python) to apply the knowledge on real networks.\nAdditional information When Thursday June 22, 202. (It was also taught it in 2022, together with Eszter Boyani). Where Summer Institute in Computational Social Science, Erasmus University Rotterdam. Registration Registration is no longer possible. Instructors Javier Garcia-Bernardo, Assistant Professor at Utrecht University. Javier\u0026rsquo;s website. Materials Course materials, including slides and code, are open and can be accessed here. ","date":"June 22, 2023","image":"http://odissei-soda.nl/images/workshops/network-science_hu3d03a01dcc18bc5be0e67db3d8d209a6_668913_650x0_resize_q100_box.jpg","permalink":"/workshops/network-science/","title":"Network science"},{"categories":null,"contents":"In May 2023, we hosted a workshop in Causal Impact Assesment focusing on policy interventions.\nAbout the workshop How do we assess whether a school policy intervention has had the desired effect on student performance? How do we estimate the impact a natural disaster has had on the inhabitants of affected regions? How can we determine whether a change in the maximum speed on highways has led to fewer accidents? These types of questions are at the core of many social scientific research problems. While questions with this structure are seemingly simple, their causal effects are notoriously hard to estimate, because often the researchers cannot perform a randomized controlled experiment.\nThe workshop offers hands-on training on several basic and advanced methods (e.g., difference-in-difference, interrupted time series and synthetic control) that can be applied when assessing causal impact. There is a focus on both the assumptions underlying the methods, and how to put them into practice.\nAdditional information When May, 2023. It is likely to be held again in 2024. Where Utrecht University. Registration Registration is no longer possible. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University (Erik-Jan\u0026rsquo;s website) \u0026amp; Oisín Ryan Assistant Professor at UMC (Oisín\u0026rsquo;s website). Materials Course materials, including slides and code, are open and can be accessed here. ","date":"May 26, 2023","image":"http://odissei-soda.nl/images/workshops/causal-inference_hu6a86fe0f271ec969b45a77f81a48f6ab_152168_650x0_resize_box_3.png","permalink":"/workshops/causal-inference-for-policy-evaluation/","title":"Causal Impact Assesment"},{"categories":null,"contents":"ArtScraper is a Python library to download images and metadata for artworks available on WikiArt and Google Arts \u0026amp; Culture.\nInstallation pip install https://github.com/sodascience/artscraper.git Downloading images from WikiArt To download data from WikiArt it is necessary to obtain free API keys.\nOnce you have the API keys, you can simply run the code below to download the images and metadata of the three artworks of Aleksandra Ekster.\n# artworks to scrape some_links = [ \u0026#34;https://www.wikiart.org/en/aleksandra-ekster/women-s-costume-1918\u0026#34;, \u0026#34;https://www.wikiart.org/en/aleksandra-ekster/still-life-1913\u0026#34;, \u0026#34;https://www.wikiart.org/en/aleksandra-ekster/view-of-paris-1912\u0026#34; ] # download images and metadata to the folder \u0026#34;data\u0026#34; with WikiArtScraper(output_dir=\u0026#34;data\u0026#34;) as scraper: for url in some_links: scraper.load_link(url) scraper.save_metadata() scraper.save_image() Downloading images from Google Art \u0026amp; Culture To download data from Google Art \u0026amp; Culture you need to download Firefox and geckodriver. The installation instructions can be found in our GitHub repository.\nOnce you have Firefox and geckodriver, you can simply run the code below to download artworks. You are not allowed to share or publish the images. Use them only for research.\n# artworks to scrape some_links = [ \u0026#34;https://artsandculture.google.com/asset/helena-hunter-fairytales/dwFMypq0ZSiq6w\u0026#34;, \u0026#34;https://artsandculture.google.com/asset/erina-takahashi-and-isaac-hernandez-in-fantastic-beings-laurent-liotardo/MQEhgoWpWJUd_w\u0026#34;, \u0026#34;https://artsandculture.google.com/asset/rinaldo-roberto-masotti/swG7r2rgfvPOFQ\u0026#34; ] # If you are on Windows, you can download geckodriver, place it in your directory, # and use the argument geckodriver_path=\u0026#34;geckodriver.exe\u0026#34; with GoogleArtScraper(\u0026#34;data\u0026#34;) as scraper: for url in some_links: scraper.load_link(url) scraper.save_metadata() scraper.save_image() You can find more examples here\nDo you want to know more about this library? Check our GitHub repository\nAre you using it for academic work? Please cite our package:\nSchram, Raoul, Garcia-Bernardo, Javier, van Kesteren, Erik-Jan, de Bruin, Jonathan, \u0026amp; Stamkou, Eftychia. (2022). ArtScraper: A Python library to scrape online artworks (0.1.1). Zenodo. https://doi.org/10.5281/zenodo.7129975 ","date":"October 4, 2022","image":"http://odissei-soda.nl/images/tutorial-2/nantenbo_header_hu48d98726654ff846597d2fe25a58192f_52421_650x0_resize_q100_box.jpg","permalink":"/tutorials/artscraper/","title":"ArtScraper: A Python library to scrape online artworks"},{"categories":null,"contents":"With the increasing popularity of open science practices, it is now more and more common to openly share code along with more traditional scientific objects such as papers. But what are the best ways to create an understandable, openly accessible, findable, citable, and stable archive of your code? In this post, we look at what you need to do to prepare your code folder and then how to upload it to Zenodo.\nPrepare your code folder To make code available, you will be uploading it to the internet as a single folder. The code you will upload will be openly accessible, and it will stay that way indefinitely. Therefore, it is necessary that you prepare your code folder (also called a “repository”) for publication. This requires time and effort, and for every project the requirements are different. Think about the following checklist:\nMust-haves Make a logical, understandable folder structure. For example, for a research project with data processing, visualization, and analysis I like the following structure: my_project/\u0026lt;br/\u0026gt; ├─ raw_data/\u0026lt;br/\u0026gt; │ ├─ questionnaire_data.csv\u0026lt;br/\u0026gt; ├─ processed_data/\u0026lt;br/\u0026gt; │ ├─ questionnaire_processed.rds\u0026lt;br/\u0026gt; │ ├─ analysis_object.rds\u0026lt;br/\u0026gt; ├─ img/\u0026lt;br/\u0026gt; │ ├─ plot.png\u0026lt;br/\u0026gt; ├─ 01_load_and_process_data.R\u0026lt;br/\u0026gt; ├─ 02_create_visualisations.R\u0026lt;br/\u0026gt; ├─ 03_main_analysis.R\u0026lt;br/\u0026gt; ├─ 04_output_results.R\u0026lt;br/\u0026gt; ├─ my_project.Rproj\u0026lt;br/\u0026gt; ├─ readme.md Make sure no privacy-sensitive information is leaked. Remove non-shareable data objects (raw and processed!), passwords hardcoded in your scripts, comments containing private information, and so on. Create a legible readme file in the folder that describes what the code does, where to find which parts of the code, and what needs to be done to run the code. You can choose how elaborate to make this! It could be a simple text file, a word document, a pdf, or a markdown document with images describing the structure. It is best if someone who does not know the project can understand the entire folder based on the readme – this includes yourself in a few years from now! Strong recommendations Reformat the code so that it is portable and easily reproducible. This means that when someone else downloads the folder, they do not need to change the code to run it. For example this means that you do not read data with absolute paths (e.g., C:/my_name/Documents/PhD/projects/project_title/raw_data/questionnaire_data.csv) on your computer, but only to relative paths on the project (e.g., raw_data/questionnaire_data.csv). For example, if you use the R programming language it is good practice to use an RStudio project. Format your code so that it is legible by others. Write informative comments, split up your scripts in logical chunks, and use a consistent style (for R I like the tidyverse style) Nice to have Record the software packages that you used to run the projects, including their versions. If a package gets updated, your code may no longer run! Your package manager may already do this, e.g., for python you can use pip freeze \u0026gt; requirements.txt. In R, you can use the renv package for this. If you have privacy-sensitive data, it may still be possible to create a synthetic or fake version of this data for others to run the code on. This ensures maximum reproducibility. Compressing the code folder The last step before uploading the code repository to Zenodo is to compress the folder. This can be done in Windows 11 by right-clicking the folder and pressing “compress to zip file”. It’s a good idea to go into the compressed folder afterwards, and checking if everything is there and also removing any unnecessary files (such as .Rhistory files for R).\nFigure 1: Zipping the code folder.\rAfter compressing, your code repository is now ready to be uploaded!\nUploading to Zenodo Zenodo is a website where you can upload any kind of research object: papers, code, datasets, questionnaires, presentations, and much more. After uploading, Zenodo will create a page containing your research object and metadata about the object, such as publication date, author, and keywords. In the figure below you can see an example of a code repository uploaded to Zenodo.\nFigure 2: A code repository uploaded to the Zenodo website. See https://zenodo.org/record/6504837\rOne of the key features of Zenodo is that you can get a Digital Object Identifier (DOI) for the objects you upload, making your research objects persistent and easy to find and cite. For example, in APA style I could cite the code as follows:\nvan Kesteren, Erik-Jan. (2022). My project (v1.2). Zenodo. https://doi.org/10.5281/zenodo.6504837\nZenodo itself is fully open source, hosted by CERN, and funded by the European Commission. These are exactly the kinds of conditions which make it likely to last for a long time! Hence, it is an excellent choice for uploading our code. So let’s get started!\nCreate an account To upload anything to Zenodo, you need an account. If you already have an ORCID or a GitHub account, then you can link these immediately to your Zenodo login. I do recommend doing so as it will make it easy to link these services and use them together.\nFigure 3: Zenodo sign-up page. See https://zenodo.org/signup/\rStart a new upload When you click the “upload” button, you will get a page where you can upload your files, determine the type of the upload, and create metadata for the research object. Now zip your prepared code folder and drag it to the upload window!\nFigure 4: Uploading a zipped folder to the Zenodo website.\rFill out the metadata One of the first options you need to specify is the “upload type”. For code repositories, you can choose the “software” option. The remaining metadata is relatively simple to fill out (such as author and institution). However, one category to pay attention to is the license: by default the CC-BY-4.0 license is selected. For a short overview of what this means, see the creative commons website: https://creativecommons.org/licenses/by/4.0/. You can opt for a different license by including a file called LICENSE in your repository.\nFigure 5: Selecting the \u0026lsquo;software\u0026rsquo; option for upload type.\rPublish! The last step is to click “publish”. Your research code is now findable, citable, understandable, reproducible, and archived until the end of times! You can now show it to all your colleagues and easily cite it in your manuscript. If you get feedback and you want to change your code, you can also upload a new version of the same project on the Zenodo website.\nConclusion In this post, I described a checklist for preparing your code folder for publication with a focus on understandability, and I have described one way in which you can upload your prepared code repository to an open access archive. Zenodo is an easy, dependable and well-built option, but of course there are many alternatives, such as hosting it on your own website, using the Open Science Framework, GitHub, or using a publisher’s website; each has its own advantages and disadvantages.\n","date":"September 5, 2022","image":"http://odissei-soda.nl/images/tutorial-1/tutorial1_header_hu650f89d19acf6379f59d4eaf3a39c00b_29682_650x0_resize_box_3.png","permalink":"/tutorials/share-your-reserarch-code/","title":"How to share your research code"},{"categories":null,"contents":"In May 2022, we hosted a workshop about efficient programming for accessing CBS microdata.\nAbout the workshop What to do when your CBS microdata analysis takes too many computational resources to run on the remote access environment? In this workshop we covered solutions to this problem. It will be an accessible introduction to a variety of ways in which you can programme more efficiently when using microdata in your research. Furthermore, it will discuss when you should and should not move your project to the ODISSEI Secure Supercomputer.\nThe introduction will include some live coding, exploring different options for project organisation, speeding up code, benchmarking, profiling, and reducing memory requirements. During his talk, Van Kesteren will also touch upon topics such as \u0026ldquo;embarassingly parallel\u0026rdquo;, scientific programming, data pipelines, open source, and open science. Although the presentation will center around data analysis with R, these principles also hold for other languages, such as Python or Julia.\nAdditional information When May 16th, 2022. Where Registration Registration is no longer possible. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University (Erik-Jan\u0026rsquo;s website). Materials Course materials, including slides and code, are open and can be accessed here. ","date":"May 16, 2022","image":"http://odissei-soda.nl/images/workshops/cbs_hub238c02f97a123a553dda5ed269ec1f4_33026_650x0_resize_q100_box.jpg","permalink":"/workshops/accesing-cbs-microdata/","title":"Efficient programming with CBS microdata"},{"categories":null,"contents":"In September 2022, we hosted a workshop about synthetic data.\nAbout the workshop Open data is one of the pillars of open science. However, there are often barriers in the way of making research data openly available, relating to consent, privacy, or organisational boundaries. In such cases, synthetic data is an excellent solution: the real data is kept secret, but a \u0026ldquo;fake\u0026rdquo; version of the data is available. The promise of the synthetic dataset is that others can then investigate the data structure, rerun scripts, use the data in educational materials, or even run a completely different analysis on their own.\nBut how do you generate synthetic data? In this session, we will introduce the field of synthetic data generation and apply several tools to generate synthetic versions of datasets, with various level of utility and privacy. We will be paying extra attention to practical issues such as missing values, data types, and disclosure control. Participants can either use a provided example dataset or they can bring their own data!\nAdditional information When September 2022. Where Open Science Festival. Registration Registration is no longer possible. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University (Erik-Jan\u0026rsquo;s website), Raoul Schram, Research Engineer at Utrecht University (Raoul\u0026rsquo;s website) \u0026amp; Thom Volker, PhD Candidate at Utrecht Universit (Thom\u0026rsquo;s website). Materials Course materials, including slides and code, are open and can be accessed here. ","date":"January 9, 2022","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/workshops/creating-synthetic-data/","title":"How to create synthetic data"},{"categories":null,"contents":"During your time as a fellow:\nyou will spend between 3-5 months full-time* working on a social science research project. you can propose your own project, based on your interests. you are a member of the SoDa team at the Methodology \u0026amp; Statistics department of Utrecht University. you will get a salary during this time, paid for by the team. one of the senior team members will be your mentor. To apply, you have to submit a short proposal for your project, together with a substantive supervisor. We are looking for projects in the social sciences for which a computational or data-related problem needs to be solved.\nThe next submission deadline is 31 May 2024\n* part-time possible, but the project should be your main priority.\n","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/fellowship/","title":"Fellowship"},{"categories":null,"contents":"","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/principles/","title":"Principles"},{"categories":null,"contents":"","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/projects/","title":"Projects"},{"categories":null,"contents":"","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/team/","title":"The SoDa Team"}] \ No newline at end of file +[{"categories":null,"contents":"In this course, you will learn the basics of cluster computing with R for social science workloads. The day starts with an introduction to supercomputer architecture, including a hands-on session focused on running jobs on a supercomputer.\nThe second half of the programme focuses on translating your R workflow from a GUI (Rstudio) workflow on your desktop to a scripting/batch environment on the supercomputer. Topics covered here include: efficient programming, parallel computing, and using the SLURM job manager to send your job/analysis to the supercomputer.\nIn this course you will:\nDo practical exercises to learn how to effectively use the Lisa national computing cluster and the national supercomputer, and how to complete your tasks with minimal effort in the shortest possible time. Experience how to achieve high performance with R by using SURF\u0026rsquo;s supercomputing facilities. Additional information When 27-09-2024 Where SURF Amsterdam Registration https://www.surf.nl/en/agenda/cluster-computing-for-social-scientists-with-r Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University, together with a SURF instructor Erik-Jan\u0026rsquo;s website Materials Course materials, including slides and code, are open and can be accessed here. ","date":"September 27, 2024","image":"http://odissei-soda.nl/images/workshops/Supercomputing_hubeda2984481e4d3f13fb836f8a1151c6_83157_650x0_resize_box_3.png","permalink":"/workshops/cluster-computing/","title":"Supercomputing for social scientists with R"},{"categories":null,"contents":"Signed networks are a way to represent relationships between entities. These types of networks are called \u0026lsquo;signed\u0026rsquo; because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting). Community detection in signed networks aims to identify groups of nodes that share similar connection patterns. In this tutorial, we will guide you through applying two popular community detection algorithms to signed networks, using Python.\nAlgorithms We will be using two algorithms:\nSpinglass: This algorithm leverages a spin model metaphor to partition nodes into communities. It considers both the weights and signs of connections. SPONGE: This spectral clustering technique identifies communities by analyzing the eigenvectors of the signed adjacency matrix. They are implemented in Python. First, you need install the necessary libraries (pandas is not strictly necessary to implement the algorithms, but we\u0026rsquo;ll use it throught the tutorial).\npip install igraph pip install git+https://github.com/alan-turing-institute/SigNet.git pip install pandas Once installed, you can import the libraries and start working with the algorithms.\nimport igraph as ig from signet.cluster import Cluster import pandas as pd Construct and visualize a signed network from data To begin, we\u0026rsquo;ll construct an example network. This network will be constructed from a series of signed interactions among agents: the edgelist. We can use pandas to read the edgelist from a file or from a list of tuples. In this network, 1 means a positive interaction, and -1 means a negative interaction.\nedgelist = [(\u0026#39;A\u0026#39;, \u0026#39;B\u0026#39;, 1), (\u0026#39;A\u0026#39;, \u0026#39;C\u0026#39;, -1), (\u0026#39;B\u0026#39;, \u0026#39;C\u0026#39;, -1) ] # create DataFrame using edgelist edgelist_df = pd.DataFrame(edgelist, columns =[\u0026#39;source\u0026#39;, \u0026#39;target\u0026#39;, \u0026#39;weight\u0026#39;]) To construct the signed network, we use igraph. In the network, the nodes (agents) will be A, B, C, and the edges (interactions) are the ones listed in the edgelist -with their corresponding weights.\ng = ig.Graph.TupleList(edgelist_df.itertuples(index=False), directed=False, weights=False, edge_attrs=\u0026#34;weight\u0026#34;) ig.summary(g) IGRAPH UNW- 3 3 -- + attr: name (v), weight (e) Then, we can plot our network, still using igraph.\n# color the edges based on their weight g.es[\u0026#34;color\u0026#34;] = [\u0026#34;#00BFB3\u0026#34; if edge[\u0026#39;weight\u0026#39;] \u0026gt;0 else \u0026#34;#F05D5E\u0026#34; for edge in g.es] ig.plot(g, vertex_color=\u0026#34;grey\u0026#34;, edge_color=g.es[\u0026#34;color\u0026#34;], edge_width=5, vertex_label=g.vs[\u0026#34;name\u0026#34;]) Figure 1. A signed network\rDetect communities Spinglass The community-spinglass method is a community detection approach grounded in statistical mechanics. Initially proposed by Reichardt and Bornholdt for unsigned networks, it was later extended by Traag and Bruggeman later extended to signed networks. This algorithm is implemented in the igraph package.\nspinglass = g.community_spinglass(weights=\u0026#34;weight\u0026#34;, spins=50, gamma= 0.5, lambda_= 0.5, implementation=\u0026#39;neg\u0026#39;) for i, node in enumerate(g.vs): print(f\u0026#39;node {node[\u0026#34;name\u0026#34;]}: community {spinglass.membership[i]}\u0026#39;) node A: community 0 node B: community 0 node C: community 1 Changing the parameters gamma and lambda_ gives more or less importance to positive or negative ties within a community, depending on whether we want agents with negative interactions to be found in the same group of agents.\nSPONGE The SPONGE method (Signed Positive Over Negative Generalized Eigenproblem), introduced by Cucuringu et al. (2019) is based on minimizing the number of violations. These \u0026ldquo;violations\u0026rdquo; consist of positive edges between communities and negative edges within communities. An open-source Python implementation of the SPONGE algorithm is available on GitHub.\n# get the adjacency matrix of the network (only signs, not weights) A = g.get_adjacency_sparse(attribute=\u0026#39;weight\u0026#39;).sign() c = Cluster((A.multiply(A\u0026gt;0), -A.multiply(A\u0026lt;0))) sponge = c.SPONGE(k=2) node A: community 0 node B: community 0 node C: community 1 Changing the parameter k, you can set as many communities as you want.\nConclusion This tutorial has equipped you with the knowledge and code to apply two common community detection algorithms – Spinglass and SPONGE – to signed networks using Python libraries like igraph and signet. Applying these algorithms, you can gain insights into the underlying community structure of signed networks.\nAs you saw, there are some parameter choices to be done. We found that changing parameters can drastically influence the results of the algorithms (i.e., find communities where there are none, or just not look for what you wanted to find.). We suggest you to check our last pre-print Community detection in bipartite signed networks is highly dependent on parameter choice and the related code to discover more about the parameter tuning in the case of two-mode (bipartite) signed networks.\nIf you are doing similar work and you have a different method, let us know! In addition, if you have further questions about community detection, or you think we can help you, do not hesitate to contact us!\nReferences Reichardt, Jörg, and Stefan Bornholdt. “Statistical Mechanics of Community Detection.” Physical Review E, vol. 74, no. 1, July 2006. Crossref, https://doi.org/10.1103/physreve.74.016110. Traag, V. A., and Jeroen Bruggeman. “Community Detection in Networks with Positive and Negative Links.” Physical Review E, vol. 80, no. 3, Sept. 2009. Crossref, https://doi.org/10.1103/physreve.80.036115. Cucuringu, Mihai, et al. \u0026ldquo;SPONGE: A Generalized Eigenproblem for Clustering Signed Networks.\u0026rdquo; arXiv, 2019, https://arxiv.org/abs/1904.08575 Candellone, Elena, et al. \u0026ldquo;Community Detection in Bipartite Signed Networks is Highly Dependent on Parameter Choice.\u0026rdquo; arXiv, 2024, https://arxiv.org/abs/2405.08203 ","date":"May 15, 2024","image":"http://odissei-soda.nl/images/tutorial-9/algorithm_hu5459c0360c2b0cb7a147d2df0eb350ca_1164949_650x0_resize_q100_box.jpg","permalink":"/tutorials/community-detection-signed-networks/","title":"Detecting communities in signed networks with Python"},{"categories":null,"contents":"Doing open, reproducible science means doing your best to openly share research data and analysis code. With these open materials, others can check and understand your research, use it to prepare their own analysis, find examples for teaching, and more. However, sometimes datasets contain sensititive or confidential information, which makes it difficult — if not impossible — to share. In this case, producing and sharing a synthetic version of the data might be a solution. In this post, we show how to do this in an auditable, transparent way with the software package metasyn.\nMetasyn is a Python package that helps you to generate synthetic data, with two ideas in mind. First, it is easy to use and understand. Second, and most importantly, it is privacy-friendly. Unlike most other synthetic data generation tools, metasyn strictly limits the statistical information in its data generation model to adhere to the highest privacy standards and only generates data that is similar on an individual column level. This makes it a great tool for initial exploration, code development, and sharing of datasets while maintining very high privacy levels - but it is not suitable for in-depth statistical analysis.\nWith metasyn, you fit a model to a dataset and synthesize data similar to the original based on that model. You can then export the synthetic data and the model used to generate it, in easy-to-read format. As a result, metasyn allows data owners to safely share synthetic datasets based on their source data, as well as the model used to generate it, without worrying about leaking any private information from the original dataset.\nLet\u0026rsquo;s say you want to use metasyn to collaborate on a sensitive dataset with others. In this tutorial, we will show you everything you need to know to get started.\nStep 1: Setup The first step is installing metasyn. The easiest way to do so is by installing it through pip. This can be done by typing the following command in your terminal:\npip install metasyn Then, in a Python environment, you can import metasyn (and Polars, which will be used to load the dataset):\nimport polars as pl from metasyn import MetaFrame, demo_file Step 2: Creating a DataFrame Before we can pass a dataset into metasyn, we need to convert it to a Polars DataFrame. In doing so, we can indicate which columns contain categorical values. We can also tell polars to find columns that may contain dates or timestamps. Metasyn can later use this information to generate categorical or date-like values where appropriate. For more information on how to use Polars, check out the Polars documentation.\nFor this tutorial, we will use the Titanic dataset, which comes preloaded with metasyn (its file path can be accessed using the demo_file function). We will specify the data types of the Sex and Embarked columns as categorical, and we will also try to parse dates in the DataFrame.\n# Get the CSV file path for the Titanic dataset csv_path = demo_file(\u0026#34;titanic\u0026#34;) # Replace this with your file path if needed # Create a Polars DataFrame df = pl.read_csv( source=csv_path, dtypes={\u0026#34;Sex\u0026#34;: pl.Categorical, \u0026#34;Embarked\u0026#34;: pl.Categorical}, try_parse_dates=True, ) Step 3: Generating a MetaFrame Now that we have created a DataFrame, we can easily generate a MetaFrame for it. Metasyn can later use this MetaFrame to generate synthetic data that aligns with the original dataset.\nA MetaFrame is a simple model that captures the essentials of each variable in the original dataset (e.g., variable names, types, data types, the percentage of missing values, and distribution), without containing any actual data entries.\nA MetaFrame can be created by simply calling MetaFrame.fit_dataframe(), passing in the DataFrame as a parameter.\n# Generate and fit a MetaFrame to the DataFrame mf = MetaFrame.fit_dataframe(df) Step 4: Generating synthetic data With our MetaFrame in place, we can use it to generate synthetic data. To do so, we can call synthesize on our MetaFrame, and pass in the amount of rows of data that we want to generate. This will return a DataFrame with synthetic data, that is similar to our original dataset.\n# generate synthetic data syn_df = mf.synthesize(5) That\u0026rsquo;s it! You can now read, analyze, modify, use and share this DataFrame as you would with any other \u0026ndash; knowing that it is rather unlikely to leak private information (though if you need actual formal privacy guarantees, look at our disclosure control plugin).\nStep 5: Exporting the MetaFrame Let\u0026rsquo;s say we want to go one step further, and also share the an auditable representation of the MetaFrame alongside our synthetic data. We can easily do so by exporting it to a JSON file.\nThese exported files follow the Generative Metadata Format (GMF). This is a format that was designed to be easy-to-read and understand.\nOther users can then import this file to generate synthetic data similar to the original dataset, without ever having access to the original data. In addition, due to these files being easy to read, others can easily understand and evaluate how the synthetic data is generated.\nTo export the MetaFrame, we can call the export method on an existing MetaFrame (in this case, mf), passing in the file path of where we want to save the JSON file.\n# Serialize and export the MetaFrame mf.export(\u0026#34;exported_metaframe.json\u0026#34;) To load the MetaFrame from the exported JSON file, we can use the MetaFrame.from_json() class method, passing in the file path as a parameter:\n# Create a MetaFrame based on a GMF (.json) file mf = MetaFrame.from_json(file_path) Conclusion You now know how to use metasyn to generate synthetic data from a dataset. Both the synthetic data and the model (MetaFrame) used to generate it can be shared safely, while maintaining a high level of privacy.\nEnjoy using metasyn!\nFor more information on how to use metasyn, check out the documentation or the GitHub repository.\n","date":"April 26, 2024","image":"http://odissei-soda.nl/images/metasyn-tutorial/metasyn_hu815f2b6e5313f67d740aa04f6f4cb5c8_145073_650x0_resize_box_3.png","permalink":"/tutorials/generating-synthetic-data-with-metasyn/","title":"Generating synthetic data in a safe way with metasyn"},{"categories":null,"contents":"About the workshop The workshop will teach how to work efficiently with data from CBS (Statistics Netherlands), using its remote access environment. It is designed for people getting acces to the remote environment for the first time, or people who want to improve the efficiency and reproducibility of their workflows. Topics covered include project organization, principles for writing legible programs (in R, but people using other languages are encouraged to join as well; the principles apply to all languages), and to retrieve, store and configure data pipelines in a reproducible and understandable way.\nThere will be a 2 hour plenary session and a 1 hour consultation session, where specific questions will be adressed in small groups. Questions can be sent in advanced in the registration process.\nAdditional information When Friday, 19th April 2024, from 13.30h to 16.30h Where Utrecht University, Administration building, room “Van Lier \u0026amp; Eggink”. Registration You can register here, and consult additional info, here. It is free of charge for ODISSEI organization members. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University. Erik-Jan\u0026rsquo;s website Materials A link with all workshop materials can be found here. ","date":"April 19, 2024","image":"http://odissei-soda.nl/images/metasyn-tutorial/metasyn_hu815f2b6e5313f67d740aa04f6f4cb5c8_145073_650x0_resize_box_3.png","permalink":"/workshops/efficient-reproducible-research-cbs-microdata/","title":"Efficient and reproducible research with CBS microdata"},{"categories":null,"contents":"Author: Flavio Hafner. Post photo from Glenn Carstens-Peters on Unsplash\nThis post is the first of a series of blogposts arising from a collaboration between the eScience Center and the ODISSEI Social Data Science Team. You can also find this text at the eScience Center page.\nResearch often relies on accessing novel data, for instance by collecting them from the internet through web scraping. If you have ever tried this, you may have run into your IP address being blocked by the website you scrape. Websites do this with a good reason — to protect themselves against malicious acts, such as denial-of-service attacks or large-scale scraping by parties with ill intent. This makes sense from the websites’ perspective, but prevents you from answering your research question.\nBut this problem can be solved. In this tutorial, we show how you as a researcher can use IP rotation to circumvent certain scraping protections with the sirup package, which works on Linux operating systems.\nBefore we jump into it, it is important to highlight that web scraping and IP rotation need to respect the law and should only be a last resort. For instance, before you scrape data from a website, you should ask the data owner whether they are willing to make them available to you through a data sharing agreement. If you nevertheless decide to scrape the data, you should get approval from the ethical review board at your university. Moreover, do only scrape data that are publicly available on the web and do not send excessive number of requests to the website in a given time.\nFor rotating the IP address, we will use a VPN service. Here you can read more about what a VPN service is — in short, the service has a bunch of remote servers, and connecting your computer to one of these servers changes your IP address.\nWhat you need 1. OpenVPN OpenVPN is a system that allows you to create secure VPN connections. You can install it by following these instructions.\n2. Root access to your computer Because internet connections are an important security concern, OpenVPN requires root access — this is the equivalent to administrator rights on a Windows computer. If you have root access, you can for instance run the following command on your terminal:\nsudo ls -lh # will ask you for your root password Installing and setting up sirup You can install sirup as follows:\npython -m pip install sirup To use the package and change your IP address, you need an account with a VPN service provider that offers OpenVPN configuration files for your account. At the time of writing, for instance ProtonVPN and Surfshark offer this option — note that these services are not for free. We will use ProtonVPN in this tutorial.\nAfter creating an account, you need to download two sets of files.\nFirst, you download credentials that identify your Proton account when using OpenVPN. On the ProtonVPN website, click on “Account” and then you see something like this:\nFigure 1. Credentials that identify your Proton account when using OpenVPN\rCopy and paste the username and the password into a .txt file that looks like this:\nusername password Then, save the file as “proton_credentials.txt”. Remember where it is stored — we will need it later.\nA first warning on security. Storing account credentials like this makes it easy for you to use the sirup package. But it also increases the risk that unauthorized persons get a hold on these credentials. Thus, be careful to store the credentials in a safe place on your laptop and do not share them with anyone.\nSecond, to use OpenVPN we need configuration files, whose names end with .ovpn. The files allow OpenVPN to connect to a server from the VPN service provider. In ProtonVPN, go to the \u0026ldquo;Download\u0026rdquo; section of your account. Select the options as follows:\nFigure 2. Options to download .ovpn configuration files\rAnd download the configuration file(s) you want to use. Store the downloaded files on your computer, and remember the location.\nNow you are ready!\nUsing sirup We start by defining the path to the proton_credentials.txt file. When you execute the code below, you will be asked to enter the root password, which is necessary to make the connection.\nimport getpass auth_file = \u0026#34;proton_credentials.txt\u0026#34; pwd = getpass.getpass(\u0026#34;Please enter your root password:\u0026#34;) A second warning on security. The code above stores your root password during the python session without encrypting it. This is OK to do on your laptop — if someone gets access to your python session, your security has already been compromised — but not recommended on a shared computer such as a cluster or a cloud service.\nChanging the IP address with sirup Now you can use the VPNConnector to change your IP address. We will use the \u0026quot;my_config_file.ovpn\u0026quot; configuration file.\nfrom sirup.VPNConnector import VPNConnector config_file = \u0026#34;my_config_file.ovpn\u0026#34; The code below first connects to the server associated with \u0026quot;my_config_file.ovpn\u0026quot; and then disconnects.\nconnector = VPNConnector(auth_file, config_file) # Let\u0026#39;s see the current IP address when no VPN tunnel is active print(connector.base_ip) connector.connect(pwd=pwd) # Now the IP address should differ print(connector.current_ip) connector.disconnect(pwd=pwd) # Now current_ip should be the same as base_ip above print(connector.current_ip) Rotating the IP address with sirup Instead of connecting to a single server, you can also rotate across many different servers — which means you rotate your IP address across a set of potential addresses. Doing so is useful for larger scraping jobs because it will spread your requests across more servers.\nTo do this, you need to download multiple configuration files as described above. Store all of the .ovpn configuration files together in a separate directory. Let\u0026rsquo;s say you store them in the \u0026quot;/path/to/config/files/\u0026quot; directory. You need to define this path in your python script:\nconfig_path = \u0026#34;/path/to/config/files/\u0026#34; The following code connects to two different servers before disconnecting again:\nfrom sirup.IPRotator import IPRotator rotator = IPRotator(auth_file=my_auth_file, config_location=config_path, seed=seed) # this will ask for the root password print(rotator.connector.base_ip) rotator.connect() print(rotator.connector.current_ip) rotator.rotate() print(rotator.connector.current_ip) rotator.disconnect() print(rotator.connector.current_ip) Conclusion This tutorial has walked you through the steps to manage your IP address in python, using the sirup package. We hope it makes your scraping workflows easier!\nsirup is an open-source package developed by the Netherlands eScience Center. If you use the tool, you can cite this zenodo repository with the DOI: https://doi.org/10.5281/zenodo.10261949.\nThe source code of the package is here, where you can contribute to it, build on it and submit issues.\nThanks to Patrick Bos, Peter Kalverla, Kody Moodley and Carlos Gonzalez Poses for comments.\n","date":"February 27, 2024","image":"http://odissei-soda.nl/images/tutorial-8/sirup_hu5459c0360c2b0cb7a147d2df0eb350ca_814836_650x0_resize_q100_box.jpg","permalink":"/tutorials/ip-rotation/","title":"How to manage your IP address in python"},{"categories":null,"contents":"By now, it\u0026rsquo;s no surprise to anybody the astonishing results large language models produce. Models such as GPT-4, Bard, Bert or RoBERTa have sparked intense research and media attention, as well as changed many people\u0026rsquo;s workflows. However, these models have issues. A common critique is that they function as black boxes: users do not know much about their training data or modelling choices. Besides, training them usually requires gigantic datasets and processing power. Therefore, there is value in alternative models that can be trained by researchers, having full control over the input data and internal process. In this tutorial, we explain how to train a natural language processing model using fastText: a lightweight, easy-to-implement and efficient word embedding model that has shown good performance in various natural language tasks over the years.\nFirst, a bit about word-embedding models. Word-embeddings models are one type of natural language processing models. By producing real number vector representations of words, they offer a powerful way to capture the semantic meaning of words in large datasets, which is why they are widely used in diverse research applications. Indeed, their uses are numerous: semantic similarity, text generation, document representation, author recognition, knowledge graph construction, sentiment analysis, or bias detection (Caliskan et al., 2017).\nInstallation To install Fasttext, we recommend checking the fasttext-wheel PyPI module. To verify the installation succeeded, you have to importat the package in a Python script.\n\u0026gt;\u0026gt;\u0026gt; import fasttext If there are no error messages, you have succeeded and we can move to the training part.\nTraining the model Training data To train fastText, you need a corpus: a large collection of text. The required size of these corpora varies depending on the research purpose: from several thousand to billions of words. Some research benefits from smaller, well-curated corpora; other research benefits from large unstructured corpora. However, while the exact size needed is hard to determine, do keep in mind that the text in the training data has to relate to your research question! If you want to use word embeddings for studying doctors\u0026rsquo; notes, you need doctors\u0026rsquo; notes - and not legal medical texts. If you want to study niche cultural sub-groups, you need data from these groups - and not necessarily a corpus of random Internet interactions. The corpus is an integral part of your research! Generally, the larger the research-related corpus you can get, the better.\nIn this tutorial we use a freely available corpus of Science-Fiction texts downloaded from Kaggle. Preferably, the text you feed to fastText should have each sentence on a new line.\nHyperparameters We will train an unsupervised fastText model, which means that lot of implementation decisions need to be made. If you don\u0026rsquo;t have specific methodological reasons and/or you lack the time or computing power for a proper grid search, we suggest you go with the default parameter options - which are optimized for many research contexts -, but switching the \u0026lsquo;dim\u0026rsquo; parameter to 300. Empirical research has shown that a dimensionality of 300 leads to optimal performance in most settings, even if that will increase computational resources and training time. If you can afford spending the time thinking about hyperparameters, you could tune the training model (CBOW or SkipGram), learning rate, dimensionality of the vector, context widow size, and more. You can see here the full list of tuning parameters available.\nFitting model We fit the model with the following command:\n\u0026gt;\u0026gt;\u0026gt; model = fasttext.train_unsupervised(\u0026#39;internet_archive_scifi_v3.txt\u0026#39;, dim = 300) Then, you can save the trained model, so that you do not have to train it again. For this, you need to feed the save_model() method a path to which to save the file. Make sure to add \u0026lsquo;.bin\u0026rsquo; to save the model as a .bin file.\n\u0026gt;\u0026gt;\u0026gt; model.save_model(\u0026#39;scifi_fasttext_model.bin\u0026#39;) Re-opening a saved model is done with the load_model() method:\n\u0026gt;\u0026gt;\u0026gt; model = fasttext.load_model(\u0026#39;scifi_fasttext_model.bin\u0026#39;) Using word embeddings Now we have trained the model, we have the word embeddings ready to be used. And, luckily, fastText comes with some nice functions to work with word embeddings! Here we highlight two of possible uses of word embeddings: obtaining most similar words, and analogies - but remember there are more possible uses. We start by simply retrieving the word embeddings. This can be done with any of the two following commands.\n\u0026gt;\u0026gt;\u0026gt; model.get_word_vector(\u0026#39;villain\u0026#39;) \u0026gt;\u0026gt;\u0026gt; model[\u0026#39;villain\u0026#39;] array([ 0.01417591, -0.06866349, 0.09390495, -0.04146367, 0.10481305, -0.2541916 , 0.26757774, -0.04365376, -0.02336818, 0.07684527, -0.05139925, 0.14692445, 0.07103274, 0.23373744, -0.28555775, .............................................................. -0.14082788, 0.27454248, 0.02602287, 0.03754443, 0.18067479, 0.20172128, 0.02454677, 0.04874028, -0.17860755, -0.01387627, 0.02247835, 0.05518318, 0.04844297, -0.2925061 , -0.05710272], dtype=float32) Since fastText does not only train an embedding for the full word, but also so for the ngrams in each word as well, subwords and their embeddings can be accessed as follows:\n\u0026gt;\u0026gt;\u0026gt; ngrams, hashes = model.get_subwords(\u0026#39;villain\u0026#39;) \u0026gt;\u0026gt;\u0026gt; \u0026gt;\u0026gt;\u0026gt; for ngram, hash in zip(ngrams, hashes): \u0026gt;\u0026gt;\u0026gt; print(ngram, model.get_input_vector(hash)) Note: using the get_subwords() method returns two lists, one with the ngrams of type string, the other with hashes. These hashes are not the same as embeddings, but rather are the identifier that fastText uses to store and retrieve embeddings. Therefore, to get the (sub-)word embedding using a hash, the get_input_vector() method has to be used.\nFurthermore, vectors can be created for full sentences as well:\n\u0026gt;\u0026gt;\u0026gt; model.get_sentence_vector(\u0026#39;the villain defeated the hero, tyrrany reigned throughout the galaxy for a thousand eons.\u0026#39;) array([-2.73631997e-02, 7.83981197e-03, -1.97590180e-02, -1.42770987e-02, 6.88663125e-03, -1.63909234e-02, 5.72902411e-02, 1.44126266e-02, -1.64726824e-02, 8.55281111e-03, -5.33024594e-02, 4.74718548e-02, ................................................................. 3.30820642e-02, 7.64035881e-02, 7.33195152e-03, 4.60342802e-02, 4.94049815e-03, 2.52075139e-02, -2.30138078e-02, -3.56832631e-02, -2.22732662e-03, -1.84207838e-02, 2.37668958e-03, -1.00214258e-02], dtype=float32) Most similar words A nice usecase of fastText is to retrieve similar words. For instance, you can retrieve the set of 10 words with the most similar meaning (i.e., most similar word vector) to a target word using a nearest neighbours algorithm based on the cosine distance.\n\u0026gt;\u0026gt;\u0026gt; model.get_nearest_neighbors(\u0026#39;villain\u0026#39;) [(0.9379335641860962, \u0026#39;villainy\u0026#39;), (0.9019550681114197, \u0026#39;villain,\u0026#39;), (0.890184223651886, \u0026#39;villain.\u0026#39;), (0.8709720969200134, \u0026#39;villains\u0026#39;), (0.8297745585441589, \u0026#39;villains.\u0026#39;), (0.8225630521774292, \u0026#39;villainous\u0026#39;), (0.8214142918586731, \u0026#39;villains,\u0026#39;), (0.6485553979873657, \u0026#39;Villains\u0026#39;), (0.6020095944404602, \u0026#39;heroine\u0026#39;), (0.5941146612167358, \u0026#39;villa,\u0026#39;)] Interestingly, this also works for words not in the model corpus, including misspelled words!\n\u0026gt;\u0026gt;\u0026gt; model.get_nearest_neighbors(\u0026#39;vilain\u0026#39;) [(0.6722341179847717, \u0026#39;villain\u0026#39;), (0.619519829750061, \u0026#39;villain.\u0026#39;), (0.6137816309928894, \u0026#39;lain\u0026#39;), (0.6128077507019043, \u0026#39;villainous\u0026#39;), (0.609745979309082, \u0026#39;villainy\u0026#39;), (0.6089878678321838, \u0026#39;Glain\u0026#39;), (0.5980470180511475, \u0026#39;slain\u0026#39;), (0.5925296545028687, \u0026#39;villain,\u0026#39;), (0.5779100060462952, \u0026#39;villains\u0026#39;), (0.5764451622962952, \u0026#39;chaplain\u0026#39;)] Analogies Another nice use for fastText is creating analogies. Since the word embedding vectors are created in relation to every other word in the corpus, these relations should be preserved in the vector space so that analogies can be created. For analogies, a triplet of words is required according to the formula \u0026lsquo;A is to B as C is to [output]\u0026rsquo;. For example, if we take the formula \u0026lsquo;Men is to Father as [output] is to Mother\u0026rsquo;, we get the expected answer of Women.\n\u0026gt;\u0026gt;\u0026gt; model.get_analogies(\u0026#39;men\u0026#39;, \u0026#39;father\u0026#39;, \u0026#39;mother\u0026#39;) [(0.6985629200935364, \u0026#39;women\u0026#39;), (0.6015384793281555, \u0026#39;all\u0026#39;), (0.5977899432182312, \u0026#39;man\u0026#39;), (0.5835891366004944, \u0026#39;out\u0026#39;), (0.5830296874046326, \u0026#39;now\u0026#39;), (0.5767865180969238, \u0026#39;one\u0026#39;), (0.5711579322814941, \u0026#39;in\u0026#39;), (0.5671708583831787, \u0026#39;wingmen\u0026#39;), (0.567089855670929, \u0026#39;women\u0026#34;\u0026#39;), (0.5663136839866638, \u0026#39;were\u0026#39;)] However, since the model that we have created was done using uncleaned data from a relatively small corpus, our output is not perfect. For example, with the following analogy triplets, the correct answer of bad comes fourth, after villainy, villain. and villain, showing that for a better model, we should do some additional cleaning of our data (e.g., removing punctuation).\n\u0026gt;\u0026gt;\u0026gt; model.get_analogies(\u0026#39;good\u0026#39;, \u0026#39;hero\u0026#39;, \u0026#39;villain\u0026#39;) [(0.5228292942047119, \u0026#39;villainy\u0026#39;), (0.5205934047698975, \u0026#39;villain.\u0026#39;), (0.5122538208961487, \u0026#39;villain,\u0026#39;), (0.5047158598899841, \u0026#39;bad\u0026#39;), (0.483129620552063, \u0026#39;villains.\u0026#39;), (0.4676515460014343, \u0026#39;good\u0026#34;\u0026#39;), (0.4662466049194336, \u0026#39;vill\u0026#39;), (0.46115875244140625, \u0026#39;villains\u0026#39;), (0.4569159746170044, \u0026#34;good\u0026#39;\u0026#34;), (0.4529685974121094, \u0026#39;excellent.\u0026#34;\u0026#39;)] Conclusion In this blogpost we have shown how to train a lightweight, efficient natural language processing model using fastText. After installing it, we have shown how to use some of the fastText functions to train the model, retrieve word embeddings, and usem them for different questions. While this was a toy example, we hope you found it inspiring for your own research! And remember, if you have a research idea that entails using natural language models and are stuck or do not know how to start - you can contact us!\nBonus: Tips to improve your model performance Depending on your research, various tweaks can be made to your data to improve fastText performance. For example, if multi-word phrases (e.g., Social Data Science) play some key aspect in your analyses, you might want to change these word phrases in the data with a dash or underscore (e.g., Social_Data_Science) so that these phrases are trained as a single token, not as a sum of the tokens Social, Data, and Science.\nAs shown with the good-hero-villain analogy, removing punctuation and other types of non-alphabetic characters can help remove learning representations for unwanted tokens. For this, stemming (removing word stems) and lemmatization (converting all words to their base form) can also be useful. Similarly, two other ways to deal with unwanted tokens are to remove stop-words from your data, and to play around with the minCount parameter (i.e, the minimum number of times a word needs to occur in the data to have a token be trained) when training your model.\nMost importantly, try to gather as much knowledge about the domain of your research. This tip can seem obvious, but having a proper understanding of the topic you are researching is the most important skill to have when it comes to language models. Let\u0026rsquo;s take the Sci-Fi corpus we used as an example: the 10th nearest neigbor of the word villain was villa. If you don\u0026rsquo;t really understand what either of those words mean, you would not know that these results seem fishy (since the model we created has low internal quality, it relies very much on the trained n-grams. Since both words contain the n-gram \u0026lsquo;villa\u0026rsquo;, they are rated as being close in the vector space). Therefore, make sure to understand the domain to the best of your abilities and scrutinize your findings to get reliable results.\n","date":"January 22, 2024","image":"http://odissei-soda.nl/images/tutorial-6/logo-color_hu1abe885bd00d0c3db0943f497a717f5c_29206_650x0_resize_box_3.png","permalink":"/tutorials/fasttext/","title":"Training a fastText model from scratch using Python"},{"categories":null,"contents":"In this tutorial we present Geoflow, a newly created tool designed to visualize international flows in an interactive way. The tool is free and open-source, and can be accessed here. It is designed to visualize any international flows like, for instance, cash or migration flows. Since it\u0026rsquo;s easier to understand its capabilities by using them, we\u0026rsquo;ll start showing a couple of examples of what Geoflow can do. After that, we briefly explain how to upload your own dataset and visualize it using the tool.\nWhat Geoflow can do For these examples, we use the included demo dataset showing investments in fossil companies across countries. Figure 1 shows the top 10 investments in fossil fuel companies from China. The visualization is straightforward: the arrows indicate in which countries these investments are located. By placing the cursor on top of Bermudas, its arrow gets highlighted, showing the flow strength (weight), as well as the inflow and outflow country. Figure 2 is a barplot that shows to which countries most investments go to: in this case, it is mainly Singapore, followed by the Netherlands.\nFigure 1. Map of top 10 investments in fossil fuel companies from China\rFigure 2. Barplot of 10 investments in fossil fuel companies from China\rIn general, in the tool we can select the source and target countries for which we want to see the flows. There is also an option to select source or target for a country, which is useful when we want to focus on one country: for example, selecting all flows into the Netherlands or from the Netherlands. Besides, we can select the number of flows to visualize in the upper part. Lastly, we can select whether we want to visualize the inflow or the outflow. This last option changes the colouring of the countries (colouring either the inflow or outflow countries), and it also changes the barplot visualization.\nFigure 3. Some configuration options for the visualization\rHow to use Geoflow for your visualizations It is very straightforward to use Geoflow for your own visualizations. You only need to upload a .csv dataset to the app with the following columns:\nSource: the source of the flow, written in ISO2 format. Target: the target of the flow, also in ISO2 format Weight: strength of the flow (e.g., the number of migrants, or the financial revenue) Additionally, you can add:\nYear: If year is present, a new visualization on the bottom right panel shows a time series for the flows. You can see it in Figure 4 below. Other columns: They will be interpreted as categorical variables. This allows you to split the flows into categories, as shown in Figure 5 It is a requirement that the data format is .csv and the name of the variables are source, target, weight and (if included) year, so note you might have to reformat your data to use the tool.\nFigure 4. How the time series plot look like, with Germany highlighted\rFigure 5. Map of China top 10 fossil fuel investments, showing only investments in Manufacturing\rConclusion In this tutorial we have shown you how to use Geoflow for your own visualizations. We have shown what the Geoflow capabilities are, and how to upload your data to use it. We hope you\u0026rsquo;ve find it inspiring!\nGeoflow is open-source and has been developed by Peter Kok, Javier García Bernardo and mbabic332. If you use the tool, you can cite this Zenodo repo, with a DOI. If you wish to expand on Geoflow, you can check the source code here, and contribute or build on it. The tool is written using JavaScript.\n","date":"December 11, 2023","image":"http://odissei-soda.nl/images/tutorial-7/geoflow-miniature_hu6dbb4503ac6d6ce6a020e99dbb85449c_282734_650x0_resize_box_3.png","permalink":"/tutorials/geoflow-visualizer/","title":"Visualizing international flows with Geoflow visualizer"},{"categories":null,"contents":"In October 2023, we hosted a workshop about data visualization.\nAbout the workshop A key aspect of data science is turning data into a story that anyone can understand at a glance. In this Data Visualization Bootcamp, you will learn how to represent data in visual formats like pictures, charts and graphs.\nIn this workshop you will:\nlearn the most important principles of data visualization,\nlearn how to use data visualization libraries in Python (with Bokeh and Pyodide),\ndevelop your own Panel/hvPlot/Bokeh dashboard to interactively visualize data.\nPlenary instruction is combined with hands-on practice with your own datasets in small groups. The workshop will cover how interactive visualizations and dashboarding in Python/Bokeh differs from RShiny. You will learn how to run Bokeh Apps in the browser via Pyodide (no server required).\nBy the end of the workshop, you will be able to create powerful storytelling visuals with your own research data.\nAdditional information When Friday, 20 October 2023. It was also taught in 2021 and 2022. Where Data Science Center, University of Amsterdam. Registration Registration is no longer possible. Instructors Javier Garcia-Bernardo, Assistant Professor at Utrecht University. Javier\u0026rsquo;s website. Materials Course materials, including slides and code, are open and can be accessed here. ","date":"October 20, 2023","image":"http://odissei-soda.nl/images/workshops/data-visualization_hu3d03a01dcc18bc5be0e67db3d8d209a6_1348270_650x0_resize_q100_box.jpg","permalink":"/workshops/data-visualization/","title":"Data Visualization Bootcamp"},{"categories":null,"contents":"One common issue we encounter in helping researchers work with the housing register data of Statistics Netherlands is its transactional nature: each row in the housing register table contains data on when someone registered and deregistered at an address (more info in Dutch here).\nIn this post, we show how to use this transactional data to perform one of the most common transformations we see: what part of a certain time interval (e.g, the entire year 2021 or January 1999) did the people I’m interested in live in the Netherlands? To solve this issue, we will use time interval objects, as implemented in the package {lubridate} which is part of the {tidyverse} since version 2.0.0.\nlibrary(tidyverse) The data Obviously, we cannot share actual Statistics Netherlands microdata here, so we first generate some tables that capture the gist of the data structure. First, let’s generate some basic person identifiers and some info about each person:\nCode (person_df \u0026lt;- tibble( person_id = factor(c(\u0026#34;A10232\u0026#34;, \u0026#34;A39211\u0026#34;, \u0026#34;A28183\u0026#34;, \u0026#34;A10124\u0026#34;)), firstname = c(\u0026#34;Aron\u0026#34;, \u0026#34;Beth\u0026#34;, \u0026#34;Carol\u0026#34;, \u0026#34;Dave\u0026#34;), income_avg = c(14001, 45304, 110123, 43078) )) # A tibble: 4 × 3 person_id firstname income_avg \u0026lt;fct\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; 1 A10232 Aron 14001 2 A39211 Beth 45304 3 A28183 Carol 110123 4 A10124 Dave 43078 Then, we create a small example of housing transaction register data. In this data, for any period where a person is not registered to a house, they are assumed to live abroad (because everyone in the Netherlands is required to be registered at an address).\nCode (house_df \u0026lt;- tibble( person_id = factor(c(\u0026#34;A10232\u0026#34;, \u0026#34;A10232\u0026#34;, \u0026#34;A10232\u0026#34;, \u0026#34;A39211\u0026#34;, \u0026#34;A39211\u0026#34;, \u0026#34;A28183\u0026#34;, \u0026#34;A28183\u0026#34;, \u0026#34;A10124\u0026#34;)), house_id = factor(c(\u0026#34;H1200E\u0026#34;, \u0026#34;H1243D\u0026#34;, \u0026#34;H3432B\u0026#34;, \u0026#34;HA7382\u0026#34;, \u0026#34;H53621\u0026#34;, \u0026#34;HC39EF\u0026#34;, \u0026#34;HA3A01\u0026#34;, \u0026#34;H222BA\u0026#34;)), start_date = ymd(c(\u0026#34;20200101\u0026#34;, \u0026#34;20200112\u0026#34;, \u0026#34;20211120\u0026#34;, \u0026#34;19800101\u0026#34;, \u0026#34;19900101\u0026#34;, \u0026#34;20170303\u0026#34;, \u0026#34;20190202\u0026#34;, \u0026#34;19931023\u0026#34;)), end_date = ymd(c(\u0026#34;20200112\u0026#34;, \u0026#34;20211120\u0026#34;, \u0026#34;20230720\u0026#34;, \u0026#34;19891231\u0026#34;, \u0026#34;20170102\u0026#34;, \u0026#34;20180720\u0026#34;, \u0026#34;20230720\u0026#34;, \u0026#34;20230720\u0026#34;)) )) # A tibble: 8 × 4 person_id house_id start_date end_date \u0026lt;fct\u0026gt; \u0026lt;fct\u0026gt; \u0026lt;date\u0026gt; \u0026lt;date\u0026gt; 1 A10232 H1200E 2020-01-01 2020-01-12 2 A10232 H1243D 2020-01-12 2021-11-20 3 A10232 H3432B 2021-11-20 2023-07-20 4 A39211 HA7382 1980-01-01 1989-12-31 5 A39211 H53621 1990-01-01 2017-01-02 6 A28183 HC39EF 2017-03-03 2018-07-20 7 A28183 HA3A01 2019-02-02 2023-07-20 8 A10124 H222BA 1993-10-23 2023-07-20 Interval objects! Notice how each transaction in the housing data has a start and end date, indicating when someone registered and deregistered at an address. A natural representation of this information is as a single object: a time interval. The package {lubridate} has support for specific interval objects, and several operations on intervals:\ncomputing the length of an interval with int_length() computing whether two intervals overlap with int_overlap() and much more\u0026hellip; as you can see here So let’s transform these start and end columns into a single interval column!\nhouse_df \u0026lt;- house_df |\u0026gt; mutate( # create the interval int = interval(start_date, end_date), # drop the start/end columns .keep = \u0026#34;unused\u0026#34; ) house_df # A tibble: 8 × 3 person_id house_id int \u0026lt;fct\u0026gt; \u0026lt;fct\u0026gt; \u0026lt;Interval\u0026gt; 1 A10232 H1200E 2020-01-01 UTC--2020-01-12 UTC 2 A10232 H1243D 2020-01-12 UTC--2021-11-20 UTC 3 A10232 H3432B 2021-11-20 UTC--2023-07-20 UTC 4 A39211 HA7382 1980-01-01 UTC--1989-12-31 UTC 5 A39211 H53621 1990-01-01 UTC--2017-01-02 UTC 6 A28183 HC39EF 2017-03-03 UTC--2018-07-20 UTC 7 A28183 HA3A01 2019-02-02 UTC--2023-07-20 UTC 8 A10124 H222BA 1993-10-23 UTC--2023-07-20 UTC We will want to compare this interval with a reference interval to compute the proportion of time that a person lived in the Netherlands within the reference interval. Therefore, we quickly define a new interval operation which truncates an interval to a reference interval. Don’t worry too much about it for now, we will use it later. Do notice that we’re always using the int_*() functions defined by {lubridate} to interact with the interval objects.\n# utility function to truncate an interval object to limits (also vectorized so it works in mutate()) int_truncate \u0026lt;- function(int, int_limits) { int_start(int) \u0026lt;- pmax(int_start(int), int_start(int_limits)) int_end(int) \u0026lt;- pmin(int_end(int), int_end(int_limits)) return(int) } Computing the proportion in the Netherlands The next step is to define a function that computes for each person a proportion overlap for a reference interval. By creating a function, it will be easy later to do the same operation for different intervals (e.g., different reference years) to work with the rich nature of the Statistics Netherlands microdata. To compute this table, we make extensive use of the {tidyverse}, with verbs like filter(), mutate(), and summarize(). If you want to know more about these, take a look at the {dplyr} documentation (but of course you can also use your own flavour of data processing, such as {data.table} or base R).\n# function to compute overlap proportion per person proportion_tab \u0026lt;- function(housing_data, reference_interval) { # start with the housing data housing_data |\u0026gt; # only retain overlapping rows, this makes the following # operations more efficient by only computing what we need filter(int_overlaps(int, reference_interval)) |\u0026gt; # then, actually compute the overlap of the intervals mutate( # use our earlier truncate function int_tr = int_truncate(int, reference_interval), # then, it\u0026#39;s simple to compute the overlap proportion prop = int_length(int_tr) / int_length(reference_interval) ) |\u0026gt; # combine different intervals per person summarize(prop_in_nl = sum(prop), .by = person_id) } Now we’ve defined this function, let’s try it out for a specific year such as 2017!\nint_2017 \u0026lt;- interval(ymd(\u0026#34;20170101\u0026#34;), ymd(\u0026#34;20171231\u0026#34;)) prop_2017 \u0026lt;- proportion_tab(house_df, int_2017) prop_2017 # A tibble: 3 × 2 person_id prop_in_nl \u0026lt;fct\u0026gt; \u0026lt;dbl\u0026gt; 1 A39211 0.00275 2 A28183 0.832 3 A10124 1 Now we’ve computed this proportion, notice that we only have three people. This means that the other person was living abroad in that time, with a proportion in the Netherlands of 0. To nicely display this information, we can join the proportion table with the original person dataset and replace the NA values in the proportion column with 0.\nleft_join(person_df, prop_2017, by = \u0026#34;person_id\u0026#34;) |\u0026gt; mutate(prop_in_nl = replace_na(prop_in_nl, 0)) # A tibble: 4 × 4 person_id firstname income_avg prop_in_nl \u0026lt;fct\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; 1 A10232 Aron 14001 0 2 A39211 Beth 45304 0.00275 3 A28183 Carol 110123 0.832 4 A10124 Dave 43078 1 Success! We now have a dataset for each person with the proportion of time they lived in the Netherlands in 2017. If you look at the original housing dataset, you may see the following patterns reflected in the proportion:\nAron indeed did not live in the Netherlands at this time. Beth moved away on January 2nd, 2017. Carol moved into the Netherlands on March 3rd, 2017 and remained there until 2018 Dave lived in the Netherlands this entire time. Conclusion In this post, we used interval objects and operations from the {lubridate} package to wrangle transactional housing data into a proportion of time spent living in the Netherlands. The advantage of using this package and its functions is that any particularities with timezones, date comparison, and leap years are automatically dealt with so that we could focus on the end result rather than the details.\nIf you are doing similar work and you have a different method, let us know! In addition, if you have further questions about working with Statistics Netherlands microdata or other complex or large social science datasets, do not hesitate to contact us on our website: https://odissei-soda.nl.\nBonus appendix: multiple time intervals Because we created a function that takes in the transaction data and a reference interval, we can do the same thing for multiple time intervals (e.g., years) and combine the data together in one wide or long dataset. This is one way to do this:\nlibrary(glue) # for easy string manipulation # initialize an empty dataframe with all our columns nl_prop \u0026lt;- tibble(person_id = factor(), prop_in_nl = double(), yr = integer()) # then loop over the years of interest for (yr in 2017L:2022L) { # construct reference interval for this year ref_int \u0026lt;- interval(ymd(glue(\u0026#34;{yr}0101\u0026#34;)), ymd(glue(\u0026#34;{yr}1231\u0026#34;))) # compute the proportion table for this year nl_prop_yr \u0026lt;- proportion_tab(house_df, ref_int) |\u0026gt; mutate(yr = yr) # append this year to the dataframe nl_prop \u0026lt;- bind_rows(nl_prop, nl_prop_yr) } # we can pivot it to a wide format nl_prop_wide \u0026lt;- nl_prop |\u0026gt; pivot_wider( names_from = yr, names_prefix = \u0026#34;nl_prop_\u0026#34;, values_from = prop_in_nl ) # and join it with the original person data, replacing NAs with 0 again person_df |\u0026gt; left_join(nl_prop_wide, by = \u0026#34;person_id\u0026#34;) |\u0026gt; mutate(across(starts_with(\u0026#34;nl_prop_\u0026#34;), \\(p) replace_na(p, 0))) # A tibble: 4 × 9 person_id firstname income_avg nl_prop_2017 nl_prop_2018 nl_prop_2019 \u0026lt;fct\u0026gt; \u0026lt;chr\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; \u0026lt;dbl\u0026gt; 1 A10232 Aron 14001 0 0 0 2 A39211 Beth 45304 0.00275 0 0 3 A28183 Carol 110123 0.832 0.549 0.912 4 A10124 Dave 43078 1 1 1 # ℹ 3 more variables: nl_prop_2020 \u0026lt;dbl\u0026gt;, nl_prop_2021 \u0026lt;dbl\u0026gt;, # nl_prop_2022 \u0026lt;dbl\u0026gt; ","date":"September 29, 2023","image":"http://odissei-soda.nl/images/tutorial-5/lubridate_1_hu9ab8adf199cf1b31e00f5b6813db3886_430902_650x0_resize_box_3.png","permalink":"/tutorials/lubridate/","title":"Wrangling interval data using lubridate"},{"categories":null,"contents":"These days, our online presence leaves traces of our behavior everywhere. There is data of what we do and say in platforms such as WhatsApp, Instagram, online stores, and many others. Of course, this so-called ‘digital trace data’ is of interest for social scientists: new, rich, enormous datasets that can be used to describe and understand our social world. However, this data is commonly owned by private companies. How can social scientists access and make sense of this data?\nIn this tutorial, we use data donation and the Port software to get access to WhatsApp group-chat data in a way that completely preserves privacy of research participants. Our goal is to show a small peek of what can be achieved with these methods. If you have an idea for your own research that entails collecting digital trace data, don’t hesitate to contact us! We can help you think about data acquisition, analysis and more.\nWith data donation, it is possible to collect data about any online platform: under the General Data Protection Regulation (EU law), companies are required to provide their data to any citizen that requests it. This data is available in so-called Data Download Packages (DDP’s), which are rather cumbersome to work with and contain personal information. Therefore, the Port software processes these DDP’s so that the data is in a format ready for analysis, while completely guaranteeing privacy of the respondents. The only thing research participants have to do is request their DDP’s, see which information they are sharing and consent to sharing it.\nSince we do not dive in with a lot of detail, we refer to Port\u0026rsquo;s github for more details on how to get started with your own project. There you can find a full guide to install and use Port, examples of past studies done with it, a tutorial for creating your own data donation workflow, and more. You can also read more about data donation in general here and here.\nAn application with WhatsApp data In this example, we extract some basic information from WhatsApp group chats, such as how many messages links, locations, and pictures were shared, as well as which person in the group the participant responded most to.\nNote that this is the only information we want to collect from the participants of the study, not the whole group chat file!\nThe first step in creating a DDP processing script is to obtain an example DDP and examine it. This example DDP can be, for example, your own DDP requested from WhatsApp. Usually, platforms provide a (compressed) folder with many different files; i.e., data in a format that is not ready to use. Once uncompressed, a WhatsApp group chat file could look like this:\n[16/03/2022, 15:10:17] Messages and calls are end-to-end encrypted. No one outside of this chat, not even WhatsApp, can read or listen to them. Tap to learn more. [16/03/2022, 15:20:25] person1: Hi shiva! [16/03/2022, 15:25:38] person2: Hi 👋 [16/03/2022, 15:26:48] person3: Hoi! [16/03/2022, 18:39:29] person2: https://youtu.be/KBmUTY6mK_E [16/03/2022, 18:35:51] person1: Location: https://maps.google.com/?q=52.089451,5.108469 [20/03/2022, 20:08:51] person4: I’m about to generate some very random messages so that I can make some screenshots for the explanation to participants [24/03/2022, 20:19:38] person1: @user3 if you remove your Profile picture for a moment I will redo the screenshots 😁 [26/03/2022, 18:52:15] person2: Well done Utrecht 😁 [14/07/2020, 22:05:54] person4: 👍Bedankt As part of a collaboration with data donation researchers using Port, we wrote a Python script1 to convert this into the information we need. The main script is available here; in short, it does the following:\nseparate the header and the message itself parse the date and time information in each message remove unneeded information such as alert notifications anonymize usernames convert the extracted information to a nice data frame format to show the participant for consent One big problem we had to overcome is that messages and alert notifications cannot be identified in the same way (i.e., using the same regular expression) on every device. Through trial-and-error, we tailored the steps to work with every operating system, language, and device required for this study. Indeed, if you design a study like this, it is very important to try out your script on many different DDPs from different people and devices. That way you will make sure you have covered possible variation in DPPs before actually starting data collection. This is a process that can take quite a while, so keep this in mind when you want to run a data donation study!\nThe end result In Figure 1 you can see a (fictitious) snippet of the dataset obtained. This is how a dataset in which you combine donations from different users would look like. As can be seen, we have moved from a rather untidy text file to a tidy, directly analysable dataset, where each row corresponds to an user in a given data donation package, and the rest of the columns give information about that user. Particularly, the dataset displays the following information: data donation package id (ddp_id), an anonymized user name (user_name), number of words sent on the groupchat by the user (nwords), date of first and last messages sent on the groupchat by the user (date_first_mess, date_last_mess), number of urls, files and locations sent on the groupchat by the user (respectively, nurls, nfiles, nlocations), and the (other) user that has replied more to that user (replies_from), as well as the user that that user has replied to the most (replies_to).\nTable 1. Snippet of fictitious dataset\nddp_id user_name nwords date_first_mess date_last_mess nurls nfiles nlocations replies_from replies_to 1 User1_1 121 10/08/2023 27/08/2023 0 15 0 User1_2 User1_4 1 User1_2 17 11/08/2023 28/08/2023 3 1 2 User1_1 User1_1 1 User1_3 44 10/08/2023 28/08/2023 9 6 3 User1_2 User1_1 1 User1_4 50 12/08/2023 29/08/2023 0 3 1 User1_3 User1_1 2 User2_1 123 01/05/2022 01/11/2022 2 0 1 User2_2 User2_2 2 User2_2 250 01/05/2022 02/11/2022 0 32 3 User2_3 User2_1 2 User2_3 176 08/07/2022 04/12/2022 6 0 5 User2_2 User2_3 3 User3_1 12 05/06/2023 26/07/2023 12 2 0 User3_1 User3_2 3 User3_2 16 06/06/2023 26/07/2023 17 2 0 User3_2 User3_1 In Figure 2 you can see a screenshot of how the Port software would display the data to be shared (number of words or messages, date stamps…) and ask for consent to the research subjects. As you see, the Port software guarantees that research subjects are aware of what information they are sharing and consent to it. The rest of the DDPs, including sensitive data, is analyzed locally and does not leave the respondents\u0026rsquo; devices.\nFigure 2. How the Port software displays the data to be shared and asks for consent\rConclusion The aim of this post was to illustrate how to use data donation with the software Port to extract online platform data. We illustrated all of this with the extraction of group-chat information from WhatsApp data. The main challenge of this project was to write a robust script that transforms this data into a nice, readily usable format while maintaining privacy. If you want to implement something similar but do not know how or where to start, let us know and we can help!\nThis script uses a deprecated version of Port, but large part of the script can be reused;\u0026#160;\u0026#x21a9;\u0026#xfe0e;\n","date":"September 8, 2023","image":"http://odissei-soda.nl/images/tutorial-4/whatsapp_header_hu3d03a01dcc18bc5be0e67db3d8d209a6_1708177_650x0_resize_q100_box.jpg","permalink":"/tutorials/collect-online-data-whatsapp/","title":"Collecting online platforms data for science: an example using WhatsApp"},{"categories":null,"contents":"In June 2023, we hosted a workshop about Network Science.\nAbout the workshop How can networks at Statistics Netherlands (CBS) help us understand and predict social systems? In this workshop, we provide participants with the conceptual and practical skills necessary to use network science tools to answer social, economic and biological questions.\nThis workshop introduces concepts and tools in network science. The objective of the course is that participants acquire hands-on knowledge on how to analyze CBS network data. Participants will be able to understand when a network approach is useful, work with networks efficiently, and create network variables.\nThe course has a hands-on focus, with lectures accompanied by programming practicals (in Python) to apply the knowledge on real networks.\nAdditional information When Thursday June 22, 202. (It was also taught it in 2022, together with Eszter Boyani). Where Summer Institute in Computational Social Science, Erasmus University Rotterdam. Registration Registration is no longer possible. Instructors Javier Garcia-Bernardo, Assistant Professor at Utrecht University. Javier\u0026rsquo;s website. Materials Course materials, including slides and code, are open and can be accessed here. ","date":"June 22, 2023","image":"http://odissei-soda.nl/images/workshops/network-science_hu3d03a01dcc18bc5be0e67db3d8d209a6_668913_650x0_resize_q100_box.jpg","permalink":"/workshops/network-science/","title":"Network science"},{"categories":null,"contents":"In May 2023, we hosted a workshop in Causal Impact Assesment focusing on policy interventions.\nAbout the workshop How do we assess whether a school policy intervention has had the desired effect on student performance? How do we estimate the impact a natural disaster has had on the inhabitants of affected regions? How can we determine whether a change in the maximum speed on highways has led to fewer accidents? These types of questions are at the core of many social scientific research problems. While questions with this structure are seemingly simple, their causal effects are notoriously hard to estimate, because often the researchers cannot perform a randomized controlled experiment.\nThe workshop offers hands-on training on several basic and advanced methods (e.g., difference-in-difference, interrupted time series and synthetic control) that can be applied when assessing causal impact. There is a focus on both the assumptions underlying the methods, and how to put them into practice.\nAdditional information When May, 2023. It is likely to be held again in 2024. Where Utrecht University. Registration Registration is no longer possible. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University (Erik-Jan\u0026rsquo;s website) \u0026amp; Oisín Ryan Assistant Professor at UMC (Oisín\u0026rsquo;s website). Materials Course materials, including slides and code, are open and can be accessed here. ","date":"May 26, 2023","image":"http://odissei-soda.nl/images/workshops/causal-inference_hu6a86fe0f271ec969b45a77f81a48f6ab_152168_650x0_resize_box_3.png","permalink":"/workshops/causal-inference-for-policy-evaluation/","title":"Causal Impact Assesment"},{"categories":null,"contents":"ArtScraper is a Python library to download images and metadata for artworks available on WikiArt and Google Arts \u0026amp; Culture.\nInstallation pip install https://github.com/sodascience/artscraper.git Downloading images from WikiArt To download data from WikiArt it is necessary to obtain free API keys.\nOnce you have the API keys, you can simply run the code below to download the images and metadata of the three artworks of Aleksandra Ekster.\n# artworks to scrape some_links = [ \u0026#34;https://www.wikiart.org/en/aleksandra-ekster/women-s-costume-1918\u0026#34;, \u0026#34;https://www.wikiart.org/en/aleksandra-ekster/still-life-1913\u0026#34;, \u0026#34;https://www.wikiart.org/en/aleksandra-ekster/view-of-paris-1912\u0026#34; ] # download images and metadata to the folder \u0026#34;data\u0026#34; with WikiArtScraper(output_dir=\u0026#34;data\u0026#34;) as scraper: for url in some_links: scraper.load_link(url) scraper.save_metadata() scraper.save_image() Downloading images from Google Art \u0026amp; Culture To download data from Google Art \u0026amp; Culture you need to download Firefox and geckodriver. The installation instructions can be found in our GitHub repository.\nOnce you have Firefox and geckodriver, you can simply run the code below to download artworks. You are not allowed to share or publish the images. Use them only for research.\n# artworks to scrape some_links = [ \u0026#34;https://artsandculture.google.com/asset/helena-hunter-fairytales/dwFMypq0ZSiq6w\u0026#34;, \u0026#34;https://artsandculture.google.com/asset/erina-takahashi-and-isaac-hernandez-in-fantastic-beings-laurent-liotardo/MQEhgoWpWJUd_w\u0026#34;, \u0026#34;https://artsandculture.google.com/asset/rinaldo-roberto-masotti/swG7r2rgfvPOFQ\u0026#34; ] # If you are on Windows, you can download geckodriver, place it in your directory, # and use the argument geckodriver_path=\u0026#34;geckodriver.exe\u0026#34; with GoogleArtScraper(\u0026#34;data\u0026#34;) as scraper: for url in some_links: scraper.load_link(url) scraper.save_metadata() scraper.save_image() You can find more examples here\nDo you want to know more about this library? Check our GitHub repository\nAre you using it for academic work? Please cite our package:\nSchram, Raoul, Garcia-Bernardo, Javier, van Kesteren, Erik-Jan, de Bruin, Jonathan, \u0026amp; Stamkou, Eftychia. (2022). ArtScraper: A Python library to scrape online artworks (0.1.1). Zenodo. https://doi.org/10.5281/zenodo.7129975 ","date":"October 4, 2022","image":"http://odissei-soda.nl/images/tutorial-2/nantenbo_header_hu48d98726654ff846597d2fe25a58192f_52421_650x0_resize_q100_box.jpg","permalink":"/tutorials/artscraper/","title":"ArtScraper: A Python library to scrape online artworks"},{"categories":null,"contents":"With the increasing popularity of open science practices, it is now more and more common to openly share code along with more traditional scientific objects such as papers. But what are the best ways to create an understandable, openly accessible, findable, citable, and stable archive of your code? In this post, we look at what you need to do to prepare your code folder and then how to upload it to Zenodo.\nPrepare your code folder To make code available, you will be uploading it to the internet as a single folder. The code you will upload will be openly accessible, and it will stay that way indefinitely. Therefore, it is necessary that you prepare your code folder (also called a “repository”) for publication. This requires time and effort, and for every project the requirements are different. Think about the following checklist:\nMust-haves Make a logical, understandable folder structure. For example, for a research project with data processing, visualization, and analysis I like the following structure: my_project/\u0026lt;br/\u0026gt; ├─ raw_data/\u0026lt;br/\u0026gt; │ ├─ questionnaire_data.csv\u0026lt;br/\u0026gt; ├─ processed_data/\u0026lt;br/\u0026gt; │ ├─ questionnaire_processed.rds\u0026lt;br/\u0026gt; │ ├─ analysis_object.rds\u0026lt;br/\u0026gt; ├─ img/\u0026lt;br/\u0026gt; │ ├─ plot.png\u0026lt;br/\u0026gt; ├─ 01_load_and_process_data.R\u0026lt;br/\u0026gt; ├─ 02_create_visualisations.R\u0026lt;br/\u0026gt; ├─ 03_main_analysis.R\u0026lt;br/\u0026gt; ├─ 04_output_results.R\u0026lt;br/\u0026gt; ├─ my_project.Rproj\u0026lt;br/\u0026gt; ├─ readme.md Make sure no privacy-sensitive information is leaked. Remove non-shareable data objects (raw and processed!), passwords hardcoded in your scripts, comments containing private information, and so on. Create a legible readme file in the folder that describes what the code does, where to find which parts of the code, and what needs to be done to run the code. You can choose how elaborate to make this! It could be a simple text file, a word document, a pdf, or a markdown document with images describing the structure. It is best if someone who does not know the project can understand the entire folder based on the readme – this includes yourself in a few years from now! Strong recommendations Reformat the code so that it is portable and easily reproducible. This means that when someone else downloads the folder, they do not need to change the code to run it. For example this means that you do not read data with absolute paths (e.g., C:/my_name/Documents/PhD/projects/project_title/raw_data/questionnaire_data.csv) on your computer, but only to relative paths on the project (e.g., raw_data/questionnaire_data.csv). For example, if you use the R programming language it is good practice to use an RStudio project. Format your code so that it is legible by others. Write informative comments, split up your scripts in logical chunks, and use a consistent style (for R I like the tidyverse style) Nice to have Record the software packages that you used to run the projects, including their versions. If a package gets updated, your code may no longer run! Your package manager may already do this, e.g., for python you can use pip freeze \u0026gt; requirements.txt. In R, you can use the renv package for this. If you have privacy-sensitive data, it may still be possible to create a synthetic or fake version of this data for others to run the code on. This ensures maximum reproducibility. Compressing the code folder The last step before uploading the code repository to Zenodo is to compress the folder. This can be done in Windows 11 by right-clicking the folder and pressing “compress to zip file”. It’s a good idea to go into the compressed folder afterwards, and checking if everything is there and also removing any unnecessary files (such as .Rhistory files for R).\nFigure 1: Zipping the code folder.\rAfter compressing, your code repository is now ready to be uploaded!\nUploading to Zenodo Zenodo is a website where you can upload any kind of research object: papers, code, datasets, questionnaires, presentations, and much more. After uploading, Zenodo will create a page containing your research object and metadata about the object, such as publication date, author, and keywords. In the figure below you can see an example of a code repository uploaded to Zenodo.\nFigure 2: A code repository uploaded to the Zenodo website. See https://zenodo.org/record/6504837\rOne of the key features of Zenodo is that you can get a Digital Object Identifier (DOI) for the objects you upload, making your research objects persistent and easy to find and cite. For example, in APA style I could cite the code as follows:\nvan Kesteren, Erik-Jan. (2022). My project (v1.2). Zenodo. https://doi.org/10.5281/zenodo.6504837\nZenodo itself is fully open source, hosted by CERN, and funded by the European Commission. These are exactly the kinds of conditions which make it likely to last for a long time! Hence, it is an excellent choice for uploading our code. So let’s get started!\nCreate an account To upload anything to Zenodo, you need an account. If you already have an ORCID or a GitHub account, then you can link these immediately to your Zenodo login. I do recommend doing so as it will make it easy to link these services and use them together.\nFigure 3: Zenodo sign-up page. See https://zenodo.org/signup/\rStart a new upload When you click the “upload” button, you will get a page where you can upload your files, determine the type of the upload, and create metadata for the research object. Now zip your prepared code folder and drag it to the upload window!\nFigure 4: Uploading a zipped folder to the Zenodo website.\rFill out the metadata One of the first options you need to specify is the “upload type”. For code repositories, you can choose the “software” option. The remaining metadata is relatively simple to fill out (such as author and institution). However, one category to pay attention to is the license: by default the CC-BY-4.0 license is selected. For a short overview of what this means, see the creative commons website: https://creativecommons.org/licenses/by/4.0/. You can opt for a different license by including a file called LICENSE in your repository.\nFigure 5: Selecting the \u0026lsquo;software\u0026rsquo; option for upload type.\rPublish! The last step is to click “publish”. Your research code is now findable, citable, understandable, reproducible, and archived until the end of times! You can now show it to all your colleagues and easily cite it in your manuscript. If you get feedback and you want to change your code, you can also upload a new version of the same project on the Zenodo website.\nConclusion In this post, I described a checklist for preparing your code folder for publication with a focus on understandability, and I have described one way in which you can upload your prepared code repository to an open access archive. Zenodo is an easy, dependable and well-built option, but of course there are many alternatives, such as hosting it on your own website, using the Open Science Framework, GitHub, or using a publisher’s website; each has its own advantages and disadvantages.\n","date":"September 5, 2022","image":"http://odissei-soda.nl/images/tutorial-1/tutorial1_header_hu650f89d19acf6379f59d4eaf3a39c00b_29682_650x0_resize_box_3.png","permalink":"/tutorials/share-your-reserarch-code/","title":"How to share your research code"},{"categories":null,"contents":"In May 2022, we hosted a workshop about efficient programming for accessing CBS microdata.\nAbout the workshop What to do when your CBS microdata analysis takes too many computational resources to run on the remote access environment? In this workshop we covered solutions to this problem. It will be an accessible introduction to a variety of ways in which you can programme more efficiently when using microdata in your research. Furthermore, it will discuss when you should and should not move your project to the ODISSEI Secure Supercomputer.\nThe introduction will include some live coding, exploring different options for project organisation, speeding up code, benchmarking, profiling, and reducing memory requirements. During his talk, Van Kesteren will also touch upon topics such as \u0026ldquo;embarassingly parallel\u0026rdquo;, scientific programming, data pipelines, open source, and open science. Although the presentation will center around data analysis with R, these principles also hold for other languages, such as Python or Julia.\nAdditional information When May 16th, 2022. Where Registration Registration is no longer possible. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University (Erik-Jan\u0026rsquo;s website). Materials Course materials, including slides and code, are open and can be accessed here. ","date":"May 16, 2022","image":"http://odissei-soda.nl/images/workshops/cbs_hub238c02f97a123a553dda5ed269ec1f4_33026_650x0_resize_q100_box.jpg","permalink":"/workshops/accesing-cbs-microdata/","title":"Efficient programming with CBS microdata"},{"categories":null,"contents":"In September 2022, we hosted a workshop about synthetic data.\nAbout the workshop Open data is one of the pillars of open science. However, there are often barriers in the way of making research data openly available, relating to consent, privacy, or organisational boundaries. In such cases, synthetic data is an excellent solution: the real data is kept secret, but a \u0026ldquo;fake\u0026rdquo; version of the data is available. The promise of the synthetic dataset is that others can then investigate the data structure, rerun scripts, use the data in educational materials, or even run a completely different analysis on their own.\nBut how do you generate synthetic data? In this session, we will introduce the field of synthetic data generation and apply several tools to generate synthetic versions of datasets, with various level of utility and privacy. We will be paying extra attention to practical issues such as missing values, data types, and disclosure control. Participants can either use a provided example dataset or they can bring their own data!\nAdditional information When September 2022. Where Open Science Festival. Registration Registration is no longer possible. Instructors Erik-Jan van Kesteren, Assistant Professor at Utrecht University (Erik-Jan\u0026rsquo;s website), Raoul Schram, Research Engineer at Utrecht University (Raoul\u0026rsquo;s website) \u0026amp; Thom Volker, PhD Candidate at Utrecht Universit (Thom\u0026rsquo;s website). Materials Course materials, including slides and code, are open and can be accessed here. ","date":"January 9, 2022","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/workshops/creating-synthetic-data/","title":"How to create synthetic data"},{"categories":null,"contents":"During your time as a fellow:\nyou will spend between 3-5 months full-time* working on a social science research project. you can propose your own project, based on your interests. you are a member of the SoDa team at the Methodology \u0026amp; Statistics department of Utrecht University. you will get a salary during this time, paid for by the team. one of the senior team members will be your mentor. To apply, you have to submit a short proposal for your project, together with a substantive supervisor. We are looking for projects in the social sciences for which a computational or data-related problem needs to be solved.\nThe next submission deadline is 31 May 2024\n* part-time possible, but the project should be your main priority.\n","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/fellowship/","title":"Fellowship"},{"categories":null,"contents":"","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/principles/","title":"Principles"},{"categories":null,"contents":"","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/projects/","title":"Projects"},{"categories":null,"contents":"","date":"January 1, 1","image":"http://odissei-soda.nl/images/workshops/syntheticdata_hu8a3495d7a8e26162d9e6ccba84eefed3_12080_650x0_resize_box_3.png","permalink":"/team/","title":"The SoDa Team"}] \ No newline at end of file diff --git a/index.xml b/index.xml index c995173..6235472 100644 --- a/index.xml +++ b/index.xml @@ -18,11 +18,11 @@Signed networks are a way to represent relationships between entities. This type of netwworks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
- read more +Signed networks are a way to represent relationships between entities. These types of networks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
+ read more diff --git a/tutorials/comunity-detection-signed-networks/index.html b/tutorials/community-detection-signed-networks/index.html similarity index 99% rename from tutorials/comunity-detection-signed-networks/index.html rename to tutorials/community-detection-signed-networks/index.html index 0d22173..b01e4ae 100644 --- a/tutorials/comunity-detection-signed-networks/index.html +++ b/tutorials/community-detection-signed-networks/index.html @@ -418,7 +418,7 @@Signed networks are a way to represent relationships between entities. This type of netwworks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting). Community detection in signed networks aims to identify groups of nodes that share similar connection patterns. In this tutorial, we will guide you through applying two popular community detection algorithms to signed networks, using Python.
+Signed networks are a way to represent relationships between entities. These types of networks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting). Community detection in signed networks aims to identify groups of nodes that share similar connection patterns. In this tutorial, we will guide you through applying two popular community detection algorithms to signed networks, using Python.
We will be using two algorithms:
Signed networks are a way to represent relationships between entities. This type of netwworks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
- read more +Signed networks are a way to represent relationships between entities. These types of networks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
+ read moreSigned networks are a way to represent relationships between entities. This type of netwworks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
- read more +Signed networks are a way to represent relationships between entities. These types of networks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
+ read more diff --git a/tutorials/index.xml b/tutorials/index.xml index 96c0fe2..ba777ae 100644 --- a/tutorials/index.xml +++ b/tutorials/index.xml @@ -9,11 +9,11 @@Signed networks are a way to represent relationships between entities. This type of netwworks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
- read more +Signed networks are a way to represent relationships between entities. These types of networks are called ‘signed’ because the connections between entities are signed: they can be positive (or cooperative) or negative (or conflicting).
+ read more