Skip to content

Commit

Permalink
Update first chapter
Browse files Browse the repository at this point in the history
  • Loading branch information
apiad committed Jul 15, 2024
1 parent 50e45b0 commit 502309b
Show file tree
Hide file tree
Showing 3 changed files with 240 additions and 5 deletions.
28 changes: 25 additions & 3 deletions docs/search.json
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@
"href": "theory/intro.html",
"title": "1  Introduction to Formal Languages",
"section": "",
"text": "1.1 What is a language\nIntuitively, a language is just a collection of correct sentences. In natural languages (Spanish, English, etc,), each sentence is made up of words, which have some intrinsic meaning, and there are rules that describe which sequences of words are valid.\nSome of these rules, which we often call “syntactic” are just about the structure of words and sentences, and not their meaning–like how nouns and adjectives must match in gender and number or how verbs connect to adverbs and other modifiers. Other rules, which we call “semantic”, deal with the valid meanings of collections of words–the reason why the sentence “the salad was happy” is perfectly valid syntactically but makes no sense. In linguistics, the set of rules that determine which sentences are valid is called a “grammar”.\nIn formal language theory, we want to make all these notions as precise as possible in mathematical terms. To achieve so, we will have to make some simplifications which, ultimately, will imply that natural languages fall outside the scope of what formal language theory can fully study. But these simplifications will enable us to define a very robust notion of language for which we can make pretty strong theoretical claims.\nSo let’s build this definition from the ground up, starting with our notion of words, or, formally, symbols:\nExamples of symbols in abstract languages might be single letters like a, b or c. In programming languages, a symbol might be a variable name, a number, or a keyword like for or class. The next step is to define sentences:\nAn example of a sentence formed with the symbols a and b is abba. In a programming language like C# or Python, a sentence can be anything from a single expression to a full program.\nWe are almost ready to define a language. But before, we need to define a “vocabulary”, which is just a collection of valid symbols.\nGiven a concrete vocabulary, we can then define a language as a (posibly infinite) subset of all the sentences that can be formed with the symbols from that vocabulary.\nLet’s see some examples.",
"text": "1.1 What is a language\nIntuitively, a language is just a collection of correct sentences. In natural languages (Spanish, English, etc,), each sentence is made up of words, which have some intrinsic meaning, and there are rules that describe which sequences of words are valid.\nSome of these rules, which we often call “syntactic” are just about the structure of words and sentences, and not their meaning–like how nouns and adjectives must match in gender and number or how verbs connect to adverbs and other modifiers. Other rules, which we call “semantic”, deal with the valid meanings of collections of words–the reason why the sentence “the salad was happy” is perfectly valid syntactically but makes no sense. In linguistics, the set of rules that determine which sentences are valid is called a “grammar”.\nIn formal language theory, we want to make all these notions as precise as possible in mathematical terms. To achieve so, we will have to make some simplifications which, ultimately, will imply that natural languages fall outside the scope of what formal language theory can fully study. But these simplifications will enable us to define a very robust notion of language for which we can make pretty strong theoretical claims.\nSo let’s build this definition from the ground up, starting with our notion of words, or, formally, symbols:\nExamples of symbols in abstract languages might be single letters like a, b or c. In programming languages, a symbol might be a variable name, a number, or a keyword like for or class. The next step is to define sentences:\nAn example of a sentence formed with the symbols a and b is abba. In a programming language like C# or Python, a sentence can be anything from a single expression to a full program.\nOne special string is the empty string, which has zero symbols, and will often bite us in proofs. It is often denoted as \\(\\epsilon\\).\nWe are almost ready to define a language. But before, we need to define a “vocabulary”, which is just a collection of valid symbols.\nAn example of a vocabulary is \\(\\{ a,b,c \\}\\), which contains three symbols. In a programming language like Python, a sensible vocabulary would be something like \\(\\{ \\mathrm{for}, \\mathrm{while}, \\mathrm{def}, \\mathrm{class}, ... \\}\\) containing all keywords, but also symbols like +, ., etc.\nGiven a concrete vocabulary, we can then define a language as a (posibly infinite) subset of all the sentences that can be formed with the symbols from that vocabulary.\nLet’s see some examples.",
"crumbs": [
"Formal Language Theory",
"<span class='chapter-number'>1</span>  <span class='chapter-title'>Introduction to Formal Languages</span>"
Expand All @@ -25,7 +25,7 @@
"href": "theory/intro.html#what-is-a-language",
"title": "1  Introduction to Formal Languages",
"section": "",
"text": "Definition 1.1 (Symbol) A symbol is an atomic element that has an intrinsic meaning.\n\n\n\nDefinition 1.2 (Sentence) A sentence is a finite sequence of symbols.\n\n\n\n\nDefinition 1.3 (Vocabulary) A vocabulary \\(V\\) is a finite set of symbols.\n\n\n\nDefinition 1.4 (Language) Given a vocabulary \\(V\\), a language \\(L\\) is a set of sentences with symbols taken from \\(V\\).",
"text": "Definition 1.1 (Symbol) A symbol is an atomic element that has an intrinsic meaning.\n\n\n\nDefinition 1.2 (Sentence) A sentence (alternatively called a string) is a finite sequence of symbols.\n\n\n\n\n\nDefinition 1.3 (Vocabulary) A vocabulary \\(V\\) is a finite set of symbols.\n\n\n\n\n\n\n\n\nWhat about identifiers?\n\n\n\nIf you think about our definition of vocabulary for a little bit, you’ll notice we defined it as finite set of symbols. At the same time, I’m claiming that things like variable and function names, and all identifiers in general, will end up being part of the vocabulary in programming languages. However, there are infinitely many valid identifiers, so… how does that work?\nThe solution to this problem is that we will actually deal with two different languages, on two different levels. We will define a first language for the tokens, which just determines what types of identifiers, numbers, etc., are valid. Then the actual programming language will be defined based on the types of tokens available. So, all numbers are the same token, and all identifiers as well.\n\n\n\n\nDefinition 1.4 (Language) Given a vocabulary \\(V\\), a language \\(L\\) is a set of sentences with symbols taken from \\(V\\).",
"crumbs": [
"Formal Language Theory",
"<span class='chapter-number'>1</span>  <span class='chapter-title'>Introduction to Formal Languages</span>"
Expand All @@ -36,7 +36,29 @@
"href": "theory/intro.html#examples-of-languages",
"title": "1  Introduction to Formal Languages",
"section": "1.2 Examples of languages",
"text": "1.2 Examples of languages",
"text": "1.2 Examples of languages\nTo illustrate how rich languages can be, let’s define a simple vocabulary with just two symbols, \\(V = \\{a,b\\}\\), and see how many interesting languages we can come up with.\nThe simplest possible language in any vocabulary is the singleton language whose only sentence is formed by a single symbol from the vocabulary. For example, \\(L_a=\\{a\\}\\) or \\(L_b = \\{b\\}\\). This is, of course, rather useless, so let’s keep up.\nWe can also define what’s called a finite language, which is just a collection a few (or perhaps many) specific strings. For example, \\[L_1 = \\{bab, abba, ababa, babba\\}\\]\n\n\n\n\n\n\nNote\n\n\n\nSince languages are sets, there is no intrinsic order to the sentences in a language. For visualization purposes, we will often sort sentences in a language in shortest-to-largest, and then lexicographic order, assuming there is a natural order for the symbols. But this is just one arbitrary way of doing it.\n\n\nNow we can enter the realm of infinite languages. Even when the vocabulary is finite, and each sentence itself is also as finite sequence of symbols, we can have infinitely many different sentences in a language. If you need to convince yourself of this claim, think about the language of natural numbers: every natural number is a finite sequence of, at most, 10 different digits, and yet, we have infinitely many natural numbers because we always take a number and add a digit at the end to make a new one.\nIn the same sense, we can have infinite languages simply by concatenating symbols from the vocabulary ad infinitum. The most straightforward infinite language we can make from an arbitrary vocabulary \\(V\\) is called the universe language, and it’s just the collection of all possible strings one can form with symbols from \\(V\\).\n\nDefinition 1.5 (Universe language) Given a vocabulary \\(V\\), the universe language, denoted \\(V^*\\) is the set of all possible strings that can be formed with symbols from \\(V\\).\n\nAn extensional representation of a finite portion of \\(V^*\\) would be:\n\\[V^* = \\{\\epsilon,a,b,aa,ab,ba,bb,aaa,aab,aba,abb,baa,bab,bba,bbb,...\\}\\]\nWe can now easily see that an alternative definition of language could be any subset of the universe language of a given vocabulary \\(V\\).\nNow let’s take it up a notch. We can come up with a gazillion languages just involving \\(a\\) and \\(b\\), by concocting different relationships between the symbols. For this, we will need some way to describe the languages that doesn’t require listing all the elements–as they are infinitely many. We can do it with natural language, of course, but in the long run it will pay to be a slightly more formal when describing infinite languages.\nFor example, let \\(L_2\\) be the language of strings over the alphabet \\(V=\\{a,b\\}\\) that has the exact same number of \\(a\\) and \\(b\\).\n\\[L_2 = \\{\\epsilon, ab, aabb, abab, baba, baab, abba, ...\\}\\]\nWe can define it with a bit of math syntax sugar as follows:\n\\[L_2 = \\{ \\omega \\in \\{a,b\\}^* | \\#(a,\\omega) = \\#(b,\\omega) \\}\\]\nLet’s unpack this definition. We start by saying, \\(\\omega in \\{a,b\\}^*\\), which literaly parses to “strings \\(\\omega\\) in the universe language of the vocabulary \\(\\{a,b\\}\\)”. This is just standard notation to say “string made out of \\(a\\) and \\(b\\). Then we add the conditional part \\(\\#(a,\\omega) = \\#(b,\\omega)\\) which should be pretty straightforward: we are using the \\(\\#(\\mathrm{&lt;symbol&gt;},\\mathrm{&lt;string&gt;})\\) notation to denote the function that counts a given symbol in a string.\n\\(L_2\\) is slightly more interesting than \\(V^*\\) because it introduces the notion that a formal language is equivalent to a computation. This insight is the fundamental idea that links formal language and computability theories, and we will formalize this idea in the next section. But first, let’s see other, even more interesting languages, to solidify this intuition that languages equal computation.\nLet’s define \\(L_3\\) as the language of all strings in \\(V^*\\) where the number \\(a\\) is a prime factor of the number of \\(b\\). Intuitively, working with this language–e.g., finding valid strings–will require us to solve prime factoring, as any question about \\(L\\) that has different answers for string in \\(L\\) than for strings not in \\(L\\) will necessarily go through what it means for a number to be a prime factor of another.\nBut it gets better. We can define the language of all strings made out of \\(a\\) and \\(b\\) such that, when interpreting \\(a\\) as \\(0\\) and \\(b\\) as \\(1\\), the resulting binary number has any property we want. We can thus codify all problems in number theory as problems in formal language theory.\nAnd, as you can probably understand already, we can easily codify any mathematical problem, not just about number theory. Ultimately, we can define a language as the set of strings that are valid input/ouput pairs for any specific problem we can come up with. Let’s make this intuition formal.",
"crumbs": [
"Formal Language Theory",
"<span class='chapter-number'>1</span>  <span class='chapter-title'>Introduction to Formal Languages</span>"
]
},
{
"objectID": "theory/intro.html#recognizing-a-language",
"href": "theory/intro.html#recognizing-a-language",
"title": "1  Introduction to Formal Languages",
"section": "1.3 Recognizing a language",
"text": "1.3 Recognizing a language\nThe central problem is formal language theory is called the word problem. Intuitively, it is about determining whether a given string is part of a language, or not. Formally:\n\nDefinition 1.6 (The Word Problem) Given a language \\(L\\) on some vocabulary \\(V\\), the word problem is defined as devising a procedure that, for any string \\(\\omega \\in V^*\\), determines where \\(\\omega \\in L\\).\n\nNotice that we didn’t define the word problem simply as “given a language \\(L\\) and a string \\(\\omega\\), is $omega L$”. Why? Because we might be able to answer that question correctly only for some \\(\\omega\\), but not all. Instead, the word problem is coming up with an algorithm that answers for all possible strings \\(\\omega\\)–technically, a procedure, which is not exactly the same, we will see the details in ?sec-computability.\nThe word problem is the most important question in formal language theory, and one of the central problems in computer science in general. So much so, that we actually classify languages (and by extension, all computer science problems) according to how easy or hard it is to solve their related word problem.\nIn the next few chapters, we will review different classes of languages that have certain common characterists which make them, in a sense, equally complex. But first, let’s see what it would take to solve the word problem in our example languages.\nSolving the word problem in any finite language is trivial. You only need to iterate through all of the strings in the language. The word problem becomes way more interesting when we have infinite languages. In these cases, we need to define a recognizer mechanism, that is, some sort of computational algorithm or procedure to determine whether any particular string is part of the language.\nFor example, language \\(L_2\\) has a very simple solution to the word problem. The following Python program gets the job done:\ndef l2(s):\n a,b = 0,0\n\n for c in s:\n if c == \"a\":\n a += 1\n else:\n b += 1\n return a == b\nOne fundamental question in formal language theory is not only coming up with a solution to the word problem for a given language but, actually, coming up with the simplest solution–for a very specific definition of simple: how much do you need to remember.\nFor example, we can solve \\(L_2\\) with \\(O(n)\\) memory. That is, we need to remember something that is proportional to how many \\(a\\)’s or \\(b\\)’s are in the string. And we cannot do it with less, as we will prove a couple chapters down the road.\nNow, let’s turn to the opposite problem, that of generating strings from a given language, and wonder what, if any, is the connection between these two.",
"crumbs": [
"Formal Language Theory",
"<span class='chapter-number'>1</span>  <span class='chapter-title'>Introduction to Formal Languages</span>"
]
},
{
"objectID": "theory/intro.html#generating-a-language",
"href": "theory/intro.html#generating-a-language",
"title": "1  Introduction to Formal Languages",
"section": "1.4 Generating a language",
"text": "1.4 Generating a language",
"crumbs": [
"Formal Language Theory",
"<span class='chapter-number'>1</span>  <span class='chapter-title'>Introduction to Formal Languages</span>"
Expand Down
Loading

0 comments on commit 502309b

Please sign in to comment.