Merge pull request #287 from bast/radovan/profiling-exercise

add profiling exercise
AaltoSciComp · Oct 24, 2024 · fe285f8 · fe285f8
2 parents 3754551 + d94db4d
commit fe285f8
Show file tree

Hide file tree

Showing 2 changed files with 117 additions and 6 deletions.
diff --git a/content/profiling.md b/content/profiling.md
@@ -43,11 +43,122 @@
 > - [ ] Introduce line profilers
 > - [ ] Visualize one code example using `scalane`
 
-## Exercise
-> [!IMPORTANT]
-> Prepare two exercises for the last 20 minutes of this lecture.
-> Left to do:
-> - [ ] Provide exercise in pure python, Radovan has some ideas
-> - [ ] Provide exercise showing the improvement in performance when introducing numpy and/or pandas, Gregor will work on this
 
+## Exercises
+
+:::{exercise} Exercise Profiling-1
+Work in progress: we will provide an exercise showing the improvement in
+performance when introducing numpy and/or pandas.
+:::
+
+::::{exercise} Exercise Profiling-2
+In this exercise we will use the `scalene` profiler to find out where most of the time is spent
+and most of the memory is used in a given code example.
+
+Please try to go through the exercise in the following steps:
+1. Make sure `scalene` is installed in your environment (if you have followed
+   this course from the start and installed the recommended software
+   environment, then it is).
+1. Download Leo Tolstoy's "War and Peace" from the following link (the text is
+   provided by [Project Gutenberg](https://www.gutenberg.org/)):
+   <https://www.gutenberg.org/cache/epub/2600/pg2600.txt>
+   (right-click and "save as" to download the file and **save it as "book.txt"**).
+1. **Before** you run the profiler, try to predict in which function the code
+   will spend most of the time and in which function it will use most of the
+   memory.
+1. Run the `scalene` profiler on the following code example and browse the
+   generated HTML report to find out where most of the time is spent and where
+   most of the memory is used:
+   ```console
+   $ scalene example.py
+   ```
+   Alternatively you can do this (and then open the generated file in a browser):
+   ```console
+   $ scalene example.py --html > profile.html
+   ```
+   You can find an example of the generated HTML report in the solution below.
+1. Does the result match your prediction? Can you explain the results?
+
+```python
+"""
+The code below reads a text file and counts the number of unique words in it
+(case-insensitive).
+"""
+import re
+
+
+def count_unique_words1(file_path: str) -> int:
+    with open(file_path, "r", encoding="utf-8") as file:
+        text = file.read()
+    words = re.findall(r"\b\w+\b", text.lower())
+    return len(set(words))
+
+
+def count_unique_words2(file_path: str) -> int:
+    unique_words = []
+    with open(file_path, "r", encoding="utf-8") as file:
+        for line in file:
+            words = re.findall(r"\b\w+\b", line.lower())
+            for word in words:
+                if word not in unique_words:
+                    unique_words.append(word)
+    return len(unique_words)
+
+
+def count_unique_words3(file_path: str) -> int:
+    unique_words = set()
+    with open(file_path, "r", encoding="utf-8") as file:
+        for line in file:
+            words = re.findall(r"\b\w+\b", line.lower())
+            for word in words:
+                unique_words.add(word)
+    return len(unique_words)
+
+
+def main():
+    _result = count_unique_words1("book.txt")
+    _result = count_unique_words2("book.txt")
+    _result = count_unique_words3("book.txt")
+
+
+if __name__ == "__main__":
+    main()
+```
+
+:::{solution}
+  ```{figure} profiling/exercise2.png
+  :alt: Result of the profiling run for the above code example.
+  :width: 100%
+
+  Result of the profiling run for the above code example. You can click on the image to make it larger.
+  ```
+
+  Results:
+  - Most time is spent in the `count_unique_words2` function.
+  - Most memory is used in the `count_unique_words1` function.
+
+  Explanation:
+  - The `count_unique_words2` function is the slowest because it **uses a list**
+    to store unique words and checks if a word is already in the list before
+    adding it.
+    Checking whether a list contains an element might require traversing the
+    whole list, which is an O(n) operation. As the list grows in size,
+    the lookup time increases with the size of the list.
+  - The `count_unique_words1` and `count_unique_words3` functions are faster
+    because they **use a set** to store unique words.
+    Checking whether a set contains an element is an O(1) operation.
+  - The `count_unique_words1` function uses the most memory because it **creates
+    a list of all words** in the text file and then **creates a set** from that
+    list.
+  - The `count_unique_words3` function uses less memory because it traverses
+    the text file line by line instead of reading the whole file into memory.
 
+  What we can learn from this exercise:
+  - When processing large files, it can be good to read them line by line
+    or in batches
+    instead of reading the whole file into memory.
+  - It is good to get an overview over standard data structures and their
+    advantages and disadvantages (e.g. adding an element to a list is fast but checking whether
+    it already contains the element is slow).
+  :::
+::::
diff --git a/content/profiling/exercise2.png b/content/profiling/exercise2.png