A data analysis tool primarily used for parsing Korean Congressional meeting record files (PDF files), which might assist in your π«political science papersβοΈβ¨οΈ, provided that you are not intimidated by the use of terminal command linesπ» and Pythonπ.
This library was designed to complete my graduation thesisπ, including modules for web scraping, text analysis, and more. Since it was designed for personal use, it has not been published on PyPi.
Specifically, it uses the PyMuPDF
library to extract text content from PDF files, and then filters the content using regular expressions, converting a PDF file into a plain Python dict
(dict[str, list[str]]
), storing it in the format of {"person name": ["speech content line #1", "speech content line #2"...]}
for further analysisπ.
The design is not complicated and can be easily replicated in other languages. Humbly speaking, the main contribution of this library is the author's physical labor, integrating tedious details together for convenient future use. From a computer science perspective, its design doesn't have any standout features.π€·ββοΈ
To perform decade-spanning analyses, I conducted a simple analysis and reverse engineering of the KR Assembly website, implementing a web scraping module for downloading meeting records within a specified range.
This web crawler heavily depends on the API design of the Congressional website. If the API of the Congressional website changes, the crawling function will be affected. Additionally, when using its crawling function to download PDF meeting records, you may want to limit the frequency of the crawler's operations, such as scraping a small number of PDF files per second. This could potentially alleviate the server pressure on the Congressional website and prevent your own IP from being banned due to high-frequency access.
cd src/CongressWatch/
main.py
is everything you need.
# Show all options / commands
python3 main.py
# 'main.py comm':
# This command takes in the name of a committee and then plots the historical changes in the proportion of female parliamentarians' speeches in meetings of that committee.
python3 main.py comm 'νλ
Έμ'
# Other commands
# 'main.py detail':
python3 main.py detail YEAR TOP_N
# 'main.py plotfreq':
python3 main.py plotfreq YEAR
# ... and so on. Run `main.py` without any arguments to see all.
# Also, there're some useful little scripts under this directory:
python3 _get_percentage_by_comm.py get 20
- TUI Mode
This is a TUI interface based on the textual
library, supporting both mouse and keyboard operations. As textual
is(was?) still in the early stages of development, if the UI interface fails to function due to changes in its API, please try downgrading textual
, install earlier versions of textual
(prior to 0.10), or cease using the UI interface and use the command line interface instead.
python3 app.py PDF_FILE
# Example:
python3 app.py pdf_for_testing/sample_2020.pdf
- With Python
cd src/
# make sure you're in this directory:
# `/CongressWatchKR/src`
# instead of
# `dearaj/src`.
pwd
/Home/Downloads/CongressWatchKR/src
then
$ ~/D/C/src (with-local-analyzer)> python3
Python 3.11.3 (main, Apr 7 2023, 19:30:05) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
finally
>>> import CongressWatch as cw
>>> cw.GEN_PERIOD_DICT
{21: ('2020-05-30', '2024-05-29'), 20: ('2016-05-30', '2020-05-29'), 19: ('2012-05-30', '2016-05-29'), 18: ('2008-05-30', '2012-05-29'), 17: ('2004-05-30', '2008-05-29'), 16: ('2000-05-30', '2004-05-29'), 15: ('1996-05-30', '2000-05-29'), 14: ('1992-05-30', '1996-05-29'), 13: ('1988-05-30', '1992-05-29'), 12: ('1985-04-11', '1988-05-29'), 11: ('1981-04-11', '1985-04-10'), 10: ('1979-03-12', '1980-10-27'), 9: ('1973-03-12', '1979-03-11'), 8: ('1971-07-01', '1972-10-17'), 7: ('1967-07-01', '1971-06-30'), 6: ('1963-12-17', '1967-06-30')}
>>>
>>>
>>> cw.period("2019-01-01", "2019-03-01")
<'period' object, with 10 conferences from 2019-01-01 to 2019-03-01, at 0x10f159900>
>>> for conf in cw.period("2019-01-01", "2019-03-01"): print(conf)
...
Conference(2019-01-09, 14:05, μ, μ 365ν κ΅ν(μμν) μ 04μ°¨ λ¨λΆκ²½μ νλ ₯νΉλ³μμν, 2 movie(s))
Conference(2019-01-01, 00:15, ν, μ 365ν κ΅ν(μμν) μ 02μ°¨ κ΅νμ΄μμμν, 1 movie(s))
Conference(2019-01-16, 09:36, μ, μ 365ν κ΅ν(μμν) ννμ€ μ 01μ°¨ κ³ΌνκΈ°μ μ 보방μ‘ν΅μ μμν, 1 movie(s))
Conference(2019-01-15, 10:16, ν, μ 365ν κ΅ν(μμν) μ 01μ°¨ κ΅λ°©μμν, 1 movie(s))
Conference(2019-01-09, 10:24, μ, μ 365ν κ΅ν(μμν) μ 02μ°¨ νμ μμ μμν, 1 movie(s))
Conference(2019-01-09, 11:05, μ, μ 365ν κ΅ν(μμν) μ 01μ°¨ 보건볡μ§μμν, 1 movie(s))
Conference(2019-01-18, 10:03, κΈ, μ 365ν κ΅ν(μμν) ννμ€ μ 02μ°¨ 보건볡μ§μμν, 2 movie(s))
Conference(2019-01-22, 14:28, ν, μ 366ν κ΅ν(μμν) μ 01μ°¨ λ¬Έν체μ‘κ΄κ΄μμν, 1 movie(s))
Conference(2019-01-21, 14:38, μ, μ 366ν κ΅ν(μμν) μ 01μ°¨ νμ μμ μμν, 1 movie(s))
Conference(2019-01-24, 10:07, λͺ©, μ 366ν κ΅ν(μμν) μ 08μ°¨ μ μΉκ°ννΉλ³μμν, 1 movie(s))
Downloading PDF files
>>> # The download function uses Python's file operation functions to write to local storage in binary format. This is a relatively primitive API, which might not be very friendly to computer novices.
>>> import CongressWatch as cw
>>> for conf in cw.period("2019-01-01", "2019-03-01"):
... file = f"{conf.date}_{conf.ct1}.{conf.ct2}.{conf.ct3}.{conf.mc}.pdf"
... with open(f"./{file}", 'wb') as out:
... out.write(conf.pdf)
...
# These numbers represent the disk space taken up by the PDF files you've downloaded. You can just ignore them.
1760256
498688
1280000
486400
494592
1103872
1641472
412672
475136
714752
# The mysterious codes:
# `ct1`, `ct2`, etc., are part of the Congressional website's API,
# and they are used to locate a specific PDF file.
# The website's API design includes abbreviations
# in Korean romanized letters and English,
# neither of which are my native language.
#
# As a result, I gave up trying to decipher these abbreviations and obediently use what they provide.
# If you want to use this download function,
# I suggest you simply copy this code β of course, changing the date range to the one you need.
# After the download is complete,
# you will see PDF file names like these in your directory:
2019-01-01_20.365.02.324.pdf 2019-01-18_20.365.02.333.pdf
2019-01-09_20.365.01.333.pdf 2019-01-21_20.366.01.345.pdf
2019-01-09_20.365.02.345.pdf 2019-01-22_20.366.01.359.pdf
2019-01-09_20.365.04.4GT.pdf 2019-01-24_20.366.08.4AL.pdf
2019-01-15_20.365.01.337.pdf CongressWatch/
2019-01-16_20.365.01.356.pdf
Getting the MP list of a specific period
>>> import CongressWatch as cw
>>> cw.Assembly(19).has("λ¬Έμ¬μΈ")
True
>>> cw.Assembly(20).has("λ°κ·Όν")
False
PDFText
is a wrapper class based on the PyMuPDF
module. It can parse a κ΅ννμλ‘ PDF file into a Python dictionary in the form of {MP_NAME: LIST_OF_LINES_OF_HIS_OR_HER_SPEECH}
. For more information, use the built-in function help
.
>>> cw.PDFText
<class 'ajpdf.PDFText'>
Conferences(N)
is a collection of conferences for the nth assembly. Conferences
contain elements of Conference
s.
>>> for conf in cw.Conferences(15):
... print(conf)
... print(conf.movies)
... break
...
Conference(1999-12-28, 12:00, ν, μ 209ν κ΅ν(μμν) μ 01μ°¨ λ³Ένμ, 1 movie(s))
[Movie(41 speak(s))]
Movie
objects
for conf in cw.Conferences(10):
... for movie in conf:
... for speak in movie:
... print(speak)
... break
...
Speak(real_time=None, play_time='00:05:58', speak_type='λ³΄κ³ ', no=106739, speak_title='ꡬλ²λͺ¨μμ', wv=0)
Speak(real_time=None, play_time='00:02:15', speak_type='μΈμ¬', no=106740, speak_title='λΆμ΄λ¦¬κ²Έκ²½μ κΈ°νμμ₯κ΄', wv=0)
Speak(real_time=None, play_time='00:04:31', speak_type='κΈ°ν', no=106741, speak_title='μμμ₯', wv=0)
Movie
objects and Speak
objects come from the metadata of meeting video records (video files). In this library, we are not concerned with the video records of a meeting, but by obtaining this metadata, we can use it as an index for searching meetings. Through this information, we can know which members spoke in a particular meeting, as well as their speaking time, type of speech, and other information. By using these objects, we don't need to obtain the video files, as they occupy too much disk space, take longer to download, and require more powerful hardware capabilities for computer vision analysis; we simply use them as indices.
Due to being designed for personal use, this program has only been tested on the author's computerπ, and some configuration information, such as the font settings for the graphics module, is hard-coded into the source file. However, it is not difficult to modify β as long as you have a basic understanding of Python, you can modify these configuration settings.
Regarding further updates to this library, I will only conduct necessary bug fixes as conditions allow, and won't add any new functions. If you are a proficient Python user, you should be able to see the simple design of this library and easily modify or extend it with the help of some auxiliary stuff ;-). If you are a Python beginner and encounter problems when using this library, you can try to contact me. However, due to potential changes in the network conditions of my region, I can't guarantee a timely response.
I anticipate that this library could offer a bit of help to future studies in Korean politicsπ°π·. Therefore, this README
file is prepared for those who plan to use this library for writing their papers, especially for my seniors and juniors in the political science department. I also recognize that some AI experts might want to use it to gather relevant training data for NLPπ€οΈ. When using it, please be mindful of the terms set by the Korean Congressional website that prohibit the use of data for commercial purposesπ°, as well as the terms of third-party libraries used in this library. CongressWatch uses many third-party libraries, some of which may also limit commercial use.
This library was developed while simultaneously being tested and used. If you see some hard-to-understand file names or cryptic screenshots, they are byproducts from the development process. You just need to use the library following the example code, and make the necessary code modifications when errors occur (I guess most issues are related to the configuration files, as they have been hard-coded into the source code to facilitate my paper writing).
This library has many areas that should be improved. For example, it should use SQL databases to store the information obtained, instead of storing it in pickle, JSON, or CSV formats and keeping it in memory as Python objects for analysis. This is due to the lack of consideration in the initial design and the author's lack of engineering experience.