Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data #887

Open
wants to merge 55 commits into
base: main
Choose a base branch
from
Open

Data #887

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
5b98772
Add IL Israeli Knesset Corpus sample
Oct 31, 2024
4474855
only sample speakers
GiliGoldin Nov 18, 2024
3bebb74
ids fix
GiliGoldin Nov 18, 2024
f7d9518
protocol 2021 deleted
GiliGoldin Nov 18, 2024
c69da10
faction names fix
GiliGoldin Nov 19, 2024
5d131ab
faction names fix and roles fixed
GiliGoldin Nov 19, 2024
2d683d0
ud-syn placeholder links
GiliGoldin Nov 19, 2024
534a157
taxonomies
GiliGoldin Nov 19, 2024
5175081
parlamint2conllu add he
GiliGoldin Nov 19, 2024
9d39b03
ud link changes
GiliGoldin Nov 20, 2024
19c9928
ud link changes
GiliGoldin Nov 20, 2024
300799b
reference to s instead of seg and fix msds
GiliGoldin Nov 20, 2024
f46960f
merge sentences to one segment
GiliGoldin Nov 21, 2024
332bb31
fix fallback ud tree treatment
GiliGoldin Nov 21, 2024
27a4fa7
Merge branch 'data' into data
matyaskopp Nov 21, 2024
158ce27
fixed: filenames, maintitle, meeting elements, bibl, settingDesc, ids…
GiliGoldin Nov 24, 2024
8040ae5
add abb names same as full name as we don't have abbreviated
GiliGoldin Nov 24, 2024
97b5979
Merge branch 'data' into data
matyaskopp Nov 25, 2024
ebb9a2c
deleted and reverted unrelated files
GiliGoldin Nov 25, 2024
199b869
removed minister roles and changed gov dates
GiliGoldin Nov 25, 2024
f71de66
fix nested words, fixed ner, erased unnececssary taxenomies
GiliGoldin Nov 27, 2024
cb109e2
fix ud links
GiliGoldin Nov 27, 2024
ede9e3e
ud links
GiliGoldin Nov 28, 2024
da45193
Merge branch 'data' into data
matyaskopp Nov 28, 2024
7d48aff
multi words
GiliGoldin Nov 28, 2024
8085586
multi words
GiliGoldin Nov 28, 2024
9fb37a9
Merge pull request #886 from clarin-eric/main
matyaskopp Nov 29, 2024
1d7be11
Merge branch 'data' into data
matyaskopp Nov 29, 2024
7eec9b8
fix one day affiliation overlap with the same organization report
matyaskopp Nov 29, 2024
e9d0061
Merge branch 'data' into data
matyaskopp Nov 29, 2024
257f397
place derived formats in correct folders
matyaskopp Nov 29, 2024
4b910aa
add prime ministers
GiliGoldin Dec 1, 2024
f14ec49
polish segments to text conversion
matyaskopp Dec 2, 2024
5e8eec0
Merge branch 'data' into data
matyaskopp Dec 2, 2024
a4480ac
languages fix
GiliGoldin Dec 2, 2024
f277525
join fixes
GiliGoldin Dec 2, 2024
b32736a
[data f14ec491] polish segments to text conversion
matyaskopp Dec 2, 2024
97c4a5c
join fixes
GiliGoldin Dec 2, 2024
11640a7
join fixes
GiliGoldin Dec 2, 2024
9cb4c22
removing indentation spaces inside orthographical tokens
matyaskopp Dec 2, 2024
4571733
Merge branch 'data' into data
matyaskopp Dec 2, 2024
8d6b4a9
fixing joins
GiliGoldin Dec 2, 2024
bd6b621
fix default translation expectation behaviour (do not use the setup f…
matyaskopp Dec 2, 2024
fba109f
[data bd6b621e] fix default translation expectation behaviour (do not…
matyaskopp Dec 2, 2024
abc39b0
Merge branch 'data' into data
matyaskopp Dec 2, 2024
9f7767c
fix join in last token
GiliGoldin Dec 3, 2024
3732c07
fix join in last token
GiliGoldin Dec 3, 2024
d2fc9e7
Merge pull request #881 from GiliGoldin/data
matyaskopp Dec 6, 2024
cabf663
fix saxon path in create sample action
matyaskopp Dec 6, 2024
04ab061
add template IL readme
matyaskopp Dec 6, 2024
8341641
DEV change
matyaskopp Dec 6, 2024
17d957f
Merge pull request #888 from clarin-eric/data-IL
matyaskopp Dec 6, 2024
767642a
action: generating ParlaMint-[IL] sample files with #888
matyaskopp Dec 6, 2024
61c44e7
action: generating ParlaMint roots for Sample folder #888
matyaskopp Dec 6, 2024
2984f5b
Trigger Action empty commit
matyaskopp Dec 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 5 additions & 2 deletions .github/actions/ParlaMintValidate/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,10 +6,13 @@ inputs:
required: true
requireTaxonomiesTranslations:
description: 'require every term in common taxonomies to be translated'
default: '1'
required: false
runs:
using: "composite"
steps:
- name: Set environment variables
run: echo "REQ_TRANSLATION=${{ inputs.requireTaxonomiesTranslations || '1' }}" >> $GITHUB_ENV
shell: bash
- name: Convert and Validate
run: ${{ github.action_path }}/validate.sh '${{inputs.parlas}}' '${{inputs.requireTaxonomiesTranslations}}'
run: ${{ github.action_path }}/validate.sh '${{inputs.parlas}}' '${{env.REQ_TRANSLATION}}'
shell: bash
5 changes: 3 additions & 2 deletions .github/workflows/createSample.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,16 +49,17 @@ jobs:
- name: Create Parliaments Samples ${{needs.Changes.outputs.parla_changed}}
run: |
cd $GITHUB_WORKSPACE/ParlaMint
SAXON=Scripts/bin/saxon.jar
for parla in $(jq -r '.[]' <<< '${{needs.Changes.outputs.parla_changed}}' ); do
echo "::group::Processing ParlaMint-$parla"
DIR="${{env.SAMPLE_DIR}}/$parla"
mkdir $DIR
echo "::notice::New sample files [$parla] TEXT"
java -jar $GITHUB_WORKSPACE/Saxon.jar outDir=$DIR revRespPers='GitHub Action' -xsl:${{env.SAMPLE_SCRIPT}} Samples/ParlaMint-$parla/ParlaMint-$parla.xml
java -jar $SAXON outDir=$DIR revRespPers='GitHub Action' -xsl:${{env.SAMPLE_SCRIPT}} Samples/ParlaMint-$parla/ParlaMint-$parla.xml

echo "::notice::New sample files [$parla] ANNOTATED"
if [ -f "Samples/ParlaMint-$parla/ParlaMint-$parla.ana.xml" ] ; then
java -jar $GITHUB_WORKSPACE/Saxon.jar outDir=$DIR revRespPers='GitHub Action' -xsl:${{env.SAMPLE_SCRIPT}} Samples/ParlaMint-$parla/ParlaMint-$parla.ana.xml
java -jar $SAXON outDir=$DIR revRespPers='GitHub Action' -xsl:${{env.SAMPLE_SCRIPT}} Samples/ParlaMint-$parla/ParlaMint-$parla.ana.xml
else
echo "::warning::skipping annotated conversion - missing corpus root file"
fi
Expand Down
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,10 @@ nohup.*
*.zip
*.tar
*.tgz
.idea/
/output.log
/output1.log
/Samples/ParlaMint-IL/add-common-content/
/Samples/ParlaMint-IL/text.seg/
/Samples/ParlaMint-IL/text.seg.ana/
/Samples/ParlaMint-IL/text.seg.ana/
33 changes: 19 additions & 14 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -356,7 +356,7 @@ text: $(text-XX)
$(text-XX): text-%: %
rm -f `ls ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/ParlaMint-$<_*.txt | grep -v '.ana.'`
find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-$<_*.xml" | grep -v '.ana.' | $P --jobs 10 \
'$s -xsl:Scripts/parlamint-tei2text.xsl {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/{/.}.txt'
'$s -xsl:Scripts/parlamint-tei2text.xsl {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/{.}.txt'

text.ana-XX = $(addprefix text.ana-, $(PARLIAMENTS))
## text.ana ## create text version from TEI.ana files
Expand All @@ -365,7 +365,7 @@ text.ana: $(text.ana-XX)
$(text.ana-XX): text.ana-%: %
rm -f ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/ParlaMint-$<_*.ana.txt
find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-$<_*.xml" | grep '.ana.' | $P --jobs 10 \
'$s -xsl:Scripts/parlamint-tei2text.xsl {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/{/.}.txt'
'$s -xsl:Scripts/parlamint-tei2text.xsl {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/{.}.txt'



Expand All @@ -377,7 +377,7 @@ $(meta-XX): meta-%: %
rm -f ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/*-meta.tsv
find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-*_*.xml" | grep -v '.ana.' | $P --jobs 10 \
'$s meta=../${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/ParlaMint-$<.xml -xsl:Scripts/parlamint2meta.xsl \
{} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/{/.}-meta.tsv'
{} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/{.}-meta.tsv'



Expand Down Expand Up @@ -495,22 +495,27 @@ $(composite-teiHeader-INPLACE-XX): composite-teiHeader-INPLACE-%: % composite-te
text.seg-XX = $(addprefix text.seg-, $(PARLIAMENTS))
## text.seg ## create text version from TEI files - each line contains one segment
text.seg: $(text.seg-XX)
## text-XX ## convert TEI files to text
## text.seg-XX ## convert TEI files to text
$(text.seg-XX): text.seg-%: %
mkdir -p ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg
rm -f `ls ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg/ParlaMint-$<_*.seg.txt | grep -v '.ana.'`
find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-$<_*.xml" | grep -v '.ana.' | $P --jobs 10 \
'$s -xsl:Scripts/parlamint-tei2text.xsl element=seg {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg/{/.}.txt'
@mkdir -p ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg
@rm -f ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg/ParlaMint-$<_*.seg.txt
@find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-$<_*.xml" | grep -v '.ana.' | $P --jobs 10 \
'$s -xsl:Scripts/parlamint-tei2text.xsl element=seg {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg/{/.}.seg.txt'
@echo "INFO: segments converted to text are stored in ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg"

text.seg.ana-XX = $(addprefix text.seg.ana-, $(PARLIAMENTS))
## text.seg ## create text version from TEI.ana files - each line contains one segment
## text.seg.ana ## create text version from TEI.ana files - each line contains one segment
text.seg.ana: $(text.seg.ana-XX)
## text.seg.ana-XX ## convert TEI.seg.ana files to text
## text.seg.ana-XX ## convert TEI.ana files to text
$(text.seg.ana-XX): text.seg.ana-%: %
mkdir -p ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg
rm -f ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg/ParlaMint-$<_*.seg.ana.txt
find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-$<_*.xml" | grep '.ana.' | $P --jobs 10 \
'$s -xsl:Scripts/parlamint-tei2text.xsl element=seg {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg/{/.}.txt'
@mkdir -p ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg.ana
@rm -f ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg.ana/ParlaMint-$<_*.seg.txt
@find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX} -maxdepth 2 -type f -name "ParlaMint-$<_*.xml" | grep '.ana.' | $P --jobs 10 \
'$s -xsl:Scripts/parlamint-tei2text.xsl element=seg {} > ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg.ana/{/.}'
@find -H ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg.ana -type f | $P 'mv {} {.}.seg.txt'
@echo "INFO: annotated segments converted to text are stored in ${DATADIR}/ParlaMint-$<${CORPUSDIR_SUFFIX}/text.seg.ana"




######---------------
Expand Down
Loading
Loading