-
Notifications
You must be signed in to change notification settings - Fork 2
/
README_SOLR.sh
153 lines (102 loc) · 9.53 KB
/
README_SOLR.sh
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# Solr Cloud 7 extraction
This README collects notes from performing statistics extraction from the Solr Cloud 7 setup populated by [webarchive-discovery](https://github.com/ukwa/webarchive-discovery) used at the Royal Danish Library. This proved to be tricke due to an internal and non-overridable limit in Solr 7 resulting in timeouts.
Due to a hard timeout of 10 minutes, extraction was bisected both at shard level and at query level. bash snippets were used for this together with the general streaming oriented script `solr_stream_queries.sh`.
For someone else to
## Initial setup and test
`solr_stream_queries.sh` requires either Solr server oriented arguments or a configuration file. The latter is recommended: Create a file `solr.conf` and state essential information there, e.g.:
```
: ${SERVER:="localhost"}
: ${PORT:="52300"}
: ${INDEX:="ns1"}
```
It can be used directly with
```
./solr_stream_queries.sh solrq_font.q
```
If the file `font-result.csv` is produced and looks OK, everything is fine and the statistics extraction (at least for `font`) can be done without further tricks. Simply run the `solr_stream_queries.sh` for each `.q` file, then merge the CSVs with
```
java DomainStatJSON.java $(find -maxdepth 1 -iname "*-result.csv") > total.summary
```
The notes below is for setups where timeouts occur and where the overall collection consists of multiple sub-collection. The sections **Bisect 4** and **Bisect 8** are also relevant for setups without sub-collections.
## Extraction
Netarchive Search uses sub-collections named `ns1, ns2... ns143` served by 5 physical machines (25 sub-collections/machine). Normally they are queried using the alias `ns`, but the collections can also be queries individually.
Find out what the highest collection ID is (140 at the time of extraction) and run ~13 parallel jobs on each machine to saturate CPU power.
```
mkdir -p shard_results
for BASE in $(seq 1 2); do seq $BASE 2 140 | xargs -P 0 -I {} bash -c 'INDEX=ns{} OUTPUT_PREFIX=shard_results/{}_ ./solr_stream_queries.sh' ; done
```
Some of these extractions failed due to timeout, so time to handle problems.
For each query and for each sub-collection (aka shard), check if there is an output and if not, create a CSV file of size 0, signalling a need for bisection with the snippets under "Re-running and bisecting":
```
for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -z "$(find shard_results -maxdepth 1 -iname "${SHARD}_q_${QT}-result.csv" -o -iname "${SHARD}[a-z]_q_${QT}-result.csv")" ]]; then touch "shard_results/${SHARD}_q_${QT}-result.csv" ; fi ; done ; done
```
Move all successful results with `size > 0` to the folder `done`:
```
mkdir -p done ; for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -s shard_results/${SHARD}_q_${QT}-result.csv ]]; then mv "shard_results/${SHARD}_q_${QT}-result.csv" done/ ; fi ; done ; done
```
### Re-running and bisecting
The scripts skips tasks where a result is already present, so rerunning is safe.
In case of timeouts for the full queries, bisection of the exports are a possibility. Large queries such as `http`, `https`, `html` and `total` are prone to timeouts. The snippet below looks for missing results for all queries, then performs bisection into 4 parts for those.
Note: The hash uses base32 encoding with the characters `ABCDEFGHIJKLMNOPQRSTUVWXYZ234567`. Solr sorts those in order `234567ABCDEFGHIJKLMNOPQRSTUVWXYZ`.
#### Bisect 4
```
for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for BISECT in "a#[* TO sha1:C}" "b#[sha1:C TO sha1:K}" "c#[sha1:K TO sha1:S}" "d#[sha1:S TO *]"; do find shard_results/ -size 0 -iname "*q_${QT}-result.csv" | grep -o '[0-9]*' | sort -u | xargs -P 0 -I {} bash -c "FILTER=\"hash:${BISECT:2}\" INDEX=ns{} OUTPUT_PREFIX=shard_results/{}${BISECT:0:1}_ ./solr_stream_queries.sh solrq_${QT}.q" ; done ; done
```
After this, move all full results (all 4 bisections > size 0) to `4`:
```
mkdir -p 4 ; for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -s shard_results/${SHARD}a_q_${QT}-result.csv && -s shard_results/${SHARD}b_q_${QT}-result.csv && -s shard_results/${SHARD}c_q_${QT}-result.csv && -s shard_results/${SHARD}d_q_${QT}-result.csv ]]; then mv shard_results/${SHARD}_q_${QT}-result.csv shard_results/${SHARD}[a-d]_q_${QT}-result.csv 4/ ; fi ; done ; done
```
Merge the bisected files into a single file and move that to `done`. This assumes that if one bisection part is there, all 4 bisection parts are there.
```
mkdir -p done ; for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -s 4/${SHARD}a_q_${QT}-result.csv ]]; then echo "$QT $SHARD" ; java -Xmx8g DomainStatMerger.java 4/${SHARD}a_q_${QT}-result.csv 4/${SHARD}b_q_${QT}-result.csv 4/${SHARD}c_q_${QT}-result.csv 4/${SHARD}d_q_${QT}-result.csv > 4/${SHARD}_q_${QT}-result.csv ; mv -n 4/${SHARD}_q_${QT}-result.csv done/ ; fi ; done ; done
```
#### Bisect 8
In the case of timeouts exen with bisection, the answer is of course more bisection!
Start by locating problematic bisections (where only some parts were generated) from above:
```
> find shard_results/ -iname "*[a-d]_q_*.csv"
shard_results/93c_q_html-result.csv
shard_results/96d_q_html-result.csv
```
if that looks correct, delete **all** the 4 sections for the problematic bisections:
```
rm $(find shard_results/ -size 0 -iname "*[a-d]_q_*.csv" | sed 's/[a-d]_q_/[a-d]_q_/')
```
then perform a bisection in 8 parts:
```
for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for BISECT in "a#[* TO sha1:6}" "b#[sha1:6 TO sha1:C}" "c#[sha1:C TO sha1:G}" "d#[sha1:G TO sha1:K}" "e#[sha1:K TO sha1:O}" "f#[sha1:O TO sha1:S}" "g#[sha1:S TO sha1:W}" "h#[sha1:W TO *]"; do find shard_results/ -size 0 -iname "*q_${QT}-result.csv" | grep -o '[0-9]*' | sort -u | xargs -P 0 -I {} bash -c "FILTER=\"hash:${BISECT:2}\" INDEX=ns{} OUTPUT_PREFIX=shard_results/{}${BISECT:0:1}_ ./solr_stream_queries.sh solrq_${QT}.q" ; done ; done
```
After this, move all full results (all 8 bisections > size 0) to `8`:
```
mkdir -p 8 ; for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -s shard_results/${SHARD}a_q_${QT}-result.csv && -s shard_results/${SHARD}b_q_${QT}-result.csv && -s shard_results/${SHARD}c_q_${QT}-result.csv && -s shard_results/${SHARD}d_q_${QT}-result.csv && -s shard_results/${SHARD}e_q_${QT}-result.csv && -s shard_results/${SHARD}f_q_${QT}-result.csv && -s shard_results/${SHARD}g_q_${QT}-result.csv && -s shard_results/${SHARD}h_q_${QT}-result.csv ]]; then mv shard_results/${SHARD}_q_${QT}-result.csv shard_results/${SHARD}[a-h]_q_${QT}-result.csv 8/ ; fi ; done ; done
```
and merge to done:
```
mkdir -p done ; for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -s 8/${SHARD}a_q_${QT}-result.csv ]]; then echo "$QT $SHARD" ; java -Xmx8g DomainStatMerger.java 8/${SHARD}a_q_${QT}-result.csv 8/${SHARD}b_q_${QT}-result.csv 8/${SHARD}c_q_${QT}-result.csv 8/${SHARD}d_q_${QT}-result.csv 8/${SHARD}e_q_${QT}-result.csv 8/${SHARD}f_q_${QT}-result.csv 8/${SHARD}g_q_${QT}-result.csv 8/${SHARD}h_q_${QT}-result.csv > 8/${SHARD}_q_${QT}-result.csv ; mv -n 8/${SHARD}_q_${QT}-result.csv done/ ; fi ; done ; done
```
If anything is still missing, try re-running bisect 8 again and if that fails, break out bisect 16 or the ultimate bisect 32. Left as an exercise to the reader.
Check if anything is missing:
```
for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ ! -s done/${SHARD}_q_${QT}-result.csv ]]; then echo "$QT $SHARD is missing" ; fi ; done ; done
```
### Hacks
For each query, check if there are any result (including bisected) for all shards:
```
for QT in $(find . -maxdepth 1 -iname "solrq_*.q" | sed 's/.*solrq_\(.*\).q/\1/'); do for SHARD in $(seq 1 140); do if [[ -z "$(find shard_results -maxdepth 1 -iname "${SHARD}_q_${QT}-result.csv" -o -iname "${SHARD}[a-z]_q_${QT}-result.csv")" ]]; then echo "Missing: $QT $SHARD" ; fi ; done ; done
```
## Merging
Merge results of the same type
```
find done/ -iname "*.csv" | sed 's/.*[0-9]*_q_//' | sort -u | while read -r M; do echo "Processing $M" ; java -Xmx4g DomainStatMerger.java done/*_q_$M > total_$M ; done
```
Merge results across types for the big summary
```
java DomainStatJSON.java $(find -maxdepth 1 -iname "total_*.csv") > total.summary
```
## Sanity checking
The sum of the statistics numbers for `http` and `https` should equal those of `total`:
Extract a sample of 1000 entries from the middle of the total summary, then verify that the number of `http+https` entries matches `total` and that the bytes from those entries also matches. Do note that the order of `http`, `https` and `total` matters in the input so if the order is different in the concrete `total.summary`, the snippet below must be adjusted accordingly. Yes, it is a quick hack.
```
head -n $( echo "$(wc -l < total.summary) / 2" | bc ) total.summary | tail -n 1000 > sample1000
cat sample1000 | sed -e 's/\({"[0-9][0-9][0-9]\)/\n\1/g' -e 's/, \("[0-9][0-9][0-9][0-9]"\)/\n{\1/g' | grep "{" | while read -r LINE; do sed 's/.*n_http":\([0-9]\+\).*n_total":\([0-9]\+\).*n_https":\([0-9]\+\).*/\1+\3-\2/' <<< "$LINE" | bc ; sed 's/.*s_http":\([0-9]\+\).*s_total":\([0-9]\+\).*s_https":\([0-9]\+\).*/\1+\3-\2/' <<< "$LINE" | bc ; done | grep -v '^0$'
```