data.txt

Materials and methods Materials All chemicals were obtained from Sigma (St Louis, MO, USA) unless otherwise stated. 2’-fucosyllactose (2’FL) and 3-fucosyllactose (3FL) were obtained from Glycom/DSM (Esbjerg, Denmark). Blood group A type II (BgA), Blood group B type II (BgB), Blood group H type II (BgH) and LewisY (LeY) were obtained from Elicityl (Crolles, France). Lewis A trisaccharide (LeA), 3′-sialyl Lewis A (sLeA), Lewis X trisaccharide (LeX), 3’-sialyl Lewis X (sLeX), 2-acetamido-2-deoxy-6-O-(α-l-fucopyranosyl)-d-glucopyranose (6FN), 2-acetamido-2-deoxy-4-O-(α-l-fucopyranosyl)-d-glucopyranose (4FN), 2-acetamido-2-deoxy-3-O-(α-l-fucopyranosyl)-d-glucopyranose (3FN), 4-nitrophenyl α-l-fucopyranoside (pNP-Fuc), 2-Chloro-4-nitrophenyl-αl-fucopyranoside (CNP-Fuc), 2-Chloro-4-nitrophenol (CNP) and N-acetyllactosamine (LacNAc) were obtained from Biosynth Ltd (Compton, UK). FA2G2 N-glycan was from Ludger (Oxford, UK). IgG was purified from human serum using the protein A IgG purification kit from Thermofisher (Carlsbad, US). Purified porcine gastric mucin (pPGM) was obtained as previously described71. PNGase B035DRAFT_0334172 was a kind gift from Dr Lucy Crouch (Newcastle University). Phospholipase A2 (PLA2) from honeybee venom (Apis mellifera) was purchased from Sigma (St Louis, MO, USA). Recombinant fucosidases E1_101251B from R. gnavus E1 and ATCC_038333A from R. gnavus ATCC 29149 were produced in-house as previously reported11. Bioinformatics analyses For sequence similarity networks (SSN) analysis, the sequences encoding GH29 fucosidases were extracted from CAZy database (www.cazy.org). A total of 9,505 GH29 sequences from the CAZy database (last update 2022-10-18) were winnowed down to 2,971 sequences following a sequence identity cut-off at 0.8 via CD-HIT suite73. The amino acid sequences were then used to generate SSN using the Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST) with an alignment score threshold of 96 (40% sequence identity)33,74. The SSN was visualised using Cytoscape 3.9.1. The GH29 sequences from each cluster are provided in Supplementary Data 5. A semi-supervised training method, termed GH29BERT, was applied to implement the unsupervised protein sequence representation learning and supervised classification for GH29 fucosidases. This pLM training process is composed of two phases (as illustrated in Fig. S7), pre-training, and classification task-training, respectively. The pre-training utilised the BERT-based (Bidirectional Encoder Representations from Transformers) language model75 to extract features from 34,258 non-redundant GH29 sequences derived from the CAZy database (last update: 2023-10-10) and InterPro database (downloaded on 2023-11-02). A random 95%-5% data split was adopted for model pre-training and training-process validation. The original BERT model has 12 repeated blocks of Transformer encoders76, here we tuned this hyperparameter to 5 for best performance. Notably, the pre-training model, including a Masked Language Modelling (MLM) prediction head, enabled hiding a certain percentage of input tokens and training the model to predict them in a self-supervised approach. It implements both the next-token prediction and the previous-token prediction, facilitating bidirectional context understanding, which is critical for protein sequence modelling29. The classification task-training model, composed of two attention layers, three dense connected layers and one softmax classification head, was performed on 2,796 labelled sequences (see Supplementary Data 6) derived from the top 45 clusters of the SSN excluding 14 GH29 sequences which were further used for validation (see list of enzymes in Table 1). We randomly selected 80% of the labelled data for training and 20% for testing. Pre-training and task-training were executed on two NVIDIA A100 40GB GPU for one week and 2 h, respectively. In addition to evaluating the accuracy of cluster predictions, we incorporated Exponential Cross-Entropy (ECE) to assess the uncertainty of each pLM while processing input sequences. For the classification task in this study, ECE was calculated using \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${e}^{1/n{\sum }_{i}^{n}{CE}({s}_{i},{y}_{i})}$$\end{document}e1/n∑inCE(si,yi), where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}n is number of protein sequences tested, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${s}_{i}$$\end{document}si and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${y}_{i}$$\end{document}yi denote the sequences and their corresponding cluster labels, respectively. ECE is also known as one kind of perplexity, which in our context ranges from 1, indicating deterministic predictions, to 45, equivalent to a completely random selection from the 45 clusters. In addition to adopting the semi-supervised method through training the pLM on GH29 fucosidase sequences from scratch, i.e., with randomly initialised model parameters, we also included two state-of-the-art pLMs, ESM-228 and ProtT5 model29, for validating their efficacy on GH29 sequence cluster prediction. These pLMs are in larger scale than GH29BERT, in terms of training data, number of parameters, and training time, and were pre-trained on the entire known proteins (see detailed configuration comparison in Table S5). We loaded and froze their official pre-trained parameters, then used labelled GH29 sequences to fine-tune the task-training model, which had the identical structure as for GH29BERT. For comparison, we established two baselines using a non-pretrained GH29BERT and a one-hot encoding approach, respectively. Non-pretrained GH29BERT was trained directly with labelled GH29 sequences, while the one-hot method trained the same task-training model using a one-hot encoding of the protein sequence. Dimension reduction was performed using Uniform Manifold Approximation and Projection (UMAP)52. The GH29BERT model is accessible through a friendly user-interface: https://huggingface.co/spaces/Oiliver/GH29BERT. This web tool assigns a corresponding cluster ID to any sequence uploaded. The configuration of running environment, including all dependencies and used packages, as well as the Python version are detailed in supplementary data 7. The source code and instruction for running environment preparation are available on GitHub: https://github.com/ke-xing/GH29BERT. Cloning, expression and purification of fucosidases The GH29-encoding genes were synthesised exempt of the signal peptide sequence and cloned into pET28a with N terminal His6-tag by Prozomix (Haltwhistle, UK). BaGH2926AD218A and BaGH2926AD218N mutants were synthesised by NZYTech (Lisboa, Portugal). Escherichia coli TunerDE3 pLacI cells were transformed with the recombinant plasmids according to manufacturer’s instructions. Expression was carried out in 1 L LB media growing cells at 37°C until OD600 reached 0.3 to 0.6 and then induced at 16°C for 20-22 h. The cells were harvested by centrifugation at 4,000 g for 35 min. The His-tagged proteins were purified by immobilised metal affinity chromatography (IMAC) and further purified by gel filtration on an ÄKTApure (Cytiva, Little Chalfont, UK). Protein purification was assessed by standard SDS–polyacrylamide gel electrophoresis using the NuPAGE Novex 4–12% Bis-Tris gels (Life Technologies, Paisley, UK). Protein concentration was measured with a NanoDrop (Thermo Scientific, Wilmington, USA) and using the extinction coefficient calculated by Protparam77 from the peptide sequence. Enzymatic activity assays For kinetics, all enzymes were incubated with CNP-Fuc in 50 mM citrate buffer at pH 6 and 37°C. The amount of enzyme was determined to fulfil free-ligand approximation, i.e. the enzyme concentration was linear with product formation. The reaction duration was optimised to measure the reaction rates under initial conditions. A standard curve was made with the reaction product CNP and Fuc in 1:1 ratio from 0 to 0.3 mM to better mimic the reaction products. The release of CNP was monitored using a microplate reader (FLUOstar Omega, BMG LABTECH, Ortenberg, Germany) by monitoring absorbance at 405 nm every 2 min for 40 min in 3 technical replicates. The kinetic parameters were calculated based on the Michaelis-Menten equation using a non-linear regression analysis programme, and one-way ANOVA was performed compared to E1_101251B (Prism 5, GraphPad, San Diego, USA). The enzymatic activity of recombinant fucosidases was determined on 2’FL, 3FL, BgA, BgB, BgH, LeA, sLeA, LeX, sLeX, LeY, 6FN, pNP-Fuc and pPGM using 10 μM enzyme, 0.5 mM substrate or 1 mg/mL for pPGM in 50 mM citrate buffer pH 6 and 1 mg/mL bovine serum albumin (BSA). The reactions were incubated at 37°C and stopped by boiling at 95°C for 10 min. The release of Fuc was quantified with the l-fucose assay kit from Megazyme (Wicklow, Ireland) using a microplate reader (FLUOstar Omega, BMG LABTECH, Ortenberg, Germany) by monitoring absorbance at 340 nm every 2 min. To determine the specific activity, the enzymatic reactions were optimised by adjusting enzyme concentration and incubation time (Table S6) to obtain between 6%-25% of substrate hydrolysis which is within detection limit and corresponds to linear range. Specific activity was calculated from 4 technical replicates. One unit of activity was defined as the amount of enzyme needed to release 1 μmol Fuc per min under the conditions described above. Enzymatic reactions were carried out as above but with 0.1 mM substrate and were incubated for 24 h, and the released Fuc were confirmed by HPAEC-PAD using a Dionex ICS 5000 system (Thermo Scientific, Hemel Hempstead, UK). The sugars were separated on a CarboPac PA1 analytical column protected with a CarboPac PA1 guard column using the following gradient conditions at 1 mL/min: 0-20 min, 18 mM NaOH; 20.1-35 min, 100 mM NaOH; 35.1-50 min, 18 mM NaOH. Enzymatic reactions (20 μL) were also performed against complex glycans and glycoproteins using 10 μM of enzyme and 5 μM of oligosaccharides or FA2G2 (5 ng/μL)78, PLA2 (1 mg/mL), IgG (1 mg/mL) untreated or treated with PNGase B035DRAFT_03341 (10 μM)72 or PNGaseF (5000 units/mL), respectively in 50 mM citrate buffer at pH 6, 37°C for 24 h to release N-glycans. The products were analysed by LC-FD-MS/MS as previously described11. The reactions were stopped by heating 95 °C for 5 min and then dried down using Savant SpeedVac centrifugal evaporator (Thermo Fisher, Wilmington, USA), labelled at the reducing end with procainamide using the glycan labelling kit with sodium cyanoborohydride as the reductant (Ludger, Oxford, UK) and purified using a LudgerClean Procainamide Plate (LC-PROC-96, Ludger, Oxford, UK) to remove the excess dye. The samples were dried down using a Thermo Savant SpeedVac centrifugal evaporator and resuspended in 50 µL of 75% acetonitrile: 25% water. The suspensions were then injected onto a Waters BEH amide column (2.1 ×150 mm, 1.7 µm particle size, 130 Å pore size) at 40 °C on a Dionex Ultimate 3000 UHPLC instrument with a fluorescence detector (λex = 310 nm, λem = 370 nm) coupled to a Bruker Amazon Speed ETD. A 50 mM ammonium formate solution pH 4.4 (Ludger, Oxford, UK) was used as mobile phase A and acetonitrile (Romil, UK) was used as mobile phase B. A 70 min gradient was used at 0.2 mL/min unless otherwise specified, 0-53.5 min, 76% to 51%B, 0.4 mL/min; 53.5-55.5 min, 51% to 0% B; 55.5-57.5 min, 0% B; 57.5-59.5 min, 0% to 76% B; 59.5-65.5 min, 76% B; 65.5-70 min, 76% B, 0.4 mL/min. The Heatmap of enzyme specific activities was constructed via Chiplot (https://www.chiplot.online/). Hierarchical clustering was performed based on Euclidean distance calculated with complete linkage as computing method. Transfucosylation reactions For transfucosylation, enzymatic reactions with 1 μM enzyme (1.43 μM for BaGH2926A), 180 mM GlcNAc and 18 mM pNP-Fuc were incubated in 20% (v/v) DMSO to increase the solubility of pNP-Fuc for 1 h at 37°C. The reactions were stopped by addition of ethanol using three times the volume of the reaction. To assay the capacity of the enzymes to carry out further transfucosylation reactions, 1 μM enzyme (1.43 μM was used for BaGH2926A) was incubated with 180 mM 3FN or 6FN and 18 mM pNP-Fuc in 20% (v/v) DMSO. Thin layer chromatography (TLC) and TLC-electrospray ionisation-mass spectrometry (TLC-ESI-MS) analysis To analyse the products of transfucosylation reactions, standards, Fuc (0.01 µmol), pNP-Fuc (0.03 µmol), GlcNAc (0.25 µmol), 4FN (0.005 µmol), 3FN (0.005 µmol), 6FN (0.005 µmol), and reaction samples with GlcNAc (8 ×0.5 µL) or with 3FN or 6FN (4 ×0.5 µL) were loaded on a 12 cm tall plate (TLC Silica gel 60 F254, Sigma-Aldrich, Germany). The plates were developed using an isopropanol-ammonium hydroxide-water 6:3:1 mixture (namely IPA-NH4OH-H2O) for 3 h or until the frontline of the solvent rose to ca. 11.25 cm. The plate was then dried using a hair dryer and stained using a 5% ethanolic solution of sulphuric acid. Gently heating of the plate allowed the identification of the TLC spots corresponding to controls and reaction products. TLC-ESI-MS of the enzymatic reactions was performed using an Expression Compact Mass Spectrometer (Advion, UK) coupled with a Plate Express reader (Advion, UK) in positive mode to identify the fucosylated reaction products. The enzymatic reactions were analysed through TLC as described above. The analysis was performed in duplicates to stain one TLC plate and use it as a guide to perform the TLC-ESI-MS on the non-stained plate. By comparison with the stained plate, the laser of the Plate Express reader was aimed at the right retention factor (Rf) and the data obtained was analysed using Advion Mass Express software. Comparison of retention factors, in combination with TLC-ESI-MS analysis and NMR allowed identification of reaction products. Nuclear Magnetic Resonance (NMR) spectroscopy An aliquot of the enzymatic reactions (600 μL) was evaporated to dryness and reconstituted in 600 μL of NMR buffer (100 mL D2O containing 0.26 g NaH2PO4, 1.41 g K2HPO4, and 1 mM deuterated trimethylsilyl propionate (TSP) as a reference compound) before 1H-NMR spectroscopic analysis. 1H-NMR spectra were recorded using a 600-MHz Bruker Avance spectrometer fitted with a 5-mm TCI proton-optimised triple resonance NMR inverse cryoprobe and autosampler (Bruker, Bremen, Germany). Sample temperature was controlled at 300 K. Spectra were acquired with 32 scans, a spectral width of 12500 Hz and an acquisition time of 2.6 s. The “noesypr1d” presaturation sequence was used to suppress the residual water signal with a low-power selective irradiation at the water frequency during the recycle delay. Spectra were then transformed with a 0.3-Hz line broadening and zero filling, manually phased, baseline corrected, and referenced by setting the TSP–d4 signal to 0 ppm. Reaction products were identified by comparison with the spectra of standards (GlcNAc, pNP-Fuc, Fucose, 6FN, 3FN and 4FN). STD NMR All NMR binding experiments were performed at 278 K on a Bruker Avance III 800 MHz spectrometer equipped with a 5-mm TXI 800 MHz H-C/N-D-05 Z BTO probe (Bruker, Bremen, Germany). First, FA2G2 was spectroscopically characterised by standard COSY (cosydfesgpph), TOCSY (mlvevphpp), 1H-13C HSQC (hsqctgpsp) and NOESY (noesygpph) for the purpose of assignment. Then, FA2G2 was recovered and prepared in a Shigemi advanced NMR microtube assembly at the concentration of ~200 μM in the presence of ~20 uM BaGH2926A (protein:ligand ratio 1:20), in D2O buffer solution containing 25 mM Tris-d11 pH 7.8 and 100 mM NaCl. An STD NMR pulse sequence including 2.5 ms and 5 ms trim pulses and a 3 ms spoil gradient was used. Saturation was achieved by applying a train of 50 ms Gaussian pulses (0.40 mW) on the f2 channel, at 6.70 ppm (on-resonance experiments) and 40 ppm (off-resonance experiments). The broad protein signals were removed using a 40 ms spinlock (T1ρ) filter. As a first test for binding, an STD NMR experiment with a saturation time of 2 s and a relaxation delay of 5 s was performed. Then, an STD build up curve was performed, by carrying out STD experiments at different saturation times (0.5, 1, 2, 3, 4 and 5 s) with 2 K scans, in order to obtain the binding epitope mapping. The resulting build-up curves for each proton were fitted mathematically to a mono-exponential equation (y = a*[1-exp(b*x)]), from which the initial slopes (a*b) were obtained. Finally, the binding epitope mapping was obtained by dividing the initial slopes by the strongest signal corresponding to the methyl group of GlcNAc A, to which an arbitrary value of 100% was assigned. X-ray crystallography BaGH2926A was dialysed into 20 mM Tris 150 mM NaCl. Sitting drop vapour diffusion plates were set up with a protein concentration of 20 mg/mL and 5 mM 2’FL. Crystals appeared in many conditions across commercial sparse matrix screens with the best diffracting crystals appearing in the following condition: 0.12 M diethylene glycol, 0.12 M triethylene glycol, 0.12 M tetraethylene glycol, 0.12 M pentaethylene glycol, 100 mM Tris(base)/bicine pH 8.5, 12.5% 2-methyl-2,4-pentanediol, 12.5% PEG 1000, 12.5% w/v PEG 3350. Diffraction datasets were collected at Diamond Light Source on beamline I24 at a wavelength of 0.9686 Å. Attempts were made to soak out the resulting Fuc molecule bound in the active site so that additional complexes could be attained. These attempts were unsuccessful. By seeding with WT BaGH2926A crystals we also grew diffracting crystals of BaGH2926AD218N active site mutant. These crystals were grown in 0.1 M sodium acetate pH 4.6, 8% (w/v) PEG 4000. Data were processed using xia279 and dials80. The phase problem was solved by molecular replacement using the search model 1ODU, prepared using Chainsaw81. Initial model building was performed using ArpWarp82, followed by alternating cycles of model building and refinement using coot83, refmac84, and PDBredo85. The refined WT BaGH2926A structure has 0.3% ramachandran outliers. The final D218N BaGH2926A structure has 0.05% ramachandran outliers. Reporting summary Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.