My friends call me Nacho (yep, like the tortilla chips ๐ฎ) and I am SWE/Data Scientist from Madrid, now based in Berlin. I am using this GitHub to put together some of the projects I can actually showcase, as a (bit untidy) portfolio.
I created harmonized versions of the four clinical, distributable and annotated datasets in German language. Using the harmonized corpora as a meta-dataset, I performed a total of 340 cross-corpus evaluation experiments to assess the generalization capabilities of current-day pre-trained Transformers on each dataset. The paper was worth a publication in the 5th Clinical Natural Language Processing Workshop of the ACL. An extended version of the work with a substantially larger k-fold experimental design will come out shortly. The datasets can be applied for through their respective DUAs. Unfortunately, the code and model checkpoints are kept private due to GDPR concerns.
I participated in the development of xMEN, a python package for cross-lingual (x) Medical Entity Normalization. Mainly, I took care of the pypi integration, the Command Line Interface, unit tests and documentation. We used the application for the first time to obtain SOTA results in the Disease Text Mining Shared Task from the Barcelona Supercomputing Center. We presented the results in this book for the International Conference of the Cross-Language Evaluation Forum for European Languages 2023. Furthermore, a complete specification can be read in this paper.
Promptbook is an open-source project that automatically provides a GUI for prompt-generating Python functions. Built on top of Python type hints, you can store, launch, tune and reuse and share GPT prompts of all kinds in a convinient GUI manner. The code and contribution guide are open in the repo.
This little proof of concept showcases how combining different AI actors can be effective to build conversational language teachers. The app allows you first to record your voice. It will process your speech, show you which grammatical/spelling mistakes you made, and answer back to keep the conversation going. You can also practice your listening skills by replaying an audio of the robot's answer! More technical specs can be found in the repo.
Katrin Ortmann presented in 2022 a new evaluation method that more accurately reflects true prediction quality for labeled span-based metrics. We thought the idea was fantastic, so I provided an implementation within the HuggingFace Evaluate module to democratize its use. It has already been of help in some workshops and shared tasks. Just follow the link to see the code!
I developed an application to detect whether the face in front of the camera is real or fake (a print, a screen...) for a Kiosk manufacturing company. I use a CNN trained separately on RGB, infrared and depth information and test its performance on a self-recorded dataset, paying special attention to the effects of different conditions of room lighting or camera position. The depth model proves to be superior in most scenarios, notably obtaining 0.05 ACER score across all settings. A summary of the project development can be read here. The code unfortunately belongs to the company and cannot be shown.
Musicator is my Python program with Streamlit frontend that uses a neural network trained on the GTZAN dataset to guess the genre of a song among blues, classical, country, disco, hiphop, jazz, metal, pop, reggae and rock. The source code and and Jupyter Notebook with an extensive analysis of the problem can be seen on this repo.
I am always on the look to do my bit for open-source efforts that I use frequently. As such, I am a top contributor for BigBIO and xMEN, I made some improvements in Sci-spaCy or included various modules and models in Hugging Face.
Old computer-vision application I developed with a couple of classmates on the use of CNNs to detect facial landmarks (points that define mouth, eyes, nose, eyebrows, etc.) for face recognition.
My M. Sc. was very heavy on statistics. I have compiled some interesting implementations I did from scratch of various Machine Learning methods in jupyter notebooks. [๐ง under construction ๐ง]
Some other smaller or shared projects. [๐ง under construction ๐ง]
- ๐ฆ Epidemiological Data Analysis in R-language: I had access to a sweet German biomedical dataset, so here is a varied statistical analysis I performed to show my R skills.
- ๐ฒ Impact of Paid Subscriptions in Physician Platforms: in 2020, I conducted an investigation with a couple classmates about whether paying a subscription has a significant effect on the ratings of medical physicians in the platform Jameda.
- ๐๏ธ ETL tools for Master Data Management: brief literature survey of ETL and MDM with a fictional demonstration of some use cases.