This year I spent learning NLP mostly, in the area of my AI involvement. I learned a lot, thanks to my newly discovered interest in Transformers established in BERT-ology and GPT2-ology. Watching BERT triumphing in so many NLP tasks was ground-shaking. All of that looked like SF to me just a couple of weeks before.
Just imagine this: you ask a machine (your machine, not the clouded one) “Who is Albert Einstein?”, and the machine answers with “A German-born physicist” and “a person of a century”. After the initial wow-effect, I started to look for semantic search implementations (i.e., closed domain knowledge Q&A) systems that enable search through more significant text corpora doable. Tricky, since BERT works with only 512 tokens. But, possible, if you combine it with traditional search engines. You see, if you ask questions about Einstein, and you have a large text corpus under your fingers, you’ll probably find answers in text paragraphs that contain Einstein. It appeared that some Python libraries incorporate this concept in a couple of lines of code. The vital trade-off seemed to be that you don’t have control over the engine that searches through paragraphs/chunks of text for a specific keyword, and you can’t choose which BERT model you’ll use.
The first one is crucial for taking control over the initial search speed and optimizing it according to your needs, i.e., the size and the structure of the text corpus you are using as a knowledge source. Not that trivial if your goal is to expose the complete English Wikipedia content to semantic search, as mine was. Doable if you split large texts into pieces not larger than 512 tokens and expose them to Elastic Search through FScrawler. One improvement implemented: text pieces are overlapping in leading and trailing 15%. Reason: to avoid the possibility to split the existing answer of some question into two separated text pieces, thus making it invisible to BERT, since you expose only one text piece to semantic search with it.
The second one is crucial for the search’s accuracy through the pre-filtered paragraphs/chunks that contain the wanted keywords.
The things become trickier and trickier for choosing the right BERT-like model for answering the questions. Since Google outsourced BERT, several optimizations appeared:
- Distillation, the process of pruning nodes that don’t weigh significantly in the process of prediction making, and
- Shrinking, the process of moving weights from 32-bit floats to 16-bit floats. Less memory consumption and faster models, but less accurate results ?
As a result, the following variations appeared, among others: XLNet, RoBERTa, ALBERT, DistilBERT.
To use them properly, you have to choose the correct balance between your needs for accuracy and the speed they are providing. Also, not all models behave in the same way with all the text corpora you expose to search. Therefore, you have to choose carefully, testing all of the available ones against your text corpus you are working with.
The things are even more complicated with how QA variations of BERT are fine-tuned. There are two datasets used for training QA engines: SQUAD and SQUAD 2. The first one can answer any question in any text example you are offering for a search, but if there is a reasonable answer from the human perspective, he will figure it out quite accurately. The second one is trained by the original SQUAD dataset, enriched with questions labeled with “no answer found” labels, where humans couldn’t find an answer.
The result from all of the things mentioned above: a web application capable of providing answers to human questions like other humans would, having the knowledge of complete English Wikipedia served from your laptop. For me, this was quite amazing.
Now, this NLP story continues: the weakest part is the most substantial part: BERT. The piece that gives the meaning to all this story has its weakness: it works with only 512 tokens at the input and is slow. The reason is the transformers architecture, with the complexity of the input tokens’ number, squared. Cure: simmilar, but different architecture in a way that accepts more tokens and has linear complexity: Performers.
More on them, next time.