🤖 Mastering BERT Tokenization and Encoding

To use a pre-trained BERT model, we need to convert the input data into an appropriate format so that each sentence can be sent to the pre-trained model to obtain the corresponding embedding. This article introduces how this can be done using modules and functions available in Hugging Face’s transformers package (https://huggingface.co/transformers/index.html).

2020-06-19

/2020/06/19/bert-tokenization.html map[location:Dublin, Ireland name:Albert Au Yeung]

🐍 Effortlessly Create N-Grams from Text in Python

N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places.

When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents. This post describes several different ways to generate n-grams quickly from input sentences in Python.

2018-06-03

/2018/06/03/generating-ngrams.html map[location:Dublin, Ireland name:Albert Au Yeung]

ayeung.dev

Tag: Nlp

🤖 Mastering BERT Tokenization and Encoding

🐍 Effortlessly Create N-Grams from Text in Python