Have you ever wondered how computers manage to understand language? For us, words and sentences feel natural, but for computers, it’s a challenge because they only work with numbers. To help machines make sense of text, we need to turn words into numbers that they can process. Over the years, researchers have developed many ways to do this, starting with basic methods and moving toward more advanced techniques, like the language models we see today such as ChatGPT and BERT. In this blog, we’ll explore the key steps in this journey and see how each new idea has helped computers get better at understanding human language.
The easiest way to turn words into numbers is to simply assign each word a number. For example, “apple” might become 1, “banana” becomes 2, and so on. This method is called label encoding. It’s basic and uses little memory. Label Encoding can be used to convert predictable categories of data into integers so that the predictive model can work properly.
Applications:
Limitations:
Code Example:
from sklearn.preprocessing import LabelEncoder
words = ['apple', 'banana', 'carrot']
encoder = LabelEncoder()
labels = encoder.fit_transform(words)
print(labels) # Output: [0 1 2]
One-hot encoding prevents the “ordering” mistake from label encoding. Each word is unique, but there’s no information about similarity or meaning between words. If you have a vocabulary of 10,000 words, you end up with 10,000-dimensional vectors—usually mostly zero! It’s quick for small vocabularies, but unmanageable for big ones.
Applications:
Limitations
Code Example:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
words = np.array(['apple', 'banana', 'carrot']).reshape(-1, 1)
encoder = OneHotEncoder(sparse=False)
one_hot = encoder.fit_transform(words)
print(one_hot)
Instead of looking at single words (“unigrams”), n-gram encoding breaks texts into contiguous sequences of n items – for instance “new york” (bigram) or “machine learning is” (trigram). N-grams capture local context, which is vital for basic text classification, sentiment analysis, or bag-of-words models. However, as n increases, the unique combinations skyrocket, making your feature matrix bigger and sparser.
Applications:
Limitations:
Code Example:
from sklearn.feature_extraction.text import CountVectorizer
texts = ["machine learning is fun", "learning machines is exciting"]
vectorizer = CountVectorizer(ngram_range=(2,2)) # Bigram
features = vectorizer.fit_transform(texts)
print(vectorizer.get_feature_names_out())
# ['is exciting' 'is fun' 'learning is' 'learning machines' 'machine learning' 'machines is']
Measures how important a word is in a document relative to a whole collection. Frequent words in one document get higher score, but if they appear everywhere, their score is reduced. TF-IDF helps identify discriminative words for document retrieval, spam detection, and baseline text classification. It does not understand synonyms or semantics—“dog” and “canine” are totally different.
Applications:
Limitations:
Code Example:
from sklearn.feature_extraction.text import TfidfVectorizer
docs = ["cats chase mice", "dogs love cats"]
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(docs)
print(tfidf.get_feature_names_out())
# Output: ['cats', 'chase', 'dogs', 'love', 'mice']
print(X.toarray())
# [[0.44943642 0.6316672 0. 0. 0.6316672 ]
[0.44943642 0. 0.6316672 0.6316672 0. ]]
Uses neural networks to learn a vector representation for each word by predicting neighboring words (Skip-gram) or by predicting a word from its neighbors (CBOW). Word2Vec allows arithmetic like king - man + woman ≈ queen, showing true semantic learning. Vectors are dense and capture similarity. Downside: Each word gets one vector, so it can’t distinguish “bank” in “river bank” from “bank” in “money bank.”
Applications:
Limitations:
Code Example:
from gensim.models import Word2Vec
sentences = [["dog", "barks"], ["cat", "meows"]]
model = Word2Vec(sentences, vector_size=10, min_count=1)
print(model.wv.similarity('dog', 'cat'))
# Output: Numeric similarity score between 'dog' and 'cat'
BERT (Bidirectional Encoder Representations from Transformers) is a deep learning model built on the Transformer architecture, focused on understanding natural language context. It processes text in both directions at once (bidirectional), giving it a deep understanding of word meaning in context. BERT generates contextual embeddings for all words in a sentence or paragraph. It was pre-trained using tasks like masked language modeling (predict hidden words given surrounding context), enabling self-supervised learning. BERT’s output vectors can be adapted for tasks like text classification, question answering, and named entity recognition by adding a small neural layer and fine-tuning. It uses self-attention to weigh the importance of each word in relation to others, making parallel training, scalability, and context handling possible.
These days, we employ embedding models provided sentence-transformers, bge, instructor, nomic-ai, etc. in our RAG pipelines to find most similar text chunks to generate a chatbot response upon data similar to user prompt.
Applications:
Limitations:
Code Example:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')
print(classifier("I can't stand this movie"))
# Output: [{'label': 'NEGATIVE', 'score': 0.99}]
GPT (Generative Pre-trained Transformer) is a deep learning model based on the Transformer architecture that excels at text generation and understanding. reads and processes text in a left-to-right fashion, predicting the next word given previous words (autoregressive). Uses only the decoder part of the transformer. It can be fine-tuned on specific tasks like summarization, translation, or conversation. Produces contextual embeddings and can create coherent sentences, paragraphs, or entire articles from a prompt. GPT is ideal for text generation, chatbots, creative writing, code synthesis, and even answering questions in conversational style. You give it a starting sentence (prompt), and it continues writing based on what it has learned.
We incorporate LLMs like ChatGPT, Gemini, Mistral, LlaMA, Grok, Qwen, etc. in our chatbot and AI Agentic pipelines to generate accurate structured and prompt based textual outputs to make our lives easier.
Applications:
Limitations:
Code Example:
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
print(generator("In the future, AI will", max_length=15))
# Output: Generated sentence, e.g., "In the future, AI will change the way we work..."
The journey from simple label encoding to powerful models like BERT and GPT highlights how far natural language processing has come. Early techniques gave computers a basic way to handle language as numbers, but struggled with meaning and context. With the arrival of word embeddings and especially transformer-based models, machines can now understand not just individual words, but also the relationships and subtleties within sentences and conversations.
Whether you’re building a basic classification model or experimenting with advanced generative AI, choosing the right encoding or embedding technique is a crucial first step. So, it’s very important that engineers know about these concepts before choosing the tools in NLP projects.
Now that we’ve seen how text becomes numbers, the next step is: how do we measure similarity between these numbers? That’s where cosine similarity, dot product, and other distance metrics come in. We’ll cover that in the next post. Thanks for reading!
Want to read more about how we think and build at Ipsator?