Working with Large-Scale One-Hot Encoding: A Memory-Efficient Approach
Tame Your RAM-Hungry Categorical Variables Without Breaking Your Machine
Hey everyone! Recently, I've been diving deep into handling massive datasets, and today I want to share a clever workaround I discovered while tackling the Criteo Advertising Competition on Kaggle. Trust me, this one's going to be good!
The Challenge
Picture this: You've got an 11GB training dataset with categorical variables that can take millions of unique values. Your first instinct? "Let me just load it into pandas and use scikit-learn's DictVectorizer." Well, spoiler alert - your RAM's gonna tap out faster than a rookie in a marathon!
Even with my beefy 16GB machine, I couldn't fit the entire dataset into memory. And while scikit-learn's SGDClassifier
has a handy partial_fit
method for incremental learning, the same courtesy isn't extended to OneHotEncoder
or DictVectorizer
. Talk about a pickle! 🥒
Understanding the Data Structure
Before we dive into the solution, let's break down what we're working with:
40 features total
13 continuous variables (I1-I13)
26 categorical variables (C1-C26)
Some categorical variables had over a million unique values!
The Aha! Moment 💡
Here's where things get interesting. I realized I was approaching the problem all wrong. The key insight? We don't need to load all the data at once to create our feature space!
Instead, we can:
First identify all possible unique values for each categorical variable
Create a representative dictionary that covers all possible combinations
Fit our DictVectorizer on this smaller, complete feature space
Transform our actual data incrementally
The Solution: Creating a Smart Feature Dictionary
Let me break this down with a simple example. Imagine you have three categorical variables:
C1: values from 1-100
C2: values from 1-3
C3: values from 1-1000
Instead of loading all your data, you create a dictionary like this:
feature_dict = {
'C1': [1,2,3,...,100] * 10, # Repeated to match C3's length
'C2': [1,2,3] * 333 + [1], # Repeated to match C3's length
'C3': list(range(1, 1001)) # All 1000 values
}
Here's the actual implementation:
from sklearn.feature_extraction import DictVectorizer
# Create representative dictionary
max_length = max(len(unique_values) for unique_values in category_values.values())
feature_dict = {}
for col, values in category_values.items():
# Repeat values to match max_length
repeats = max_length // len(values)
remainder = max_length % len(values)
feature_dict[col] = values * repeats + values[:remainder]
# Fit DictVectorizer on representative data
vectorizer = DictVectorizer(sparse=True)
vectorizer.fit([dict(zip(feature_dict.keys(), row)) for row in zip(*feature_dict.values())])
Processing the Data Incrementally
Now comes the cool part. Instead of transforming all data at once, we process it chunk by chunk:
def process_chunks(filename, chunk_size=10000):
for chunk in pd.read_csv(filename, chunksize=chunk_size):
# Convert chunk to dictionary format
chunk_dict = chunk[categorical_cols].to_dict('records')
# Transform and process
X_transformed = vectorizer.transform(chunk_dict)
# Do something with transformed data
# (e.g., partial_fit your model)
model.partial_fit(X_transformed, chunk['target'])
Pro Tips 🚀
Don't store transformed data: A transformed chunk of 100,000 records can easily eat up 10GB of disk space due to the high dimensionality.
Use sparse matrices: They're your best friend when dealing with one-hot encoded categorical variables.
Monitor memory usage: Keep an eye on your RAM usage during processing. If it starts climbing too high, reduce your chunk size.
Final Thoughts
This approach helped me break through my initial plateau of 0.47 on the Kaggle leaderboard. The beauty of this solution is that it scales well with large datasets while being memory-efficient.
Remember, sometimes the best solution isn't about having more computational resources - it's about being smarter with how you use them!
That's it for today, folks! Drop your thoughts and experiences in the comments below. Happy modeling!
Follow me up on Medium, Linkedin, and X for more such stories and to be updated with recent developments in the ML and AI space.