Telecom Analytics Engineering: A Practical Guide to Managing Messy Data

Blog

Telecom Analytics Engineering: A Practical Guide to Managing Messy Data

Learn how to navigate identity, metadata, and ambiguity in telecom data to build trust and deliver real business impact—despite the mess.

Introduction

I started working with telecom data in 2022, and since then, I’ve worked across three very different companies—from large telcos to satellite communication providers. If there’s one thing all of them had in common, it’s this: telecom data is never clean and rarely ready for analysis.
That’s why you won’t find a flashy GitHub portfolio from me filled with Kaggle competitions or polished notebooks. The truth is, those datasets don’t even come close to the mess, scale, and velocity of real-world telecom data. The senior leaders I work with don’t ask me to show them Iris or Titanic models—they see “Telco + Satellite + Deployed to Production” on my CV, and they know I’m battle-tested.
Because when you work in a major telecom environment, you learn quickly: the data layer is messy, and no modern stack or glossy dashboard is going to clean it up for you. But that doesn’t mean you can’t build with it. You can, and you must. You still need to build analytics, marketing models, and customer tracking.
This post is about surviving and thriving in that environment—the patterns, pain points, and how to build trust despite the noise.

Section 1: Why Telecom Data Gets So Messy

Telecoms are operationally complex, and their data reflects that complexity. Here are just a few of the realities:

Customers often stay on outdated plans after renewal, or are silently migrated to newer tariffs, without clear flags in the data.
Mergers and acquisitions mean customer records may appear in multiple systems. A customer churns from one brand and reappears in another, but it’s technically the same person.
Technology transitions (like 3G to 5G or the integration of legacy GX and BGAN terminals) create messy handoffs—data gaps, duplicate signals, or timestamps that no longer align across platforms.
Different services—mobile, broadband, TV, wholesale—often sit in separate systems, each with their own schema, cadence, and ID conventions.
IDs vary: MSISDN, IMSI, IMEI, internal account IDs, household IDs—with little standardisation.
Some data flows in real-time; others arrive in nightly or even weekly batches. Synchronising timelines is its battle.
Regionalisation and business-unit silos create divergent logic and naming conventions.

Even something as seemingly straightforward as “customer tenure” becomes ambiguous. I have three different contracts under my name: a business mobile, a personal line, and a child’s device. They were activated months apart, but share common payment details and occasionally the same physical handset. So what’s my tenure? Which product? Which line of business? Based on the bill date or SIM activation?
This kind of complexity is not a corner case. It’s the norm. And it’s why telecom data needs to be navigated—not just queried.

Section 2: Anchoring Your Analysis — Start with Purpose, Not Tables

Before writing a single line of SQL or opening the first table, the most important step is to clarify your goal: What are you trying to improve, predict, or understand, and over what time?
In my workflow, I rely heavily on JIRA tickets to structure my thinking. It gives me a simple but powerful template:
“I want to achieve X. To do that, I need to do Y. While doing it, I discovered Z1, Z2, Z3.”
This helps me stay grounded in purpose and document the real-world complexity I encounter along the way. It is actually something that I learned from the program director in the first Telecom company I worked for, and helpful to the current day.
Too often, analysts jump straight into JOINs without this clarity. That leads to overly complex SQL, poor warehouse performance, and output that nobody trusts.
It’s easy to fall into habits like over-indexing on performance or reaching for ROW_NUMBER() OVER(PARTITION BY col1, col2, …, colN) to deduplicate, without checking whether your partitions match the true grain of the data. Fast SQL doesn’t always mean good SQL—especially when semantic mismatches skew your outputs.
In telecom, this kind of mistake compounds quickly. For example, BigQuery is powerful at handling nested JSON and evolving schemas, but performance recesses when you attempt large cross-joins without filtering down first. Execution time alone is not a measure of solution quality. Sometimes, slower queries reflect more meaningful segmentation or more precise joins.
Anchoring your work in a clearly defined objective helps prevent these problems. It ensures your logic is intentional, your joins are meaningful, and your outputs are aligned with real business needs, not just technical completeness.

Section 3: Identity Resolution — It’s Never Just One ID

In telecom, identity is never as simple as a user ID or email address. It’s a patchwork of technical identifiers that often don’t align across devices, SIMS, accounts, and systems. Here’s what you’re usually dealing with:

MSISDN (phone number): This may change if a customer ports their number or switches lines.
IMEI: A unique 15-digit code that identifies a physical device (e.g. smartphone, tablet).
IMSI: Tied to the SIM card and used to authenticate a user on the mobile network—totally different from IMEI.
SIM: May move between devices, adding yet another layer of noise.
Internal customer IDs: Often not standardised across mobile, broadband, TV, and wholesale systems.
Household or account-level keys: Sometimes data is structured around households; sometimes around individuals.
Agent or reseller accounts that behave like individuals but aren’t.

So what happens when you’re trying to build a customer 360 view, define churn cohorts, or report on tenure? You hit ambiguity fast. Is this the same person switching SIMs, or two people sharing a device? Is this a churned customer reactivating under a different MSISDN, or a brand-new signup?
The answer isn’t to “fix” this once and for all. That’s not realistic in telecom. Instead, your job is to build structure around the mess. You need:

An ID resolution hierarchy (e.g. MSISDN → Account ID → hashed email, in priority order)
A version-controlled logic for this hierarchy—documented, testable, and owned by someone
A trust threshold: Decide on a “good enough” trust score. Is chasing the 1% edge case worth it?

Identity in telecom is about managing ambiguity, not eliminating it. The goal is consistency, reproducibility, and knowing when “good enough” is actually good enough.
Section 4: Metadata and Contracts Matter More Than Tables
In messy environments like telecom, data isn’t distrusted because it’s “wrong”—it’s distrusted because people don’t know how to read it. That uncertainty comes from a lack of clarity around:

What’s actually in a column -> I do remember the first time I looked into the Telco data, it was fun for sure.
Where the data originated
When it was last updated
What logic was used to derive this metric?
How reliable is it for decisions?

This is where metadata and process matter more than the raw tables themselves. If you’re using tools like GitLab for pipeline orchestration or Terraform for infrastructure-as-code, this is your leverage point. You can’t always guarantee perfectly clean data, but you can absolutely guarantee traceability, accountability, and structure.
Here’s how I’ve made this work in practice:

Versioned SQL logic using GitLab, so every change is visible, testable, and reviewable. For the change you made, add your initials, Jira ticket number and date of the change.
Metadata tagging on tables (e.g., last update timestamp, owner, SLA on refresh).
Terraform deployment of BigQuery datasets with access controls baked into the infrastructure layer.
Wiki pages that link directly to the tables and scripts, with explanations, change history, and JIRA ticket references.

When we first tried documenting tables, we thought naming conventions and a few links would be “good enough.” But at scale, that breaks down. Now I’m the first to say: if it’s important, document it. Don’t rely on tribal knowledge. Clean code means readable code: meaningful CTE names, clear comments, and notes that explain what changed, why, and when. Good table hygiene also means tracking the evolution—link each change to a JIRA ticket and log it in a shared doc or wiki.
This kind of operational maturity doesn’t happen overnight—it comes from working at scale. I’ve been in e-commerce, insurance, media, even fashion tech. Nothing has tested my systems thinking like telco. The volume and velocity of the data force you to think in terms of pipelines and contracts, not just dashboards and queries.
Even when the underlying data isn’t perfect, these practices build a surface of trust. That’s often enough to shift stakeholders from PowerPoint arguments to actual data-driven decisions.

Conclusions:

Working with telecom data means learning to live with ambiguity—and still deliver. It’s not about finding perfect data; it’s about documenting what you can, tracking your decisions, and building trust through careful, layered thinking.
Across two of the largest telcos I’ve worked with, I’ve received the same feedback: your accuracy is exceptional, and you surface anomalies others miss. That’s not luck—it’s the result of going one layer deeper, every time. Knowing that the story is rarely on the surface.
You don’t need perfect data. But you do need:

A clear sense of purpose
A thoughtful approach to identity
Metadata that builds trust
Pipelines that are reproducible and versioned

If you do that, you’ll make a big impact—even in a noisy, shifting data landscape like telecom.

Blog