When Charu Aggarwal joined IBM Research nearly 25 years ago, he did so against his MIT PhD advisor’s advice. Aggarwal’s doctoral research emphasized mathematics and theory, and “my advisor wanted me to stay in that field,” he says. “But I wanted to work on practical problems.”
That decision ended up having an enormous impact on the development of data analysis, privacy and AI in the decades to follow. Aggarwal, now a Distinguished Research Staff Member (DRSM) at the IBM T J Watson Research Center in Yorktown Heights, NY, has published a number of groundbreaking studies, earning a great deal of respect from his peers in the process.
His most recent accolade comes in the form of the 2021 IEEE W. Wallace McDowell Award, which he received recently in recognition of his extensive body of work, particularly in the areas of high-dimensional data analysis, data stream research and data privacy techniques. The W. Wallace McDowell Award, sometimes referred to as the “IT Nobel,” is named after the former director of engineering at IBM during the development of the pioneering IBM 701 electronic data processing machine in the early 1950s.
The virtues of playing with data
Data with a large number of dimensions, or features, defy conventional approaches of calculation. “Years ago, it was difficult to apply some of the classical algorithms like clustering or outlier detection on large amounts of data that had many features,” Aggarwal says. When he joined IBM in the mid-1990s, researchers were looking for an effective way to cluster groups of data that had similar features. “There is no unique way to define clustering, so the real problem became that we needed a mathematical formalism to redefine clustering in a way that made sense for high-dimensional data,” he adds.
When researchers tried to cluster IBM sales data using software available at the time, it wouldn’t work properly, Aggarwal says. They introduced a family of methods that used carefully chosen projections of this data, select subsets of features on which analysis could be performed. Aggarwal and his colleagues referred to those methods as “projected data mining techniques,” which likewise included projected clustering and projected outlier detection.
“That was the first time I’d experienced that you can come up with some really good ideas just by playing with real data,” he says. “You can then formalize those ideas in mathematical terms.” Over the next five years Aggarwal worked on a suite of papers related to high-dimensional analysis and data clustering.
Summarizing data streams
Aggarwal’s next significant contributions to data science came in the area of data stream research. In the early 2000s, before big data had come of age, “I was working in [IBM Fellow Nagui Halim’s department doing analytics work on data streams, and we foresaw that streaming would become very important,” Aggarwal says.
“The challenge for analyzing streamed data is that, whatever you do, you have to do it in real time and make certain decisions on the fly,” he says. That turns out to be very difficult and served as the crux of Aggarwal’s work in data streams — how to quickly make intelligent decisions about streamed data.
“The key to analyzing a data stream is to construct the right type of synopsis in real time,” he says. “You have to create and maintain a compressed representation of what you’ve seen. You’re not keeping all the data, so you have to find the right types of math representations that can answer the questions you want to answer.”
The worked proved incredibly successful: Aggarwal’s most-cited paper (of the more than 200 he’s written) was a 2003 study about clustering data streams. Big data soon caught on, and all of a sudden everyone needed to make decisions about lots of real-time data streaming into their systems. “The earliest work in new areas are the studies that get the most attention when those areas take off,” he says, adding that his work on streaming data analysis continued for another decade, as he and his colleagues examined different types of applications and streamed data.
Data safety in numbers
Aggarwal would later make his mark on data privacy research, where he explored ideas about grouping together data records so patterns and trends would surface, but researchers wouldn’t be able to identify individual people whose data they used.
“We created clusters of similar individuals, which we call a condensation-based approach because we condensed a group of similar individuals,” he says. “We did not include any information that allowed you to distinguish individuals in the group.”
The research was published in 2004 at a European conference on database technology. A decade later, Aggarwal and fellow IBM researcher Philip Yu received the EDBT 2014 Test-of-Time Award to commemorate the impact of their work. Aggarwal also won an IBM Outstanding Innovation Award for his scientific contributions to privacy technology.
Data privacy research is often criticized for not advancing as fast as the threats to people’s data. It’s a fair criticism, Aggarwal acknowledges, and there isn’t as much commercial application of privacy technology as there should be. “But it’s also true that most privacy techniques require data to be modified in some way, and that makes end users very nervous,” he adds.
Onward to AI
Aggarwal joined IBM straight out his PhD program, after having interned with the company during the summer of 1995. Yu, his manager at the time and currently a computer science professor at University of Chicago, encouraged Aggarwal to use his mathematical theory background to tackle the company’s real-world challenges. “One good thing about working for a place like IBM is you get exposure to practical problems,” Aggarwal says.
During his time at IBM Research, Aggarwal has served as editor-in-chief for multiple ACM publications, thrice been designated a Master Inventor and received two IBM Outstanding Technical Achievement Awards for his research on data streams and high-dimensional data, respectively.
Today, Aggarwal’s research is more in the realm of core AI, including the study of graph neural networks, AutoAI and unsupervised learning. “In general, the problem with AI is that even though it can do some things extremely well—such as playing chess and identifying people and things in digital images—research is far from its goal of creating a general form of AI that can make decisions the way humans do,” he says.
To achieve that, researchers will have to make a lot more progress in unsupervised learning, which is how humans receive most of their knowledge. “People store away information, whether or not they use it immediately,” he says. “When a task comes into play that can benefit from that information, your unsupervised learning helps you learn faster.”
It will be years before such general-purpose AI comes to fruition, but it’s a safe bet that when it does Aggarwal will still be with IBM Research, and will have played a crucial role in its development.