How China Is Training AI to Censor Its Secrets

0Points

A newly leaked database shows China is developing a large language model (LLM) system to automatically detect and suppress politically sensitive content, dramatically expanding the country’s capacity for digital censorship, TechCrunch reports.

The tool appears to serve the Chinese government’s long-standing goals of controlling online narratives, using artificial intelligence to identify dissent far more efficiently than traditional methods.

Newsweek has contacted the Chinese Embassy in Washington, D.C., for comment via email.

Why It Matters

The scale and sophistication of the dataset shows how authoritarian regimes are beginning to deploy AI to tighten grips over online discourse. While China has long censored information through keyword filters and human oversight, the new model leverages the capabilities of generative AI to detect more nuanced or coded expressions of dissent.

People walking below a surveillance camera on the Great Wall of China at Mutianyu, north of Beijing, on May 1, 2023.

Getty Images

What To Know

The leaked dataset, found by independent researcher NetAskari and shared with TechCrunch, was stored on an unsecured Elasticsearch server hosted by Baidu, the outlet reported. It included recent data, some as late as December, showing the AI was being developed.

The LLM’s training data included more than 133,000 examples of “sensitive” content, spanning topics such as corruption, rural poverty, military operations, labor unrest and Taiwanese politics.

The model is designed to flag content categorized as “highest priority,” including anything related to military affairs, Taiwan or political criticism.

Even subtle language, such as the Chinese idiom “When the tree falls, the monkeys scatter,” used to imply regime instability, was marked for suppression.

This is not the first time China’s AI development process has faced allegations of censorship. When tested by Newsweek, the newly launched Chinese chatbot DeepSeek was unable to discuss the 1989 Tiananmen Square massacre.

The AI instead responded: “Sorry, that’s beyond my current scope. Let’s talk about something else.” However, when asked about the January 6 Capitol riot in the United States, the bot delivered a detailed timeline and context.

DeepSeek also refused to offer criticisms of Chinese President Xi Jinping but readily listed critiques of U.S. political figures, reinforcing concerns that Chinese-origin AI tools are calibrated to echo state narratives while withholding or distorting politically inconvenient information.

What People Are Saying

Sam Altman, the CEO of OpenAI, wrote in a Washington Post op-ed in July: “We face a strategic choice about what kind of world we are going to live in: Will it be one in which the United States and allied nations advance a global AI that spreads the technology’s benefits and opens access to it, or an authoritarian one, in which nations or movements that don’t share our values use AI to cement and expand their power?”

What Happens Next

China has not publicly confirmed the origins or purpose of the dataset, though the Chinese Embassy in Washington, D.C., told TechCrunch that it opposed “groundless attacks and slanders against China” and emphasized its commitment to creating ethical AI.

Emma Thompson

Emma is a tech enthusiast with a passion for everything related to WiFi technology. She holds a degree in computer science and has been actively involved in exploring and writing about the latest trends in wireless connectivity. Whether it's…