Researchers Finds a New Algorithm that Trains AI to Avoid Bad Behaviors

Artificial intelligence has moved into the commercial mainstream thanks to the growing prowess of machine learning algorithms that enable computers to train themselves to do things like drive cars, control robots or automate decision-making.

WhatsApp Channel Join Now

Telegram Channel Join Now

But as AI starts handling sensitive tasks, such as helping pick which prisoners get bail, policy makers are insisting that computer scientists offer assurances that automated systems have been designed to minimize, if not completely avoid, unwanted outcomes such as excessive risk or racial and gender bias.

Machine learning algorithms are being used in an ever-increasing number of applications, and many of these applications affect quality of life. Yet such algorithms often exhibit undesirable behavior, from various types of bias to causing financial loss or delaying medical diagnoses. In standard machine learning approaches, the burden of avoiding this harmful behavior is placed on the user of the algorithm, who most often is not a computer scientist. Thomas et al. introduce a general framework for algorithm design in which this burden is shifted from the user to the designer of the algorithm. The researchers illustrate the benefits of their approach using examples in gender fairness and diabetes management.

A team led by researchers at Stanford and the University of Massachusetts Amherst published a paper Nov. 22 in Science suggesting how to provide such assurances. The paper outlines a new technique that translates a fuzzy goal, such as avoiding gender bias, into the precise mathematical criteria that would allow a machine-learning algorithm to train an AI application to avoid that behavior.

“We want to advance AI that respects the values of its human users and justifies the trust we place in autonomous systems,” said Emma Brunskill, an assistant professor of computer science at Stanford and senior author of the paper.

Abstract

Intelligent machines using machine learning algorithms are ubiquitous, ranging from simple data analysis and pattern recognition tools to complex systems that achieve superhuman performance on various tasks. Ensuring that they do not exhibit undesirable behavior—that they do not, for example, cause harm to humans—is therefore a pressing problem. We propose a general and flexible framework for designing machine learning algorithms. This framework simplifies the problem of specifying and regulating undesirable behavior. To show the viability of this framework, we used it to create machine learning algorithms that precluded the dangerous behavior caused by standard machine learning algorithms in our experiments. Our framework for designing machine learning algorithms simplifies the safe and responsible application of machine learning.

Avoiding misbehavior

The work is premised on the notion that if “unsafe” or “unfair” outcomes or behaviors can be defined mathematically, then it should be possible to create algorithms that can learn from data on how to avoid these unwanted results with high confidence. The researchers also wanted to develop a set of techniques that would make it easy for users to specify what sorts of unwanted behavior they want to constrain and enable machine learning designers to predict with confidence that a system trained using past data can be relied upon when it is applied in real-world circumstances.

“We show how the designers of machine learning algorithms can make it easier for people who want to build AI into their products and services to describe unwanted outcomes or behaviors that the AI system will avoid with high-probability,” said Philip Thomas, an assistant professor of computer science at the University of Massachusetts Amherst and first author of the paper.

Fairness and safety

The researchers tested their approach by trying to improve the fairness of algorithms that predict GPAs of college students based on exam results, a common practice that can result in gender bias. Using an experimental dataset, they gave their algorithm mathematical instructions to avoid developing a predictive method that systematically overestimated or underestimated GPAs for one gender. With these instructions, the algorithm identified a better way to predict student GPAs with much less systematic gender bias than existing methods. Prior methods struggled in this regard either because they had no fairness filter built-in or because algorithms developed to achieve fairness were too limited in scope.

The group developed another algorithm and used it to balance safety and performance in an automated insulin pump. Such pumps must decide how big or small a dose of insulin to give a patient at mealtimes. Ideally, the pump delivers just enough insulin to keep blood sugar levels steady. Too little insulin allows blood sugar levels to rise, leading to short term discomforts such as nausea, and elevated risk of long-term complications including cardiovascular disease. Too much and blood sugar crashes – a potentially deadly outcome.

Machine learning can help by identifying subtle patterns in an individual’s blood sugar responses to doses, but existing methods don’t make it easy for doctors to specify outcomes that automated dosing algorithms should avoid, like low blood sugar crashes. Using a blood glucose simulator, Brunskill and Thomas showed how pumps could be trained to identify dosing tailored for that person – avoiding complications from over- or under-dosing. Though the group isn’t ready to test this algorithm on real people, it points to an AI approach that might eventually improve quality of life for diabetics.

In their Science paper, Brunskill and Thomas use the term “Seldonian algorithm” to define their approach, a reference to Hari Seldon, a character invented by science fiction author Isaac Asimov, who once proclaimed three laws of robotics beginning with the injunction that “A robot may not injure a human being or, through inaction, allow a human being to come to harm.”

While acknowledging that the field is still far from guaranteeing the three laws, Thomas said this Seldonian framework will make it easier for machine learning designers to build behavior-avoidance instructions into all sorts of algorithms, in a way that can enable them to assess the probability that trained systems will function properly in the real world.

Brunskill said this proposed framework builds on the efforts that many computer scientists are making to strike a balance between creating powerful algorithms and developing methods to ensure that their trustworthiness.

“Thinking about how we can create algorithms that best respect values like safety and fairness is essential as society increasingly relies on AI,” Brunskill said.

Researchers Finds a New Algorithm that Trains AI to Avoid Bad Behaviors

Abstract

Avoiding misbehavior

Fairness and safety

Related

KNOWLAB

Other Articles

Tesla Cybertruck: Truck for the Future

Programming Languages that will get you a Job at Google

Programming Languages that will get you a Job at Google

Tesla Cybertruck: Truck for the Future

Related Posts

Best AI Workflow Automation Tools in 2026: 9 Smart Platforms That Can Save You Hours Every Week

Your Personal AI Agent: What Manus Can Do That Others Can’t (And Why It Matters)

5 Game-Changing AI Tools That Are Secretly Making Professionals Smarter Every Single Day

Recraft Is the Design Multiplier You’ve Wanted—Ready to Switch?!

Related Post

Best AI Workflow Automation Tools in 2026: 9 Smart Platforms That Can Save You Hours Every Week

Your Personal AI Agent: What Manus Can Do That Others Can’t (And Why It Matters)

Technology

Claude Fable 5, Mythos 5 and Sonnet 5 Explained — The AI Ban, Comeback and Benchmark Battle

15 Free LLM APIs in 2026: Build AI Apps Without Burning Your Budget

Categories

Type and hit Enter to search

Researchers Finds a New Algorithm that Trains AI to Avoid Bad Behaviors

Abstract

Avoiding misbehavior

Fairness and safety

Related

Share Article

KNOWLAB

Other Articles

Related Posts

Related Post

Technology

Categories