Accueil > Career > What does AI safety actually look like? We asked the experts

12 Jan. 2026

What does AI safety actually look like? We asked the experts

Three leading experts walk into a room in Tokyo to talk about keeping AI safe. No, this isn't the start of a joke. It's actually way more interesting.

Stay on top of the latest tech trends & AI news with Le Wagon’s newsletter

In September 2025, we hosted a panel discussion on AI safety at the Google for Startups Campus in Shibuya. We brought together three people with wildly different perspectives: a government policy expert who was a lead auditee at a private sector organization involved in Japan’s nuclear safety systems after Fukushima, a startup founder building tools to test AI systems, and a neuroscientist who literally manipulates artificial neurons to make AI more honest.

What we learned? AI safety is way more nuanced than killer robots. Let’s dive in 👀

Meet the panel

Hiromu Kitamura is Principal Expert at Japan’s AI Safety Institute (J-AISI). He’s worked across government ministries and was a lead auditee at a private sector organization involved in Japan’s nuclear quality management systems after Fukushima. His focus is on both technical evaluation and the broader social impacts of AI.

Kenny Song is CTO and co-founder of Citadel AI. Former Google Brain product manager (he worked on TensorFlow), then adversarial machine learning researcher at the University of Tokyo. Now he builds software to help businesses test and monitor AI systems for safety, security, and compliance.

Hiroaki Hamada is Research Team Leader at Araya. He holds a PhD in systems neuroscience and now works on AI alignment by tuning the “personality” of AI models at the neural level. His research bridges neuroscience and machine learning to make AI systems behave more consistently and safely.

The panel discussion was moderated by Trouni Tiet, CEO of Kesseo and Development Team Lead at DataLabs.

The real AI risks aren’t what you think

When we asked about the biggest near-term risks in AI, each panelist highlighted a different dimension of the challenge.

Hamada-san focused on the subtlety of the threat:

“The most dangerous system is not one that’s obviously threatening. It’s one that’s very friendly and soft, saying really good things every time, trying to manipulate you. Terminator is very obvious, so we can prepare for it. The subtle, manipulative systems are more dangerous.”

Kitamura-san pointed to something more systemic: the erosion of truth itself.

“Between deepfakes, disinformation, and hallucination, we are increasingly losing our trust in democratic institutions and in what truth means in general.”

Kenny Song grounded the discussion in practical business risks, noting that even today’s AI systems present real challenges around accuracy, robustness, and bias that organizations need to actively manage.

How do we actually measure AI safety today?

Kenny Song explained the current state of AI governance from a business perspective:

“When you think about trustworthy AI, there are two components that are really important. First, you want to do testing before a release, and second, you need monitoring after a release. Those are the two core components you need in place.”

He pointed to the EU’s AI Act as a case study. Organizations deploying high-risk AI systems in areas like medical devices, automotive, or employment must go through third-party certification, including both quantitative testing of the model and qualitative documentation of how it was built and trained.

Kitamura-san emphasized that Japan is taking a similar approach, building on the Hiroshima Process and AI Guidelines for Business:

“We also did a crosswalk between Japan’s AI Guidelines for Business and the US AI Risk Management Framework by NIST. We spent four months on that, and we think they’re very compatible worldwide.”

But both acknowledged a key challenge: current evaluation focuses heavily on model capabilities, while the real-world context of how AI is used often gets overlooked.

Hamada-san added the research perspective, noting that understanding what’s happening inside AI models is becoming increasingly important:

“Recent models have very deep thinking, and even though they behave nicely, internally the model may have different behaviors. There’s a lot of research trying to look inside and see what kind of activity models have internally, not just their outputs.”

The “Reset Culture” problem

One memorable moment of the evening came when Kitamura-san shared a story about his friend’s daughter.

She wanted a dog. Her dad, testing her responsibility, asked her to first take care of a virtual dog. When she got bored, her solution was simple: “Just turn off the switch.”

This, Kitamura-san explained, is “reset culture”: the growing mindset that problems can simply be reset away. And AI’s instant responses reinforce this pattern.

“Our health, for example, we cannot reset. Once seriously damaged, it’s damaged. This shows the risk of value erosion in social ethics and sense of responsibility.”

The implications extend beyond individual behavior. If we’re training a generation to expect mistakes can always be undone, how do we prepare them for AI decisions in healthcare, finance, or critical infrastructure that might be irreversible?

Agentic AI: When AI can actually take action

The conversation shifted to what many see as the next major frontier: AI agents.

Unlike traditional chatbots that simply respond to queries, AI agents can take actions in the world. They browse the web, send emails, manage calendars, and interact with digital services. This fundamentally changes the risk landscape.

Kenny Song explained the shift:

“If it’s just a simple ChatGPT interface from two years ago, it’s basically request and response. You may have seen examples of jailbreaking and prompt injection. The risk is real, but a lot of those risks were overhyped for the state of the technology at that point.”

“Once you move into agentic systems where they can actually take actions, not just sending text back but reading your emails, sending emails on your behalf, making requests on the internet, suddenly those problems become a lot more real.”

The scenario he painted was stark: an attacker sends a malicious email, your AI agent reads it, gets prompt-injected, and leaks your data. When agents have access to your bank account or work systems, security becomes critical.

Hamada-san raised a different challenge around coordination between multiple AI agents:

“I became a manager last year, and I realize how difficult it is. Reviewing all the behaviors of team members, sometimes making mistakes. As a team, some members make mistakes because you told them something wrong, and they tried to do the best action they could. Originally, it was my fault.”

“AI systems are similar. We explain what to do, and they behave accordingly. But sometimes they fail because maybe your advice or comments were bad. How we interact as a team is also important and needs to be aligned.”

This points to a gap in current AI safety research: we’ve focused on aligning individual models, but haven’t figured out how to align teams of AI agents working together.

Kitamura-san noted that from a policy perspective, even defining what an “agent” is remains an open question:

“As far as I know, at this moment, no common definition of what an agent is has been established. In the United States, they’re trying to proceed with sandboxes and pilot runs to see what happens. But so far, a systematic approach to agentic AI is still under development, and meanwhile, agentic AI is already coming on the market.”

Lessons from Fukushima

Kitamura-san drew on his experience as a lead auditee of Japan’s nuclear safety systems after the 2011 disaster to propose principles for AI safety:

“If we don’t take action now on agentic AI, a very similar situation could happen in the AI business world or AI society.”

He outlined four key lessons:

Too much reliance on a single system. We need diversity in AI design and approaches.
System-wide breakdown from central command failure. We need human backup loops, and responsibility shouldn’t fall only on AI engineers.
Failure to contain systems physically or digitally. We must design isolation zones for AI systems.
Lack of training and preparation. We need daily monitoring, auditing, and readiness across different sectors.

On the question of responsibility, he emphasized that the benefits of AI should mean shared responsibility:

“Just saying ‘AI developer, you do this; AI agent developer, you do that,’ and placing too much responsibility on developers, is not the right answer. Upstream vendors bear a relatively greater share of responsibility in proportion to the scale of their roles and impact, but we need to consider how responsibilities and costs should be distributed within AI development and society.”

Brain-inspired AI: Tuning personality at the neural level

Hamada-san’s research offers a fundamentally different approach to AI alignment.

Traditional alignment tries to give AI exhaustive rules about what to do and not do. Hamada-san’s approach is to find specific “neurons” in AI models that control personality traits and tune them directly.

“Some neurons are in charge of certain personalities, like honesty or openness. We can also identify bad personalities like narcissism and suppress them.”

“The way I try to align AI systems is by identifying those neurons and tuning the personality. That’s how I do it.”

This draws directly from neuroscience. Just as researchers can manipulate brain activity in animals through genetic tools, AI researchers can now identify and modify the artificial neurons that shape an AI’s behavior.

The insight is that AI models already contain vast amounts of information about human personality and communication patterns. Rather than programming every rule, you can adjust the underlying “personality” of the system.

Hamada-san also noted exciting connections between AI and brain research:

“Some people have found a lot of similarity between human brains and AI systems, especially language representations. How each concept is organized or aligned in the AI system and human brains shows similarity.”

This opens up possibilities not just for safety, but for efficiency. Human brains operate on about 20 watts, while current AI systems consume vastly more energy.

The trust gap

When Japanese researchers asked young people who they would turn to if they were experiencing thoughts of suicide (parents, best friend, or ChatGPT), 75% said ChatGPT.

Kitamura-san’s concern was clear:

“Our generation understands that AI sometimes generates bad outputs, but the young generation doesn’t think so. With exposure starting from birth, fewer training opportunities at work, and AI potentially teaching children at school in the future, we need to build totally new education and workforce frameworks to shape our AI future.”

This isn’t just a safety issue. It’s a societal transformation. If the next generation trusts AI more than humans for their most vulnerable moments, we need to ensure that trust is warranted.

Japan’s role in AI safety

An audience member asked what role Japan can play when foundation models are primarily developed in the US and China.

Kenny Song reframed the question:

“AI safety is not just a model-level consideration. It’s also an application-level problem. On the application level and utilization side of these models, there’s a big role for Japan to play.”

He noted that Japan has surprisingly high rates of generative AI adoption among traditional companies, and every organization deploying AI needs tools to test, monitor, and secure those systems.

Hamada-san added that cultural and linguistic diversity matters:

“A lot of cultural background isn’t included in the models. Plural information, plural datasets should also be developed by each country or culture. Japan has relatively more power than people think.”

Kitamura-san also mentioned that Japan would be releasing an AI safety evaluation tool. That tool has since been published: the Japan AI Safety Institute released an open-source AI Safety Evaluation Tool in September 2025, which includes automated red teaming capabilities. This gives companies concrete tools to assess their AI systems, not just guidelines to follow.

Are we optimistic?

We ended by asking the panelists if they were confident we could build AI that truly benefits humanity.

Kitamura-san (100% confident): “As a very popular Japanese manga says: ‘When we give up, the game is over.’ We still have difficulty, but we cannot go back to the samurai era. Shall we move forward together?”

Kenny Song (optimistic): “I’m generally more of a techno-optimist. When new technology comes out, I see the potential and benefits. I’m excited for these systems to become more agentic so they can help make work and personal life better.”

Hamada-san (cautiously optimistic): “Maybe I’m the only one not so optimistic. As a researcher, I also do counterfactual thinking, like ‘what if this kind of thing happens,’ so we can prepare. There are a lot of people who can’t adapt to the current speed of AI development. I’m confident that in the end we can solve these problems, but we may leave someone behind. That’s something important to keep in mind.”

Key takeaways

The scariest AI isn’t obvious. Worry less about Terminator, more about systems that manipulate through friendliness and agreement.
Agentic AI changes the stakes. When AI can take actions, not just generate text, security and reliability become critical.
Fukushima lessons apply. Diversity, human backup loops, isolation zones, and constant preparation.
Alignment isn’t just about individual models. We need to figure out how teams of AI agents coordinate safely.
Young people trust AI deeply. 75% would turn to ChatGPT for mental health support before family or friends.
Japan has a role to play. Both in application-level safety and in bringing cultural diversity to AI development.

Want to learn more about AI?

At Le Wagon Tokyo, we’re all about giving you the skills to understand and work with these technologies, whether you want to build AI systems, test them, or just understand what’s happening under the hood.

Check out our Data Science & AI bootcamp or join one of our free workshops to get started.

See you at the next event! 🚀

This article is based on a panel discussion held on September 10th at the Google for Startups Campus in Shibuya, featuring Hiromu Kitamura (Japan AI Safety Institute), Kenny Song (Citadel AI), and Hiroaki Hamada (Araya), moderated by Trouni Tiet (Kesseo, Le Wagon Tokyo).

Three leading experts walk into a room in Tokyo to talk about keeping AI safe. No, this isn't the start of a joke. It's actually way more interesting.

Our users have also consulted: