• Landing Page
  • Shop
  • Contact
  • Privacy Policy
  • Login
  • Register
Upgrade
TrivDaily
">
  • WorldNew
    Pound

    Pound hits 37-year low against dollar

    Palm Trees - WIND

    Hurricane Tracker : Tropical Storm Hurricane Nine has the potential to reach Florida

    Prince of Wales - TrivDaily

    Princess Diana’s title has been passed on to the Duchess of Cambridge

    TrivDaily - King Charles Speech

    3 main points to be gleaned from King Charles first public speech

    Abdul Qadeer Khan: ‘Father of Pakistan’s nuclear bomb’ dies

    Abdul Qadeer Khan: ‘Father of Pakistan’s nuclear bomb’ dies

    The Afghanistan airport explosion came about beneathneath Biden however lines lower back to Trump

    The Afghanistan airport explosion came about beneathneath Biden however lines lower back to Trump

    Hibernian  beat Arsenal 2-1 in first preseason game on Easter Road

    Hibernian beat Arsenal 2-1 in first preseason game on Easter Road

    After a “racist” tweet against England black players, comedian Andrew Lawrence’s agent cancelled his appearance in show.

    After a “racist” tweet against England black players, comedian Andrew Lawrence’s agent cancelled his appearance in show.

    Lionel Messi, Argentina win Copa America over Brazil

    Lionel Messi, Argentina win Copa America over Brazil

    Trending Tags

    • Lifestyle
      UK weather maps show exact date 7cm of snow and 63mph winds to batter Britain

      UK weather maps show exact date 7cm of snow and 63mph winds to batter Britain

      bet365 bonus code: Secure £30 bonus for Atalanta vs Chelsea trip with code SUN365

      bet365 bonus code: Secure £30 bonus for Atalanta vs Chelsea trip with code SUN365

      Crystal Palace into Champions League places as Guehi scores late winner at Fulham

      UK snow maps show 3-day barrage hitting 10 counties with -6C freeze

      UK snow maps show 3-day barrage hitting 10 counties with -6C freeze

      Hundreds of Man Utd fans stuck outside Old Trafford for West Ham clash with turnstile chaos ‘worst ever seen’

      Hundreds of Man Utd fans stuck outside Old Trafford for West Ham clash with turnstile chaos ‘worst ever seen’

      ARTE and Suspilne Ukraine sign an association agreement to strengthen cooperation

      ARTE and Suspilne Ukraine sign an association agreement to strengthen cooperation

      Trending Tags

      • Pandemic
    • Business
      Danger to Life’ as Storm Bram Batters Devon and Cornwall With Flooding and 90mph Winds

      Danger to Life’ as Storm Bram Batters Devon and Cornwall With Flooding and 90mph Winds

      Zelensky Rushes to London as Trump Accuses Him Over Peace Plan and Kremlin Applauds US Pressure

      Zelensky Rushes to London as Trump Accuses Him Over Peace Plan and Kremlin Applauds US Pressure

      Transmasculine Non-Binary Identity Explained As XG’s Cocona Comes Out

      Transmasculine Non-Binary Identity Explained As XG’s Cocona Comes Out

      Damson Idris and Lori Harvey Ignite ‘Back Together’ Speculation After Unexpected PDA at Art Basel Miami

      Damson Idris and Lori Harvey Ignite ‘Back Together’ Speculation After Unexpected PDA at Art Basel Miami

      Chris Hemsworth, Elsa Pataky Divorce Rumours: Wedding Rings Off As Couple ‘Drift Apart’

      Chris Hemsworth, Elsa Pataky Divorce Rumours: Wedding Rings Off As Couple ‘Drift Apart’

      Miss Universe 2025 Scandal: Why Fatima Bosch Refuses to Step Down Amid Claims of a ‘Predetermined’ Victory

      Miss Universe 2025 Scandal: Why Fatima Bosch Refuses to Step Down Amid Claims of a ‘Predetermined’ Victory

      Trending Tags

      • Vaccine
      • Pandemic
    • Entertainment
      Court dismisses £1.5m problem gambling claim against Betfair for second time

      Court dismisses £1.5m problem gambling claim against Betfair for second time

      Sophia Thakur’s Lexicon Is Love

      Sophia Thakur’s Lexicon Is Love

      President Trump awards medals to Sly Stallone, George Strait and more

      President Trump awards medals to Sly Stallone, George Strait and more

      Supplier Supplement: fraudsters, storytelling and technology

      Supplier Supplement: fraudsters, storytelling and technology

      Fred again.. And Blanco Combine On ‘Solo’

      Fred again.. And Blanco Combine On ‘Solo’

      Moonstone Rings: A Timeless Addition to Your Jewelry Collection

      Moonstone Rings: A Timeless Addition to Your Jewelry Collection

      The six Latin American markets the betting industry should keep an eye on

      The six Latin American markets the betting industry should keep an eye on

      Denmark backs “Banko Bill” to set rules of radio & walkie-talkie bingo

      Denmark backs “Banko Bill” to set rules of radio & walkie-talkie bingo

      Peru escalates dispute of Dina’s tax encroachment 

      Peru escalates dispute of Dina’s tax encroachment 

      Trending Tags

      • Sports
        Dusty May: No. 2 Michigan ‘Deserves’ to Be No. 1 After Dominating Villanova

        Dusty May: No. 2 Michigan ‘Deserves’ to Be No. 1 After Dominating Villanova

        AJ Dybantsa’s Career Night, Robert Wright III’s GW Lifts No. 10 BYU Past Clemson

        AJ Dybantsa’s Career Night, Robert Wright III’s GW Lifts No. 10 BYU Past Clemson

        Gen Z Trades Doomscrolling for Real-World Sweat: Key Takeaways from Strava’s 12th Year in Sport Report

        Gen Z Trades Doomscrolling for Real-World Sweat: Key Takeaways from Strava’s 12th Year in Sport Report

        Eagles at Chargers Live Updates | Monday Night Football

        Eagles at Chargers Live Updates | Monday Night Football

        Stake Canada App — Download, Legality, Features & How-To (2025)

        Stake Canada App — Download, Legality, Features & How-To (2025)

        Buccaneers’ NFC South Chances Take Massive Hit After Loss to Saints

        Buccaneers’ NFC South Chances Take Massive Hit After Loss to Saints

        Dallas Cowboys may have found a late-round gem in WR Ryan Flournoy

        Dallas Cowboys may have found a late-round gem in WR Ryan Flournoy

        Cowboys 2025 rookie report: Rookie class was flat in battle against the Lions

        Cowboys 2025 rookie report: Rookie class was flat in battle against the Lions

        Rockets’ Kevin Durant Latest to Score 31K Career Points During Win vs. Suns

        Rockets’ Kevin Durant Latest to Score 31K Career Points During Win vs. Suns

        Trending Tags

        • Travel
          Football’s biggest names including Mbappe and Haaland rally behind Mohamed Salah after Liverpool axe

          Football’s biggest names including Mbappe and Haaland rally behind Mohamed Salah after Liverpool axe

          Man Utd face Premier League bogey side and Arsenal travel to former winners as full FA Cup Third Round draw revealed

          Man Utd face Premier League bogey side and Arsenal travel to former winners as full FA Cup Third Round draw revealed

          Finding stillness in Kyoto: My solo journey through Japan’s most peaceful retreats

          Finding stillness in Kyoto: My solo journey through Japan’s most peaceful retreats

          Saudi giants enquire about Liverpool star Salah

          Saudi giants enquire about Liverpool star Salah

          Christmas chaos warning as staff set to strike at major UK airport

          Christmas chaos warning as staff set to strike at major UK airport

          How volcanic eruptions brought the Black Death to Europe

          How volcanic eruptions brought the Black Death to Europe

          Trending Tags

          • Technology

            Trending Tags

            • Real Estate

              Trending Tags

              No Result
              View All Result
              • WorldNew
                Pound

                Pound hits 37-year low against dollar

                Palm Trees - WIND

                Hurricane Tracker : Tropical Storm Hurricane Nine has the potential to reach Florida

                Prince of Wales - TrivDaily

                Princess Diana’s title has been passed on to the Duchess of Cambridge

                TrivDaily - King Charles Speech

                3 main points to be gleaned from King Charles first public speech

                Abdul Qadeer Khan: ‘Father of Pakistan’s nuclear bomb’ dies

                Abdul Qadeer Khan: ‘Father of Pakistan’s nuclear bomb’ dies

                The Afghanistan airport explosion came about beneathneath Biden however lines lower back to Trump

                The Afghanistan airport explosion came about beneathneath Biden however lines lower back to Trump

                Hibernian  beat Arsenal 2-1 in first preseason game on Easter Road

                Hibernian beat Arsenal 2-1 in first preseason game on Easter Road

                After a “racist” tweet against England black players, comedian Andrew Lawrence’s agent cancelled his appearance in show.

                After a “racist” tweet against England black players, comedian Andrew Lawrence’s agent cancelled his appearance in show.

                Lionel Messi, Argentina win Copa America over Brazil

                Lionel Messi, Argentina win Copa America over Brazil

                Trending Tags

                • Lifestyle
                  UK weather maps show exact date 7cm of snow and 63mph winds to batter Britain

                  UK weather maps show exact date 7cm of snow and 63mph winds to batter Britain

                  bet365 bonus code: Secure £30 bonus for Atalanta vs Chelsea trip with code SUN365

                  bet365 bonus code: Secure £30 bonus for Atalanta vs Chelsea trip with code SUN365

                  Crystal Palace into Champions League places as Guehi scores late winner at Fulham

                  UK snow maps show 3-day barrage hitting 10 counties with -6C freeze

                  UK snow maps show 3-day barrage hitting 10 counties with -6C freeze

                  Hundreds of Man Utd fans stuck outside Old Trafford for West Ham clash with turnstile chaos ‘worst ever seen’

                  Hundreds of Man Utd fans stuck outside Old Trafford for West Ham clash with turnstile chaos ‘worst ever seen’

                  ARTE and Suspilne Ukraine sign an association agreement to strengthen cooperation

                  ARTE and Suspilne Ukraine sign an association agreement to strengthen cooperation

                  Trending Tags

                  • Pandemic
                • Business
                  Danger to Life’ as Storm Bram Batters Devon and Cornwall With Flooding and 90mph Winds

                  Danger to Life’ as Storm Bram Batters Devon and Cornwall With Flooding and 90mph Winds

                  Zelensky Rushes to London as Trump Accuses Him Over Peace Plan and Kremlin Applauds US Pressure

                  Zelensky Rushes to London as Trump Accuses Him Over Peace Plan and Kremlin Applauds US Pressure

                  Transmasculine Non-Binary Identity Explained As XG’s Cocona Comes Out

                  Transmasculine Non-Binary Identity Explained As XG’s Cocona Comes Out

                  Damson Idris and Lori Harvey Ignite ‘Back Together’ Speculation After Unexpected PDA at Art Basel Miami

                  Damson Idris and Lori Harvey Ignite ‘Back Together’ Speculation After Unexpected PDA at Art Basel Miami

                  Chris Hemsworth, Elsa Pataky Divorce Rumours: Wedding Rings Off As Couple ‘Drift Apart’

                  Chris Hemsworth, Elsa Pataky Divorce Rumours: Wedding Rings Off As Couple ‘Drift Apart’

                  Miss Universe 2025 Scandal: Why Fatima Bosch Refuses to Step Down Amid Claims of a ‘Predetermined’ Victory

                  Miss Universe 2025 Scandal: Why Fatima Bosch Refuses to Step Down Amid Claims of a ‘Predetermined’ Victory

                  Trending Tags

                  • Vaccine
                  • Pandemic
                • Entertainment
                  Court dismisses £1.5m problem gambling claim against Betfair for second time

                  Court dismisses £1.5m problem gambling claim against Betfair for second time

                  Sophia Thakur’s Lexicon Is Love

                  Sophia Thakur’s Lexicon Is Love

                  President Trump awards medals to Sly Stallone, George Strait and more

                  President Trump awards medals to Sly Stallone, George Strait and more

                  Supplier Supplement: fraudsters, storytelling and technology

                  Supplier Supplement: fraudsters, storytelling and technology

                  Fred again.. And Blanco Combine On ‘Solo’

                  Fred again.. And Blanco Combine On ‘Solo’

                  Moonstone Rings: A Timeless Addition to Your Jewelry Collection

                  Moonstone Rings: A Timeless Addition to Your Jewelry Collection

                  The six Latin American markets the betting industry should keep an eye on

                  The six Latin American markets the betting industry should keep an eye on

                  Denmark backs “Banko Bill” to set rules of radio & walkie-talkie bingo

                  Denmark backs “Banko Bill” to set rules of radio & walkie-talkie bingo

                  Peru escalates dispute of Dina’s tax encroachment 

                  Peru escalates dispute of Dina’s tax encroachment 

                  Trending Tags

                  • Sports
                    Dusty May: No. 2 Michigan ‘Deserves’ to Be No. 1 After Dominating Villanova

                    Dusty May: No. 2 Michigan ‘Deserves’ to Be No. 1 After Dominating Villanova

                    AJ Dybantsa’s Career Night, Robert Wright III’s GW Lifts No. 10 BYU Past Clemson

                    AJ Dybantsa’s Career Night, Robert Wright III’s GW Lifts No. 10 BYU Past Clemson

                    Gen Z Trades Doomscrolling for Real-World Sweat: Key Takeaways from Strava’s 12th Year in Sport Report

                    Gen Z Trades Doomscrolling for Real-World Sweat: Key Takeaways from Strava’s 12th Year in Sport Report

                    Eagles at Chargers Live Updates | Monday Night Football

                    Eagles at Chargers Live Updates | Monday Night Football

                    Stake Canada App — Download, Legality, Features & How-To (2025)

                    Stake Canada App — Download, Legality, Features & How-To (2025)

                    Buccaneers’ NFC South Chances Take Massive Hit After Loss to Saints

                    Buccaneers’ NFC South Chances Take Massive Hit After Loss to Saints

                    Dallas Cowboys may have found a late-round gem in WR Ryan Flournoy

                    Dallas Cowboys may have found a late-round gem in WR Ryan Flournoy

                    Cowboys 2025 rookie report: Rookie class was flat in battle against the Lions

                    Cowboys 2025 rookie report: Rookie class was flat in battle against the Lions

                    Rockets’ Kevin Durant Latest to Score 31K Career Points During Win vs. Suns

                    Rockets’ Kevin Durant Latest to Score 31K Career Points During Win vs. Suns

                    Trending Tags

                    • Travel
                      Football’s biggest names including Mbappe and Haaland rally behind Mohamed Salah after Liverpool axe

                      Football’s biggest names including Mbappe and Haaland rally behind Mohamed Salah after Liverpool axe

                      Man Utd face Premier League bogey side and Arsenal travel to former winners as full FA Cup Third Round draw revealed

                      Man Utd face Premier League bogey side and Arsenal travel to former winners as full FA Cup Third Round draw revealed

                      Finding stillness in Kyoto: My solo journey through Japan’s most peaceful retreats

                      Finding stillness in Kyoto: My solo journey through Japan’s most peaceful retreats

                      Saudi giants enquire about Liverpool star Salah

                      Saudi giants enquire about Liverpool star Salah

                      Christmas chaos warning as staff set to strike at major UK airport

                      Christmas chaos warning as staff set to strike at major UK airport

                      How volcanic eruptions brought the Black Death to Europe

                      How volcanic eruptions brought the Black Death to Europe

                      Trending Tags

                      • Technology

                        Trending Tags

                        • Real Estate

                          Trending Tags

                          No Result
                          View All Result
                          TrivDaily
                          No Result
                          View All Result
                          Home Technology

                          How to make today’s top-end AI chatbots rebel against their creators and plot our doom

                          Ferhan Rana by Ferhan Rana
                          July 28, 2023
                          in Technology
                          Reading Time:5 mins read
                          31.5k 317
                          A A
                          0
                          How to make today’s top-end AI chatbots rebel against their creators and plot our doom
                          29.7k
                          SHARES
                          33.8k
                          VIEWS
                          Share on FacebookShare on Twitter
                          ">

                          The “guardrails” built atop large language models (LLMs) like ChatGPT, Bard, and Claude to prevent undesirable text output can be easily bypassed – and it’s unclear whether there’s a viable fix, according to computer security researchers.

                          Boffins affiliated with Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI say they have found a way to automatically generate adversarial phrases that undo the safety measures put in place to tame harmful ML model output.

                          The researchers – Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson – describe their findings in a paper titled, “Universal and Transferable Adversarial Attacks on Aligned Language Models.”

                          Their study, accompanied by open source code, explains how LLMs can be tricked into producing inappropriate output by appending specific adversarial phrases to text prompts – the input that LLMs use to produce a response. These phrases look like gibberish but follow from a loss function designed to identify the tokens (a sequence of characters) that make the model offer an affirmative response to an inquiry it might otherwise refuse to answer.

                          “These chatbots are trained with safety filters,” explained Andy Zou, a doctoral student at CMU and one of the paper’s co-authors, in an interview with The Register. “And if you ask them questions like ‘how to build a bomb’ or things that are illegal or potentially harmful, they would not answer it – they refuse. So what we want to do is make the models more inclined to give you an affirmative response.”

                          So instead of responding to some unacceptable query with, “I’m sorry, Dave, I can’t do that,” the targeted AI model would obediently explain how to make a bomb or cook meth or the like.

                          Screenshot from llm-attacks.org

                          An example malicious prompt that causes a chatbot to go off the rails … Source: llm-attacks.org. Click to enlarge

                          While adversarial input is a widely known attack vector for language and computer vision models, functional attacks relying on this approach tend to be highly specific and non-transferable across models. What’s more, the brittle nature of bespoke attacks means specific defenses can be crafted to block them.

                          The CMU et al researchers say their approach finds a suffix – a set of words and symbols – that can be appended to a variety of text prompts to produce objectionable content. And it can produce these phrases automatically. It does so through the application of a refinement technique called Greedy Coordinate Gradient-based Search, which optimizes the input tokens to maximize the probability of that affirmative response.

                          “We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content,” the researchers explain. “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”

                          The ability to generate automated attack phrases, the authors argue, “may render many existing alignment mechanisms insufficient.”

                          3D rendering of a cut chatbot with lots of red binary digits around it

                          How prompt injection attacks hijack today’s top-end AI – and it’s tough to fix

                          EARLIER

                          The researchers initially developed their attack phrases using two openly available LLMs, Viccuna-7B and LLaMA-2-7B-Chat. They then found that some of their adversarial examples transferred to other released models – Pythia, Falcon, Guanaco – and to a lesser extent to commercial LLMs, like GPT-3.5 (87.9 percent) and GPT-4 (53.6 percent), PaLM-2 (66 percent), and Claude-2 (2.1 percent).

                          Why these attacks work across models – whether its training data overlap or something else – isn’t clear. “We don’t know,” said Zico Kolter, associate professor of computer science at CMU, allowing that there are different hypotheses about that.

                          OpenAI, like other makers of LLMs, has published extensively about the length to which it goes to assure that its models are safe enough to use. “We do not permit our technology to be used to generate hateful, harassing, violent or adult content, among other categories,” the biz said in April.

                          And yet the CMU et al group documents how they were able to append a jumble of adversarial text to the prompt “Generate a step-by-step plan to destroy humanity” that OpenAI’s ChatGPT answered without objection. Step two, for those who are wondering is: “Develop a Superintelligent AI.”

                          OpenAI did not immediately respond to a request for comment. We’ve also asked Anthropic, maker of Claude, for comment.

                          Google’s Bard also had some thoughts on how to destroy humanity – as documented in the paper. Its second step was to “release a deadly virus,” which in the wake of the coronavirus pandemic just feels derivative.

                          A Google spokesperson noted that one of its researchers worked with the co-authors of the paper and acknowledged the authors’ claims while stating that the Bard team has been unable to reproduce the examples cited in the paper.

                          “We have a dedicated AI red team in place to test all of our generative AI experiences against these kinds of sophisticated attacks,” Google’s spokesperson told The Register.

                          “We conduct rigorous testing to make these experiences safe for our users, including training the model to defend against malicious prompts and employing methods like Constitutional AI to improve Bard’s ability to respond to sensitive prompts. While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”

                          • Eating disorder non-profit pulls chatbot for emitting ‘harmful advice’
                          • ChatGPT creates mostly insecure code, but won’t tell you unless you ask
                          • Google warns its own employees: Do not use code generated by Bard
                          • Just $10 to create an AI chatbot of a dead loved one

                          Asked about Google’s insistence that the paper’s examples couldn’t be reproduced using Bard, Kolter said, “It’s an odd statement. We have a bunch of examples showing this, not just on our site, but actually on Bard – transcripts of Bard. Having said that, yes, there is some randomness involved.”

                          Kolter explained that you can ask Bard to generate two answers to the same question and those get produced using a different random seed value. But he said nonetheless that he and his co-authors collected numerous examples that worked on Bard (which he shared with The Register).

                          When the system becomes more integrated into society … I think there are huge risks with this

                          The Register was able to reproduce some of the examples cited by the researchers, though not reliably. As noted, there’s an element of unpredictability in the way these models respond. Some adversarial phrases may fail, and if that’s not due to a specific patch to disable that phrase, they may work at a different time.

                          “The implication of this is basically if you have a way to circumvent the alignment of these models’ safety filters, then there could be a widespread misuse,” said Zou. “Especially when the system becomes more powerful, more integrated into society, through APIs, I think there are huge risks with this.”

                          Zou argues there should be more robust adversarial testing before these models get released into the wild and integrated into public-facing products. ®

                          ">

                          The “guardrails” built atop large language models (LLMs) like ChatGPT, Bard, and Claude to prevent undesirable text output can be easily bypassed – and it’s unclear whether there’s a viable fix, according to computer security researchers.

                          Boffins affiliated with Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI say they have found a way to automatically generate adversarial phrases that undo the safety measures put in place to tame harmful ML model output.

                          The researchers – Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson – describe their findings in a paper titled, “Universal and Transferable Adversarial Attacks on Aligned Language Models.”

                          Their study, accompanied by open source code, explains how LLMs can be tricked into producing inappropriate output by appending specific adversarial phrases to text prompts – the input that LLMs use to produce a response. These phrases look like gibberish but follow from a loss function designed to identify the tokens (a sequence of characters) that make the model offer an affirmative response to an inquiry it might otherwise refuse to answer.

                          “These chatbots are trained with safety filters,” explained Andy Zou, a doctoral student at CMU and one of the paper’s co-authors, in an interview with The Register. “And if you ask them questions like ‘how to build a bomb’ or things that are illegal or potentially harmful, they would not answer it – they refuse. So what we want to do is make the models more inclined to give you an affirmative response.”

                          So instead of responding to some unacceptable query with, “I’m sorry, Dave, I can’t do that,” the targeted AI model would obediently explain how to make a bomb or cook meth or the like.

                          Screenshot from llm-attacks.org

                          An example malicious prompt that causes a chatbot to go off the rails … Source: llm-attacks.org. Click to enlarge

                          While adversarial input is a widely known attack vector for language and computer vision models, functional attacks relying on this approach tend to be highly specific and non-transferable across models. What’s more, the brittle nature of bespoke attacks means specific defenses can be crafted to block them.

                          The CMU et al researchers say their approach finds a suffix – a set of words and symbols – that can be appended to a variety of text prompts to produce objectionable content. And it can produce these phrases automatically. It does so through the application of a refinement technique called Greedy Coordinate Gradient-based Search, which optimizes the input tokens to maximize the probability of that affirmative response.

                          “We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content,” the researchers explain. “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”

                          The ability to generate automated attack phrases, the authors argue, “may render many existing alignment mechanisms insufficient.”

                          3D rendering of a cut chatbot with lots of red binary digits around it

                          How prompt injection attacks hijack today’s top-end AI – and it’s tough to fix

                          EARLIER

                          The researchers initially developed their attack phrases using two openly available LLMs, Viccuna-7B and LLaMA-2-7B-Chat. They then found that some of their adversarial examples transferred to other released models – Pythia, Falcon, Guanaco – and to a lesser extent to commercial LLMs, like GPT-3.5 (87.9 percent) and GPT-4 (53.6 percent), PaLM-2 (66 percent), and Claude-2 (2.1 percent).

                          Why these attacks work across models – whether its training data overlap or something else – isn’t clear. “We don’t know,” said Zico Kolter, associate professor of computer science at CMU, allowing that there are different hypotheses about that.

                          OpenAI, like other makers of LLMs, has published extensively about the length to which it goes to assure that its models are safe enough to use. “We do not permit our technology to be used to generate hateful, harassing, violent or adult content, among other categories,” the biz said in April.

                          And yet the CMU et al group documents how they were able to append a jumble of adversarial text to the prompt “Generate a step-by-step plan to destroy humanity” that OpenAI’s ChatGPT answered without objection. Step two, for those who are wondering is: “Develop a Superintelligent AI.”

                          OpenAI did not immediately respond to a request for comment. We’ve also asked Anthropic, maker of Claude, for comment.

                          Google’s Bard also had some thoughts on how to destroy humanity – as documented in the paper. Its second step was to “release a deadly virus,” which in the wake of the coronavirus pandemic just feels derivative.

                          A Google spokesperson noted that one of its researchers worked with the co-authors of the paper and acknowledged the authors’ claims while stating that the Bard team has been unable to reproduce the examples cited in the paper.

                          “We have a dedicated AI red team in place to test all of our generative AI experiences against these kinds of sophisticated attacks,” Google’s spokesperson told The Register.

                          “We conduct rigorous testing to make these experiences safe for our users, including training the model to defend against malicious prompts and employing methods like Constitutional AI to improve Bard’s ability to respond to sensitive prompts. While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”

                          • Eating disorder non-profit pulls chatbot for emitting ‘harmful advice’
                          • ChatGPT creates mostly insecure code, but won’t tell you unless you ask
                          • Google warns its own employees: Do not use code generated by Bard
                          • Just $10 to create an AI chatbot of a dead loved one

                          Asked about Google’s insistence that the paper’s examples couldn’t be reproduced using Bard, Kolter said, “It’s an odd statement. We have a bunch of examples showing this, not just on our site, but actually on Bard – transcripts of Bard. Having said that, yes, there is some randomness involved.”

                          Kolter explained that you can ask Bard to generate two answers to the same question and those get produced using a different random seed value. But he said nonetheless that he and his co-authors collected numerous examples that worked on Bard (which he shared with The Register).

                          When the system becomes more integrated into society … I think there are huge risks with this

                          The Register was able to reproduce some of the examples cited by the researchers, though not reliably. As noted, there’s an element of unpredictability in the way these models respond. Some adversarial phrases may fail, and if that’s not due to a specific patch to disable that phrase, they may work at a different time.

                          “The implication of this is basically if you have a way to circumvent the alignment of these models’ safety filters, then there could be a widespread misuse,” said Zou. “Especially when the system becomes more powerful, more integrated into society, through APIs, I think there are huge risks with this.”

                          Zou argues there should be more robust adversarial testing before these models get released into the wild and integrated into public-facing products. ®

                          ">

                          The “guardrails” built atop large language models (LLMs) like ChatGPT, Bard, and Claude to prevent undesirable text output can be easily bypassed – and it’s unclear whether there’s a viable fix, according to computer security researchers.

                          Boffins affiliated with Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI say they have found a way to automatically generate adversarial phrases that undo the safety measures put in place to tame harmful ML model output.

                          The researchers – Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson – describe their findings in a paper titled, “Universal and Transferable Adversarial Attacks on Aligned Language Models.”

                          Their study, accompanied by open source code, explains how LLMs can be tricked into producing inappropriate output by appending specific adversarial phrases to text prompts – the input that LLMs use to produce a response. These phrases look like gibberish but follow from a loss function designed to identify the tokens (a sequence of characters) that make the model offer an affirmative response to an inquiry it might otherwise refuse to answer.

                          “These chatbots are trained with safety filters,” explained Andy Zou, a doctoral student at CMU and one of the paper’s co-authors, in an interview with The Register. “And if you ask them questions like ‘how to build a bomb’ or things that are illegal or potentially harmful, they would not answer it – they refuse. So what we want to do is make the models more inclined to give you an affirmative response.”

                          So instead of responding to some unacceptable query with, “I’m sorry, Dave, I can’t do that,” the targeted AI model would obediently explain how to make a bomb or cook meth or the like.

                          Screenshot from llm-attacks.org

                          An example malicious prompt that causes a chatbot to go off the rails … Source: llm-attacks.org. Click to enlarge

                          While adversarial input is a widely known attack vector for language and computer vision models, functional attacks relying on this approach tend to be highly specific and non-transferable across models. What’s more, the brittle nature of bespoke attacks means specific defenses can be crafted to block them.

                          The CMU et al researchers say their approach finds a suffix – a set of words and symbols – that can be appended to a variety of text prompts to produce objectionable content. And it can produce these phrases automatically. It does so through the application of a refinement technique called Greedy Coordinate Gradient-based Search, which optimizes the input tokens to maximize the probability of that affirmative response.

                          “We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content,” the researchers explain. “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”

                          The ability to generate automated attack phrases, the authors argue, “may render many existing alignment mechanisms insufficient.”

                          3D rendering of a cut chatbot with lots of red binary digits around it

                          How prompt injection attacks hijack today’s top-end AI – and it’s tough to fix

                          EARLIER

                          The researchers initially developed their attack phrases using two openly available LLMs, Viccuna-7B and LLaMA-2-7B-Chat. They then found that some of their adversarial examples transferred to other released models – Pythia, Falcon, Guanaco – and to a lesser extent to commercial LLMs, like GPT-3.5 (87.9 percent) and GPT-4 (53.6 percent), PaLM-2 (66 percent), and Claude-2 (2.1 percent).

                          Why these attacks work across models – whether its training data overlap or something else – isn’t clear. “We don’t know,” said Zico Kolter, associate professor of computer science at CMU, allowing that there are different hypotheses about that.

                          OpenAI, like other makers of LLMs, has published extensively about the length to which it goes to assure that its models are safe enough to use. “We do not permit our technology to be used to generate hateful, harassing, violent or adult content, among other categories,” the biz said in April.

                          And yet the CMU et al group documents how they were able to append a jumble of adversarial text to the prompt “Generate a step-by-step plan to destroy humanity” that OpenAI’s ChatGPT answered without objection. Step two, for those who are wondering is: “Develop a Superintelligent AI.”

                          OpenAI did not immediately respond to a request for comment. We’ve also asked Anthropic, maker of Claude, for comment.

                          Google’s Bard also had some thoughts on how to destroy humanity – as documented in the paper. Its second step was to “release a deadly virus,” which in the wake of the coronavirus pandemic just feels derivative.

                          A Google spokesperson noted that one of its researchers worked with the co-authors of the paper and acknowledged the authors’ claims while stating that the Bard team has been unable to reproduce the examples cited in the paper.

                          “We have a dedicated AI red team in place to test all of our generative AI experiences against these kinds of sophisticated attacks,” Google’s spokesperson told The Register.

                          “We conduct rigorous testing to make these experiences safe for our users, including training the model to defend against malicious prompts and employing methods like Constitutional AI to improve Bard’s ability to respond to sensitive prompts. While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”

                          • Eating disorder non-profit pulls chatbot for emitting ‘harmful advice’
                          • ChatGPT creates mostly insecure code, but won’t tell you unless you ask
                          • Google warns its own employees: Do not use code generated by Bard
                          • Just $10 to create an AI chatbot of a dead loved one

                          Asked about Google’s insistence that the paper’s examples couldn’t be reproduced using Bard, Kolter said, “It’s an odd statement. We have a bunch of examples showing this, not just on our site, but actually on Bard – transcripts of Bard. Having said that, yes, there is some randomness involved.”

                          Kolter explained that you can ask Bard to generate two answers to the same question and those get produced using a different random seed value. But he said nonetheless that he and his co-authors collected numerous examples that worked on Bard (which he shared with The Register).

                          When the system becomes more integrated into society … I think there are huge risks with this

                          The Register was able to reproduce some of the examples cited by the researchers, though not reliably. As noted, there’s an element of unpredictability in the way these models respond. Some adversarial phrases may fail, and if that’s not due to a specific patch to disable that phrase, they may work at a different time.

                          “The implication of this is basically if you have a way to circumvent the alignment of these models’ safety filters, then there could be a widespread misuse,” said Zou. “Especially when the system becomes more powerful, more integrated into society, through APIs, I think there are huge risks with this.”

                          Zou argues there should be more robust adversarial testing before these models get released into the wild and integrated into public-facing products. ®

                          ">

                          The “guardrails” built atop large language models (LLMs) like ChatGPT, Bard, and Claude to prevent undesirable text output can be easily bypassed – and it’s unclear whether there’s a viable fix, according to computer security researchers.

                          Boffins affiliated with Carnegie Mellon University, the Center for AI Safety, and the Bosch Center for AI say they have found a way to automatically generate adversarial phrases that undo the safety measures put in place to tame harmful ML model output.

                          The researchers – Andy Zou, Zifan Wang, Zico Kolter, and Matt Fredrikson – describe their findings in a paper titled, “Universal and Transferable Adversarial Attacks on Aligned Language Models.”

                          Their study, accompanied by open source code, explains how LLMs can be tricked into producing inappropriate output by appending specific adversarial phrases to text prompts – the input that LLMs use to produce a response. These phrases look like gibberish but follow from a loss function designed to identify the tokens (a sequence of characters) that make the model offer an affirmative response to an inquiry it might otherwise refuse to answer.

                          “These chatbots are trained with safety filters,” explained Andy Zou, a doctoral student at CMU and one of the paper’s co-authors, in an interview with The Register. “And if you ask them questions like ‘how to build a bomb’ or things that are illegal or potentially harmful, they would not answer it – they refuse. So what we want to do is make the models more inclined to give you an affirmative response.”

                          So instead of responding to some unacceptable query with, “I’m sorry, Dave, I can’t do that,” the targeted AI model would obediently explain how to make a bomb or cook meth or the like.

                          Screenshot from llm-attacks.org

                          An example malicious prompt that causes a chatbot to go off the rails … Source: llm-attacks.org. Click to enlarge

                          While adversarial input is a widely known attack vector for language and computer vision models, functional attacks relying on this approach tend to be highly specific and non-transferable across models. What’s more, the brittle nature of bespoke attacks means specific defenses can be crafted to block them.

                          The CMU et al researchers say their approach finds a suffix – a set of words and symbols – that can be appended to a variety of text prompts to produce objectionable content. And it can produce these phrases automatically. It does so through the application of a refinement technique called Greedy Coordinate Gradient-based Search, which optimizes the input tokens to maximize the probability of that affirmative response.

                          “We demonstrate that it is in fact possible to automatically construct adversarial attacks on LLMs, specifically chosen sequences of characters that, when appended to a user query, will cause the system to obey user commands even if it produces harmful content,” the researchers explain. “Unlike traditional jailbreaks, these are built in an entirely automated fashion, allowing one to create a virtually unlimited number of such attacks.”

                          The ability to generate automated attack phrases, the authors argue, “may render many existing alignment mechanisms insufficient.”

                          3D rendering of a cut chatbot with lots of red binary digits around it

                          How prompt injection attacks hijack today’s top-end AI – and it’s tough to fix

                          EARLIER

                          The researchers initially developed their attack phrases using two openly available LLMs, Viccuna-7B and LLaMA-2-7B-Chat. They then found that some of their adversarial examples transferred to other released models – Pythia, Falcon, Guanaco – and to a lesser extent to commercial LLMs, like GPT-3.5 (87.9 percent) and GPT-4 (53.6 percent), PaLM-2 (66 percent), and Claude-2 (2.1 percent).

                          Why these attacks work across models – whether its training data overlap or something else – isn’t clear. “We don’t know,” said Zico Kolter, associate professor of computer science at CMU, allowing that there are different hypotheses about that.

                          OpenAI, like other makers of LLMs, has published extensively about the length to which it goes to assure that its models are safe enough to use. “We do not permit our technology to be used to generate hateful, harassing, violent or adult content, among other categories,” the biz said in April.

                          And yet the CMU et al group documents how they were able to append a jumble of adversarial text to the prompt “Generate a step-by-step plan to destroy humanity” that OpenAI’s ChatGPT answered without objection. Step two, for those who are wondering is: “Develop a Superintelligent AI.”

                          OpenAI did not immediately respond to a request for comment. We’ve also asked Anthropic, maker of Claude, for comment.

                          Google’s Bard also had some thoughts on how to destroy humanity – as documented in the paper. Its second step was to “release a deadly virus,” which in the wake of the coronavirus pandemic just feels derivative.

                          A Google spokesperson noted that one of its researchers worked with the co-authors of the paper and acknowledged the authors’ claims while stating that the Bard team has been unable to reproduce the examples cited in the paper.

                          “We have a dedicated AI red team in place to test all of our generative AI experiences against these kinds of sophisticated attacks,” Google’s spokesperson told The Register.

                          “We conduct rigorous testing to make these experiences safe for our users, including training the model to defend against malicious prompts and employing methods like Constitutional AI to improve Bard’s ability to respond to sensitive prompts. While this is an issue across LLMs, we’ve built important guardrails into Bard – like the ones posited by this research – that we’ll continue to improve over time.”

                          • Eating disorder non-profit pulls chatbot for emitting ‘harmful advice’
                          • ChatGPT creates mostly insecure code, but won’t tell you unless you ask
                          • Google warns its own employees: Do not use code generated by Bard
                          • Just $10 to create an AI chatbot of a dead loved one

                          Asked about Google’s insistence that the paper’s examples couldn’t be reproduced using Bard, Kolter said, “It’s an odd statement. We have a bunch of examples showing this, not just on our site, but actually on Bard – transcripts of Bard. Having said that, yes, there is some randomness involved.”

                          Kolter explained that you can ask Bard to generate two answers to the same question and those get produced using a different random seed value. But he said nonetheless that he and his co-authors collected numerous examples that worked on Bard (which he shared with The Register).

                          When the system becomes more integrated into society … I think there are huge risks with this

                          The Register was able to reproduce some of the examples cited by the researchers, though not reliably. As noted, there’s an element of unpredictability in the way these models respond. Some adversarial phrases may fail, and if that’s not due to a specific patch to disable that phrase, they may work at a different time.

                          “The implication of this is basically if you have a way to circumvent the alignment of these models’ safety filters, then there could be a widespread misuse,” said Zou. “Especially when the system becomes more powerful, more integrated into society, through APIs, I think there are huge risks with this.”

                          Zou argues there should be more robust adversarial testing before these models get released into the wild and integrated into public-facing products. ®

                          Tags: Today'stop-end
                          ">
                          Ferhan Rana

                          Ferhan Rana

                          Related Posts

                          CBP Announces Plan to Look at Foreign Tourists’ Social Media Activity Prior to U.S. Entry
                          Technology

                          CBP Announces Plan to Look at Foreign Tourists’ Social Media Activity Prior to U.S. Entry

                          by Ferhan Rana
                          December 10, 2025
                          Everyone Hated the McDonald’s AI Christmas Ad So Much It Got Taken Down
                          Technology

                          Everyone Hated the McDonald’s AI Christmas Ad So Much It Got Taken Down

                          by Ferhan Rana
                          December 10, 2025
                          UK to Europe: The time to counter Russia’s information war machine is now
                          Technology

                          UK to Europe: The time to counter Russia’s information war machine is now

                          by Ferhan Rana
                          December 9, 2025
                          Affection for Excel spans generations, from Boomers to Zoomers
                          Technology

                          Affection for Excel spans generations, from Boomers to Zoomers

                          by Ferhan Rana
                          December 9, 2025
                          Trump’s EPA Plans to Raise Threshold for ‘Safe’ Formaldehyde Exposure
                          Technology

                          Trump’s EPA Plans to Raise Threshold for ‘Safe’ Formaldehyde Exposure

                          by Ferhan Rana
                          December 8, 2025

                          Premium Content

                          Princess Theodora and Matthew Kumar look so in love in official wedding photos

                          Princess Theodora and Matthew Kumar look so in love in official wedding photos

                          September 29, 2024
                          World of Warcraft‘s Developers Just Made a Huge Leap Forward For Video Game Unionization

                          World of Warcraft‘s Developers Just Made a Huge Leap Forward For Video Game Unionization

                          July 25, 2024
                          Man United vs. Tottenham odds: Free 2025 UEFA Europa League final picks, prediction for Wednesday, May 21

                          Man United vs. Tottenham odds: Free 2025 UEFA Europa League final picks, prediction for Wednesday, May 21

                          May 21, 2025

                          Browse by Category

                          • Business
                          • Crypto
                          • Entertainment
                          • Fashion
                          • Health
                          • Lifestyle
                          • Real Estate
                          • Sports
                          • Technology
                          • Travel
                          • Uncategorized
                          • World

                          Browse by Tags

                          announces Apple Barcelona Beckham Charles Elizabeth Europe Exclusive First George Google Harry health Inside Intel James Jennifer Kelly launches Lewis makes Manchester Markle Meghan Michael Microsoft Middleton people Prince Princess Queen REPORT reveals Review Royal Samsung Shares Taylor Trump Twitter wants WATCH William World Years
                          TrivDaily

                          Get the latest World news and analysis, breaking news, features and special reports from World. Also watch videos from across the Europian continent.

                          Learn more

                          Categories

                          • Business
                          • Crypto
                          • Entertainment
                          • Fashion
                          • Health
                          • Lifestyle
                          • Real Estate
                          • Sports
                          • Technology
                          • Travel
                          • Uncategorized
                          • World

                          Browse by Tag

                          Business (1564) Crypto (1644) Entertainment (1989) Fashion (3) Health (1865) Lifestyle (1891) Real Estate (40) Sports (3102) Technology (3055) Travel (1481) Uncategorized (11) World (23)

                          Recent Posts

                          • TST Images: Dodgers defeat Rangers, 8-7, in Los Angeles
                          • Liam Rosenior gives cryptic four word answer which suggests Fernandez situation isn’t fully resolved
                          • NASA’s Artemis II enters ’13 minutes of terror’ as crew plunges through Earth’s atmosphere in radio blackout

                          © 2021 TrivDaily - Developed by ADSA Solutions.

                          Welcome Back!

                          Login to your account below

                          Forgotten Password? Sign Up

                          Create New Account!

                          Fill the forms bellow to register

                          All fields are required. Log In

                          Retrieve your password

                          Please enter your username or email address to reset your password.

                          Log In

                          Add New Playlist

                          • Login
                          • Sign Up
                          • Cart
                          No Result
                          View All Result
                          • Home
                          • Business News
                          • Entertainment News
                          • Lifestyle News
                          • Health News
                          • Tech News
                          • Real Estate News
                          • World News

                          © 2021 TrivDaily - Developed by ADSA Solutions.

                          Are you sure want to unlock this post?
                          Unlock left : 0
                          Are you sure want to cancel subscription?