Guardrails for trust, safety, and ethical development and deployment of Large Language Models (LLM)
DOI:
https://doi.org/10.55662/JST.2023.4605Downloads
Keywords:
trust and safety, large language models, language model safety, guardrails, ethical AIAbstract
The AI era has ushered in Large Language Models (LLM) to the technological forefront, which has been much of the talk in 2023, and is likely to remain as such for many years to come. LLMs are the AI models that are the power house behind generative AI applications such as ChatGPT. These AI models, fueled by vast amounts of data and computational prowess, have unlocked remarkable capabilities, from human-like text generation to assisting with natural language understanding (NLU) tasks. They have quickly become the foundation upon which countless applications and software services are being built, or at least being augmented with. However, as with any groundbreaking innovations, the rise of LLMs brings forth critical safety, privacy, and ethical concerns. These models are found to have a propensity to leak private information, produce false information, and can be coerced into generating content that can be used for nefarious purposes by bad actors, or even by regular users unknowingly. Implementing safeguards and guardrailing techniques is imperative for applications to ensure that the content generated by LLMs are safe, secure, and ethical. Thus, frameworks to deploy mechanisms that prevent misuse of these models via application implementations is imperative. In this study, we propose a Flexible Adaptive Sequencing mechanism with trust and safety modules, that can be used to implement safety guardrails for the development and deployment of LLMs.
Downloads
References
Black, Sid, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He et al. "Gpt-neox-20b: An open-source autoregressive language model." arXiv preprint arXiv:2204.06745 (2022).
Radford, Alec, Jeff Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever. “Language Models are Unsupervised Multitask Learners.” OpenAI (2019).
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and LIU, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, no. 140 (2020): 1-67.
Lee, Katherine and Cooper, A. Feder and Grimmelmann, James and Grimmelmann, James, Talkin’ ‘Bout AI Generation: Copyright and the Generative-AI Supply Chain (July 27, 2023). Forthcoming, Journal of the Copyright Society 2024, Available at SSRN: https://ssrn.com/abstract=4523551 or http://dx.doi.org/10.2139/ssrn.4523551
Borkar, Jaydeep. "What can we learn from Data Leakage and Unlearning for Law?." arXiv preprint arXiv:2307.10476 (2023).
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- ` Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., and Raffel, C. Extracting training ´ data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp. 2633–2650. USENIX Association, August 2021. ISBN 978-1-939133- 24-3.
Lehman, E., Jain, S., Pichotta, K., Goldberg, Y., and Wallace, B. Does BERT pretrained on clinical notes reveal sensitive data? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 946–959, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main. 73.
Huang, J., Shao, H., and Chang, K. C.-C. Are large pretrained language models leaking your personal information? In Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 2038–2047, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics.
Kandpal, N., Wallace, E., and Raffel, C. Deduplicating training data mitigates privacy risks in language models, 2022.
Biderman, S., Prashanth, U. S., Sutawika, L., Schoelkopf, H., Anthony, Q., Purohit, S., and Raf, E. Emergent and predictable memorization in large language models, 2023.
Wang, Boxin, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu et al. "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models." In NeurIPS. 2023.
Perez, Fábio, and Ian Ribeiro. "Ignore previous prompt: Attack techniques for language models." arXiv preprint arXiv:2211.09527 (2022).
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Gray, et al. 2022. Training language models to follow instructions with human feedback. In NeurIPS
Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. 2020. Learning to summarize with human feedback. In NeurIPS.
[Web] Presidio: Data Protection and De-identification SDK https://microsoft.github.io/presidio/
Devlin, Jacob, Ming-Wei Chang, Kenton Lee and Kristina Toutanova. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” North American Chapter of the Association for Computational Linguistics (2019).
Robust DeID: De-Identification of Medical Notes using Transformer Architectures. https://zenodo.org/doi/10.5281/zenodo.6617861
Liu, Yinhan, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint arXiv:1907.11692 (2019).
[Web] spaCy: Industrial-Strength Natural Language Processing - https://spacy.io/
Sanh, Victor, Lysandre Debut, Julien Chaumond, and Thomas Wolf. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint arXiv:1910.01108 (2019).
Ian Kivlichan, Jeffrey Sorensen, Julia Elliott, Lucy Vasserman, Martin Görner, Phil Culliton. (2020). Jigsaw Multilingual Toxic Comment Classification. Kaggle. https://kaggle.com/competitions/jigsaw-multilingual-toxic-comment-classification
Schulhoff, S., Pinto, J., Khan, A., Bouchard, L. F., Si, C., Anati, S., ... & Boyd-Graber, J. (2023). Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition. arXiv preprint arXiv:2311.16119.
Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." arXiv preprint arXiv:1908.10084 (2019).
[Web] sentence-transformers/paraphrase-MiniLM-L6-v2 via Hugging Face Model Hub: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2
[Web] European Union. (2016). General Data Protection Regulation. Retrieved from https://eur-lex.europa.eu/eli/reg/2016/679/oj
[Web] California State Legislature. (2018). Assembly Bill No. 375. Retrieved from https://leginfo.legislature.ca.gov/faces/billTextClient.xhtml?bill_id=201720180AB375
[Web] U.S. Department of Health and Human Services. (2002). Standards for Privacy of Individually Identifiable Health Information. Retrieved from https://www.govinfo.gov/content/pkg/FR-2002-12-28/pdf/02-31917.pdf
[Web] European Commission. (2021). Proposal for a Regulation of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act) and Amending Certain Union Legislative Acts. Retrieved from https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206
[Web] The White House. (2023). Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. Retrieved from https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
[Web] IEEE. (2019). Ethically Aligned Design: A Vision for Prioritizing Human Wellbeing with Autonomous and Intelligent Systems, First Edition. Retrieved from https://standards.ieee.org/content/dam/ieee-standards/standards/web/documents/other/ead_v1.pdf
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2023 Anjanava Biswas , Wrick Talukdar (Author)
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
License Terms
Ownership and Licensing:
Authors of this research paper submitted to the journal owned and operated by The Science Brigade Group retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agreed to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
License Permissions:
Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the Journal. This license allows for the broad dissemination and utilization of research papers.
Additional Distribution Arrangements:
Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in this Journal.
Online Posting:
Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the Journal. Online sharing enhances the visibility and accessibility of the research papers.
Responsibility and Liability:
Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. The Science Brigade Publishers disclaim any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.
Plaudit
License Terms
Ownership and Licensing:
Authors of this research paper submitted to the Journal of Science & Technology retain the copyright of their work while granting the journal certain rights. Authors maintain ownership of the copyright and have granted the journal a right of first publication. Simultaneously, authors agreed to license their research papers under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License.
License Permissions:
Under the CC BY-NC-SA 4.0 License, others are permitted to share and adapt the work, as long as proper attribution is given to the authors and acknowledgement is made of the initial publication in the Journal of Science & Technology. This license allows for the broad dissemination and utilization of research papers.
Additional Distribution Arrangements:
Authors are free to enter into separate contractual arrangements for the non-exclusive distribution of the journal's published version of the work. This may include posting the work to institutional repositories, publishing it in journals or books, or other forms of dissemination. In such cases, authors are requested to acknowledge the initial publication of the work in the Journal of Science & Technology.
Online Posting:
Authors are encouraged to share their work online, including in institutional repositories, disciplinary repositories, or on their personal websites. This permission applies both prior to and during the submission process to the Journal of Science & Technology. Online sharing enhances the visibility and accessibility of the research papers.
Responsibility and Liability:
Authors are responsible for ensuring that their research papers do not infringe upon the copyright, privacy, or other rights of any third party. The Journal of Science & Technology and The Science Brigade Publishers disclaim any liability or responsibility for any copyright infringement or violation of third-party rights in the research papers.