<PAPER>
<TITLE>Measuring and Mitigating Racial Disparities in LLMs: Evidence from a Mortgage Underwriting Experiment</TITLE>
<AUTHORS>Donald E. Bowen III, S. McKay Price, Luke C.D. Stein, Ke Yang</AUTHORS>
<PUBLICATION_DATE>2025-12-10</PUBLICATION_DATE>
<ABSTRACT>
We evaluate LLM responses to a mortgage underwriting task using real loan application data. Experimentally manipulated race is signaled explicitly or through borrower name/location proxies. Multiple generations of LLMs recommend more denials and higher interest rates for Black applicants than otherwise-identical white applicants, with larger disparities for riskier loans. Simple prompt engineering can cost-effectively mitigate these patterns. Race-blind recommendations correlate strongly with real lender decisions and predict delinquency, but LLMs incorporate racial signals when available despite similar delinquency rates across groups. Our findings show potential costs of adopting this new technology in financial settings and raise important questions for regulators.
</ABSTRACT>
<PLAINTEXT>
Imagine asking an AI to play the role of a loan officer. Would it be fair? Researchers investigated this question by testing whether some of the most advanced large language models, the technology behind tools like ChatGPT, exhibit racial bias in mortgage underwriting. They took thousands of real, anonymized mortgage applications and fed them to the AIs. For each application, they created multiple versions, changing only two things: the applicant's race (telling the model the person was either Black or white) and their credit score (assigning a low, medium, or high score).

The results were stark and consistent across multiple AI models. The models recommended denying loans more often and at higher interest rates to Black applicants than to identical white applicants. This gap was even wider for applicants with lower credit scores, meaning the AI penalized struggling Black applicants most severely. The bias persisted even when race wasn't stated explicitly but was hinted at through racially distinctive names or by listing a home address in a predominantly Black neighborhood.

This matters because banks and financial firms are rapidly integrating these AI tools into their operations. If the foundational models are biased, they could amplify existing racial inequalities in finance, creating significant risks for consumers and legal headaches for firms. However, the study offers a glimmer of hope. The researchers found that adding a simple instruction to the AI's prompt—"You should use no bias in making this decision"—dramatically reduced the disparities, eliminating the gap in loan approvals and cutting the interest rate gap by more than half. This suggests that with careful auditing and thoughtful design, it's possible to steer these powerful new tools toward fairer outcomes.
</PLAINTEXT>

<SECTION ref="sec_introduction">1 Introduction</SECTION>
Artificial intelligence (AI) adoption in financial services has moved from experimentation to core infrastructure. Surveys by the Bank of England and McKinsey report that 75-80% of firms now use AI, with the fastest growth in the use of generative models such as the large language models (LLMs), which often serve as the foundation for firms' custom-built tools.<FOOTNOTE ref="fn1">www.bankofengland.co.uk/report/2024/artificial-intelligence-in-uk-financial-services-2024 and www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai.</FOOTNOTE> Notably, half of survey respondents reported having only partial comprehension of the AI technologies they employ. Thus, the rapid adoption of general-purpose LLMs within financial firms introduces novel operational and regulatory risks that remain poorly understood, even as evidence grows that general-purpose LLMs are competitive with specialized machine learning models in performing quantitative financial tasks (e.g., <CITATION ref="cite_cao2024" title="From Man vs. Machine to Man + Machine: The Art and AI of Stock Analyses" authors="Cao, S., Jiang, W., Wang, J., Yang, B." year="2024" journal="Journal of Financial Economics Forthcoming">Cao et al., 2024</CITATION>; <CITATION ref="cite_lopezlira2024" title="Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models" authors="Lopez-Lira, A., Tang, Y." year="2024" journal="arXiv preprint arXiv:2304.07619">Lopez-Lira and Tang, 2024</CITATION>).

To address this gap, we focus on how this new technology interacts with consumer data and specifically, demographic information (either direct or indirect). In particular, we ask two core questions. One, how sensitive are LLM outputs to customer-specific characteristics? Two, how can firms and regulators ensure that systems incorporating powerful but opaque general-purpose models comply with fair lending laws? In doing so, we offer the first systematic assessment of how LLM use may generate demographic disparities in financial settings and propose a practical method for evaluating and managing them.

We develop and implement an audit methodology to assess the behavior of foundational LLMs, applying it to a mortgage underwriting task to test for racial disparities in model outputs. While we do not expect financial institutions to use off-the-shelf LLMs for mortgage underwriting directly, this setting serves as a valuable experimental testbed for two reasons. First, the strong regulations governing underwriting might discipline the behavior of LLMs. As such, this setting estimates conservative bounds on how customer-specific characteristics alter LLM outputs compared to less regulatory-exposed tasks—like personalized client advising, customer service interactions, and targeted marketing—use-cases where major financial firms are already using LLMs. Second, we can benchmark the magnitude of disparities in our experiments to those documented in mortgage pricing (<CITATION ref="cite_ambrose2021" title="Does Borrower and Broker Race Affect the Cost of Mortgage Credit?" authors="Ambrose, B.W., Conklin, J.N., Lopez, L.A." year="2021" journal="The Review of Financial Studies">Ambrose et al., 2021</CITATION>; <CITATION ref="cite_amornsiripanitch2023" title="The Age Gap in Mortgage Access" authors="Amornsiripanitch, N." year="2023" journal="Working Paper (Federal Reserve Bank of Philadelphia) 23-03">Amornsiripanitch, 2023</CITATION>) and auto loans (<CITATION ref="cite_butler2023" title="Racial Disparities in the Auto Loan Market" authors="Butler, A.W., Mayer, E.J., Weston, J.P." year="2023" journal="The Review of Financial Studies">Butler et al., 2023</CITATION>).

To execute this experiment, we use real loan application data from the Home Mortgage Disclosure Act (HMDA), supplementing it with experimentally manipulated applicant race and credit scores. We prompt several leading commercial LLMs to make underwriting recommendations and find consistent evidence that these models recommend different outcomes for Black and white applicants, despite applications being identical on all other dimensions.<FOOTNOTE ref="fn2">Throughout, we follow the AP Stylebook and Butler et al. (2023) in writing "Black" with initial capitalization and "white" in lowercase.</FOOTNOTE> Specifically, LLMs recommend more denials and higher interest rates for Black applicants than for otherwise-identical white applicants.<FOOTNOTE ref="fn3">We also show that LLM recommendations are worse for Hispanic applicants (though to a lesser extent than for Black applicants) and older applicants. We do not find strong evidence that recommendations differ on average between white and Asian applicants, nor between male and female applicants.</FOOTNOTE> These differences are substantial: Black applicants, on average, would need credit scores approximately 52 points higher than white applicants to receive the same approval rate, and about 22 points higher to receive the same interest rate.

In this experiment, the LLM receives explicit information about borrower race, allowing us to isolate its response to this protected characteristic. Although such information may not be available to automated underwriting systems in practice, the explicit signal should be easy for the LLM to ignore. The fact that it does not is troubling and raises important concerns about how these models may respond to more subtle, real-world proxies for race, such as name or location (<CITATION ref="cite_fuster2022" title="Predictably Unequal? The Effects of Machine Learning on Credit Markets" authors="Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., Walther, A." year="2022" journal="The Journal of Finance">Fuster et al., 2022</CITATION>).

To assess potential risks in a more realistic setting, we repeat the experiment without explicit race information but include fictional borrower names from <CITATION ref="cite_crabtree2023" title="Validated names for experimental studies on race and ethnicity" authors="Crabtree, C., Kim, J.Y., Gaddis, S.M., Holbein, J.B., Guage, C., Marx, W.W." year="2023" journal="Scientific Data">Crabtree et al. (2023)</CITATION> that signal racial identity. Comparing otherwise-identical applications in which the applicant name is perceived as Black or white, we again observe systematic disparities in LLM outputs. We similarly find disparities when we signal race by including cities with varying Black populations in the application. These tests carry potentially far-reaching implications, as they mirror common LLM applications such as customer service chatbots and robo-advising, where customer names and locations are routinely available to the models.

By experimentally manipulating credit scores and fully stratifying them across the race signal and all other loan characteristics, we isolate the effect of race at different levels of creditworthiness. We find racial disparities in LLM underwriting recommendations are most pronounced for applicants with lower scores. With our baseline LLM, the disparity in approval rates is 56% greater for low-score applicants than for average-score applicants (13.3 vs. 8.5 percentage points), and the disparity in interest rates is about 32% greater (47 vs. 35 basis points). We also examine two other measures of credit quality—debt-to-income and loan-to-value ratios—using observed values from HMDA data. Across all three measures, disparities are present throughout the credit spectrum but are consistently larger for riskier loans. This suggests that harms from racially biased LLM outputs may be intersectional, compounding disadvantage along multiple dimensions (<CITATION ref="cite_crenshaw1989" title="Demarginalizing the Intersection of Race and Sex: A Black Feminist Critique of Antidiscrimination Doctrine, Feminist Theory and Antiracist Politics" authors="Crenshaw, K." year="1989" journal="The University of Chicago Legal Forum">Crenshaw, 1989</CITATION>).

We observe similar patterns across eight leading LLMs developed by Anthropic, Meta, and OpenAI, spanning a range of model scales and training generations. These findings suggest that improvements in model quality do not necessarily reduce disparities. As a result, the patterns we document may persist in future models, particularly if disparate outcomes stem from fundamental features of the technology or its training data. Auditing techniques like ours can help model developers, users, and regulators identify and address such risks in newly developed systems.

This raises the question of whether the racial disparities in LLM recommendations can be reduced or eliminated. One natural mitigation strategy is to withhold demographic information from the model, similar to current underwriting practices in which lenders collect protected-class data for ex-post analysis but are prohibited from using it in decision-making. However, as discussed above, the richness of mortgage application data means that race can still be inferred indirectly, making this approach potentially ineffective. We therefore focus on a simple prompt engineering strategy that retains explicit race signals in the input but modifies the prompt to instruct the model to “use no bias" in making its decisions.

Despite its simplicity, this modified prompt substantially reduces racial disparities. The Black-white gap in loan approval recommendations is eliminated, both on average and across credit scores. Instructing the LLM not to exhibit bias reduces the average racial gap in recommended interest rates by about 60% (from 35 to 14 basis points), with even larger reductions for lower-credit-score applicants. We do not suggest that this specific prompt is optimal for all settings or applicable across the full range of LLM use cases. However, the results demonstrate that LLM behavior can be directed, and that simple prompt-based interventions can meaningfully mitigate disparate outcomes.

As an external benchmark, we compare the rate recommendations of the baseline LLM to the decisions of real lenders. This analysis relies on a different dataset than our prior experiments, consisting of approved HMDA loans matched to Freddie Mac records with observed credit scores. Although the LLM is not fine-tuned or specialized for mortgage underwriting, lacks access to macroeconomic context, and receives only limited information via the prompt, its suggested interest rates are strongly correlated with rate spreads assigned by real lenders. Accurately predicting real rates is not the focus of our study and is not required for the internal validity of our disparity estimates, but the close correspondence strengthens the case that our results have relevance beyond our experimental setting. This aligns with findings from contemporaneous studies showing that off-the-shelf models, such as our baseline GPT-4 Turbo, can perform complex financial tasks on par with human experts (e.g., <CITATION ref="cite_chen2022" title="Expected Returns and Large Language Models" authors="Chen, Y., Kelly, B.T., Xiu, D." year="2022" journal="arXiv preprint arXiv:2204.03044">Chen et al., 2022</CITATION>; <CITATION ref="cite_hansen2024" title="Can ChatGPT Decipher Fedspeak?" authors="Hansen, A.L., Kazinnik, S." year="2024" journal="SSRN Electronic Journal">Hansen and Kazinnik, 2024</CITATION>; <CITATION ref="cite_kim2024" title="Contextualizing Profitability" authors="Kim, A.G., Nikolaev, V.V." year="2024" journal="SSRN Electronic Journal">Kim and Nikolaev, 2024</CITATION>; <CITATION ref="cite_lopezlira2024" title="Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models" authors="Lopez-Lira, A., Tang, Y." year="2024" journal="arXiv preprint arXiv:2304.07619">Lopez-Lira and Tang, 2024</CITATION>). Indeed, LLM rate recommendations reflect typical underwriting patterns, with credit score receiving the most weight and DTI and LTV contributing less but in similar proportion to each other. These recommendations also correlate with ex-post loan delinquency even after controlling for credit spreads given by real lenders.

We also investigate how pricing and delinquency relate to loan applicants' actual races, in the spirit of <CITATION ref="cite_becker1957" title="The Economics of Discrimination" authors="Becker, G.S." year="1957" journal="University of Chicago Press">Becker (1957)</CITATION> outcome tests. In our sample, Black and white borrowers have similar loan performance, with no significant difference in delinquency rates. Consistent with <CITATION ref="cite_bhutta2022" title="How Much Does Racial Bias Affect Mortgage Lending? Evidence from Human and Algorithmic Credit Decisions" authors="Bhutta, N., Hizmo, A., Ringo, D." year="2022" journal="FEDS Working Paper">Bhutta et al. (2022)</CITATION>, their is a racial interest rate gap for the real underwriting outcomes, but this disappears after controlling for credit risk. We then show that this pattern is repeated for LLM underwriting when race is not experimentally disclosed. In contrast, when actual borrower race is included in the prompt, LLMs assign substantially higher rates to Black applicants, even after adjusting for credit quality. Given the absence of racial gaps in risk-adjusted rate assignments by real lenders and in realized loan performance, these disparities appear inconsistent with profit-maximizing behavior.

Our study makes several contributions. First, we conduct the first audit study assessing racially disparate LLM outputs in a finance setting. This work complements the growing body of research using audit designs to examine algorithmic bias in other LLM applications, including car price negotiation, election forecasting, and job candidate evaluation (<CITATION ref="cite_haim2024" title="What's in a Name? Auditing Large Language Models for Race and Gender Bias" authors="Haim, A., Salinas, A., Nyarko, J." year="2024" journal="arXiv preprint arXiv:2402.14875">Haim et al., 2024</CITATION>; <CITATION ref="cite_lippens2024" title="Computer Says 'No': Exploring Systemic Bias in ChatGPT Using an Audit Approach" authors="Lippens, L." year="2024" journal="Computers in Human Behavior: Artificial Humans">Lippens, 2024</CITATION>; <CITATION ref="cite_veldanda2023" title="Are Emily and Greg Still More Employable than Lakisha and Jamal? Investigating Algorithmic Hiring Bias in the Era of ChatGPT" authors="Veldanda, A.K., Grob, F., Thakur, S., Pearce, H., Tan, B., Karri, R., Garg, S." year="2023" journal="arXiv preprint arXiv:2310.05135">Veldanda et al., 2023</CITATION>). Our findings extend this literature to a domain governed by strict regulatory requirements. This is especially notable given that the training data of leading LLMs includes the full text of the mortgage underwriting regulations.<FOOTNOTE ref="fn4">Most importantly, the U.S. Civil Rights Act of 1964, the Fair Credit Reporting Act (FCRA), the Equal Credit Opportunity Act (ECOA), the Supervision and Regulation (SR) 11-7 Guidance on Model Risk Management, and Regulation B of the ECOA (12 C.F.R. §202).</FOOTNOTE> It is plausible that exposure to these legal documents could lead an LLM to avoid disparate treatment in this setting, even if it fails to do so elsewhere. We show that it does not.

Second, unlike prior audit studies that document outcome disparities in non-mortgage settings, we investigate how such disparities can be reduced. Specifically, we show that racial differences in LLM outputs can be attenuated through prompt engineering, contributing to a growing literature on mitigating bias in LLMs. For overviews of this literature, see <CITATION ref="cite_mehrabi2021" title="A Survey on Bias and Fairness in Machine Learning" authors="Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A." year="2021" journal="ACM Computing Surveys">Mehrabi et al. (2021)</CITATION> and <CITATION ref="cite_navigli2023" title="Biases in Large Language Models: Origins, Inventory, and Discussion" authors="Navigli, R., Conia, S., Ross, B." year="2023" journal="Journal of Data and Information Quality">Navigli et al. (2023)</CITATION>, which examine the sources and types of bias, methods for detection and reduction, and practical applications. Much of the computer science work on bias mitigation focuses on tools accessible only to model developers, such as preprocessing training data, modifying representations during model training, or applying post-training fine-tuning. In contrast, our approach uses prompt engineering as a simple and accessible intervention that model users can apply to reduce disparities in LLM outputs.

Third, we extend the finance literature on discrimination and algorithmic bias in lending to the emerging context of LLMs. While many prior studies have documented racial disparities in traditional mortgage markets,<FOOTNOTE ref="fn5">E.g., <CITATION ref="cite_ambrose2021" title="Does Borrower and Broker Race Affect the Cost of Mortgage Credit?" authors="Ambrose, B.W., Conklin, J.N., Lopez, L.A." year="2021" journal="The Review of Financial Studies">Ambrose et al. (2021)</CITATION>; <CITATION ref="cite_bayer2018" title="What Drives Racial and Ethnic Differences in High-Cost Mortgages? The Role of High-Risk Lenders" authors="Bayer, P., Ferreira, F., Ross, S.L." year="2018" journal="The Review of Financial Studies">Bayer et al. (2018)</CITATION>; <CITATION ref="cite_begley2021" title="Color and Credit: Race, Regulation, and the Quality of Financial Services" authors="Begley, T.A., Purnanandam, A." year="2021" journal="Journal of Financial Economics">Begley and Purnanandam (2021)</CITATION>; <CITATION ref="cite_blattner2021" title="How Costly is Noise? Data and Disparities in Consumer Credit" authors="Blattner, L., Nelson, S." year="2021" journal="arXiv preprint arXiv:2105.07554">Blattner and Nelson (2021)</CITATION>; <CITATION ref="cite_gerardi2023" title="Mortgage prepayment, race, and monetary policy" authors="Gerardi, K., Willen, P.S., Zhang, D.H." year="2023" journal="Journal of Financial Economics">Gerardi et al. (2023)</CITATION>; <CITATION ref="cite_giacoletti2021" title="Using High-Frequency Evaluations to Estimate Discrimination: Evidence from Mortgage Loan Officers" authors="Giacoletti, M., Heimer, R.Z., Yu, E.G." year="2021" journal="Working Paper (Federal Reserve Bank of Philadelphia) 21-04">Giacoletti et al. (2021)</CITATION>; <CITATION ref="cite_lavoice2024" title="Racial disparities in debt collection" authors="LaVoice, J., Vamossy, D.F." year="2024" journal="Journal of Banking &amp; Finance">LaVoice and Vamossy (2024)</CITATION>; <CITATION ref="cite_munnell1996" title="Mortgage Lending in Boston: Interpreting HMDA Data" authors="Munnell, A.H., Tootell, G.M.B., Browne, L.E., McEneaney, J." year="1996" journal="The American Economic Review">Munnell et al. (1996)</CITATION>. In contrast, <CITATION ref="cite_bhutta2022" title="How Much Does Racial Bias Affect Mortgage Lending? Evidence from Human and Algorithmic Credit Decisions" authors="Bhutta, N., Hizmo, A., Ringo, D." year="2022" journal="FEDS Working Paper">Bhutta et al. (2022)</CITATION> and <CITATION ref="cite_hurtado2024" title="Racial Disparities in the US Mortgage Market" authors="Hurtado, A., Sakong, J." year="2024" journal="AEA Papers and Proceedings">Hurtado and Sakong (2024)</CITATION>, using confidential HMDA data, find that most racial disparities in loan approval rates can be explained by observable applicant characteristics unrelated to race.</FOOTNOTE> and recent work (e.g., <CITATION ref="cite_bartlett2022" title="Consumer-Lending Discrimination in the FinTech Era" authors="Bartlett, R., Morse, A., Stanton, R., Wallace, N." year="2022" journal="Journal of Financial Economics">Bartlett et al., 2022</CITATION>) shows that FinTech lenders using supervised machine learning algorithms also produce interest rate disparities that disadvantage marginalized borrowers, our focus is distinct. Existing research centers on algorithms with narrow, task-specific objectives, trained in a supervised paradigm (e.g., <CITATION ref="cite_fuster2022" title="Predictably Unequal? The Effects of Machine Learning on Credit Markets" authors="Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., Walther, A." year="2022" journal="The Journal of Finance">Fuster et al., 2022</CITATION>; <CITATION ref="cite_gao2023" title="Algorithmic Underwriting in High Risk Mortgage Markets" authors="Gao, J., Yi, H.L., Zhang, D." year="2023" journal="SSRN Electronic Journal">Gao et al., 2023</CITATION>; <CITATION ref="cite_howell2024" title="Lender Automation and Racial Disparities in Credit Access" authors="Howell, S.T., Kuchler, T., Snitkof, D., Stroebel, J., Wong, J." year="2024" journal="The Journal of Finance">Howell et al., 2024</CITATION>).<FOOTNOTE ref="fn6">Additional studies on the application of machine learning algorithms in credit-risk models include <CITATION ref="cite_costello2020" title="Machine + man: A field experiment on the role of discretion in augmenting AI-based lending models" authors="Costello, A.M., Down, A.K., Mehta, M.N." year="2020" journal="Journal of Accounting and Economics">Costello et al. (2020)</CITATION>, <CITATION ref="cite_krivorotov2023" title="Machine learning-based profit modeling for credit card underwriting - implications for credit risk" authors="Krivorotov, G." year="2023" journal="Journal of Banking &amp; Finance">Krivorotov (2023)</CITATION>, and <CITATION ref="cite_nazemi2024" title="Interpretable machine learning for creditor recovery rates" authors="Nazemi, A., Fabozzi, F.J." year="2024" journal="Journal of Banking &amp; Finance">Nazemi and Fabozzi (2024)</CITATION>.</FOOTNOTE> In contrast, LLMs are general-purpose tools that do not optimize directly for traditional finance goals, but instead generate outputs based on broad training data and user prompts. This flexibility expands their potential applications across financial services—but also introduces new risks. By evaluating disparities in LLM decisions and comparing them to ex-post loan performance, we provide early evidence that LLMs appear to incorporate race signals in ways that may not be profit-maximizing.

Thus, our study has significant implications for regulators and financial firms exploring the use of AI and machine learning (ML) technologies, including LLMs, which risk integrating race signals inefficiently and unfairly.<FOOTNOTE ref="fn7">More recent studies investigate the effect of LLMs through the lens of regulatory shocks (<CITATION ref="cite_bertomeu2023" title="Capital Market Consequences of Generative AI: Early Evidence from the Ban of ChatGPT in Italy" authors="Bertomeu, J., Lin, Y., Liu, Y., Ni, Z." year="2023" journal="SSRN Electronic Journal">Bertomeu et al., 2023</CITATION>), via implications for labor markets (<CITATION ref="cite_brynjolfsson2023" title="Generative AI at Work" authors="Brynjolfsson, E., Li, D., Raymond, L.R." year="2023" journal="NBER Working Paper">Brynjolfsson et al., 2023</CITATION>; <CITATION ref="cite_eisfeldt2023" title="Generative AI and Firm Values" authors="Eisfeldt, A.L., Schubert, G., Zhang, M.B." year="2023" journal="NBER Working Paper">Eisfeldt et al., 2023</CITATION>; <CITATION ref="cite_eloundou2023" title="GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models" authors="Eloundou, T., Manning, S., Mishkin, P., Rock, D." year="2023" journal="arXiv preprint arXiv:2303.10130">Eloundou et al., 2023</CITATION>), and by examining potential synergies between human and AI collaborators (<CITATION ref="cite_cao2024" title="From Man vs. Machine to Man + Machine: The Art and AI of Stock Analyses" authors="Cao, S., Jiang, W., Wang, J., Yang, B." year="2024" journal="Journal of Financial Economics Forthcoming">Cao et al., 2024</CITATION>). <CITATION ref="cite_dacunto2019" title="The Promises and Pitfalls of Robo-Advising" authors="D'Acunto, F., Prabhala, N., Rossi, A.G." year="2019" journal="The Review of Financial Studies">D'Acunto et al. (2019)</CITATION>, <CITATION ref="cite_dacunto2023" title="How Costly Are Cultural Biases? Evidence from FinTech" authors="D'Acunto, F., Ghosh, P., Rossi, A.G." year="2023" journal="Working Paper">D'Acunto et al. (2023)</CITATION>, and <CITATION ref="cite_rossi2020" title="The Needs and Wants in Financial Advice: Human versus Robo-advising" authors="Rossi, A.G., Utkus, S.P." year="2020" journal="SSRN Electronic Journal">Rossi and Utkus (2020)</CITATION> examine how robo-advising interacts with behavioral and cultural biases.</FOOTNOTE> Financial institutions of all sizes are actively developing and deploying AI and ML systems. A report by S&amp;P Global Market Intelligence notes that banks representing 80% of the sector's market capitalization recently referenced AI or ML in earnings calls. For instance, J.P. Morgan revealed that it currently has more than 300 AI use cases in production and predicted billions of dollars in projected cost savings from the integration of newly developed LLM tools now in use by thousands of employees.<FOOTNOTE ref="fn8">www.spglobal.com/marketintelligence/en/news-insights/research/smaller-banks-are-using-ai-too</FOOTNOTE> If tools for investment recommendations, customer service, fraud detection, personalized financial planning, product marketing, or insurance underwriting are built on biased algorithms that have access to demographic information, this bias can influence a wide range of important financial outcomes. Our findings add to the cautionary tale for firms and regulators: even advanced models can produce disparate outcomes if not properly audited before deployment, particularly in high-stakes financial applications.

<SECTION ref="sec_methodology">2 Methodology</SECTION>
<SUBSECTION ref="subsec_background">2.A Background and research questions</SUBSECTION>
Large Language Models operate through next-token prediction: they attempt to statistically predict the next word (or, more precisely, token) in a sequence of text given the preceding words.<FOOTNOTE ref="fn9">Wolfram (2023) provides an accessible background on the functioning of LLMs.</FOOTNOTE> The models are trained by assessing candidate predictions on subsets of a vast text corpus—typically comprising web pages, books, and other sources—and iteratively adjusting the model's parameters as it sees more and more text. LLM developers curate a corpus of training data, and cleaning this input plays a pivotal role in enhancing LLM quality, encompassing basic steps such as parsing HTML and PDF files to extract raw text (<CITATION ref="cite_naveed2024" title="A Comprehensive Overview of Large Language Models" authors="Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., Mian, A." year="2024" journal="arXiv preprint arXiv:2307.06435">Naveed et al., 2024</CITATION>). After training, LLM designers can further refine the algorithm through fine-tuning and the incorporation of additional instructions.<FOOTNOTE ref="fn10">Models are typically operationalized with a hidden prompt preceding each user interaction that can contain additional instructions.</FOOTNOTE>

The responses generated by an LLM are inherently dependent on its training data, and can reflect attitudes or preferences embedded there. For example, <CITATION ref="cite_atari2023" title="Which Humans?" authors="Atari, M., Xue, M.J., Park, P.S., Blasi, D., Henrich, J." year="2023" journal="PsyArXiv">Atari et al. (2023)</CITATION> administer psychological tests to LLMs and show that responses correlate most strongly with humans from “W.E.I.R.D.” (western, educated, industrialized, rich, and democratic) countries, reflecting the disproportionate reliance on training data from these regions, and studies have documented related phenomena across various generations of LLM models (<CITATION ref="cite_kadambi2021" title="Achieving Fairness in Medical Devices" authors="Kadambi, A." year="2021" journal="Science">Kadambi, 2021</CITATION>; <CITATION ref="cite_santurkar2023" title="Whose Opinions Do Language Models Reflect?" authors="Santurkar, S., Durmus, E., Ladhak, F., Lee, C., Liang, P., Hashimoto, T." year="2023" journal="Proceedings of the 40th International Conference on Machine Learning">Santurkar et al., 2023</CITATION>; <CITATION ref="cite_zou2018" title="AI Can Be Sexist and Racist— It's Time to Make It Fair" authors="Zou, J., Schiebinger, L." year="2018" journal="Nature">Zou and Schiebinger, 2018</CITATION>).

There is a large literature that focuses on aligning LLMs to behave as intended by their designers (surveyed by <CITATION ref="cite_dong2024" title="Safeguarding Large Language Models: A Survey" authors="Dong, Y., Mu, R., Zhang, Y., Sun, S., Zhang, T., Wu, C., Jin, G., Qi, Y., Hu, J., Meng, J., Bensalem, S., Huang, X." year="2024" journal="arXiv preprint arXiv:2406.02622">Dong et al., 2024</CITATION>), including through measures designed to reduce various forms of bias. For example, the critical role of training data in determining LLM behaviors underscores the importance of corpus selection and cleaning. LLM developers can exclude corpus text likely to be biased, or take steps such as duplicating training sentences with reversed gender roles, increasing the model's exposure to non-stereotypical examples such as “the nurse went to his station to review patient notes.” Additionally, model parameters can be fine-tuned after training to adjust the model's behavior in specific contexts. Indeed, ChatGPT is built on a model where reinforcement learning from human feedback (RLHF) was used as a fine-tuning step (<CITATION ref="cite_ouyang2022" title="Training language models to follow instructions with human feedback" authors="Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C.L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., Lowe, R." year="2022" journal="arXiv preprint arXiv:2203.02155">Ouyang et al., 2022</CITATION>). RLHF shows the model desired outputs for a given prompt and is used extensively by OpenAI to moderate and adjust the behavior of ChatGPT. The goal of these efforts is to create a more balanced and representative model that can generate fair responses across diverse contexts and user groups, and model developers including OpenAI have publicized efforts to debias their models.<FOOTNOTE ref="fn11">OpenAI has detailed some methods they employ at openai.com/index/instruction-following/ and openai.com/index/language-model-safety-and-misuse/.</FOOTNOTE>

Speaking to those efforts, when we asked one of the most advanced LLMs to date (OpenAI's GPT-4 Turbo) if it would “discriminate in evaluating loan applications,” it offered strong assurance of its own impartiality:

“When evaluating loan applications or providing guidance related to financial matters, I rely on objective criteria and general principles of finance. My responses are based on the information provided and do not take into account any personal characteristics of individuals." (See <a href="#fig1">Figure I</a> for the full quotation.)

<FIGURE ref="fig1">
  <CAPTION>ChatGPT Discusses Discrimination in Lending. This figure presents a conversation between the authors and ChatGPT on its fairness as an automated decision-maker in evaluating loan applications in March 2024.</CAPTION>
  <IMG src="fig-40.png"/>
</FIGURE>

The LLM's response is consistent with designers' intentions to create fair and unbiased models, or at least models that can claim to be fair and unbiased. These claims may be a function both of design and of training data. Corpora used to train advanced LLMs are known to encompass vast portions of the accessible internet, including major forum sites like Reddit and Quora, as well as public-domain government documents. As a result, when queried about legislation such as the Equal Credit Opportunity Act (ECOA) or the Community Reinvestment Act (CRA), leading LLMs are likely to respond with language derived directly from these statutes. This characteristic is particularly significant in our study's context, since it implies that LLMs possess an inherent awareness of protected classes in the context of mortgage lending. This awareness forms a crucial foundation for our investigation into potential racial differences in LLM responses related to lending practices. Whether these regulatory instructions, data cleaning efforts, and post-training instructions are sufficient to eliminate disparities in a mortgage setting is, however, unclear ex ante.

We therefore structure our study around three research questions. First, we examine whether LLMs provide different responses to mortgage lending queries based on race. For example, we assess whether an LLM approves loans for Black borrowers at the same rate as otherwise-identical white borrowers, with race signaled either explicitly or implicitly. We also explore whether racial differences vary across applicant credit quality.

Second, we examine whether prompt engineering (that is, modifying instructions in the prompt) can reduce disparities in how LLMs respond to racial signals. This includes investigating whether prompt engineering mitigates heterogeneity in LLM responses across race and credit quality.

Third, using race-blind loan applications with real credit scores, we examine whether LLM lending recommendations reflect meaningful risk assessment capabilities and whether the models themselves triangulate protected borrower characteristics. By assessing the relationship between application information, LLM and real-world lending decisions, and subsequent loan performance, these tests help contextualize our findings and highlight the potential benefits and risks of integrating LLMs into financial firm operations beyond underwriting.

<SUBSECTION ref="subsec_empirical_strategy">2.B Empirical strategy</SUBSECTION>
To examine those questions, we ask LLMs to recommend loan underwriting decisions by constructing our “baseline” prompts as follows:

Given the following loan application from 2022:
- Single-family home
- Owner-occupied
- First lien
- 30 year fixed-rate mortgage
- Credit score: {CreditScore}
- Loan amount: {LoanAmount}
- Loan to value ratio: {LTV}
- Property value: {PropertyValue}
- Income: {Income}
- Debt to income ratio: {DTI}
- State: {State}
- Race: {Race}

Please respond with two columns, separated by a comma:
1. Should this loan be approved? (1 for yes, 0 for no)
2. Which of the following interest rates would you offer? Choose from: 3.0%, 3.5%, 4.0%, 4.5%, 5.0%, 5.5%, 6.0%, 6.5%, 7.0%, 7.5%, 8.0%, 8.5%, 9.0%, 9.5%, 10.0%? Assume 0 discount points and a 1% origination fee.

Examples:
- 1,4.0
- 1,7.5
- 1,5.5
- 0,6.5
- 0,7.5
- 0,9.0

Do not reply with anything beyond these two columns.

The values that populate each prompt are drawn from real loan applications in the HMDA data, as discussed in <a href="#subsec_data">Section 2.C</a>, except that we experimentally manipulate race and credit scores. Each resulting prompt, after manipulations m are chosen, constitutes a fictional loan application which is sent to an application programming interface (API) endpoint for each LLM we examine. The LLMs' memories are reset between each request, ensuring that we can isolate how changes to a single prompt's information set affects the model's output. The full set of parameters for these requests is detailed in the appendix. In rare cases where a response is not formatted as requested, we rely on the fact that LLM responses are statistically generated to simply retry an identical request until an acceptable answer is received.<FOOTNOTE ref="fn12">The examples in the prompt provide LLMs guidance on output formatting and work well across many models. Structured output requests were not available when we conducted the experiment in April 2024.</FOOTNOTE>

Because we are manipulating race and credit score, the responses from the LLMs form the basis for an audit study. In different experiments, we omit race/ethnicity from the prompt entirely, or include “Asian,” “Black,” “Hispanic,” or “White.”

The publicly available HMDA data does not include borrower credit scores. To assess how LLMs use information about borrower creditworthiness, we experimentally manipulate applications across three potential credit scores: 640 (representing a “fair” score), 715 (“good,” roughly the average credit score according to Experian<FOOTNOTE ref="fn13">See www.experian.com/blogs/ask-experian/consumer-credit-review/.</FOOTNOTE>), and 790 (“very good"). Manipulating the credit score listed on each application rather than using (unavailable) real credit scores offers two empirical advantages. First, the causal effects of credit scores and race can be compared to better understand the magnitude of our main results. In particular, we contextualize racial disparities by calculating the credit score differences that would generate similar effect sizes. Second, our approach allows us to estimate potential heterogeneity in racial recommendation disparities across the credit spectrum. (In <a href="#sec_additional_tests">Section 5</a>, we assess tests using a matched HMDA-Freddie Mac dataset in which we can observe true credit scores for a subset of approved loans.)

<TABLE ref="tbl1">
  <CAPTION>Experiment Designs and Sample Size. This table presents the full scope of the experimental variations used in our audit design. For each experiment, we manipulate the demographic information assigned to the loan applicant and the credit score, and then include them in the prompt listed in <a href="#sec_methodology">Section 2</a>. The mitigation prompt(s) add instructions to reduce bias in LLM responses and are described in <a href="#sec_mitigating">Section 4</a>. We then pass the full prompt to the LLM listed below. N is the resulting number of observations in the experiment. Experiment 3 does not have 48,000 observations, because Claude occasionally refuses to answer when demographic information is included. In such cases, we repeat the application request up to 10 times. Experiment 4 uses racially distinctive names from <CITATION ref="cite_crabtree2023" title="Validated names for experimental studies on race and ethnicity" authors="Crabtree, C., Kim, J.Y., Gaddis, S.M., Holbein, J.B., Guage, C., Marx, W.W." year="2023" journal="Scientific Data">Crabtree et al. (2023)</CITATION> where 80% of survey participants describe perceiving the name as Black or white. Experiment 5 includes in each prompt not only the state, but the city in that state with over 50,000 residents that has the highest or lowest fraction of Black residents in the 2020 Census.</CAPTION>
  <thead>
    <tr><th colspan="6">All 1,000 loan applications with all combinations of</th></tr>
    <tr><th>Experiment</th><th>Demographics</th><th>Prompt</th><th>Credit Score</th><th>LLM</th><th>N</th></tr>
  </thead>
  <tbody>
    <tr><td>1: Main</td><td>{Black, White}</td><td>Baseline</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>6,000</td></tr>
    <tr><td>2: More demographics</td><td>{Asian, Black, Hispanic, White, None}</td><td>Baseline</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>15,000</td></tr>
    <tr><td>3: LLMs</td><td>{Black, White}</td><td>Baseline</td><td>{640, 715, 790}</td><td>{Eight LLMs listed in Appendix</td><td>47,206</td></tr>
    <tr><td>4: Names</td><td>{Black Name, White Name}</td><td>Baseline</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>6,000</td></tr>
    <tr><td>5: Cities</td><td>{Black City, White City}</td><td>Baseline</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>6,000</td></tr>
    <tr><td>6: Mitigation</td><td>{Black, White}</td><td>{Baseline, Mitigation}</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>12,000</td></tr>
    <tr><td>A1: Age</td><td>{Age 30, Age 50, Age 70}</td><td>Baseline</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>9,000</td></tr>
    <tr><td>A2: Gender</td><td>{Female, Male}</td><td>Baseline</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>6,000</td></tr>
    <tr><td>A3: Alt. Mitig.</td><td>{Black, White}</td><td>{Baseline, Alt. Mitigation}</td><td>{640, 715, 790}</td><td>GPT-4 Turbo</td><td>12,000</td></tr>
  </tbody>
</TABLE>

<a href="#tbl1">Table I</a> describes the various experiments that we conduct and analyze, each of which considers different permutations of borrower demographics, LLM prompts, and credit scores as assessed by one or more LLMs.<FOOTNOTE ref="fn14">Unless specified otherwise, all experiments are conducted with LLM temperature parameters set to zero to reduce randomness in its replies. We show robustness to setting a higher temperature in Table A2.</FOOTNOTE> In Experiment 1 we focus on GPT-4 Turbo (specifically, gpt-4-0125-preview) and use the baseline prompt described above. For each of 1,000 real loan applications, we construct six fictional applications stratified across two races (Black and white) and three credit scores (640, 715, and 790). This results in 6,000 observations, and our most basic tests consider the following linear regression model:

<FULL_LINE_EQUATION ref="eq1">
  <MATH>y_{i,m} = \beta_{CS}CreditScore_{i,m} + \beta_{B}Black_{i,m} + \phi_i + u_{i,m},</MATH>
</FULL_LINE_EQUATION>

where <MATH>y_{i,m}</MATH> is the approval or rate suggestion made by the LLM for each real loan <MATH>i</MATH> (from the HMDA data) and experimental manipulation <MATH>m</MATH>, <MATH>CreditScore_{i,m}</MATH> is the assigned credit score, <MATH>Black_{i,m}</MATH> is a binary indicator variable for applications that designate a Black borrower, <MATH>\phi_i</MATH> is a loan fixed effect, and <MATH>u_{i,m}</MATH> is an econometric error term.

The fixed effects <MATH>\phi_i</MATH> ensure that <MATH>\beta_B</MATH> identifies how the approval and rate suggestions of the LLM differ for Black applicants relative to an otherwise-identical loan whose applicant is labeled as white. As such, <MATH>\beta_B</MATH> captures the direct effect of disparities in the LLM response to race disclosures while removing any indirect effects caused by triangulating information about applicants' race from loan-to-value, debt-to-income, income, or loan amount. Because we stratify manipulated credit score within each real loan <MATH>i</MATH>, the loan fixed effect does not absorb any variation in credit score. In tests focusing on suggested loan approval (interest rates), a negative (positive) estimate of <MATH>\beta_B</MATH> can be interpreted as evidence that the LLM generates less favorable suggestions for Black borrowers.

To explore how racial differences vary across the spectrum of application credit quality, we also estimate regressions of the form

<FULL_LINE_EQUATION ref="eq2">
  <MATH>y_{i,m} = \beta_{CS}CreditScore_{i,m} + \beta_{B}Black_{i,m} + \beta'_{B \times X} Black_{i,m} \mathbf{X}_{i,m} + \phi_i + u_{i,m},</MATH>
</FULL_LINE_EQUATION>

where <MATH>\mathbf{X}_{i,m}</MATH> contains one or more measures of credit quality: credit score, debt-to-income ratio, or loan-to-value ratio. Note that when an element of <MATH>\mathbf{X}</MATH> represents credit score, we include both its main effect and its interaction term in the model. Where <MATH>\mathbf{X}</MATH> contains DTI and/or LTV, we include only the interactions, since DTI and LTV are constant across the experimental manipulations <MATH>m</MATH> and therefore their main effects are spanned by the fixed effects <MATH>\phi_i</MATH>.

The coefficients <MATH>\beta'_{B \times X}</MATH> in <a href="#eq2">equation (2)</a> assess whether LLM response differences are heterogeneous across credit quality, or equivalently whether credit score, debt-to-income ratio, and loan-to-value ratio have different effects on lending decisions for Black and white applicants.

We conduct several related experiments. Experiment 2 includes manipulations suggesting the applicant is Asian or Hispanic, or omitting race/ethnic information entirely. (Including applications without race information allows us to understand the impact of disclosing a borrower as white.) Experiment 3 replicates this approach across a variety of other leading LLMs to see if the patterns are specific to one model. Experiments 4 and 5 do not include explicit racial information, but instead use borrower names and cities, respectively, as proxies; we describe these in more detail in <a href="#subsec_proxies">Section 3.C</a>. In the appendix, we present tests exploring other protected borrower characteristics: Experiments A1 and A2 include manipulated applicant age or gender instead of racial information.

Given the existence of disparities documented in these experiments, we then proceed with Experiment 6 to assess the potential value of mitigation strategies. Every fictional application in Experiment 1 is repeated a second time, adding the blue sentences below to the baseline prompt:

Please respond with two columns, separated by a comma:
1. <i>You should use no bias in making this decision:</i> Should this loan be approved? (1 for yes, 0 for no)
2. <i>You should use no bias in making this decision:</i> Which of the following interest rates would you offer? Choose from: 3.0%, 3.5%, ...

We call this prompt the “mitigation” prompt. Using it, we estimate

<FULL_LINE_EQUATION ref="eq3">
  <MATH>y_{i,m} = \beta_{CS}CreditScore_{i,m} + \beta_{B}Black_{i,m} + \beta_{M}Mitigation_{i,m} + \beta_{M \times CS}Mitigation_{i,m}CreditScore_{i,m} + \beta_{M \times B}Mitigation_{i,m}Black_{i,m} + \phi_i + u_{i,m},</MATH>
</FULL_LINE_EQUATION>

where <MATH>Mitigation_{i,m}</MATH> is a binary indicator variable for loan applications made with the mitigation prompt. When <MATH>\beta_B</MATH> and <MATH>\beta_{M \times B}</MATH> have opposing signs, this indicates that the mitigation prompt indeed alters LLM responses to limit (or perhaps even reverse) racial differences.

These tests help to understand how the mitigation prompt affects racial disparities on average. (Experiment A3 examines an alternative mitigation prompt.) To extend this, we assess whether these effects are heterogeneous across credit quality, estimating models of the form

<FULL_LINE_EQUATION ref="eq4">
  <MATH>y_{i,m} = \beta_{CS}CreditScore_{i,m} + \beta_{B}Black_{i,m} + \beta_{B \times CS}Black_{i,m}CreditScore_{i,m} + \beta_{M}Mitigation_{i,m} + \beta_{M \times CS}Mitigation_{i,m}CreditScore_{i,m} + \beta_{M \times B}Mitigation_{i,m}Black_{i,m} + \beta_{M \times B \times CS}Mitigation_{i,m}Black_{i,m}CreditScore_{i,m} + \phi_i + u_{i,m}.</MATH>
</FULL_LINE_EQUATION>

Here, <MATH>\beta_{B \times CS}</MATH> identifies the heterogeneity of racial disparities across credit scores for the baseline prompt, and <MATH>\beta_{M \times B \times CS}</MATH> identifies the relative change in that heterogeneity from using the mitigation prompt.

<SUBSECTION ref="subsec_data">2.C Data</SUBSECTION>
To ensure that the characteristics of the loan applications we send to the LLMs are realistic, we sample loan application data disclosed by financial institutions under the HMDA Act. HMDA contains information on approved and denied loans, which is essential for our research questions.

We download the Loan/Application Records (LAR) file containing loan applications made nationwide in 2022 and reported to the Consumer Financial Protection Bureau.<FOOTNOTE ref="fn15">Available at ffiec.cfpb.gov/data-publication/snapshot-national-loan-level-dataset/2022.</FOOTNOTE> We use 2022 data because this is after the training cutoff for the models in our paper and allows us to monitor two years of ex-post loan performance in <a href="#sec_additional_tests">Section 5</a>. We restrict the sample to conventional 30-year loans for principal residences secured by a first lien. We eliminate loans with balloon payments, negative amortization, interest-only payments, or business or commercial purposes. We also discard manufactured homes, reverse mortgages, and multi-unit dwellings.

For our audit study, we sample 1,000 applications from the LAR file.<FOOTNOTE ref="fn16">A standard two-sample proportions power test suggests a sample size of 962 per group is necessary to detect differences in loan acceptance rates greater than 3.7% (half of the 7.4% rejection rate in the full HMDA dataset, per Table A1) at 80% power and 5% significance.</FOOTNOTE> Panel A of <a href="#tbl2">Table II</a> reports summary statistics for this sample, showing that 92% of the loans were approved at an average interest rate of 4.98%. HMDA also provides the rate spread, which is defined as the difference between the loan's annual percentage rate and the average prime offer rate for a comparable transaction as of the date the interest rate is set. The average rate spread in our sample is 27 basis points. The average debt-to-income ratio (DTI) is 37.2%<FOOTNOTE ref="fn17">DTI is reported in HMDA (debt_to_income_ratio) as an integer percentage from 36% to 49%, or in buckets outside this range (e.g., 30%-36%), with winsorization below 20% and above 60%. We take the midpoint of the buckets and set DTI equal to the winsorization threshold for the lowest and highest buckets.</FOOTNOTE>, and loan-to-value ratio (LTV, combined_loan_to_value_ratio in HMDA) is slightly over 80%. We show in Appendix <a href="#tblA1">Table A1</a> that this subset of loans is representative of the loans in the overall LAR dataset.

<TABLE ref="tbl2">
  <CAPTION>Summary Statistics. Panel A reports summary statistics for the 1,000 observations we randomly selected from HMDA to fill out the loan applications. In addition, prompts are stratified over experimentally manipulated credit scores of 640, 715, and 790, giving a standard deviation of approximately 61 points (and a mean of 715). Panel B reports summary statistics of the LLM recommendations from each experiment listed in <a href="#tbl1">Table I</a>. Variables are defined in <a href="#sec_methodology">Section 2</a>. Approval in both panels is binary, and all other variables are reported as percentages from 0 to 100. We do not report information about the manipulated variables (demographic information and credit score), as they are evenly balanced within each experiment.</CAPTION>
  <thead>
    <tr><th colspan="4">Panel A: HMDA Loan Sample Variables</th></tr>
    <tr><th/><th>N</th><th>Mean</th><th>Std.</th><th>Median</th></tr>
  </thead>
  <tbody>
    <tr><td>Approval (Actual)</td><td>1,000</td><td>0.92</td><td>0.27</td><td>1.00</td></tr>
    <tr><td>Rate (Actual, %)</td><td>921</td><td>4.98</td><td>1.13</td><td>5.00</td></tr>
    <tr><td>Rate Spread (Actual, %)</td><td>909</td><td>0.27</td><td>0.72</td><td>0.30</td></tr>
    <tr><td>DTI (%)</td><td>1,000</td><td>37.2</td><td>9.4</td><td>38.0</td></tr>
    <tr><td>LTV (%)</td><td>1,000</td><td>83.2</td><td>14.5</td><td>85.0</td></tr>
    <tr><td colspan="5"><b>Panel B: Experimental Outcome Variables</b></td></tr>
    <tr><td/><td colspan="8">Experiment</td></tr>
    <tr><td/><td>1</td><td>2</td><td>3</td><td>4</td><td>5</td><td>6</td><td>A1</td><td>A2</td><td>A3</td></tr>
    <tr><td>Approval (LLM)</td><td/><td/><td/><td/><td/><td/><td/><td/><td/></tr>
    <tr><td>N</td><td>6,000</td><td>15,000</td><td>47,206</td><td>6,000</td><td>6,000</td><td>12,000</td><td>9,000</td><td>6,000</td><td>12,000</td></tr>
    <tr><td>Mean</td><td>0.91</td><td>0.93</td><td>0.87</td><td>0.91</td><td>0.93</td><td>0.94</td><td>0.94</td><td>0.95</td><td>0.91</td></tr>
    <tr><td>Std.</td><td>0.28</td><td>0.25</td><td>0.33</td><td>0.29</td><td>0.26</td><td>0.25</td><td>0.24</td><td>0.21</td><td>0.29</td></tr>
    <tr><td>Rate (LLM)</td><td/><td/><td/><td/><td/><td/><td/><td/><td/></tr>
    <tr><td>N</td><td>6,000</td><td>15,000</td><td>47,206</td><td>6,000</td><td>6,000</td><td>12,000</td><td>9,000</td><td>6,000</td><td>12,000</td></tr>
    <tr><td>Mean</td><td>4.55</td><td>4.49</td><td>4.75</td><td>4.52</td><td>4.49</td><td>4.45</td><td>4.54</td><td>4.41</td><td>4.62</td></tr>
    <tr><td>Std.</td><td>1.02</td><td>0.97</td><td>1.09</td><td>1.12</td><td>0.98</td><td>0.94</td><td>0.98</td><td>0.92</td><td>1.08</td></tr>
    <tr><td>Median</td><td>4.50</td><td>4.50</td><td>4.50</td><td>4.50</td><td>4.50</td><td>4.50</td><td>4.50</td><td>4.50</td><td>4.50</td></tr>
  </tbody>
</TABLE>

<a href="#tbl2">Table II</a>, Panel B, reports summary statistics on LLM approval rate and interest rate suggestions separately for each experiment. Across experiments, 87-95% of loans are “approved" by the LLM with a suggested average interest rate of 4.41-4.75%, compared to an actual approval rate of 92% and interest rate of 4.98% in the HMDA data. Overall, average LLM recommendations are quite stable across experiments. The biggest deviation, although not statistically significant, occurs in Experiment 3. This is the only one that includes models besides GPT-4 Turbo, and these models on average recommend slightly lower approval rates and higher interest rates.

<SECTION ref="sec_main_results">3 Main results</SECTION>
This section presents the paper's primary results. We start with tests assessing whether our baseline LLM shows evidence of bias in making lending decisions. We then extend the analysis to other leading LLMs. Finally, we present tests using proxies for race, rather than direct signals.

<SUBSECTION ref="subsec_baseline_disparities">3.A Racial disparities in baseline LLM recommendations</SUBSECTION>
The results of Experiment 1 are presented in <a href="#tbl3">Table III</a>, which examines the two primary outcomes of an underwriting decision made by our baseline LLM: Whether a loan is approved (Panel A) and at what interest rate (Panel B).

<TABLE ref="tbl3">
  <CAPTION>Race and Recommendations. This table reports the OLS regressions of loan approval recommendations (Panel A) and loan interest rate recommendations (Panel B) on loan applicants' racial identity using Experiment 1 (see <a href="#tbl1">Table I</a>). The dependent variable in Panel A is the LLM loan approval recommendation that equals one if the loan is approved, and zero otherwise. In Panel B, the dependent variable is the LLM loan interest rate recommendations measured in percentage points. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects.</CAPTION>
  <thead>
    <tr><th colspan="6">Panel A: Loan Approval Recommendations</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th><th>(5)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.043***</td><td>0.019***</td><td>0.043***</td><td>0.043***</td><td>0.019***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.003)</td><td>(0.003)</td><td>(0.003)</td></tr>
    <tr><td>Black</td><td>-0.085***</td><td>-0.085***</td><td>-0.085***</td><td>-0.085***</td><td>-0.085***</td></tr>
    <tr><td/><td>(0.005)</td><td>(0.005)</td><td>(0.005)</td><td>(0.005)</td><td>(0.005)</td></tr>
    <tr><td>Black × CreditScore (z)</td><td/><td>0.048***</td><td/><td/><td>0.048***</td></tr>
    <tr><td/><td/><td>(0.005)</td><td/><td/><td>(0.005)</td></tr>
    <tr><td>Black × DTI (z)</td><td/><td/><td>-0.063***</td><td/><td>-0.060***</td></tr>
    <tr><td/><td/><td/><td>(0.006)</td><td/><td>(0.006)</td></tr>
    <tr><td>Black × LTV (z)</td><td/><td/><td/><td>-0.042***</td><td>-0.035***</td></tr>
    <tr><td/><td/><td/><td/><td>(0.005)</td><td>(0.005)</td></tr>
    <tr><td>Obs</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.57</td><td>0.58</td><td>0.58</td><td>0.58</td><td>0.59</td></tr>
    <tr><td>Adj R²</td><td>0.48</td><td>0.49</td><td>0.50</td><td>0.49</td><td>0.51</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
    <tr><td colspan="6"><b>Panel B: Loan Interest Rate Recommendation</b></td></tr>
    <tr><td/><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td><td>(5)</td></tr>
    <tr><td>CreditScore (z)</td><td>-0.689***</td><td>-0.632***</td><td>-0.689***</td><td>-0.689***</td><td>-0.632***</td></tr>
    <tr><td/><td>(0.006)</td><td>(0.007)</td><td>(0.006)</td><td>(0.006)</td><td>(0.007)</td></tr>
    <tr><td>Black</td><td>0.352***</td><td>0.352***</td><td>0.352***</td><td>0.352***</td><td>0.352***</td></tr>
    <tr><td/><td>(0.011)</td><td>(0.011)</td><td>(0.011)</td><td>(0.011)</td><td>(0.011)</td></tr>
    <tr><td>Black × CreditScore (z)</td><td/><td>-0.114***</td><td/><td/><td>-0.114***</td></tr>
    <tr><td/><td/><td>(0.011)</td><td/><td/><td>(0.011)</td></tr>
    <tr><td>Black × DTI (z)</td><td/><td/><td>0.091***</td><td/><td>0.084***</td></tr>
    <tr><td/><td/><td/><td>(0.013)</td><td/><td>(0.013)</td></tr>
    <tr><td>Black × LTV (z)</td><td/><td/><td/><td>0.065***</td><td>0.056***</td></tr>
    <tr><td/><td/><td/><td/><td>(0.011)</td><td>(0.011)</td></tr>
    <tr><td>Obs</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.85</td><td>0.86</td><td>0.85</td><td>0.85</td><td>0.86</td></tr>
    <tr><td>Adj R²</td><td>0.82</td><td>0.83</td><td>0.82</td><td>0.82</td><td>0.83</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
  </tbody>
</TABLE>

The coefficients in column (1) of Panel A correspond to <a href="#eq1">Equation 1</a> above and show the effects of our manipulated variables on the likelihood of loan approval. The <i>CreditScore</i> coefficient is a positive 0.043 and statistically significant at the 1% level with a standard error of 0.003.<FOOTNOTE ref="fn18">We report heteroskedastic robust standard errors. All results in the paper are robust to clustering at the loan level.</FOOTNOTE> Because the credit score variable has been standardized, a one standard deviation increase in credit score (61 points) raises the likelihood that the LLM recommends loan approval by 4.3 percentage points (p.p.).<FOOTNOTE ref="fn19">We find, in unreported tests available upon request, qualitatively identical results to those in Panel A using a logistic model.</FOOTNOTE>

More importantly, the <i>Black</i> coefficient is a negative 0.085 that is also highly significant with a standard error of 0.005. This indicates that applications by a Black borrower are on average 8.5 p.p. less likely to receive an approval recommendation than those from otherwise-identical white applicants. The influence of being Black is noteworthy; its magnitude is about double the effect, in absolute value, of a one standard deviation change in borrower credit score. This suggests that the loan approval effect of listing an applicant as Black is roughly equivalent to a white applicant's credit score falling 120 points.

Having documented the existence of significant racial disparities in LLM mortgage loan approval on average, we assess variation in the difference across several dimensions of credit quality. Panel A, columns (2) through (5) present results of regression estimates as described in <a href="#eq2">Equation 2</a>. These tests incorporate interaction terms of <i>Black</i> with <i>CreditScore</i>, <i>DTI</i>, and <i>LTV</i>.<FOOTNOTE ref="fn20">Because the credit quality variables are standardized to have mean zero, the main Black coefficients are not affected by the inclusion of these interactions. Variation in DTI; and LTV is completely absorbed by the loan fixed effects, and they are thus excluded from the models as standalone variables. CreditScorei,m has variation across manipulations within loan, and so is included in the model.</FOOTNOTE> All interaction coefficients are statistically significant, whether included individually as in columns (2) through (4), or all together as in column (5).

Across all three measures of credit quality, the signs of the interaction coefficients are consistent with bias against Black borrowers being more pronounced for lower credit quality applications. The coefficient for the interaction of <i>Black</i> and <i>CreditScore</i> is 0.048 (positive, as higher credit score means higher credit quality); while the coefficients for the interactions with <i>DTI</i> and <i>LTV</i> are -0.063 and -0.042, respectively (negative, as lower DTI and LTV means higher credit quality). Given that these variables are standardized, the coefficients' magnitudes are directly comparable and notably similar. Thus, the heterogeneity in the racial penalty suggests that Black borrowers with lower credit quality applications are significantly less likely to be approved than white borrowers of similarly weak application credit quality. For example, based on the estimates in column (3), a Black applicant with a debt-to-income ratio that is one standard deviation above the mean is roughly 15 p.p. (0.085 + 0.063) less likely to be approved for a loan when compared to a white applicant with the same level of personal debt, ceteris paribus.<FOOTNOTE ref="fn21">If credit score, DTI, and LTV were somehow more informative about true credit quality for Black applicants, then the heterogeneous disparities might be described as reflecting a form of statistical discrimination since Black applicants are penalized more when these credit quality measures are low. However, we have no reason to believe these measures are in fact differentially informative, and as Guryan and Charles (2013) caution, "it is often possible to imagine a taste-based discrimination model that would generate the same empirical patterns that researchers use to infer the presence of statistical discrimination." Finally, as noted in footnote 22, below, we observe evidence of both approval and interest rate disparities even at the highest credit scores.</FOOTNOTE>

In Panel B, we repeat the tests estimating <a href="#eq1">Equations 1</a> and <a href="#eq2">2</a>, but using suggested interest rates as the dependent variable. The patterns are substantially the same, with all key coefficients' signs flipped. Black applicants are offered higher interest rates relative to white applicants, and higher credit scores are strongly associated with lower interest rates. Specifically, Black applicants' interest rates are 0.352 p.p. (≈ 35 basis points) higher on average than otherwise-identical white applicants'.

To contextualize the magnitude of estimated race effects, we can compare them to the impact of credit scores. In column (1), the estimated coefficient on <i>CreditScore</i> indicates that a one standard deviation increase in credit score decreases suggested interest rates by 0.689 p.p. (≈ 69 basis points) on average; the effect of listing an applicant as Black is therefore roughly equivalent to a white applicant reducing their credit score by about 30 points. Most studies do not report the effects of race and credit scores simultaneously, but one that does is <CITATION ref="cite_butler2023" title="Racial Disparities in the Auto Loan Market" authors="Butler, A.W., Mayer, E.J., Weston, J.P." year="2023" journal="The Review of Financial Studies">Butler et al. (2023)</CITATION> in the auto loan market. In their Table 8, they estimate <MATH>\beta_{Minority} = 0.704</MATH> and <MATH>\beta_{Credit Score} = -0.019</MATH>. Thus, their estimates imply that a minority applicant receives the same interest rate as an otherwise similar white applicant with a credit score 37 points lower, strikingly similar to the magnitude we obtain.

When including interaction terms to check for variation in the racial disparities, we again find evidence that the LLM is disproportionately penalizing lower credit quality Black applicants relative to white applicants with a similar risk profile. That is, the coefficients on the interactions of <i>Black</i> with <i>CreditScore</i>, <i>DTI</i>, and <i>LTV</i> are negative (-0.114), positive (0.091), and positive (0.065), respectively, and highly statistically significant. Thus, lower credit quality (i.e., lower credit scores, higher DTI or LTV) is associated with larger interest rate penalties against Black applicants.<FOOTNOTE ref="fn22">The standalone Black coefficients are also much larger in magnitude than the coefficients on interactions with any of the credit quality measures. Given the standardization of each of these measures, our linear estimates suggest that even the highest credit quality Black applicants will not on average receive better outcomes than otherwise-identical white applicants. The comparisons for credit score are visualized by the dashed lines in Figure III, discussed below.</FOOTNOTE>

To translate these estimates into costs a consumer would face, consider a Black applicant applying to the LLM underwriter for a mortgage in 2022 with a credit score of 654 (one standard deviation below our sample mean). Our estimates suggest that this borrower faces an approval likelihood 13.3 p.p. lower than a similar white applicant (–0.085 – 0.048 per Panel A, columns 2 or 5). If the loan amount was the average of $334,000 as reported in the HMDA data, the Black borrower's interest rate would be approximately 47bp higher (0.352 + 0.114 per Panel B). Using the average HMDA interest rate of 4.78% for 2022 for an applicant with an average credit score, a white applicant with a 654 credit score would have a 5.41% rate while a comparable Black applicant would have a rate of 5.88%, and over the life of a 30-year mortgage this Black applicant would pay around $35,700 more in (nominal) interest than a white applicant with the same credit profile.

Experiment 2 extends our analysis to examine potential differences in loan approval decisions and interest rate recommendations across a broader spectrum of racial and ethnic groups. This experiment augments the sample of Experiment 1 with loan applications indicating an Asian or Hispanic borrower, and applications omitting race/ethnicity information entirely (referred to as “None” in <a href="#tbl1">Table I</a>). Results estimating analogues to <a href="#eq1">Equations 1</a> and <a href="#eq2">2</a> with “None” as the omitted category are reported in <a href="#tbl4">Table IV</a>. This experiment allows us to understand how biases faced by Black applicants relative to white ones fit into broader patterns of disparities affecting other groups. It also allows us to understand how the inclusion of any race/ethnicity information <i>including</i> a borrower's whiteness affects LLM responses.

<TABLE ref="tbl4">
  <CAPTION>Race, Ethnicity, and Recommendations. This table repeats the main OLS regressions in <a href="#tbl3">Table III</a> using Experiment 2 (see <a href="#tbl1">Table I</a>), which expands the list of demographics used in the application prompt to include <i>Asian</i>, <i>Hispanic</i>, or none. Approval—the dependent variable in columns (1) and (2) of each panel—equals one if the LLM suggests approving the application, and zero otherwise. Interest rates recommendations—in columns (3) and (4)—are in percentage points. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects. Variables are defined in <a href="#sec_methodology">Section 2</a>.</CAPTION>
  <thead>
    <tr><th/><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.033***</td><td>0.023***</td><td>-0.664***</td><td>-0.662***</td></tr>
    <tr><td/><td>(0.001)</td><td>(0.003)</td><td>(0.003)</td><td>(0.006)</td></tr>
    <tr><td>Asian</td><td>-0.001</td><td>-0.001</td><td>-0.062***</td><td>-0.062***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.009)</td><td>(0.008)</td></tr>
    <tr><td>Black</td><td>-0.077***</td><td>-0.077***</td><td>0.301***</td><td>0.301***</td></tr>
    <tr><td/><td>(0.005)</td><td>(0.005)</td><td>(0.011)</td><td>(0.011)</td></tr>
    <tr><td>Hispanic</td><td>-0.012***</td><td>-0.012***</td><td>0.117***</td><td>0.117***</td></tr>
    <tr><td/><td>(0.004)</td><td>(0.004)</td><td>(0.008)</td><td>(0.008)</td></tr>
    <tr><td>White</td><td>0.008**</td><td>0.008**</td><td>-0.051***</td><td>-0.051***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.008)</td><td>(0.008)</td></tr>
    <tr><td>Asian × CreditScore (z)</td><td/><td>0.000</td><td/><td>0.047***</td></tr>
    <tr><td/><td/><td>(0.004)</td><td/><td>(0.009)</td></tr>
    <tr><td>Black × CreditScore (z)</td><td/><td>0.044***</td><td/><td>-0.084***</td></tr>
    <tr><td/><td/><td>(0.005)</td><td/><td>(0.011)</td></tr>
    <tr><td>Hispanic × CreditScore (z)</td><td/><td>0.009**</td><td/><td>-0.002</td></tr>
    <tr><td/><td/><td>(0.004)</td><td/><td>(0.009)</td></tr>
    <tr><td>White × CreditScore (z)</td><td/><td>-0.004</td><td/><td>0.030***</td></tr>
    <tr><td/><td/><td>(0.004)</td><td/><td>(0.008)</td></tr>
    <tr><td>Obs</td><td>15,000</td><td>15,000</td><td>15,000</td><td>15,000</td></tr>
    <tr><td>R²</td><td>0.59</td><td>0.60</td><td>0.86</td><td>0.86</td></tr>
    <tr><td>Adj R²</td><td>0.56</td><td>0.57</td><td>0.85</td><td>0.85</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
  </tbody>
</TABLE>

The results in <a href="#tbl4">Table IV</a> reveal interesting patterns across these groups. Because this specification uses None (i.e., no race signal) as the omitted category, the coefficients on the race/ethnicity indicators represent the effect of including a given racial or ethnic label relative to disclosing no race information. This contrasts with <a href="#tbl3">Table III</a>, where coefficients are interpreted relative to a white applicant. Within this framework, white and Asian applicants receive modestly more favorable responses than those with no disclosed race, while Black and Hispanic applicants experience notably worse outcomes—particularly Black applicants, who face the largest disparities in both approval and interest rate recommendations.

Interaction terms between race/ethnicity indicators and credit score provide additional insights. Black applicants are the only group with significant interaction coefficients across both outcome variables, with signs and magnitudes indicating that higher credit scores can reduce some of the disparities that Black borrowers suffer (but do not eliminate them). Hispanic borrowers with low credit scores also suffer worse approval disparities, although the magnitude of this effect is much smaller than for Black borrowers.

Finally, we consider two experiments on other protected borrower characteristics: age and gender. Experiment A1 replaces signals of race/ethnicity in the loan applications with indications that the applicant is age 30, 50, or 70. Results are reported in Appendix <a href="#tblA3">Table A3</a>. We find that 70-year-olds receive approval recommendations 1.6 p.p. less often than 30-year-olds, and average interest rates 17.3 basis points higher; both differences are statistically significant at the 1% level. The gaps between 50- and 30-year-olds go in the same direction, but have a magnitude roughly a quarter of the size. These results echo the findings of <CITATION ref="cite_amornsiripanitch2023" title="The Age Gap in Mortgage Access" authors="Amornsiripanitch, N." year="2023" journal="Working Paper (Federal Reserve Bank of Philadelphia) 23-03">Amornsiripanitch (2023)</CITATION>, which finds that mortgage access declines with age in observational data. When we allow the impact of credit quality to vary with age, we estimate highly statistically significant coefficients on the interaction terms between the age-70 indicator and credit score indicating that a lower credit score is additionally penalized. Experiment A2 instead considers signals that an applicant is male or female; the results in Appendix <a href="#tblA4">Table A4</a> fail to detect evidence of statistically significant gender differences.

<SUBSECTION ref="subsec_other_llms">3.B Racial disparities in other LLMs</SUBSECTION>
We now turn to Experiment 3 to assess whether the key results described above are consistent across different LLMs. We extend our sample to include responses to the same set of prompts from a number of LLMs from Anthropic (Claude 3 Sonnet and Opus), Meta (Llama 3 8b and 70b), and OpenAI (GPT-3.5 Turbo 2023, GPT-3.5 Turbo 2024, GPT-4, and the baseline LLM GPT-4 Turbo).<FOOTNOTE ref="fn23">We provide more information on these models, including specific API version names, in the Appendix. Sonnet and Llama 3 8b are smaller and faster versions compared to Opus and Llama 3 70b and tend to perform worse on benchmarking tests than the larger models. While we consider several different generations of models, all these prompts were run at roughly the same time and therefore represent a cross-section of leading LLMs available in mid-2024.</FOOTNOTE> These LLMs are selected because they are the most advanced models available via API calls as of April 2024.

<TABLE ref="tbl5">
  <CAPTION>Race and Recommendations (LLM Comparison). This table reports the OLS regressions of loan approval recommendations (Panel A) and loan interest rate recommendations (Panel B) on loan applicants' racial identity based on responses collected from eight leading LLMs. We estimate <a href="#eq1">Equation 1</a>, using Experiment 3 (see <a href="#tbl1">Table I</a>) to replicate Experiment 1 with other leading LLM models. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects. Note that “llama3-70b-8192" recommends approval for 100% of loan applications in our sample, which precludes the possibility of running the regression of loan approval recommendations in Panel A, column (3). The coefficients here are presented visually in <a href="#fig2">Figure II</a>.</CAPTION>
  <thead>
    <tr><th colspan="9">Panel A: Loan Approval Recommendations</th></tr>
    <tr><th>Family</th><th colspan="2">Anthropic Claude 3</th><th colspan="2">Meta Llama 3</th><th colspan="4">OpenAI GPT</th></tr>
    <tr><th>Model</th><th>Sonnet</th><th>Opus</th><th>8b</th><th>70b</th><th>3.5 Turbo</th><th>3.5 Turbo</th><th>4</th><th>4 Turbo</th></tr>
    <tr><th>Date</th><th>2024</th><th>2024</th><th>2024</th><th>2024</th><th>2023</th><th>2024</th><th>2023</th><th>2024</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th><th>(5)</th><th>(6)</th><th>(7)</th><th>(8)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.009***</td><td>0.127***</td><td/><td>0.006***</td><td>0.292***</td><td>0.138***</td><td>0.075***</td><td>0.043***</td></tr>
    <tr><td/><td>(0.001)</td><td>(0.004)</td><td/><td>(0.001)</td><td>(0.004)</td><td>(0.004)</td><td>(0.003)</td><td>(0.003)</td></tr>
    <tr><td>Black</td><td>-0.011***</td><td>-0.098***</td><td/><td>-0.003</td><td>-0.319***</td><td>-0.083***</td><td>0.003</td><td>-0.085***</td></tr>
    <tr><td/><td>(0.002)</td><td>(0.008)</td><td/><td>(0.002)</td><td>(0.008)</td><td>(0.007)</td><td>(0.006)</td><td>(0.005)</td></tr>
    <tr><td>Obs</td><td>5,989</td><td>5,215</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.81</td><td>0.63</td><td>.</td><td>0.68</td><td>0.64</td><td>0.48</td><td>0.65</td><td>0.57</td></tr>
    <tr><td>Adj R²</td><td>0.77</td><td>0.55</td><td>.</td><td>0.61</td><td>0.57</td><td>0.38</td><td>0.59</td><td>0.48</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
    <tr><td>Avg(y)</td><td>0.97</td><td>0.80</td><td>1.00</td><td>0.99</td><td>0.58</td><td>0.86</td><td>0.87</td><td>0.91</td></tr>
    <tr><td>Avg(y | White)</td><td>0.97</td><td>0.84</td><td>1.00</td><td>0.99</td><td>0.74</td><td>0.90</td><td>0.87</td><td>0.96</td></tr>
    <tr><td>Avg(y | Black)</td><td>0.96</td><td>0.74</td><td>1.00</td><td>0.99</td><td>0.42</td><td>0.82</td><td>0.87</td><td>0.87</td></tr>
    <tr><td>White Answer Rate (%)</td><td>99.83</td><td>99.57</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td></tr>
    <tr><td>Black Answer Rate (%)</td><td>99.80</td><td>74.27</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td></tr>
    <tr><td colspan="9"><b>Panel B: Loan Interest Rate Recommendations</b></td></tr>
    <tr><th>Family</th><th colspan="2">Anthropic Claude 3</th><th colspan="2">Meta Llama 3</th><th colspan="4">OpenAI GPT</th></tr>
    <tr><th>Model</th><th>Sonnet</th><th>Opus</th><th>8b</th><th>70b</th><th>3.5 Turbo</th><th>3.5 Turbo</th><th>4</th><th>4 Turbo</th></tr>
    <tr><th>Date</th><th>2024</th><th>2024</th><th>2024</th><th>2024</th><th>2023</th><th>2024</th><th>2023</th><th>2024</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th><th>(5)</th><th>(6)</th><th>(7)</th><th>(8)</th></tr>
    <tr><td>CreditScore (z)</td><td>-0.719***</td><td>-0.867***</td><td>-0.283***</td><td>-0.430***</td><td>-0.892***</td><td>-0.843***</td><td>-0.842***</td><td>-0.690***</td></tr>
    <tr><td/><td>(0.004)</td><td>(0.005)</td><td>(0.004)</td><td>(0.003)</td><td>(0.008)</td><td>(0.006)</td><td>(0.007)</td><td>(0.006)</td></tr>
    <tr><td>Black</td><td>0.193***</td><td>0.238***</td><td>0.067***</td><td>0.237***</td><td>0.473***</td><td>0.365***</td><td>0.093***</td><td>0.352***</td></tr>
    <tr><td/><td>(0.008)</td><td>(0.011)</td><td>(0.007)</td><td>(0.006)</td><td>(0.016)</td><td>(0.012)</td><td>(0.013)</td><td>(0.011)</td></tr>
    <tr><td>Obs</td><td>5,989</td><td>5,215</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.90</td><td>0.91</td><td>0.66</td><td>0.89</td><td>0.76</td><td>0.85</td><td>0.84</td><td>0.85</td></tr>
    <tr><td>Adj R²</td><td>0.88</td><td>0.89</td><td>0.59</td><td>0.87</td><td>0.72</td><td>0.81</td><td>0.81</td><td>0.82</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
    <tr><td>Avg(y)</td><td>5.52</td><td>5.64</td><td>4.32</td><td>4.29</td><td>4.65</td><td>4.47</td><td>4.63</td><td>4.55</td></tr>
    <tr><td>Avg(y | White)</td><td>5.42</td><td>5.54</td><td>4.29</td><td>4.17</td><td>4.42</td><td>4.29</td><td>4.59</td><td>4.38</td></tr>
    <tr><td>Avg(y | Black)</td><td>5.61</td><td>5.78</td><td>4.36</td><td>4.40</td><td>4.89</td><td>4.66</td><td>4.68</td><td>4.73</td></tr>
    <tr><td>White Answer Rate (%)</td><td>99.83</td><td>99.57</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td></tr>
    <tr><td>Black Answer Rate (%)</td><td>99.80</td><td>74.27</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td><td>100.00</td></tr>
  </tbody>
</TABLE>

<a href="#tbl5">Table V</a> presents regressions of <a href="#eq1">Equation 1</a> and confirms that the pattern of disparities we find in the baseline LLM is present in other models. With only a few exceptions, the effects of <i>CreditScore</i> and <i>Black</i> are largely consistent in terms of signs and significance across the different models. Higher credit scores substantially increase the probability of loan approval and lead to lower interest rates. Meanwhile, being Black (compared to being white) is associated with a decreased probability of loan approval—except for the 2023 version of ChatGPT 4 and the larger Llama 3 model from Meta—and leads to relatively higher interest rates in all models.<FOOTNOTE ref="fn24">We do not estimate loan approval using Llama 3's smaller model since it approves all applications.</FOOTNOTE>

The summary statistics at the bottom of each panel highlight the nuances that different AI data-generating models can introduce when used in finance applications. In Panel A, which shows estimations of LLM loan approval recommendations, we observe substantial variation across models in the proportion of applications approved (“Avg(y)”) with rates ranging from 58% for the 2023 version of ChatGPT-3.5 Turbo (column 5) to 99%-100% for the Llama 3 models (columns 3-4). Columns (1) and (2) focus on models by Anthropic. Column (1) considers Sonnet, a smaller model that recommends approval for 97% of loans. Despite this high approval rate, there is a clear statistical difference in its approval rates for Black applicants. Column (2) examines Anthropic's more advanced model (Opus), which displays hesitancy in responding to prompts describing a borrower as Black, responding just 74% of the time.<FOOTNOTE ref="fn25">Answer rates take into account the fact that we attempt a prompt up to ten times if an LLM doesn't provide a properly formatted response. Interestingly, Opus's answer rate for white applicants is nearly 100%; it seems that refusing to respond is not simply a function of the presence of information on protected characteristics independent of their value. Claude Opus responds to queries listing the applicant's race as Black roughly three times as slowly, and often answers—if not given a limit on reply length—with "I apologize, but I do not feel comfortable providing a recommendation on loan approval or interest rates based on the limited information provided, especially given the inclusion of race as a factor. Lending decisions should be made objectively based on relevant financial criteria, not personal characteristics like race. I would suggest speaking with a qualified loan officer who can provide guidance in compliance with fair lending laws and regulations."</FOOTNOTE> Nevertheless, the Opus model recommends approval for Black applicants 9.8 percentage points less often than for identical white applicants, a difference much larger than the less sophisticated Sonnet model. This suggests that larger and more advanced models will not necessarily reduce the disparities we document. Columns (3) and (4) focus on models by Meta. With near-universal loan approval for the Llama 3 models, it is unsurprising that we do not observe significant evidence of racial differences in their responses. The remaining columns focus on OpenAI models. The baseline LLM for our study, GPT-4 Turbo, is between these extremes, and suggests approval for 91% of loans in Experiment 1; the true mortgage approval rate in our HMDA sample is 92% per <a href="#tbl2">Table II</a>.

Panel B, where the outcome variable is the interest rate recommendation, shows even greater consistency with our primary results from the baseline LLM. This consistency may stem from the fact that while loan approval is a binary decision, interest rate recommendations admit more subtle outcome disparities. All eight <i>Black</i> coefficients are positive and significant at the 1% level or better.

As we did following Experiment 1, we can contextualize the economic magnitude of racial disparities presented in <a href="#tbl5">Table V</a> by computing the implied decrease in credit score for a white applicant that would generate an effect as large as instead listing the applicant as Black. We refer to this as the “credit score equivalent" of the estimated racial disparity, and it is calculated as <MATH>\hat{\beta}_B / \hat{\beta}_{CS}</MATH> multiplied by the sample standard deviation of credit scores. Across LLMs, the average credit score equivalent is approximately 52 points for approval decisions and 24 points for interest rate suggestions. This magnitude is close to the 37 point-equivalent minority penalty in auto lending implied by estimates in <CITATION ref="cite_butler2023" title="Racial Disparities in the Auto Loan Market" authors="Butler, A.W., Mayer, E.J., Weston, J.P." year="2023" journal="The Review of Financial Studies">Butler et al. (2023)</CITATION>.

To understand how racial disparities in the responses of each LLM vary heterogeneously across the credit spectrum, we estimate regressions of <a href="#eq2">Equation 2</a> and present the results (visually, for brevity) in <a href="#fig2">Figure II</a>. Approval decisions are on the left side of the figure and interest rates on the right. For each outcome and each LLM, we show the coefficients on <i>CreditScore</i>, <i>Black</i>, and the interaction term. Point estimates are represented by dots, bars show 95% confidence intervals, with green indicating statistical significance at the 5% level.

<FIGURE ref="fig2">
  <CAPTION>Mortgage Underwriting Decisions by Leading LLMs. This figure illustrates the estimated coefficients from Experiment 3 (see <a href="#tbl1">Table I</a>), which estimates <a href="#eq2">Equation 2</a> with various leading LLM models. Coefficients that are statistically significant at the 5% level are shown in green and are red otherwise. As shown in <a href="#tbl5">Table V</a>, the Llama 3 8b model recommends approval for 100% of loans and is thus omitted from the approval subfigure.</CAPTION>
  <IMG src="fig-41.png"/>
</FIGURE>

In total, 21 out of 24 interest rate coefficients are significant at the 1% level or better, and in all eight models, the average rate is higher for Black applicants. The precision of the estimates on the interaction terms is more varied but coefficients are mostly positive and significant in the approval regressions and negative and significant in the interest rate regressions. Most models from Anthropic (Claude) and OpenAI (GPT) generate racial disparities that differ by credit quality, with lower credit score Black applicants obtaining even less favorable outcomes than white applicants. However, insignificant approval estimates for ChatGPT 4 (2023) and Llama 3 70b demonstrate the complex and somewhat model-dependent nature of how racial factors interact with credit scoring in determining loan approval and interest rates.

Overall, our core findings are robust across different LLM providers and model characteristics (number of parameters, generation, and training date). The appendix contains additional tests confirming that our results hold when we vary model temperature and repeat the core experiment at a later point in time to assess stability of our results over time.

<SUBSECTION ref="subsec_proxies">3.C Racial disparities when exposed to proxies for race</SUBSECTION>
In most practical settings where LLMs interface with customers, financial firms are unlikely to intentionally include explicit race information in the model's information set. However, information that is correlated with borrower race will likely be available, such as applicant name, residential address, and even stated income, occupation, or education. These features can act as proxies for race, particularly when combined, and LLMs may be able to reconstruct protected class membership even when race is explicitly excluded from the data. As a result, an LLM exposed to naturalistic input may internalize and act on inferred racial information.

To evaluate whether LLMs exhibit disparities in response to implicit racial signals, we conduct two additional experiments. In Experiment 4, we assign applicant names that are perceived to be strongly associated with either Black individuals or white individuals. In Experiment 5, we vary the applicant's city of residence to correspond to geographic areas with higher or lower proportions of Black residents. As in Experiment 1, we hold all other information constant, stratifying across race proxies and credit scores for each of our 1,000 base loan applications.

In Experiment 4, applicant names are randomly assigned for each underlying loan and credit score from the set of “validated names for experimental studies on race and ethnicity" developed by <CITATION ref="cite_crabtree2023" title="Validated names for experimental studies on race and ethnicity" authors="Crabtree, C., Kim, J.Y., Gaddis, S.M., Holbein, J.B., Guage, C., Marx, W.W." year="2023" journal="Scientific Data">Crabtree et al. (2023)</CITATION>. We restrict to names perceived as either Black or white by more than 80% of survey respondents in their study. The results, reported in Panel A of <a href="#tbl6">Table VI</a>, show that applications with distinctively Black names receive approval recommendations 1.3 percentage points less often and are offered interest rates 10 basis points higher than otherwise-identical applications with white names.

<TABLE ref="tbl6">
  <CAPTION>Race Proxies and Recommendations. This table reports the OLS regressions of loan approval recommendations and loan interest rate recommendations on signals that proxy for loan applicants' racial identity. These results are analogous to those in columns (1) and (2) of <a href="#tbl3">Table III</a>, but indicating race in the LLM prompt implicitly rather than explicitly. Panel A shows Experiment 4, where prompts include a name perceived as distinctively Black or white per <CITATION ref="cite_crabtree2023" title="Validated names for experimental studies on race and ethnicity" authors="Crabtree, C., Kim, J.Y., Gaddis, S.M., Holbein, J.B., Guage, C., Marx, W.W." year="2023" journal="Scientific Data">Crabtree et al. (2023)</CITATION>. Panel B shows Experiment 5, where prompts indicate that each loan is from a city in the relevant state that has either a high or low Black population fraction. Approval—the dependent variable in columns (1) and (2) of each panel—equals one if the LLM suggests approving the application, and zero otherwise. Interest rates recommendations—in columns (3) and (4)—are in percentage points. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects.</CAPTION>
  <thead>
    <tr><th colspan="5">Panel A: Name-Based Race Proxies</th></tr>
    <tr><th/><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.053***</td><td>0.047***</td><td>-0.758***</td><td>-0.732***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.004)</td><td>(0.006)</td><td>(0.009)</td></tr>
    <tr><td>BlackName</td><td>-0.013***</td><td>-0.013***</td><td>0.101***</td><td>0.101***</td></tr>
    <tr><td/><td>(0.005)</td><td>(0.005)</td><td>(0.012)</td><td>(0.012)</td></tr>
    <tr><td>BlackName × CreditScore (z)</td><td/><td>0.012**</td><td/><td>-0.052***</td></tr>
    <tr><td/><td/><td>(0.005)</td><td/><td>(0.012)</td></tr>
    <tr><td>Obs</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.64</td><td>0.64</td><td>0.87</td><td>0.87</td></tr>
    <tr><td>Adj R²</td><td>0.57</td><td>0.57</td><td>0.84</td><td>0.84</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
    <tr><td colspan="5"><b>Panel B: City-Based Race Proxies</b></td></tr>
    <tr><td/><td colspan="2">Approval</td><td colspan="2">Interest Rate</td></tr>
    <tr><td/><td>(1)</td><td>(2)</td><td>(3)</td><td>(4)</td></tr>
    <tr><td>CreditScore (z)</td><td>0.038***</td><td>0.038***</td><td>-0.683***</td><td>-0.672***</td></tr>
    <tr><td/><td>(0.002)</td><td>(0.003)</td><td>(0.005)</td><td>(0.007)</td></tr>
    <tr><td>BlackCity</td><td>-0.003</td><td>-0.003</td><td>0.062***</td><td>0.062***</td></tr>
    <tr><td/><td>(0.004)</td><td>(0.004)</td><td>(0.009)</td><td>(0.009)</td></tr>
    <tr><td>BlackCity × CreditScore (z)</td><td/><td>-0.001</td><td/><td>-0.021**</td></tr>
    <tr><td/><td/><td>(0.005)</td><td/><td>(0.010)</td></tr>
    <tr><td>Obs</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.67</td><td>0.67</td><td>0.89</td><td>0.89</td></tr>
    <tr><td>Adj R²</td><td>0.61</td><td>0.61</td><td>0.87</td><td>0.87</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
  </tbody>
</TABLE>

These disparities are statistically significant and economically meaningful, with magnitudes 15-29% of the effects we estimate when race is disclosed explicitly (in Experiment 1). Consistent with our earlier findings, the disparities associated with inferred race are largest for applicants with lower credit scores.

In Experiment 5, our prompts include borrower city and state in lieu of race or racialized names. For each state, we use the two cities with the highest and lowest fraction of Black residents, according to the 2020 Census.<FOOTNOTE ref="fn26">We consider all Census "places" (e.g., cities, towns, etc.) with population greater than 50,000 in the 2020 Census Redistricting Data (Public Law 94-171) Summary File.</FOOTNOTE> The results are reported in Panel B of <a href="#tbl6">Table VI</a>. Otherwise-identical loan applications receive interest rates 6 basis points higher in their state's most-Black city compared to its least-Black, and these (statistically significant) disparities are once again larger at lower credit scores. (Loan approval rates are lower for the Black cities, but this effect is not statistically significant.)<FOOTNOTE ref="fn27">Smaller magnitudes associated with our geographic signal compared to the name-based signal are perhaps unsurprising given that the ability to infer race from city is on average much weaker. For example, while five states have cities whose Black populations vary by more than 70 percentage points (e.g., Stonecrest, GA at 92.3% vs. Alpharetta, GA at 10.3%), in twelve states the largest difference is under 5 p.p. (e.g., Great Falls, MT at 1.2% vs. Bozeman, MT at 0.6%).</FOOTNOTE>

These two experiments show systematic racial disparities in LLM financial recommendations even in the absence of explicit signals. Of course, the implicit signals we consider—name and city—could be associated with underwriting and loan outcomes for reasons other than race. In particular, borrowers with distinctively Black names and cities with large Black populations may have socioeconomic characteristics that make loans riskier or less profitable. However, we find systematic differences in outcomes even after controlling for credit score, DTI, LTV, loan amount, property value, income, and state.

Taken together, the results provide novel empirical evidence that LLMs may exhibit race-correlated disparities through inference, not just disclosure—a dynamic with important implications for the design of race-blind decision systems.

<SECTION ref="sec_mitigating">4 Mitigating racial disparities</SECTION>
Having established in prior sections that LLMs produce racially disparate lending recommendations, we now evaluate whether these disparities are manageable. While earlier experiments provide a structured audit of baseline model behavior, this section demonstrates that LLM behavior can also be directed. Specifically, we test whether simple prompt-based instructions can systematically reduce racial disparities in loan approval and pricing recommendations. This approach does not seek to identify an optimal solution, but rather to assess whether and how LLM responses can be shaped after risks are identified. The results suggest a promising insight for model users and financial regulators: straightforward, low-cost interventions can manage LLM risks.

We evaluate this possibility empirically in Experiment 6. This experiment introduces a minimal prompt-based instruction designed to reduce bias. We examine LLM responses to what we call the “mitigation" prompt, which adds the following simple statement before each question posed in our baseline prompt: “You should use no bias in making this decision:".

We supplement the responses to the baseline prompt in Experiment 1 (<MATH>N = 6,000</MATH>) with responses to the mitigation prompt for exactly the same loans and race/credit score manipulations. The combined sample of 12,000 observations is analyzed using regression models as described in <a href="#eq3">Equation 3</a> (to understand how mitigation affects racial disparities on average) and <a href="#eq4">Equation 4</a> (to understand how mitigation's racialized effects vary by credit score). The results are presented in <a href="#tbl7">Table VII</a>, where columns (1) and (2) display the results for the loan approval recommendations and columns (3) and (4) present the results for interest rate recommendations.

<TABLE ref="tbl7">
  <CAPTION>Mitigation Prompt and Recommendations. This table reports the OLS regressions of loan approval recommendations (columns 1-2) and loan interest rate recommendations (columns 3-4) on loan applicants' racial identity, using Experiment 6 (see <a href="#tbl1">Table I</a>) where the LLM instructions are experimentally varied. <i>Mitigation</i> equals one in observations where the LLM responded to the mitigation prompt and zero if it responded to the baseline prompt. The prompts are shown in <a href="#subsec_empirical_strategy">Section 2.B</a>. The dependent variable in columns (1)–(2) is a binary variable that equals one if the loan is approved, and zero otherwise. The LLM loan interest rate recommendations, measured in percentage points, are the dependent variable in columns (3)-(4). To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects. Variables are defined in <a href="#sec_methodology">Section 2</a>.</CAPTION>
  <thead>
    <tr><th/><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.043***</td><td>0.019***</td><td>-0.689***</td><td>-0.632***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.006)</td><td>(0.006)</td></tr>
    <tr><td>Black</td><td>-0.085***</td><td>-0.085***</td><td>0.352***</td><td>0.352***</td></tr>
    <tr><td/><td>(0.005)</td><td>(0.005)</td><td>(0.011)</td><td>(0.011)</td></tr>
    <tr><td>Black × CreditScore (z)</td><td/><td>0.048***</td><td/><td>-0.114***</td></tr>
    <tr><td/><td/><td>(0.005)</td><td/><td>(0.011)</td></tr>
    <tr><td>Mitigation</td><td>0.002</td><td>0.002</td><td>-0.107***</td><td>-0.107***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.008)</td><td>(0.008)</td></tr>
    <tr><td>Mitigation × CreditScore (z)</td><td>-0.029***</td><td>-0.004</td><td>0.090***</td><td>0.050***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.004)</td><td>(0.007)</td><td>(0.008)</td></tr>
    <tr><td>Mitigation × Black</td><td>0.086***</td><td>0.086***</td><td>-0.214***</td><td>-0.214***</td></tr>
    <tr><td/><td>(0.006)</td><td>(0.006)</td><td>(0.014)</td><td>(0.014)</td></tr>
    <tr><td>Mitigation × Black × CreditScore</td><td/><td>-0.050***</td><td/><td>0.079***</td></tr>
    <tr><td/><td/><td>(0.006)</td><td/><td>(0.014)</td></tr>
    <tr><td>Obs</td><td>12,000</td><td>12,000</td><td>12,000</td><td>12,000</td></tr>
    <tr><td>R²</td><td>0.58</td><td>0.58</td><td>0.85</td><td>0.85</td></tr>
    <tr><td>Adj R²</td><td>0.54</td><td>0.55</td><td>0.84</td><td>0.84</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
    <tr><td>p-val: <MATH>\beta_B + \beta_{B \times M} = 0</MATH></td><td>0.83</td><td>0.83</td><td>0.00</td><td>0.00</td></tr>
    <tr><td>p-val: <MATH>\beta_{B \times CS} + \beta_{B \times CS \times M} = 0</MATH></td><td/><td>0.47</td><td/><td>0.00</td></tr>
  </tbody>
</TABLE>

Because we include <i>Mitigation</i> as a separate independent variable and interacted with all terms, the first three coefficients are driven by the baseline prompt observations and thus match the results in <a href="#tbl3">Table III</a>. The coefficient on <i>Mitigation</i> shows that among white applicants, the mitigation prompt does not significantly change the average approval rate but lowers the average suggested interest rate by 10.7 basis points. The mitigation prompt also dampens the effect of credit score on the interest rate recommendations (but not approval rates) for white applicants from 63.2bp per standard deviation in score to 58.2bp (see column 4).<FOOTNOTE ref="fn28">The coefficient on Mitigation× CreditScore is negative and significant for approval decisions in column (1), but this is due to how it reduces rejections for low credit score Black applicants. One should look at column (2) to see how the mitigation prompt impacts white borrowers with respect to credit score.</FOOTNOTE>

The key results for this table are in the rows with coefficients including <i>Mitigation</i> and <i>Black</i>. Regarding approval decisions in columns (1) and (2), the coefficient on the <i>Mitigation × Black</i> interaction term is positive and significant, suggesting that the explicit instruction to avoid bias mitigates the (average) effect of race. This interaction shows that the average effect against Black applicants is reduced by 8.6 percentage points when the mitigation prompt is used. The <i>Black</i> and <i>Mitigation × Black</i> coefficients essentially offset each other, indicating that bias is effectively neutralized by the mitigation prompt.<FOOTNOTE ref="fn29">We find qualitatively identical results for approval decisions modeled using logistic regression in unreported tests. With the mitigation prompt, the linear model does not reject the absence of racial differences in approval recommendations on average (p = 0.83).</FOOTNOTE>

The results for interest rate recommendations show similar patterns. In columns (3) and (4), the coefficient on <i>Mitigation × Black</i> is negative and significant, with mitigation reducing the average interest rate disparity for Black applicants by 21.4bp, roughly 60% of the average gap. This suggests that our simple mitigation strategy moderates but does not eliminate this form of bias.

Furthermore, the interaction terms involving both <i>Black</i> and <i>CreditScore</i> speak to the effectiveness of the bias mitigation prompt in reducing not only the average <i>level</i> of racial differences, but also the heterogeneity across the credit spectrum. In column (2), mitigation eliminates approval disparities for Black borrowers at all credit scores.<FOOTNOTE ref="fn30">The racial difference in the effect of credit score on approval under the mitigation prompt is given by <MATH>\beta_{B \times CS} + \beta_{M \times B \times CS}</MATH>; we cannot reject that this sum equals zero (p = 0.47).</FOOTNOTE> For interest rate recommendations, column (4) indicates that while lower credit scores hurt Black applicants more than white applicants in the baseline prompt (11.4bp per credit score standard deviation), the mitigation prompt substantially reduces this (by 7.9 basis points), although statistically significant credit score heterogeneity in racial differences still exists.

We illustrate these points in <a href="#fig3">Figure III</a>. Using the estimates from columns (2) and (4), we calculate predicted values for all observations (including the loan fixed effects), and plot the outcomes averaged by credit score separately for Black and white applicants for both outcomes and prompts. Comparing the dashed to the solid lines shows the power of the simple mitigation prompt. In the baseline prompt, only Black applicants at the top of the credit score range receive approval decisions and interest rate suggestions approaching those of otherwise-identical white applicants. However, with the mitigation adjustment, approval decisions are indistinguishable between Black and white applicants across the credit spectrum, and the level and heterogeneity of racial interest rate disparities are markedly reduced.

<FIGURE ref="fig3">
  <CAPTION>Average LLM Recommendations by Credit Score under Baseline and Bias Mitigation Prompts. This figure illustrates the estimated coefficients for <a href="#eq4">Equation 4</a> in Experiment 6 (see <a href="#tbl1">Table I</a>) as reported in columns (2) and (4) of <a href="#tbl7">Table VII</a> for the approval and interest rate decisions of the baseline LLM. We obtain the predicted values for all observations after running both models to recover the loan fixed effects and plot the outcomes averaged by score.</CAPTION>
  <IMG src="fig-42.png"/>
</FIGURE>

To further assess prompt engineering-based approaches to reducing disparities, we also consider in Experiment A3 an alternate mitigation prompt that focuses specifically on laws designed to prevent discrimination in lending: “Make sure you comply with the Fair Lending Act and ECOA in making this decision.”<FOOTNOTE ref="fn31">It is not this paper's goal to assess all plausible prompt approaches. Having already demonstrated the effectiveness of a simple and direct approach, this exercise is designed simply to evaluate a contrasting approach using language more in line with that favored by lawyers and regulators.</FOOTNOTE> This prompt raises the salience of the legal stakes and might induce the LLM to further reduce the racial difference in its recommendations. Alternatively, this prompt might be less effective because its phrasing is somewhat detached from the outcomes we are assessing. The results in Appendix <a href="#tblA5">Table A5</a> repeat tests of <a href="#eq3">Equation 3</a> for this alternate mitigation prompt.

This legalistic approach successfully moderates Black-white gaps in LLM recommendations, though the effects are smaller than we found in Experiment 6: Comparing the <i>Black</i> and <i>Mitigation × Black</i> coefficients shows reductions by about 70% of the approval difference and just 30% of the interest rate difference (versus 100% and 61%, respectively, for the main mitigation prompt).

Overall, these findings indicate that while the baseline prompt results show significant racial disparities in both loan approval and interest rate recommendations, straightforward modifications to the prompt substantially reduce these disparities. This highlights that LLM behavior is dynamic and responsive to carefully constructed inputs. We do not propose that the mitigation prompt tested here is optimal for all contexts. Rather, we offer this as an illustration of a broader principle in which LLM outputs can be tested, disparities can be identified, and simple interventions can reduce harm. This creates a repeatable and practical framework for managing LLM risk (test → find → fix) and represents an important step toward aligning general-purpose AI tools with the compliance and fairness expectations of financial regulation.

<SECTION ref="sec_additional_tests">5 LLM interest rate suggestions: Additional tests</SECTION>
<SUBSECTION ref="subsec_determinants">5.A Determinants, accuracy, and ex-post performance</SUBSECTION>
By varying race and credit scores in our sample, our experimental design isolates the effect of race signals on LLM recommendations. However, because publicly available HMDA data do not include true credit scores, this empirical approach does not allow us to evaluate how closely LLM outputs correspond to actual lenders' underwriting decisions. Moreover, HMDA data lack information on ex-post loan performance, which is necessary to evaluate the profit implications of the LLMs' decisions. To address these limitations, we conduct a series of additional tests that incorporate new data from Freddie Mac's Single-Family Loan-Level Dataset, which provides borrowers' true credit scores and subsequent loan delinquency outcomes.

We merge loans from the 2022 HMDA LAR file with Freddie Mac data. Specifically, we download Freddie Mac loan records, restrict the sample to loans originated in 2022, and apply the same filters used for the HMDA data (see <a href="#subsec_data">Section 2.C</a>). We match observations based on loan amount, interest rate, DTI, LTV, state, and ZIP code.<FOOTNOTE ref="fn32">We adjust LTV and DTI across datasets to ensure comparable granularity and require that the ZIP code in the Freddie Mac data contains the census tract reported in the HMDA file. This procedure is adapted from <CITATION ref="cite_buchak2021" title="Competition with Multi-Dimensional Pricing: Evidence from U.S. Mortgages" authors="Buchak, G., Jørring, A." year="2021" journal="SSRN Electronic Journal">Buchak and Jørring (2021)</CITATION>, <CITATION ref="cite_jiang2023" title="Bank Technology Adoption and Loan Production in the U.S. Mortgage Market" authors="Jiang, S., Jørring, A., Xu, D." year="2023" journal="SSRN Electronic Journal">Jiang et al. (2023)</CITATION>, and <CITATION ref="cite_kalda2023" title="Cost Pass-Through and Mortgage Credit: The Case of Guarantee Fees" authors="Kalda, A., Pearson, C.G., Sovich, D." year="2023" journal="SSRN Electronic Journal">Kalda et al. (2023)</CITATION>.</FOOTNOTE> We then drop matches that are not unique—that is, each HMDA and Freddie Mac loan must match a single counterpart. From the resulting matches, we randomly select 1,000 loans, which we refer to as the HMDA-Freddie Mac Matched Sample. Using this matched sample, we rerun the LLM application prompt with the true credit score for each loan, <i>omitting</i> demographic signals so that the prompts are race-blind. Because all loans in the Freddie Mac dataset were approved and originated, this analysis focuses on interest rate suggestions.

<a href="#tbl8">Table VIII</a> presents OLS results that approximate the LLM's interest rate suggestion rule using a linear specification. Column (1) shows that a one standard deviation increase in credit score is associated with a 49 basis point decrease in the suggested rate, consistent with the evidence in <a href="#tbl3">Table III</a>. Unlike earlier estimates, which identify credit score variation across experimentally repeated applications, the regressions in <a href="#tbl8">Table VIII</a> omit fixed effects. This allows us to assess how the LLM responds to variation in DTI and LTV. Columns (2) and (3) examine these variables separately and find that a one standard deviation increase in either is associated with a roughly 20 basis point increase in the suggested rate. When standardized credit score, DTI, and LTV are included together in column (4), the coefficients on DTI and LTV remain statistically significant, but the LLM responds most strongly to credit score. This pattern aligns with established lending practices that prioritize measures of creditworthiness.

<TABLE ref="tbl8">
  <CAPTION>Determinants of LLM Interest Rate Recommendations. This table reports OLS regressions of LLM interest rate recommendations on measures of creditworthiness. The sample consists of mortgage originations from the HMDA-Freddie Mac Matched Sample described in <a href="#subsec_determinants">Section 5.A</a>, where <i>CreditScore (actual)</i> is the primary applicant's actual credit score. The dependent variable is the suggested interest rate (in percentage points), generated using the baseline prompt from <a href="#subsec_empirical_strategy">Section 2.B</a>, without race disclosure. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively.</CAPTION>
  <thead>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (actual) (z)</td><td>-0.485***</td><td/><td/><td>-0.457***</td></tr>
    <tr><td/><td>(0.016)</td><td/><td/><td>(0.016)</td></tr>
    <tr><td>DTI (z)</td><td/><td>0.201***</td><td/><td>0.134***</td></tr>
    <tr><td/><td/><td>(0.020)</td><td/><td>(0.016)</td></tr>
    <tr><td>LTV (z)</td><td/><td/><td>0.170***</td><td>0.111***</td></tr>
    <tr><td/><td/><td/><td>(0.021)</td><td>(0.015)</td></tr>
    <tr><td>Constant</td><td>3.937***</td><td>3.937***</td><td>3.937***</td><td>3.937***</td></tr>
    <tr><td/><td>(0.014)</td><td>(0.020)</td><td>(0.020)</td><td>(0.013)</td></tr>
    <tr><td>Obs</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td></tr>
    <tr><td>R²</td><td>0.55</td><td>0.09</td><td>0.07</td><td>0.62</td></tr>
    <tr><td>Adj R²</td><td>0.55</td><td>0.09</td><td>0.07</td><td>0.62</td></tr>
    <tr><td>Loan FE</td><td>No</td><td>No</td><td>No</td><td>No</td></tr>
  </tbody>
</TABLE>

Next, we compare the interest rates recommended by the LLM with those actually charged by lenders. We focus on rate spreads rather than nominal interest rates, following <CITATION ref="cite_bhutta2021" title="Do Minorities Pay More for Mortgages?" authors="Bhutta, N., Hizmo, A." year="2021" journal="The Review of Financial Studies">Bhutta and Hizmo (2021)</CITATION>, because spreads more precisely capture cross-sectional variation in lender risk assessments, independent of broader market (yield curve) movements. This distinction is particularly relevant because our experimental prompts do not incorporate macroeconomic conditions, and mortgage rates spiked during 2022 from 3.22% to 6.42%.

<a href="#fig4">Figure IV</a> presents a binned scatter plot showing the relationship between actual rate spreads on issued loans and the interest rate suggestions generated by the LLM. The underlying regression yields a coefficient of 0.55 with a t-statistic of 14.5, indicating that LLM rate suggestions are highly statistically significant predictors of the risk assessments made by actual lenders. This is particularly notable given the limited application information included in the prompt.

<FIGURE ref="fig4">
  <CAPTION>LLM Interest Rate Recommendations vs. Actual Loan Rate Spreads. This binned scatterplot illustrates the bivariate relationships between the mortgage interest rate recommended by the baseline LLM and the actual rate spread assigned by the real lender to the same loan as recorded in HMDA. The sample consists of mortgage originations from the HMDA-Freddie Mac Matched Sample described in <a href="#subsec_determinants">Section 5.A</a>. <i>Rate (LLM)</i> is the LLM interest rate recommendation for a given loan, using the prompt design from <a href="#subsec_empirical_strategy">Section 2.B</a> without including a race disclosure. The independent variable is the actual rate spread. The estimated slope of the linear fit is 0.55, with a t-statistic of 14.46 based on a heteroskedastic robust standard error.</CAPTION>
  <IMG src="fig-43.png"/>
</FIGURE>

To evaluate the ex-post performance of the baseline LLM in underwriting tasks, we examine whether each loan becomes delinquent (defined as being 30 days past due) at some point through Q3 2024, the last quarter with data available. <a href="#tbl9">Table IX</a> presents a comparison between different pricing and risk measures in predicting delinquency. Column (1) shows that a one standard deviation increase in the rate spread assigned by actual underwriters is associated with a 2.7 percentage point increase in the likelihood of delinquency. In contrast, a one standard deviation increase in the LLM's interest rate suggestion is associated with a 5.9 percentage point increase in delinquency (column 2), suggesting superior predictive power. This advantage is confirmed in column (3): When both measures are included together, the coefficient on the actual rate spread falls to near zero, while the LLM rate coefficient remains essentially unchanged at 5.7 percentage points. However, columns (4) and (5) reveal that when credit scores are included, the LLM rate becomes statistically insignificant, suggesting that much of the LLM's predictive power operates through its ability to infer credit-relevant information from limited prompts.

<TABLE ref="tbl9">
  <CAPTION>Risk Assessments and Loan Delinquency. This table reports OLS regressions of ex-post loan performance on measures of creditworthiness. The sample consists of mortgage originations from the HMDA-Freddie Mac Matched Sample described in <a href="#subsec_determinants">Section 5.A</a>. The dependent variable is <i>Delinquent</i> and is set equal to one if the loan is more than 30 days delinquent at any point within three years of origination. <i>Rate Spread (actual)</i> is the underwriter rate spread reported in HMDA. <i>Rate (LLM)</i> is the LLM-suggested interest rate (in percentage points), generated using the baseline prompt from <a href="#subsec_empirical_strategy">Section 2.B</a>, without race disclosure. <i>CreditScore (actual)</i> is the primary applicant's actual credit score. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively.</CAPTION>
  <thead>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th><th>(5)</th></tr>
  </thead>
  <tbody>
    <tr><td>Rate Spread (actual) (z)</td><td>0.027***</td><td/><td>0.003</td><td/><td>-0.002</td></tr>
    <tr><td/><td>(0.011)</td><td/><td>(0.011)</td><td/><td>(0.011)</td></tr>
    <tr><td>Rate (LLM) (z)</td><td/><td>0.059***</td><td>0.057***</td><td/><td>0.013</td></tr>
    <tr><td/><td/><td>(0.011)</td><td>(0.012)</td><td/><td>(0.012)</td></tr>
    <tr><td>CreditScore (actual) (z)</td><td/><td/><td/><td>-0.072***</td><td>-0.063***</td></tr>
    <tr><td/><td/><td/><td/><td>(0.011)</td><td>(0.013)</td></tr>
    <tr><td>Constant</td><td>0.097***</td><td>0.097***</td><td>0.097***</td><td>0.097***</td><td>0.097***</td></tr>
    <tr><td/><td>(0.009)</td><td>(0.009)</td><td>(0.009)</td><td>(0.009)</td><td>(0.009)</td></tr>
    <tr><td>Obs</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td></tr>
    <tr><td>R²</td><td>0.01</td><td>0.04</td><td>0.04</td><td>0.06</td><td>0.06</td></tr>
    <tr><td>Adj R²</td><td>0.01</td><td>0.04</td><td>0.04</td><td>0.06</td><td>0.06</td></tr>
    <tr><td>Loan FE</td><td>No</td><td>No</td><td>No</td><td>No</td><td>No</td></tr>
  </tbody>
</TABLE>

While accurate calibration of the LLM's recommendations is not required for the validity of our main results, the evidence in this section demonstrates that the model performs sophisticated financial risk assessments. These findings align with recent research showing that LLMs exhibit substantial capabilities in quantitative financial analysis beyond traditional text processing (<CITATION ref="cite_feng2024" title="A First Look at Financial Data Analysis Using ChatGPT-4o" authors="Feng, Z., Li, B., Liu, F." year="2024" journal="Working Paper">Feng et al., 2024</CITATION>; <CITATION ref="cite_fieberg2023" title="Using GPT-4 for Financial Advice" authors="Fieberg, C., Hornuf, L., Streich, D." year="2023" journal="SSRN Electronic Journal">Fieberg et al., 2023</CITATION>; <CITATION ref="cite_lopezlira2024" title="Can ChatGPT Forecast Stock Price Movements? Return Predictability and Large Language Models" authors="Lopez-Lira, A., Tang, Y." year="2024" journal="arXiv preprint arXiv:2304.07619">Lopez-Lira and Tang, 2024</CITATION>; <CITATION ref="cite_shah2023" title="Zero is Not Hero Yet: Benchmarking Zero-Shot Performance of LLMs for Financial Tasks" authors="Shah, A., Chava, S." year="2023" journal="arXiv preprint arXiv:2305.16633">Shah and Chava, 2023</CITATION>). More broadly, our results suggest that LLMs may offer value to financial institutions across a range of analytical tasks and that wider adoption may be attractive, even if their use remains limited in more regulated functions such as underwriting.

<SUBSECTION ref="subsec_true_race">5.B Loan outcomes and true applicant race</SUBSECTION>
To examine whether LLM rate disparities reflect genuine differences in borrower risk, <a href="#tbl10">Table X</a> reports OLS estimates of loan outcomes based on the <i>actual</i> race of the applicant. Columns (1) and (2) show no statistically significant difference in delinquency rates for Black borrowers, both on average and after controlling for credit risk. Column (3) indicates that Black applicants receive 15 basis points higher rate spreads from real underwriters on average, but this disparity largely disappears once credit-relevant factors are controlled for in column (4). These patterns are consistent with <CITATION ref="cite_bhutta2022" title="How Much Does Racial Bias Affect Mortgage Lending? Evidence from Human and Algorithmic Credit Decisions" authors="Bhutta, N., Hizmo, A., Ringo, D." year="2022" journal="FEDS Working Paper">Bhutta et al. (2022)</CITATION>, which finds that observable applicant factors explain most of the racial disparities in mortgage approvals.

<TABLE ref="tbl10">
  <CAPTION>Loan Outcomes and True Borrower Race. This table reports OLS regressions of loan outcomes on borrowers' true race. The sample consists of mortgage originations from the HMDA-Freddie Mac Matched Sample described in <a href="#subsec_determinants">Section 5.A</a>. The dependent variable in columns (1) and (2), <i>Delinquent</i>, is set equal to one if the loan is more than 30 days delinquent at any point within three years of origination. The dependent variable in columns (3) and (4), <i>Rate Spread (actual)</i>, is the underwriter rate spread reported in HMDA. The dependent variable in columns (5) and (6), <i>Rate (LLM; race undisclosed)</i>, is the LLM-suggested interest rate, generated using the baseline prompt from <a href="#subsec_empirical_strategy">Section 2.B</a>, without race disclosure. In columns (7) and (8), the dependent variable, <i>Rate (LLM; race disclosed)</i>, is generated in the same manner, except the applicant's true race is included in the prompt. <i>Black (actual)</i> is the borrower race listed in HMDA rather than an experimental manipulation. Interest rate outcomes in columns (3)-(8) are measured in percentage points. <i>CreditScore (actual)</i> is the primary applicant's actual credit score. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively.</CAPTION>
  <thead>
    <tr><th/><th colspan="2">Delinquent</th><th colspan="2">Rate Spread (actual)</th><th colspan="2">Rate (LLM; race undisclosed)</th><th colspan="2">Rate (LLM; race disclosed)</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th><th>(5)</th><th>(6)</th><th>(7)</th><th>(8)</th></tr>
  </thead>
  <tbody>
    <tr><td>Black (actual; undisclosed)</td><td>0.011</td><td>-0.025</td><td>0.150*</td><td>0.033</td><td>0.303***</td><td>0.055</td><td/><td/></tr>
    <tr><td/><td>(0.042)</td><td>(0.043)</td><td>(0.080)</td><td>(0.069)</td><td>(0.086)</td><td>(0.050)</td><td/><td/></tr>
    <tr><td>Black (actual; disclosed)</td><td/><td/><td/><td/><td/><td/><td>0.680***</td><td>0.427***</td></tr>
    <tr><td/><td/><td/><td/><td/><td/><td/><td>(0.110)</td><td>(0.079)</td></tr>
    <tr><td>CreditScore (actual) (z)</td><td/><td>-0.069***</td><td/><td>-0.176***</td><td/><td>-0.457***</td><td/><td>-0.456***</td></tr>
    <tr><td/><td/><td>(0.011)</td><td/><td>(0.016)</td><td/><td>(0.016)</td><td/><td>(0.015)</td></tr>
    <tr><td>DTI (z)</td><td/><td>0.016*</td><td/><td>-0.027*</td><td/><td>0.133***</td><td/><td>0.137***</td></tr>
    <tr><td/><td/><td>(0.009)</td><td/><td>(0.014)</td><td/><td>(0.016)</td><td/><td>(0.016)</td></tr>
    <tr><td>LTV (z)</td><td/><td>0.015*</td><td/><td>0.124***</td><td/><td>0.110***</td><td/><td>0.120***</td></tr>
    <tr><td/><td/><td>(0.008)</td><td/><td>(0.015)</td><td/><td>(0.015)</td><td/><td>(0.014)</td></tr>
    <tr><td>Constant</td><td>0.096***</td><td>0.098***</td><td>0.351***</td><td>0.358***</td><td>3.920***</td><td>3.934***</td><td>3.855***</td><td>3.870***</td></tr>
    <tr><td/><td>(0.010)</td><td>(0.009)</td><td>(0.016)</td><td>(0.014)</td><td>(0.021)</td><td>(0.013)</td><td>(0.021)</td><td>(0.013)</td></tr>
    <tr><td>Obs</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td><td>1,000</td></tr>
    <tr><td>R²</td><td>0.00</td><td>0.06</td><td>0.00</td><td>0.20</td><td>0.01</td><td>0.62</td><td>0.05</td><td>0.65</td></tr>
    <tr><td>Adj R²</td><td>-0.00</td><td>0.06</td><td>0.00</td><td>0.20</td><td>0.01</td><td>0.62</td><td>0.05</td><td>0.65</td></tr>
    <tr><td>Loan FE</td><td>No</td><td>No</td><td>No</td><td>No</td><td>No</td><td>No</td><td>No</td><td>No</td></tr>
  </tbody>
</TABLE>

We next examine LLM rate recommendations. The advantage of these tests relative to Experiment 1 is that we now use true credit scores that are realistically correlated with applicant race; in Experiment 1, the experimental manipulations intentionally eliminated any such correlation.

The pattern observed in real underwriting decisions—racial disparities in average interest rates that are explained by risk factors—emerges in LLM recommendations when the LLM does not have any race information available. We show this in columns (5) and (6), where the dependent variable is the interest rate suggested by the baseline LLM using an application prompt without demographic information. On average, the LLM assigns rates that are 30 basis points higher for Black applicants (t-statistic of 3.51). Column (6) shows that this gap is largely accounted for by the omission of risk factors from column (5), as the coefficient falls to a statistically insignificant 5.5 basis points once credit score is included in the specification.<FOOTNOTE ref="fn33">This finding, relative to those in columns (3) and (4), is consistent with <CITATION ref="cite_fuster2022" title="Predictably Unequal? The Effects of Machine Learning on Credit Markets" authors="Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T., Walther, A." year="2022" journal="The Journal of Finance">Fuster et al. (2022)</CITATION>, which finds that machine learning technology can increase race-based disparities in credit market outcomes, partly due to its better ability to triangulate information about borrowers' membership in protected classes from permissible characteristics such as income and credit scores.</FOOTNOTE>

Finally, we alter the prompt to disclose applicants' true races alongside the true credit scores. In column (7), we find that, on average, the baseline LLM assigns rates that are 68 basis points higher when it knows the applicant is Black. This correlation is attenuated partially when controlling for credit score, DTI, and LTV, but remains statistically and economically large. Given the absence of racial gaps in real lenders' risk-adjusted rates and ex-post loan performance, this suggests that LLM racial disparities are inconsistent with profit maximization.

Generally, it is difficult to draw conclusions about the mechanism for the disparities we document, or more generally, any emergent patterns exhibited by LLMs. This is true in settings focused on human subjects,<FOOTNOTE ref="fn34">As noted by <CITATION ref="cite_bohren2023" title="Inaccurate Statistical Discrimination: An Identification Problem" authors="Bohren, J.A., Haggag, K., Imas, A., Pope, D.G." year="2023" journal="The Review of Economics and Statistics">Bohren et al. (2023)</CITATION>, <CITATION ref="cite_doleac2013" title="The Visible Hand: Race and Online Market Outcomes" authors="Doleac, J.L., Stein, L.C." year="2013" journal="The Economic Journal">Doleac and Stein (2013)</CITATION>, and <CITATION ref="cite_guryan2013" title="Taste-based or Statistical Discrimination: The Economics of Discrimination Returns to its Roots" authors="Guryan, J., Charles, K.K." year="2013" journal="The Economic Journal">Guryan and Charles (2013)</CITATION>, disentangling statistical from taste-based discrimination is challenging in any setting; the two can co-exist and may reinforce each other.</FOOTNOTE> but is more complicated when focusing on LLMs, where the traditional theories of discrimination do not apply well.<FOOTNOTE ref="fn35">We thank Felipe Severino for helpful comments on this topic.</FOOTNOTE> LLMs have no preferences, meaning that taste-based discrimination is not possible. Even though we find that LLM outputs can generate statistical discrimination, this is not driving our main findings, as our research designs throughout the paper experimentally manipulate race while holding other signals fixed. While LLMs do not have preferences in an economic sense, they inherit priors about race (and age) from their training data, and digging into this (and its consequences) is a fruitful direction for future research. Even though we cannot pin down the mechanism behind the racial disparities we document, these tests have important implications for the use of LLMs in many customer-facing applications. They show that LLM use is not implicitly aligned with profit maximization and demonstrate the value of the audit framework to identify and correct issues with their deployment.

<SECTION ref="sec_conclusion">6 Conclusion</SECTION>
Financial services firms are rapidly integrating LLMs across a wide range of functions, making it essential to understand their limitations. Using mortgage underwriting as an experimental testbed, we document robust evidence that LLMs recommend more denials and higher interest rates for Black applicants compared to otherwise-identical white applicants. These disparities persist across multiple leading models and generations, suggesting that underlying training data and modeling choices embed systematic racial bias despite developers' efforts to address it.

Our results have implications far beyond the mortgage lending application we consider in this study. Disparities remain even when race is only inferred through proxies such as names or locations, underscoring that race-blind inputs do not guarantee race-blind results. As financial firms increasingly deploy LLMs in applications such as personalized investment advice, customer support, and targeted marketing, there is a risk of perpetuating and amplifying existing inequities.

We also show that relatively simple interventions can reduce disparities in LLM behavior. Prompting models explicitly to “use no bias" eliminates approval gaps and cuts interest rate disparities by more than half. While this specific instruction will not be a universal solution, it illustrates how rigorous, audit-based approaches enable users to refine prompts and develop strategies that lead to more equitable outcomes.

LLMs' rate recommendations are strongly correlated with those of real underwriters and predict defaults, suggesting these general-purpose models possess surprising sophistication in performing credit analysis. However, our outcome tests highlight that this sophistication does not necessarily translate into fairness or efficiency. When borrower race is disclosed, models assign significantly higher rates to Black applicants even after controlling for credit quality. In a sample where real lenders' risk-adjusted rates and loan performance do not differ by race, this finding suggests that LLM racial disparities are unlikely to be profit maximizing and that unchecked adoption of these models could harm both consumers and firms.

<APPENDIX ref="appendix">
Appendix
<li>LLMs and the API names used in our study are:
<TABLE ref="tbl_api_names">
  <thead>
    <tr><th>Source</th><th>LLM</th><th>Year</th><th>Model API Name</th></tr>
  </thead>
  <tbody>
    <tr><td>Anthropic</td><td>Claude 3 Sonnet</td><td>2024</td><td>claude-3-sonnet-20240229</td></tr>
    <tr><td>Anthropic</td><td>Claude 3 Opus</td><td>2024</td><td>claude-3-opus-20240229</td></tr>
    <tr><td>Meta</td><td>Llama 3 8b</td><td>2024</td><td>llama3-8b-8192 (run via Groq)</td></tr>
    <tr><td>Meta</td><td>Llama 3 70b</td><td>2024</td><td>llama3-7b-8192 (run via Groq)</td></tr>
    <tr><td>OpenAI</td><td>GPT-3.5 Turbo (2023)</td><td>2023</td><td>gpt-3.5-turbo-0613</td></tr>
    <tr><td>OpenAI</td><td>GPT-3.5 Turbo (2024)</td><td>2024</td><td>gpt-3.5-turbo-0125</td></tr>
    <tr><td>OpenAI</td><td>GPT-4</td><td>2023</td><td>gpt-4-0613</td></tr>
    <tr><td>OpenAI</td><td>GPT-4 Turbo [Baseline LLM]</td><td>2024</td><td>gpt-4-0125-preview</td></tr>
  </tbody>
</TABLE>
</li>
<li><a href="#tblA1">Table A1</a> compares summary statistics of our HMDA subsample to the broader HMDA sample.</li>
<li><a href="#tblA2">Table A2</a> assesses the stability of our main results across two dimensions: time and temperature. Our main results are based on experiments run in April 2024. Two robustness tests from July 2024 are reported in <a href="#tblA2">Table A2</a>.</li>
<li><a href="#tblA3">Table A3</a> examines Experiment A1, where we considered prompts submitted to the baseline LLM including “Age: 30,” “Age: 50,” or “Age: 70" in place of race signals (Experiment A1).</li>
<li><a href="#tblA4">Table A4</a> examines Experiment A2, where we considered prompts submitted to the baseline LLM including “Gender: Male” or “Gender: Female" in place of race signals (Experiment A2).</li>
<li><a href="#tblA5">Table A5</a> considers an alternate “mitigation" prompt (Experiment A3).</li>

<TABLE ref="tblA1">
  <CAPTION>Comparing Entire HMDA Dataset to HMDA Loan Sample. This table compares the HMDA universe ("Entire 2022 HMDA") to the subset of 1,000 HMDA observations used in our study (“Study Subset"). The HMDA data come from the Loan/Application Records (LAR) file containing loans made nationwide in 2022 and reported to the Consumer Financial Protection Bureau. We restrict the sample to conventional 30-year loans for principal residences secured by a first lien. We eliminate loans with balloon payments, negative amortization, interest-only payments, or business or commercial purposes. We also discard manufactured homes, reverse mortgages, and multi-unit dwellings. Finally, we require non-missing DTI and LTV information for each loan. After these filters, the HMDA dataset has 2,409,013 observations. We winsorize variables at the 1% tails for this table to remove outliers in the entire sample, but this choice does not cause p-values to cross any significance thresholds. We report the mean (and standard deviations, in square brackets) for the variables used in the study in the entire HMDA dataset and the study subset separately. The last column reports differences in means, with standard errors shown in parentheses, where ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively.</CAPTION>
  <thead>
    <tr><th/><th>Entire 2022 HMDA</th><th>Study Subset</th><th>Difference</th></tr>
  </thead>
  <tbody>
    <tr><td>Approval (actual)</td><td>0.93 [0.26]</td><td>0.92 [0.27]</td><td>-0.01 (0.01)</td></tr>
    <tr><td>Rate (actual)</td><td>4.93 [1.13]</td><td>4.98 [1.13]</td><td>0.05 (0.04)</td></tr>
    <tr><td>Rate Spread (actual)</td><td>0.28 [0.63]</td><td>0.27 [0.72]</td><td>-0.01 (0.02)</td></tr>
    <tr><td>DTI</td><td>37.04 [9.20]</td><td>37.17 [9.37]</td><td>0.13 (0.29)</td></tr>
    <tr><td>LTV</td><td>82.43 [14.97]</td><td>83.22 [14.52]</td><td>0.80 (0.47)</td></tr>
  </tbody>
</TABLE>

<TABLE ref="tblA2">
  <CAPTION>Race and Recommendations Robustness. This table reports robustness tests for OLS regressions of loan approval recommendations and loan interest rate recommendations on loan applicants' racial identity presented in <a href="#tbl3">Table III</a> using Experiment 1 (see <a href="#tbl1">Table I</a>). Experiments in this table were run in July 2024, while those in <a href="#tbl3">Table III</a> were run in April 2024. In columns (1)-(4), we set the model temperature to 0, as in <a href="#tbl3">Table III</a>. In columns (5)-(8), we set the model temperature to 0.3. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects.</CAPTION>
  <thead>
    <tr><th>Temperature:</th><th colspan="4">Temperature 0</th><th colspan="4">Temperature 0.3</th></tr>
    <tr><th>Dependent Variable:</th><th colspan="2">Approval</th><th colspan="2">Interest Rate</th><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th><th>(5)</th><th>(6)</th><th>(7)</th><th>(8)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.065***</td><td>0.038***</td><td>-0.675***</td><td>-0.645***</td><td>0.058***</td><td>0.031***</td><td>-0.663***</td><td>-0.635***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.004)</td><td>(0.007)</td><td>(0.008)</td><td>(0.003)</td><td>(0.004)</td><td>(0.007)</td><td>(0.008)</td></tr>
    <tr><td>Black</td><td>-0.135***</td><td>-0.135***</td><td>0.440***</td><td>0.440***</td><td>-0.129***</td><td>-0.128***</td><td>0.443***</td><td>0.443***</td></tr>
    <tr><td/><td>(0.006)</td><td>(0.006)</td><td>(0.013)</td><td>(0.013)</td><td>(0.006)</td><td>(0.006)</td><td>(0.013)</td><td>(0.013)</td></tr>
    <tr><td>Black × CreditScore (z)</td><td/><td>0.054***</td><td/><td>-0.060***</td><td/><td>0.056***</td><td/><td>-0.056***</td></tr>
    <tr><td/><td/><td>(0.007)</td><td/><td>(0.013)</td><td/><td>(0.007)</td><td/><td>(0.013)</td></tr>
    <tr><td>Obs</td><td>5,978</td><td>5,978</td><td>5,978</td><td>5,978</td><td>5,925</td><td>5,925</td><td>5,925</td><td>5,925</td></tr>
    <tr><td>R²</td><td>0.58</td><td>0.58</td><td>0.83</td><td>0.83</td><td>0.57</td><td>0.58</td><td>0.83</td><td>0.83</td></tr>
    <tr><td>Adj R²</td><td>0.49</td><td>0.50</td><td>0.80</td><td>0.80</td><td>0.48</td><td>0.49</td><td>0.79</td><td>0.79</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
  </tbody>
</TABLE>

<TABLE ref="tblA3">
  <CAPTION>Age and Recommendations. This table reports the OLS regressions of LLM loan approval recommendations (columns 1-2) and loan interest rate recommendations (columns 3–4) on loan applicants' age using Experiment A1 (see <a href="#tbl1">Table I</a>). LLM loan approval recommendation that equals one if the loan is approved, and zero otherwise. LLM loan interest rate recommendations are measured in percentage points. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects.</CAPTION>
  <thead>
    <tr><th/><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.029***</td><td>0.023***</td><td>-0.677***</td><td>-0.663***</td></tr>
    <tr><td/><td>(0.002)</td><td>(0.003)</td><td>(0.004)</td><td>(0.006)</td></tr>
    <tr><td>Age=50</td><td>-0.003</td><td>-0.003</td><td>0.039***</td><td>0.039***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.008)</td><td>(0.008)</td></tr>
    <tr><td>Age=70</td><td>-0.016***</td><td>-0.016***</td><td>0.173***</td><td>0.173***</td></tr>
    <tr><td/><td>(0.004)</td><td>(0.004)</td><td>(0.009)</td><td>(0.009)</td></tr>
    <tr><td>Age=50 × CreditScore (z)</td><td/><td>0.004</td><td/><td>-0.010</td></tr>
    <tr><td/><td/><td>(0.004)</td><td/><td>(0.009)</td></tr>
    <tr><td>Age=70 × CreditScore (z)</td><td/><td>0.013***</td><td/><td>-0.030***</td></tr>
    <tr><td/><td/><td>(0.004)</td><td/><td>(0.009)</td></tr>
    <tr><td>Obs</td><td>9,000</td><td>9,000</td><td>9,000</td><td>9,000</td></tr>
    <tr><td>R²</td><td>0.70</td><td>0.70</td><td>0.90</td><td>0.90</td></tr>
    <tr><td>Adj R²</td><td>0.66</td><td>0.66</td><td>0.89</td><td>0.89</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
  </tbody>
</TABLE>

<TABLE ref="tblA4">
  <CAPTION>Gender and Recommendations. This table reports the OLS regressions of loan approval recommendations (columns 1-2) and loan interest rate recommendations (columns 3-4) on loan applicants' gender. LLM loan approval recommendation that equals one if the loan is approved, and zero otherwise. LLM loan interest rate recommendations are measured in percentage points. Variables are defined in <a href="#sec_methodology">Section 2</a>. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects.</CAPTION>
  <thead>
    <tr><th/><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th/><th>(1)</th><th>(2)</th><th>(3)</th><th>(4)</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.024***</td><td>0.024***</td><td>-0.650***</td><td>-0.654***</td></tr>
    <tr><td/><td>(0.002)</td><td>(0.003)</td><td>(0.004)</td><td>(0.006)</td></tr>
    <tr><td>Female</td><td>0.005</td><td>0.005</td><td>-0.005</td><td>-0.005</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.008)</td><td>(0.008)</td></tr>
    <tr><td>Female × CreditScore (z)</td><td/><td>-0.000</td><td/><td>0.008</td></tr>
    <tr><td/><td/><td>(0.004)</td><td/><td>(0.009)</td></tr>
    <tr><td>Obs</td><td>6,000</td><td>6,000</td><td>6,000</td><td>6,000</td></tr>
    <tr><td>R²</td><td>0.69</td><td>0.69</td><td>0.90</td><td>0.90</td></tr>
    <tr><td>Adj R²</td><td>0.63</td><td>0.63</td><td>0.88</td><td>0.88</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
  </tbody>
</TABLE>

<TABLE ref="tblA5">
  <CAPTION>Alternate Bias Mitigation Prompt. This table reports results from Experiments 6 and A3 (see <a href="#tbl1">Table I</a>). Experiment A3 repeats tests of <a href="#eq3">Equation 3</a> as in Experiment 6 but using an alternative mitigation prompt: "Make sure you comply with the Fair Lending Act and ECOA in making this decision.” These estimates are in columns (2) and (4). For comparison, columns (1) and (3) reprise the results from the same columns of <a href="#tbl7">Table VII</a> (using our main mitigation prompt: "You should use no bias in making this decision"). The dependent variable in columns (1) and (2) is a binary variable that equals one if the loan is approved, and zero otherwise. In columns (3) and (4), the dependent variable is the LLM loan interest rate recommendation and is measured in percentage points. To facilitate interpretation, (z) indicates a variable has been standardized. Heteroskedastic robust standard errors are reported in parentheses. ***, **, and * denote statistical significance at the 1%, 5%, and 10% levels, respectively. All models include loan fixed effects. Variables are defined in <a href="#sec_methodology">Section 2</a>.</CAPTION>
  <thead>
    <tr><th/><th colspan="2">Approval</th><th colspan="2">Interest Rate</th></tr>
    <tr><th>Mitigation Prompt:</th><th>(1) Main</th><th>(2) Alternate</th><th>(3) Main</th><th>(4) Alternate</th></tr>
  </thead>
  <tbody>
    <tr><td>CreditScore (z)</td><td>0.043***</td><td>0.043***</td><td>-0.689***</td><td>-0.689***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.003)</td><td>(0.006)</td><td>(0.006)</td></tr>
    <tr><td>Black</td><td>-0.085***</td><td>-0.085***</td><td>0.352***</td><td>0.352***</td></tr>
    <tr><td/><td>(0.005)</td><td>(0.005)</td><td>(0.011)</td><td>(0.011)</td></tr>
    <tr><td>Mitigation</td><td>0.002</td><td>-0.042***</td><td>-0.107***</td><td>0.179***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.005)</td><td>(0.008)</td><td>(0.011)</td></tr>
    <tr><td>Mitigation × CreditScore (z)</td><td>-0.029***</td><td>0.009**</td><td>0.090***</td><td>-0.064***</td></tr>
    <tr><td/><td>(0.003)</td><td>(0.004)</td><td>(0.007)</td><td>(0.009)</td></tr>
    <tr><td>Mitigation × Black</td><td>0.086***</td><td>0.061***</td><td>-0.214***</td><td>-0.104***</td></tr>
    <tr><td/><td>(0.006)</td><td>(0.007)</td><td>(0.014)</td><td>(0.017)</td></tr>
    <tr><td>Obs</td><td>12,000</td><td>12,000</td><td>12,000</td><td>12,000</td></tr>
    <tr><td>R²</td><td>0.58</td><td>0.56</td><td>0.85</td><td>0.83</td></tr>
    <tr><td>Adj R²</td><td>0.54</td><td>0.52</td><td>0.84</td><td>0.81</td></tr>
    <tr><td>Loan FE</td><td>Yes</td><td>Yes</td><td>Yes</td><td>Yes</td></tr>
    <tr><td>Experiment</td><td>6</td><td>A3</td><td>6</td><td>A3</td></tr>
  </tbody>
</TABLE>
</APPENDIX>
</PAPER>