Model 2 uses the coding scheme of Brett and Nandkeolyar (unpublished). The study was conducted in 2010, with MBA students from India. The simulation they used was Cartoon (available at negotiationandteamresources.com). The authors shared with us 43 transcripts they had collected and coded using human coders.


Each of the codes is shown in the section below. For each of the 12 codes, we provide a definition of the code, a short explanation, and sample sentences from the transcripts. The sample sentences let you see how these scholars operationalized their codes, which is what Model 2 learned from and tries to reproduce when coding your transcripts. As with any coding scheme, different scholars might operationalize concepts slightly differently. You should decide if this coding scheme will be useful to you by reviewing how the authors used it.


When reporting your results from this model, please cite this paper:

Friedman, R., Brett, J., Cho, J., Zhan, X., et al, 2024, Coding negotiations with AI: Instructions and validation for coding model 1, https://www.ainegotiationlab- vanderbilt.com/static/assets/VandAI_Neg_Lab_Model2_Paper.pdf.

Below we have listed each of the codes used in this coding scheme (If you prefer an excel file format to view this material, download this file.), along with a brief definition and explanation, followed by example sentences from the coded transcripts. Each of the 12 codes is shown below. The purpose of providing these sample sentences is to let you see how these scholars had their coders operationalize the codes. This way of thinking is what Model 2 has learned and is trained to reproduce.

Code Number Code Name
1

Description: Asking a question about the other side's preferences, asking for clarification on a statement, or asking for information from the other party.

QUESTION - ASKING OTHER PARTY TO STATE PREFERENCES
  • Okay let's try and figure out whether we can reach anywhere what's the highest you can come? So let's keep the runs per episode (inaudible) so obviously you need for your model (inaudible) so the runs are seven, what's the highest you can come in terms of per episode the price?
  • How much are you looking at?
  • So you need another 4.8 from Ultra Rangers?
  • How much are you willing to lower on Ultra Rangers?
  • And at what point is there no loss? What is the ideal point for you for that?
QUESTION - ASKING FOR CLARIFICATION OR MORE INFORMATION
  • Quantitatively?
  • How much would that be?
  • Plus you would pay the additional for the extra runs?
  • Okay hold on this. Do you know the cost of financing in the 5th year?
  • And how many runs are you looking at?
  • What are the other advantages of it?
  • So correct me. I have, ah, like we have at the one issue which we, I want to ask you is we have four stations, independent stations in that area.
  • What number of runs are you looking at?
  • You say you want the price per episode at 77,000?
  • What's the landing cost?
  • Syndication, and so for me for any product or any commodity or anything the market decides price. So if you have this evaluation based on certain market paradigm, I can buy that. If you could explain that to me or what is the logic behind it, I could kind of try and understand, 77,000--
  • That's fine. If you could break that down for me and explain to me how you reached that --
  • And we're talking about a five year contract?
  • Well, how about the other decision parameters?
QUESTION - ASKING ABOUT PROCESS OF MAKING DEAL
  • So, do you want to start with it?
  • So, what do you think should we do now?
2

Description: Comments that provide general information about one's self/business, preferences, needs, knowledge

INFORMATION - STATING CURRENT UNDERSTANDINGS
  • I have no information on that.
  • Okay. I am aware of your network. You guys are the second largest I think you are following WIN.
  • For us also I mean as you said you have some base conditions. So as I understand your issues and so if you syndicate about 50 episodes you can't go around syndicating the other 50 to anyone else.
  • This is data from the case.
  • Okay. So, I am [Name] and we understand that you are interested in one of our TV series.
INFORMATION - INFORMATION ABOUT SELF
  • I am the general manager of WCHA.
  • Basically, I am from the WCHI network and -- As you know we run basically on the syndicated choice models.
INFORMATION - STATING DESIRES, INTERESTS, OR PREFERENCES
  • And we want our program to be on your channel.
  • Alright, alright. So we are seriously interested in Altra Rangers.
  • We are interested in purchasing the syndicating rights for Altra Rangers from your company. So.
  • I am not so sure about the young adults.
  • Now before I get into the negotiation I tell you about stepping stone, I mean this is basically how we syndicate the show or we can't do much (inaudible 3:17). The first thing is we are looking for a five-year contract. The period itself is not negotiable. We are also looking at syndicating exactly 100 episodes which basically is the entire series of the show. So we don't have any flexibility to break it down into 50 first and 50 later.
  • So these are two things that really I don't (inaudible) negotiate around.
3

Description: A short positive response in agreement with a point or validating something previously said

  • Oh very much, very much.
  • Wonderful.
  • Great.
  • Oh yes.
  • Okay.
  • Right.
  • True. True.
  • Yeah. That's a phenomenal position that you take.
  • Absolutely.
  • Correct. Correct. [Overlapping 14:50]
  • We agree to that actually I think --
  • Exactly.
  • That's fine.
  • Fair enough --
  • Okay then we can start with that.
  • Yeah total.
  • Exactly, exactly.
  • Yeah. That's alright.
  • Yes.
  • Sure.
4

Description: A negative response including "no", shutting down other options, any short negative response or stating disagreement

  • No, actually I mean given that it has received…
  • No.
  • I don't think that's I think that sometimes you can 1 episode on Monday. You. Okay. You have watched.
  • No. I wouldn't call it trouble. I wouldn't call it financial trouble.
  • I don't have any other options.
5

Description: An attempt at cognitive influence (appeals to rationality, logic, data from the case, interests); normative influence (appeals to reciprocity, fairness, consistency, morality, norms); or emotional influence (threats, statements about alternatives. Also, questions about alternatives, sympathy, apologies, flattery, bragging, or schmoozing

SUBSTANTIATION - MENTIONS OF ALTERNATIVES
  • Even given the popularity of the show, there are other players as you mentioned it earlier that are highly interested in this deal.
  • And I believe that even when it comes to the spots right now, it's not that you guys are neck and neck. You are a little farther down.
  • No, this is the figure I would like to get it. We have counteroffers.
  • So I am just going to see how the value comes out, if it is within what I can adjust I don't mind because as I said we value our relationship.
  • …morning timeslot where this could fit in perfectly. And I don't know if you're aware of this or not, but your competitor does see this as a big, big threat because they are absolutely convinced that if they don't get this show, then they're going to lose their top spot to the person who gets the show.
SUBSTANTIATION - EXPLAINING HOW OTHER PARTY WILL BENEFIT FROM WHAT IS OFFERED
  • So in addition to those things the show has been doing extremely well in its first couple of runs. I don't have the TRP ratings. But what it basically means is for every TV that is on 20% of the viewers tuned in to show at that time, which means any advertising you decide to put up at the same time would have massive, massive reach in terms of who has got the TV on. So basically we have got a 10 rating and then we have also got 20 shares. These numbers we got are reflecting how good the show is. In addition, we have a 10 rating and we have a 20 share rating.
  • I mean one I am basically getting at is for advertising, this show is brilliant you can't get much better than this.
  • See, but you definitely need to take this into account the kind of rating we have received in the past. A 10 rating with almost 20 shares in a competitive timeslot. That's why I'm talking about a competitive timeslot. That's a very, very strong rating. The uniqueness of the characters, the writing style, the way the show has been directed it will be another runaway success.
  • See, we can actually talk about all those. But in the beginning you are aware-you just mentioned-we've been doing exceptionally well for a couple of years now. So, we want a program which is also financially viable to us as an organization.
  • Oh yes, that we are. Yeah. I should say that it has been a phenomenal success. It was a wonderful series and the kind of attention it has grabbed.
  • Okay. You know what? I'll straightaway come to the specifics of the deal. As you know, the ratings of the show have been a great, great success.
  • Oh, I'm willing to relocate and relook like I told you. This is a very good deal for us also. The program has done exceptionally well. But I think it's better if you gain perspective because…
  • And [Overlapping 12:05] This is actually a very profitable proposition for you. You're actually making more profits on these things.
  • What that entails is that, because you have children and young adults watching it, the advertisers are more impressed because the number of your show. So you have very good advertising revenue.
  • We can, excuse me a second. I kind of programming would be to get back operation. We would like to area and it you wish to join us I don't think to argue. So. Yes. Out of that I could, your agreed. You win. Yourself and the other. Those would be a four big targets the reason for. In your show the indication we a new product. We think we can get a relationship with other after this. Chance at the end.
  • But they do realize if they have to lose these episodes to someone they will get knocked out of the second spot and it's a great opportunity for you. If I understand it correctly you are desperately looking for a good [Overlapping 09:10]
SUBSTANTIATION - FAIRNESS- WHAT YOU ARE OFFERING IS UNREASONABLE FOR ME
  • Six is our baseline because below six you lose (inaudible). You fall below six (inaudible)… I haven't checked out the math yet but you are not even offering me concessions with regards to the way you are going to start with the payment. One shot, is it possible? Do you agree with it? I have to check what's the minimum I can offer then because with (inaudible) payment I am still losing money.
  • Yeah so I have to make about seven runs to kind of make a marginal profit. So if give me five runs I am making a loss. On top of that you are selling me at 65 and I actually do have an alternative deal on the table which also I am not --
  • No you have to be much on line, right. At the end of the day if you are making me run five runs and you are charging me 65000 my actually landing cost right now is 3 million let's say --
  • I mean if you give me an offer then I can probably work with it. We can structure a deal so that --
  • Okay so we did some evaluation of how much we think our property is worth and so using the overall figure we then broke it down into how much we should charge per episode and based on number of runs. So for example I could reduce this figure if you agree to fewer runs per episode. So those things can be worked around.
  • Fair enough, so, see I can't come down too much so what I can do is I can come down by about 30%, 75 -- See the reason I am pricing it so high is because the evaluators (inaudible). Now to come down at 35 is crazy I mean not crazy but like --
6

Description: An offer containing only one issue

OFFERS_SINGLE ISSUE - STATING YOUR OWN OFFER OR WHAT YOU WOULD LIKE
  • I mean, does $80,000 look like a more reasonable amount?
  • I really, I would want, you know, becomes very tight for us to put 40 percent down. I would prefer if you would 20 percent over 5 years.
  • See, I still feel $35,000 should be the right thing. But I'm really open to-considering that this is a very good show-what do you think should be the right price?
  • I am willing to put up around $45,000 per episode for this.
  • I am willing to give you 8 runs per.
7

Description: An offer containing multiple issues

OFFERS-MULTIPLE ISSUE - OFFER INCLUDES SEVERAL ISSUES
  • Fine, let me just rephrase the condition again. For Ultra Rangers 70 per episode, price per episode is at 40,000, number of runs per episode is seven. You are going to make the payment all in all in one shot in year zero. For Strums, the per episode cost the price for you would be 13,000, number of runs per episode is seven, again you are going to make the payment in one shot.
8

Description: A short affirmation in response to an offer

OFFER ACCEPT - ACCEPT AN OFFER
  • Yeah, that's decent I guess. So.
  • Okay, one second and let me evaluate what you are telling me.
  • I, since you're okay. That I am agreeing.
  • Yeah.
  • Right.
  • Okay.
  • Yes.
9

Description: A short negation in response to an offer

OFFER REJECT - REJECT AN OFFER
  • I would think all of these things would have to come into play somewhere. [Overlapping 11:25] But relatively speaking, I don't think $35,000 is a fair number that you've given out. So, you will have to definitely relook at that number.
  • No, not even $80,000. I mean, where do you all get this $80,000 again? You have to be a little more reasonable.
  • That is not something we can offer. There is not something we can offer.
10

Description: A statement about the negotiation itself

PROCESS COMMENT
  • NO EXAMPLES EXIST IN THE TRANSCRIPTS
  • PRESUMABLY, THIS WOULD INCLUDE STATEMENTS LIKE "CAN WE DISCUSS ONE ISSUE AT AT TIME" OR "WE'RE RUNING OUT OF TIME"
  • NOTE: SOME COMMENTS THAT MIGHT BE THOUGHT OF AS PROCESS COMMENTS WERE INCLUDED IN "QUESTIONS"
11

Description: A comment that is on-task but otherwise uncodable

MISCELLANEOUS-ON TASK - ON TOPIC MISCELLANEOUS
  • It doesn't have to be…
  • But given that this is an [Inaudible 11:20] agreement…
  • …because as I said…
  • In that case being the summer I should probably start marketing.
  • You are the cartoon.
  • Very well, is the product times. No, a lot of action.
  • Market.
  • I can see stations, so I have information you can't release a little, only a.
  • Do you want me to.
  • Might be something that is one the table. You can't
  • Yes, it's a lot like.
  • I have that actually…
  • You would have to.
  • You want to explore that before we --
  • I think that runs per episode (inaudible) me.
  • I think it is right for me what you think is….
  • And I believe you.
  • So.
  • Agree or whatever.
12

Description: A comment that is off-task and otherwise uncodable

MISCELLANEOUS-OFF TASK - OFF TOPIC MISCELLANEOUS
  • NO EXAMPLES EXIST IN THE TRANSCRIPTS

Transcripts are coded in three steps:

1. Unitization (you need to do this): The model provides one code for each set of words or sentences that you identify as a unit in your Excel document. You can choose to have units be speaking turn, sentences, or thought units. The easiest to set up is speaking turns, since switching between speakers is clearly identifiable in transcripts. The next easiest is sentences, since they are identified by one of these symbols: .?! However, different transcribers may end sentences in different places. The hardest unit to create is the thought unit since that takes careful analysis and can represent as much work as the coding itself. (See the NegotiAct coding manual3 for how to create thought units.) Clarity of meaning runs the opposite direction. The longer the unit, the more likely there are multiple ideas in the unit, and less clarity for human or AI coders to know what part to code. Aslani et al (2014) coded speaking turns, but 72% of their speaking turns contained just one sentence. The closest alignment with the training data would be for you to use sentences as the unit.

2. Model Assigns Code: The model assigns a code to each unit you submit, based on in-context learning. Coding is guided by the prompt we developed and tested. For more on in-context learning see Xie and Min (2022). Our prompt for this model includes several elements:

  • Five fully coded transcripts. These transcripts were chosen from the 75 available transcripts in the following way. First, any combination of five was considered only if that set included all 13 codes. Second, five of those combinations were chosen at random to test. Third, the one that produced the highest level of match with human coders was retained.
  • Instructions to pay attention to who was speaking, such as “buyer” or “seller”.
  • Instructions to pay attention to what was said in the conversation before and after the unit being coded.
  • Supplementary instructions about the difference between “substantiation” and “information” since in early tests the model often coded substantiation as information, and vice versa. This confusion is not surprising since substantiation usually comes in the form of providing information, but with the purpose of supporting a specific offer or demand.
  • Additional examples of any codes where the five training transcripts did not contain at least 15 examples. We created enough additional examples (based on our understanding of the code) to bring the examples up to 15. We needed to add 12 examples of multi-issue offer, 12 examples of offer rejected, and 14 examples of Miscellaneous Off-Task.

3. We Run the Model Five Times: We automatically run the model five times, to assess consistency of results. As expected the results are not always the same, since with in-context learning the model learns anew with each run and may learn slightly differently each time. Variation is also expected since some units may reasonably be coded in several ways. By running the coding model five times, we get five codes assigned to each speaking unit. If three, four, or five of the five of the runs have the same code, we report the code and indicate the level of “consistency” of that code (three, four, or five out of five). If there are not at least three consistent results out of five runs, or if the model fails to assign a code, we do not report a model code. In these cases, the researcher needs to do human coding.

Validation occured in several steps:

Validation Step 1: Compare the model coding with humans by Brett and Nandkeolyar. To do this, we asked the model to code the 3496 units contained in the Brett and Nandkeolyar transcripts that were not selected for training. We looked at several criteria.

  • Consistency: In our test, there was perfect consistency of model coding (five out of five runs of the model assigned the same code to a unit) for 3,448 of the units, high consistency (four out of five) for 48 of the units, modest consistency (three out of five) for zero of the units, and no cases where the model did not report a code. Thus, 98.6% of codes had perfect consistency.

  • Match with human coding: We assessed whether the model assigned the same code as the human coders. The overall match level for units where the model as- signed a code was 75% (95% CI: .74, .76). To ensure that the model was not biased by matching more accurately with the human coder in early or later phases of the negotiation, we tested whether the match level was different for coding the first versus second half of all transcripts. The match level was 75% for the first half and 74% for the second half, suggesting no bias based on phase of the negotiation. We also looked at match by level of consistency (see Table 2). These results suggest that users may want to accept model-assigned codes only for those codes where the model achieves perfect consistency (five out of five).

  • Table 1: Match Percentage by Consistency Level, Validation Step 1

    Level of Consistency Match with Human Codes % Achieve This Consistency Level Among Those Assigned a Code Number* Match Percentage Match
    Modest Consistency 3 out of 5 0% 0 not match
    0 match n/a
    High Consistency 4 out of 5 1.4% 29 not match
    19 match 45%
    Perfect Consistency 5 out of 5 98.6% 845 not match
    2,603 match 75%

    *0 cases did not reach the 3 out of 5 consistency threshold or the model failed to assign a code

    We also calculated the Cohen’s kappa, with the model codes as coming from one rater and the human coding as coming from a second rater. This calculation, compared to the percentage match, accounts for matches that might occur based on chance. Cohen’s kappa was calculated in R (R Core Team, 2022) using the IRR package (Gamer & Lemon, 2019[7]). Cohen’s kappa was equal to 0.70, with the no information rate of .26 (p-value of difference is < .001). According to Landis and Koch (1977) this represents “substantial agreement”, and according to Fleiss (1981)[ 5 ] is “fair to good” agreement. Rather than relying on conventional categorical guidelines to interpret the magnitude of kappa, Bakeman (2023) argues that researchers should estimate observer accuracy or how accurate simulated observers need to be to produce a given value of kappa. The KappaAcc program (Bakeman, 2022[4]) was used to estimate observer accuracy, which was found to be 86%.

    It is also worth noting that in many cases where scholars establish inter-coder reliability, there is a process of cross-rater discussion that is used to resolve initial differences of opinion between the two coders. In a study of inter-coder agreement, coder agreements in the 80% range began with initial coder agreements in the 40% range (Garrison, et al 2005[8]). Of course, in our case with initial coder there can be no cross-rater discussion between a model and a human, taking away one step that is often used to achieve higher kappas. The closest we can get to that process is to have a third person view the cases of human-model disagreement to provide a judgment of which code was more correct. Also, the fact that so many codes need human-to-human discussions to resolve, suggests some inherent ambiguity about code assignments and opens up the possibility that several different codes might reasonably be assigned to some segments of transcripts.

  • Summary Data and Confusion Matrix: We created a confusion matrix for all codes (see see Figure 1). The vertical axis shows human coding. The horizontal axis shows model coding. Also included below (see see Table 2) are summary statistics showing which codes appeared most frequently in the human coding (Substantiation was most common representing 31.64% of the codes, while Miscellaneous Off-Topic was least common representing just .11% of the codes), and the human-model match level for each code. The highest levels of human-model match (other than for Miscel- laneous) were for Positive Response, Question, and Multi-Issue Offer. There appears to be a rough correlation between number of units and match percentage, suggesting that match percentage goes up when there are more examples of that code in the training transcripts for the model to learn from and when there are more opportuni- ties to find that code in the test transcripts.

    In terms of absolute numbers of mismatches, the largest set is 155 human-coded Substantiation codes that were coded as Information by the model. This is an issue we recognized early in our testing, which resulted in added instructions in the prompt to reduce this mismatch. The fundamental problem is that Substantiation is often achieved by providing information, but to be Substantiation that information must support a particular argument or claim. There were also 43 cases of human-coded Miscellaneous On-Task that were coded as Information by the model. The next largest set of mismatches were 42 where humans assigned a code of Single issue offer while the model assigned a code of Multi-issue offer, which is easy to imagine happening.

Figure 1: Confusion Matrix, Validation Step 1

Not Available!

Table 2: Match Percentage by Code, Validation Step 1

Human Code % of units Across All Transcripts Model Match %
S 31.64 70%
RP 22.94 93%
Q 13.56 86%
MON 8.55 69%
I 7.61 66%
OS 7.18 60%
OR 2.95 63%
OA 2.43 34%
RN 1.29 49%
OM 1.29 89%
P 0.46 6%
MOFF 0.11 75%
Closer Look at Mismatches: Closer Look at Mismatches. To assess the nature of these and other mis- matches, we selected a random sample of 100 mismatches for closer examination. Given that original human coders may be just as likely to make errors (or simply vary in their judgments) as the model, we wanted to see if newly trained coders would see the human codes or the model-provided codes as more accurate. We provided these coders with the 100 speaking turns, as well as the two speaking turns preceding that speaking turn, along with the human and model codes. They were not informed which code came from the model or humans, and the order in which they saw the two codes was flipped halfway through the 100 samples to avoid order effects. They selected which code they saw as more accurate. This was done first separately by the two coders, and then they were asked to resolve through discussion any cases where they disagreed. In the end, these new coders thought the model-provided codes were correct 67% of the time and the human-generated codes 25% of the time, and were uncertain about which was correct 8% of the time. Based on this we can expect that the model is correct in 67% of the cases with mismatches, so we can trust that about 92% of the model codes are accurate.

Validation Step 2: Match with Human coding for Different Simulations.

The first step of validation involved matching human and model codes where the negotia- tion simulation used for training was the same as the negotiation simulation used for testing the model (Cartoon). But users may have transcripts from any number of simulations or real-world negotiations, not just the simulation used in Brett and Nandkeolyar (unpub- lished) study. Therefore, we wanted to test how well the model would match human coders who applied the Brett and Nandkeolyar model to transcripts using other simulations. We selected a set of 3 Transcripts from a study that used The Sweet Shop simulation, and 3 transcripts from a study that used the Les Florets simulation. Since these transcripts were not initially coded using the Brett and Nandkeolyar codes, we needed to train two coders to use the Brett and Nandkeolyar codes. After initial training, they reached a level of inter-coder reliability of k=.73. They coded the transcripts separately and came together to discuss any cases where they disagreed and assign a code. This provided the human codes for a set of the Sweet Shop and LesFlorets simulations. These transcripts were then coded using our model.

The 6 transcripts had 1302 speaking turns, of which 99% were single sentences. The model had perfect consistency for 87% of the speaking turns (all five runs assigned the same code), high consistency for 10% of the speaking turns (4 out of 5 runs assigned the same code), and modest consistency for 1.6% of the speaking turns (3 out of 5 runs assigned the same code). There was 1 case of less than 3 out of 5 consistency. The match percentage was 67% for perfect consistency codes, 81% for high consistency codes, and 45% for moderate consistency codes (see Table 3). Overall, the match percentage was 68%. This was lower than our prior tests, as expected, because these transcripts did not have the same issues and topics as training transcripts (which used The Cartoon simulation). For that reason, these results may better represent the model’s effectiveness with most transcripts. We also checked to see if one set of transcripts did better than the other. The match percentage was also 70% for just the Les Florets transcripts and 64% for just the Sweet Shop transcripts, suggesting that the model should do just as well with transcripts using different simulations.

We also calculated the Cohen’s kappa. The weighted Cohen’s kappa was .63 with the no information rate of .26 (p-value of difference is < .001). This kappa according to Landis and Koch (1977)[ 11 ] is “moderate agreement”, and according to Fleiss (1981) is “fair to good” agreement. Rather than relying on conventional categorical guidelines to interpret the magnitude of kappa, Bakeman (2023)[ 4 ] argues that researchers should estimate observer accuracy or how accurate simulated observers need to be to produce a given value of kappa. The KappaAcc program (Bakeman, 2022)[ 4 ] was used to estimate observer accuracy, which was found to be 81%.

Table 3: Match Percentage by Consistency Level, Validation Step 1

Level of Consistency Match with Human Codes % Achieve This Consistency Level Among Those Assigned a Code Number* Match Percentage Match
Modest Consistency 3 out of 5 .02% 12 not match
10 match 45%
High Consistency 4 out of 5 10.1% 25 not match
106 match 81%
Perfect Consistency 5 out of 5 87.8% 377 not match
771 match 67%

*1 case did not reach the 3 out of 5 consistency threshold or the model failed to assign a code

The proportion of speaking units that fell into each category were roughly similar to what we saw in the first validation tests, with most speaking units being: Information, Question, and Response Positive. In this set of transcripts Substantiation was also fairly common (see Table 2). As with the first validation test, model-human match percentage appears to be highly correlated with number of codes.

The confusion matrix (see Figure 2) shows that, once again, the largest number of mis- matches comes from Information/Substantiation. It also shows that nearly all of the mismatches were cases where the model assigned a code of information when the humans assigned various other codes.

Figure 2: Confusion Matrix, Validation Step 2

Not Available!

Table 4: Match Percentage by Code, Validation Step 1

Human Code % of units Across All Transcripts Model Match %
I 21.98 57%
S 16.45 84%
RP 15.30 80%
Q 14.37 95%
P 6.69 45%
MON 6.46 49%
OS 6.30 57%
MOFF 4.77 31%
OA 2.23 52%
OR 2.15 82%
OM 1.84 71%
RN 1.46 32%

In order to assess the mismatches, we collected a random sample of 98 sentences with mismatches, along with the two prior sentences and the human and model codes. We then took off the column labels and randomly mixed the order of the codes. Since the human coding in this case was done by our coding team, we wanted a different person to select which of the two codes was more correct. This was done by the first author. The results are shown in Table 5. About 47% of the time the human code was deemed more accurate, while in 40% of the cases the model was deemed more accurate. In another 6% of the cases, both were deemed correct (because, for example, the sentence was long and really contained two thought units). In 7% of the cases both the human and model codes were deemed incorrect or the sentence was deemed uninterpretable (because, for example, there were missing words in the transcription). Looking, then, at the 32% of speaking units that were a mismatch, perhaps half of them might still be deemed accurate, bringing the match percentage up from 68% to about 85%.

Table 5: Assessment of 98 Sample Mismatches

Code Selection Count
Clear Choice Human Code is Correct 46 85
Model Code is Correct 39
Both Correct 6 6
Both Incorrect Human and Model Both Incorrect 2 2
Not Understood Could not Understand the Sentence 5 5

Set up your transcripts for analysis by putting them into an excel sheet. Files must not be longer than 999 rows (if you have longer transcripts, split them to make smaller files). The format should be as shown below. Label the first column “SpeakerName” and list whatever names you have for those speakers (e.g., buyer/seller, John/Mary). Label the second column “Content” and include the material that is contained in your unit of analysis (which may be a speaking turn, a sentence, or a thought unit). Also include columns for "ResearcherName", "Email", and "Institution" (often a university) and include that information in the next row. Note that there is no space in the headings “SpeakerName” and “ResearcherName.”

If you use speaking turns then speakers will alternate, and the format will look like this:

SpeakerName Content ResearcherName Email Institution
Buyer Words in a speaking turn… Your Name Your Email Your Institution
Seller Words in a speaking turn…
Buyer Words in a speaking turn…
Seller Words in a speaking turn…
etc. Words in a speaking turn…

If you use sentences or thought units then it is possible that speakers may appear several times in a row, and the format will look like this:

SpeakerName Content ResearcherName Email Institution
Buyer Words in sentence or thought unit… Your Name Your Email Your Institution
Seller Words in sentence or thought unit…
Seller Words in sentence or thought unit…
Seller Words in sentence or thought unit…
Buyer Words in sentence or thought unit…
Buyer Words in sentence or thought unit…
Seller Words in sentence or thought unit…
etc. Words in sentence or thought unit…

Create one Excel file for each transcript. Name each file in the following way:

  • YourName_StudyName_1
  • YourName_StudyName_2
  • YourName_StudyName_3
  • etc.

For example, my first file would be named “RayFriedman_CrownStudy_1” and the second file would be named “RayFriedman_CrownStudy_2”, and so on.

To submit your transcript for the model to code, drag and drop one or several transcript files into the section below. If you see the files you want to code listed properly (just below the “Upload” button), click submit. Note that each time you upload new files, those will replace the previously uploaded files, and be ready to submit

It will likely take about 10 minutes for Claude to process each transcript, although this can vary based on how much demand Claude has at the moment you submit your files. Do not close your window while you are waiting for results – you will lose your results. Once the analysis for each transcript is complete, you will receive the output in a csv file that is automatically downloaded to your download folder. We suggest submitting just a few files at a time, so that you can check the output before doing too many analyses. The output file will include:

  • Transcript Name
  • Speaker
  • The text (thought unit, sentence, or speaking turn)
  • The code assigned to that text
  • Consistency score for that code
  • [1] Aslani et al. (2014), Measuring negotiation strategy and predicting outcomes: Self-reports, behavioral codes, and linguistic codes, presented at the annual conference of the International Association for Conflict Management, Leiden, The Netherlands.
  • [2] Aslani, S., RamirezMarin, J., Brett, J., Yao, J., SemnaniAzad, Z., Zhang, Z. X., ... & Adair, W. (2016). Dignity, face, and honor cultures: A study of negotiation strategy and outcomes in three cultures. Journal of Organizational Behavior, 37 (8), 1178-1201.
  • [3] Adair, W. L., & Brett, J. M. (2005). The negotiation dance: Time, culture, and behavioral sequences in negotiation. Organization Science, 16, 33-51.
  • [4] Bakeman, R. (2022). KappaAcc: A program for assessing the adequacy of kappa. Behavior Research Methods. https://doi.org/10.3758/s13428-022-01836-1
  • [5] Fleiss, J.L. (1981). Statistical methods for rates and proportions (2nd ed.). New York: John Wiley. ISBN 978-0-471-26370-8.
  • [6] Friedman, R., Brett, J., Cho, J., Zhan, X. et al. (2024). An application large language models to coding negotiation transcripts: The Vanderbilt AI negotiation lab. (forthcoming)
  • [7] Gamer, M., Lemon, J., Fellows, I., & Singh P. (2019) irr: Various coefficients of interrater reliability and agreement. R package version 0.84.1. https://CRAN.R-project.org/package= irr
  • [8] Garrison, D. Cleveland-Innes, M., Koole, M. & Kappelman, J. (2006). Revisiting methodolog- ical issues in transcript analysis: Negotiated coding and reliability. The Internet and Higher Education. 9. 1-8. 10.1016/j.iheduc.2005.11.001.
  • [9] Gunia, B. C., Brett, J. M., Nandkeolyar, A. K., & Kamdar, D. (2011). Paying a price: Culture, trust, and negotiation consequences. Journal of Applied Psychology, 96, 774-789.
  • [10] Jackel, E., Zerres, A., Hamshorn de Sanchez, C., Lehmann-Willenbrock, & N., Huffmeier, J. (2022), NegotiAct: Introducing a comprehensive coding scheme to capture temporal interaction patterns in negotiations, Group and Organization Management. (See supplementary file for coding guidelines.)
  • [11] Landis, J.R. & Koch, G.G. (1977). The measurement of observer agreement for categorical data. Biometrics. 33 (1): 159174. doi:10.2307/2529310.
  • [12] R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/.
  • [13] Weingart, L. R., Thompson, L. L., Bazerman, M. H., & Carroll, J. S. (1990). Tactical behavior and negotiation outcomes. International Journal of Conflict Management, 1, 7-31.
  • [14] Xie. S.M. & Min, S. (2022). How does in-context learning work? A framework for understanding the differences from traditional supervised learning. Stanford AI Lab Blog, Aug 1. https: //ai.stanford.edu/blog/understanding-incontext/