Model 2 uses the coding scheme of Brett and Nandkeolyar (unpublished). The study was conducted in 2010, with MBA students from India. The simulation they used was Cartoon (available at negotiationandteamresources.com). The authors shared with us 43 transcripts they had collected and coded using human coders.
Each of the codes is shown in the section below. For each of the 12 codes, we provide a definition of the code, a short explanation, and sample sentences from the transcripts. The sample sentences let you see how these scholars operationalized their codes, which is what Model 2 learned from and tries to reproduce when coding your transcripts. As with any coding scheme, different scholars might operationalize concepts slightly differently. You should decide if this coding scheme will be useful to you by reviewing how the authors used it.
When reporting your results from this model, please cite this paper:
Below we have listed each of the codes used in this coding scheme (If you prefer an excel file format to view this material, download this file.), along with a brief definition and explanation, followed by example sentences from the coded transcripts. Each of the 12 codes is shown below. The purpose of providing these sample sentences is to let you see how these scholars had their coders operationalize the codes. This way of thinking is what Model 2 has learned and is trained to reproduce.
Code Number | Code Name |
---|---|
1 |
Question (Q)
Description: Asking a question about the other side's preferences, asking for clarification on a statement, or asking for information from the other party. QUESTION - ASKING OTHER PARTY TO STATE PREFERENCES
QUESTION - ASKING FOR CLARIFICATION OR MORE INFORMATION
QUESTION - ASKING ABOUT PROCESS OF MAKING DEAL
|
2 |
Information (I)
Description: Comments that provide general information about one's self/business, preferences, needs, knowledge INFORMATION - STATING CURRENT UNDERSTANDINGS
INFORMATION - INFORMATION ABOUT SELF
INFORMATION - STATING DESIRES, INTERESTS, OR PREFERENCES
|
3 |
Reaction-positive (RP)
Description: A short positive response in agreement with a point or validating something previously said
|
4 |
Reaction-negative (RN)
Description: A negative response including "no", shutting down other options, any short negative response or stating disagreement
|
5 |
Substantiation (S)
Description: An attempt at cognitive influence (appeals to rationality, logic, data from the case, interests); normative influence (appeals to reciprocity, fairness, consistency, morality, norms); or emotional influence (threats, statements about alternatives. Also, questions about alternatives, sympathy, apologies, flattery, bragging, or schmoozing SUBSTANTIATION - MENTIONS OF ALTERNATIVES
SUBSTANTIATION - EXPLAINING HOW OTHER PARTY WILL BENEFIT FROM WHAT IS OFFERED
SUBSTANTIATION - FAIRNESS- WHAT YOU ARE OFFERING IS UNREASONABLE FOR ME
|
6 |
Offers-single issue (OS)
Description: An offer containing only one issue OFFERS_SINGLE ISSUE - STATING YOUR OWN OFFER OR WHAT YOU WOULD LIKE
|
7 |
Offers-multiple issue (OM)
Description: An offer containing multiple issues OFFERS-MULTIPLE ISSUE - OFFER INCLUDES SEVERAL ISSUES
|
8 |
Offer accept (OA)
Description: A short affirmation in response to an offer OFFER ACCEPT - ACCEPT AN OFFER
|
9 |
Offer reject (OR)
Description: A short negation in response to an offer OFFER REJECT - REJECT AN OFFER
|
10 |
Process comment (P)
Description: A statement about the negotiation itself PROCESS COMMENT
|
11 |
Micellaneous-on task (MON)
Description: A comment that is on-task but otherwise uncodable MISCELLANEOUS-ON TASK - ON TOPIC MISCELLANEOUS
|
12 |
Micellaneous-off task (MOFF)
Description: A comment that is off-task and otherwise uncodable MISCELLANEOUS-OFF TASK - OFF TOPIC MISCELLANEOUS
|
Transcripts are coded in three steps:
1. Unitization (you need to do this): The model provides one code for each set of words or sentences that you identify as a unit in your Excel document. You can choose to have units be speaking turn, sentences, or thought units. The easiest to set up is speaking turns, since switching between speakers is clearly identifiable in transcripts. The next easiest is sentences, since they are identified by one of these symbols: .?! However, different transcribers may end sentences in different places. The hardest unit to create is the thought unit since that takes careful analysis and can represent as much work as the coding itself. (See the NegotiAct coding manual3 for how to create thought units.) Clarity of meaning runs the opposite direction. The longer the unit, the more likely there are multiple ideas in the unit, and less clarity for human or AI coders to know what part to code. Aslani et al (2014) coded speaking turns, but 72% of their speaking turns contained just one sentence. The closest alignment with the training data would be for you to use sentences as the unit.
2. Model Assigns Code: The model assigns a code to each unit you submit, based on in-context learning. Coding is guided by the prompt we developed and tested. For more on in-context learning see Xie and Min (2022). Our prompt for this model includes several elements:
3. We Run the Model Five Times: We automatically run the model five times, to assess consistency of results. As expected the results are not always the same, since with in-context learning the model learns anew with each run and may learn slightly differently each time. Variation is also expected since some units may reasonably be coded in several ways. By running the coding model five times, we get five codes assigned to each speaking unit. If three, four, or five of the five of the runs have the same code, we report the code and indicate the level of “consistency” of that code (three, four, or five out of five). If there are not at least three consistent results out of five runs, or if the model fails to assign a code, we do not report a model code. In these cases, the researcher needs to do human coding.
Validation occured in several steps:
Validation Step 1: Compare the model coding with humans by Brett and Nandkeolyar. To do this, we asked the model to code the 3496 units contained in the Brett and Nandkeolyar transcripts that were not selected for training. We looked at several criteria.
Table 1: Match Percentage by Consistency Level, Validation Step 1
Level of Consistency | Match with Human Codes | % Achieve This Consistency Level Among Those Assigned a Code | Number* | Match | Percentage Match |
---|---|---|---|---|---|
Modest Consistency | 3 out of 5 | 0% | 0 | not match | |
0 | match | n/a | |||
High Consistency | 4 out of 5 | 1.4% | 29 | not match | |
19 | match | 45% | |||
Perfect Consistency | 5 out of 5 | 98.6% | 845 | not match | |
2,603 | match | 75% |
*0 cases did not reach the 3 out of 5 consistency threshold or the model failed to assign a code
We also calculated the Cohen’s kappa, with the model codes as coming from one rater and the human coding as coming from a second rater. This calculation, compared to the percentage match, accounts for matches that might occur based on chance. Cohen’s kappa was calculated in R (R Core Team, 2022) using the IRR package (Gamer & Lemon, 2019[7]). Cohen’s kappa was equal to 0.70, with the no information rate of .26 (p-value of difference is < .001). According to Landis and Koch (1977) this represents “substantial agreement”, and according to Fleiss (1981)[ 5 ] is “fair to good” agreement. Rather than relying on conventional categorical guidelines to interpret the magnitude of kappa, Bakeman (2023) argues that researchers should estimate observer accuracy or how accurate simulated observers need to be to produce a given value of kappa. The KappaAcc program (Bakeman, 2022[4]) was used to estimate observer accuracy, which was found to be 86%.
It is also worth noting that in many cases where scholars establish inter-coder reliability, there is a process of cross-rater discussion that is used to resolve initial differences of opinion between the two coders. In a study of inter-coder agreement, coder agreements in the 80% range began with initial coder agreements in the 40% range (Garrison, et al 2005[8]). Of course, in our case with initial coder there can be no cross-rater discussion between a model and a human, taking away one step that is often used to achieve higher kappas. The closest we can get to that process is to have a third person view the cases of human-model disagreement to provide a judgment of which code was more correct. Also, the fact that so many codes need human-to-human discussions to resolve, suggests some inherent ambiguity about code assignments and opens up the possibility that several different codes might reasonably be assigned to some segments of transcripts.
In terms of absolute numbers of mismatches, the largest set is 155 human-coded Substantiation codes that were coded as Information by the model. This is an issue we recognized early in our testing, which resulted in added instructions in the prompt to reduce this mismatch. The fundamental problem is that Substantiation is often achieved by providing information, but to be Substantiation that information must support a particular argument or claim. There were also 43 cases of human-coded Miscellaneous On-Task that were coded as Information by the model. The next largest set of mismatches were 42 where humans assigned a code of Single issue offer while the model assigned a code of Multi-issue offer, which is easy to imagine happening.
Figure 1: Confusion Matrix, Validation Step 1
Table 2: Match Percentage by Code, Validation Step 1
Human Code | % of units Across All Transcripts | Model Match % |
---|---|---|
S | 31.64 | 70% |
RP | 22.94 | 93% |
Q | 13.56 | 86% |
MON | 8.55 | 69% |
I | 7.61 | 66% |
OS | 7.18 | 60% |
OR | 2.95 | 63% |
OA | 2.43 | 34% |
RN | 1.29 | 49% |
OM | 1.29 | 89% |
P | 0.46 | 6% |
MOFF | 0.11 | 75% |
Validation Step 2: Match with Human coding for Different Simulations.
The first step of validation involved matching human and model codes where the negotia- tion simulation used for training was the same as the negotiation simulation used for testing the model (Cartoon). But users may have transcripts from any number of simulations or real-world negotiations, not just the simulation used in Brett and Nandkeolyar (unpub- lished) study. Therefore, we wanted to test how well the model would match human coders who applied the Brett and Nandkeolyar model to transcripts using other simulations. We selected a set of 3 Transcripts from a study that used The Sweet Shop simulation, and 3 transcripts from a study that used the Les Florets simulation. Since these transcripts were not initially coded using the Brett and Nandkeolyar codes, we needed to train two coders to use the Brett and Nandkeolyar codes. After initial training, they reached a level of inter-coder reliability of k=.73. They coded the transcripts separately and came together to discuss any cases where they disagreed and assign a code. This provided the human codes for a set of the Sweet Shop and LesFlorets simulations. These transcripts were then coded using our model.
The 6 transcripts had 1302 speaking turns, of which 99% were single sentences. The model had perfect consistency for 87% of the speaking turns (all five runs assigned the same code), high consistency for 10% of the speaking turns (4 out of 5 runs assigned the same code), and modest consistency for 1.6% of the speaking turns (3 out of 5 runs assigned the same code). There was 1 case of less than 3 out of 5 consistency. The match percentage was 67% for perfect consistency codes, 81% for high consistency codes, and 45% for moderate consistency codes (see Table 3). Overall, the match percentage was 68%. This was lower than our prior tests, as expected, because these transcripts did not have the same issues and topics as training transcripts (which used The Cartoon simulation). For that reason, these results may better represent the model’s effectiveness with most transcripts. We also checked to see if one set of transcripts did better than the other. The match percentage was also 70% for just the Les Florets transcripts and 64% for just the Sweet Shop transcripts, suggesting that the model should do just as well with transcripts using different simulations.
We also calculated the Cohen’s kappa. The weighted Cohen’s kappa was .63 with the no information rate of .26 (p-value of difference is < .001). This kappa according to Landis and Koch (1977)[ 11 ] is “moderate agreement”, and according to Fleiss (1981) is “fair to good” agreement. Rather than relying on conventional categorical guidelines to interpret the magnitude of kappa, Bakeman (2023)[ 4 ] argues that researchers should estimate observer accuracy or how accurate simulated observers need to be to produce a given value of kappa. The KappaAcc program (Bakeman, 2022)[ 4 ] was used to estimate observer accuracy, which was found to be 81%.
Table 3: Match Percentage by Consistency Level, Validation Step 1
Level of Consistency | Match with Human Codes | % Achieve This Consistency Level Among Those Assigned a Code | Number* | Match | Percentage Match |
---|---|---|---|---|---|
Modest Consistency | 3 out of 5 | .02% | 12 | not match | |
10 | match | 45% | |||
High Consistency | 4 out of 5 | 10.1% | 25 | not match | |
106 | match | 81% | |||
Perfect Consistency | 5 out of 5 | 87.8% | 377 | not match | |
771 | match | 67% |
*1 case did not reach the 3 out of 5 consistency threshold or the model failed to assign a code
The proportion of speaking units that fell into each category were roughly similar to what we saw in the first validation tests, with most speaking units being: Information, Question, and Response Positive. In this set of transcripts Substantiation was also fairly common (see Table 2). As with the first validation test, model-human match percentage appears to be highly correlated with number of codes.
The confusion matrix (see Figure 2) shows that, once again, the largest number of mis- matches comes from Information/Substantiation. It also shows that nearly all of the mismatches were cases where the model assigned a code of information when the humans assigned various other codes.
Figure 2: Confusion Matrix, Validation Step 2
Table 4: Match Percentage by Code, Validation Step 1
Human Code | % of units Across All Transcripts | Model Match % |
---|---|---|
I | 21.98 | 57% |
S | 16.45 | 84% |
RP | 15.30 | 80% |
Q | 14.37 | 95% |
P | 6.69 | 45% |
MON | 6.46 | 49% |
OS | 6.30 | 57% |
MOFF | 4.77 | 31% |
OA | 2.23 | 52% |
OR | 2.15 | 82% |
OM | 1.84 | 71% |
RN | 1.46 | 32% |
In order to assess the mismatches, we collected a random sample of 98 sentences with mismatches, along with the two prior sentences and the human and model codes. We then took off the column labels and randomly mixed the order of the codes. Since the human coding in this case was done by our coding team, we wanted a different person to select which of the two codes was more correct. This was done by the first author. The results are shown in Table 5. About 47% of the time the human code was deemed more accurate, while in 40% of the cases the model was deemed more accurate. In another 6% of the cases, both were deemed correct (because, for example, the sentence was long and really contained two thought units). In 7% of the cases both the human and model codes were deemed incorrect or the sentence was deemed uninterpretable (because, for example, there were missing words in the transcription). Looking, then, at the 32% of speaking units that were a mismatch, perhaps half of them might still be deemed accurate, bringing the match percentage up from 68% to about 85%.
Table 5: Assessment of 98 Sample Mismatches
Code Selection | Count | ||
---|---|---|---|
Clear Choice | Human Code is Correct | 46 | 85 |
Model Code is Correct | 39 | ||
Both Correct | 6 | 6 | |
Both Incorrect | Human and Model Both Incorrect | 2 | 2 |
Not Understood | Could not Understand the Sentence | 5 | 5 |
Set up your transcripts for analysis by putting them into an excel sheet. Files must not be longer than 999 rows (if you have longer transcripts, split them to make smaller files). The format should be as shown below. Label the first column “SpeakerName” and list whatever names you have for those speakers (e.g., buyer/seller, John/Mary). Label the second column “Content” and include the material that is contained in your unit of analysis (which may be a speaking turn, a sentence, or a thought unit). Also include columns for "ResearcherName", "Email", and "Institution" (often a university) and include that information in the next row. Note that there is no space in the headings “SpeakerName” and “ResearcherName.”
If you use speaking turns then speakers will alternate, and the format will look like this:
SpeakerName | Content | ResearcherName | Institution | |
---|---|---|---|---|
Buyer | Words in a speaking turn… | Your Name | Your Email | Your Institution |
Seller | Words in a speaking turn… | |||
Buyer | Words in a speaking turn… | |||
Seller | Words in a speaking turn… | |||
etc. | Words in a speaking turn… |
If you use sentences or thought units then it is possible that speakers may appear several times in a row, and the format will look like this:
SpeakerName | Content | ResearcherName | Institution | |
---|---|---|---|---|
Buyer | Words in sentence or thought unit… | Your Name | Your Email | Your Institution |
Seller | Words in sentence or thought unit… | |||
Seller | Words in sentence or thought unit… | |||
Seller | Words in sentence or thought unit… | |||
Buyer | Words in sentence or thought unit… | |||
Buyer | Words in sentence or thought unit… | |||
Seller | Words in sentence or thought unit… | |||
etc. | Words in sentence or thought unit… |
Create one Excel file for each transcript. Name each file in the following way:
For example, my first file would be named “RayFriedman_CrownStudy_1” and the second file would be named “RayFriedman_CrownStudy_2”, and so on.