# Create the data frame with participant and fruit variables
<- data.frame(
df Participant = c(1:15),
Apple = c(1, 3, 2, 1, 2, 3, 2, 3, 1, 2, 3, 1, 2, 1, 3),
Banana = c(3, 1, 3, 2, 1, 2, 1, 2, 3, 1, 2, 3, 1, 3, 2),
Cherry = c(2, 2, 1, 3, 3, 1, 3, 1, 2, 3, 1, 3, 3, 2, 1),
Grape = c(3, 1, 2, 2, 1, 3, 2, 3, 1, 2, 3, 1, 2, 3, 3),
Orange = c(1, 3, 1, 1, 2, 1, 3, 1, 3, 1, 2, 3, 3, 2, 2),
Strawberry = c(2, 2, 3, 3, 3, 2, 1, 2, 2, 3, 1, 2, 1, 1, 1)
)
Exploring Pile Sorts with ChatGPT
I asked ChatGPT a bit about pile sorting (Figure 1), then I requested that the AI demonstrate the method with sample data. Below is the conversation that followed.
The Conversation
USER: General Orienting Question
Please explain the ethnographic field method of pile sorting. Detail how to conduct the actual study. Then list the types of analysis and visualizations that are traditionally used with this method.
ASSISTANT
Pile sorting is a qualitative research method used in ethnography to understand how people categorize and organize different aspects of their world. The method involves asking participants to sort a set of cards, pictures, or objects into groups or piles based on how they perceive similarities or differences between them.
To conduct a pile sorting study, the following steps are typically followed:
Recruitment of participants: The researcher identifies and recruits participants who are knowledgeable about the research topic and are willing to take part in the study.
Preparation of stimuli: The researcher prepares a set of stimuli (such as cards or objects) that represent different aspects of the research topic.
Conducting the sorting task: The participants are asked to sort the stimuli into piles or groups based on their own criteria. The participants may be asked to explain their reasoning as they sort the stimuli.
Recording the data: The researcher records the placement of each stimulus in each pile, as well as any comments or explanations provided by the participants.
Analysis of data: The researcher analyzes the data collected through pile sorting to identify patterns and themes that emerge from the participants’ categorization of the stimuli.
Types of analysis and visualizations traditionally used with pile sorting include:
Hierarchical Cluster Analysis: This involves using statistical methods to group the stimuli into clusters based on similarities and differences identified by participants.
Multidimensional Scaling: This method involves creating a visual representation of the data collected from pile sorting, with the distance between stimuli representing the degree of similarity or difference.
Qualitative Analysis: This involves analyzing the comments and explanations provided by participants during the sorting task to identify themes and patterns.
Tree Diagrams: This is a visualization technique that represents the hierarchy of categories and subcategories identified by participants during the sorting task.
Overall, pile sorting is a useful method for understanding how people categorize and organize information in their world, and can provide valuable insights into cultural and social practices.
ChatGPT did not identify correspondence analysis as a key technique and did not identify the fact that tree diagramming would be better listed as a visualization method under clustering. It seems the AI missed some key pieces (Figure 2).
USER: Demonstrate Data Structure
Please represent a sample data set for what is described in step 4. Let there be 15 participants (Participant 1-15) sorting 6 types of fruit (apple, banana, cherry, grape, orange, strawberry) into three piles (1-3).
ASSISTANT
Here is an example of a sample data set for 15 participants sorting 6 types of fruit into three piles:
Participant | Apple | Banana | Cherry | Grape | Orange | Strawberry |
---|---|---|---|---|---|---|
1 | 1 | 3 | 2 | 3 | 1 | 2 |
2 | 3 | 1 | 2 | 1 | 3 | 2 |
3 | 2 | 3 | 1 | 2 | 1 | 3 |
4 | 1 | 2 | 3 | 2 | 1 | 3 |
5 | 2 | 1 | 3 | 1 | 2 | 3 |
6 | 3 | 2 | 1 | 3 | 1 | 2 |
7 | 2 | 1 | 3 | 2 | 3 | 1 |
8 | 3 | 2 | 1 | 3 | 1 | 2 |
9 | 1 | 3 | 2 | 1 | 3 | 2 |
10 | 2 | 1 | 3 | 2 | 1 | 3 |
11 | 3 | 2 | 1 | 3 | 2 | 1 |
12 | 1 | 2 | 3 | 1 | 3 | 2 |
13 | 2 | 1 | 3 | 2 | 3 | 1 |
14 | 1 | 3 | 2 | 3 | 2 | 1 |
15 | 3 | 2 | 1 | 3 | 2 | 1 |
In this data set, each participant (identified by Participant 1-15) sorted 6 types of fruit (apple, banana, cherry, grape, orange, and strawberry) into three piles (1-3). The numbers in the cells represent the pile number that each fruit was sorted into by each participant. For example, Participant 1 sorted apple into pile 1, banana into pile 3, cherry into pile 2, grape into pile 3, orange into pile 1, and strawberry into pile 2.
In early versions of this conversation, I asked ChatGPT to demonstrate the sample data set directly in R. Unfortunately, regardless of various phrasings of the questions I got erratically structured results. Asking ChatGPT to first present the structure of the and then as a second step express that structure in R produced much more stable results.
Out of the gate, ChatGPT will produce a much smaller data set. The default results are about five individuals, six types of fruits and three groups. Therefore, in the prompt I specifically asked ChatGPT to generate a dataset with more participants.
USER: Express Data in R
Please express this dataset in the R computer language naming the object df
.
ASSISTANT
Here is an example of how to create the df
data frame in R:
In this code, we use the data.frame
function to create a data frame with the participant variable and columns for each type of fruit. The c
function is used to specify the values for each variable in the data frame.
USER: Cluster
Great, that looks perfect. Now use the R language to perform hierarchical clustering analysis. Then describe the analysis.
ASSISTANT
Here’s an example of how to perform hierarchical clustering analysis on the df
data frame in R:
# Extract the fruit columns from the data frame
<- df[, 2:7]
fruit
# Calculate the dissimilarity matrix using Euclidean distance
<- dist(fruit, method = "euclidean")
dist_matrix
# Perform hierarchical clustering using Ward's method
<- hclust(dist_matrix, method = "ward.D2")
hc
# Plot the dendrogram
plot(hc, main = "Fruit Pile Sorting Dendrogram")
In this code, we first extract the columns representing the fruit sorting data from the df
data frame and store them in a new data frame called fruit
. We then calculate the dissimilarity matrix using Euclidean distance, which measures the distance between each pair of observations (in this case, participants) based on their fruit sorting patterns. We use the dist
function to calculate the distance matrix.
Next, we perform hierarchical clustering using Ward’s method, which seeks to minimize the variance of the distances between clusters. We use the hclust
function to perform the clustering and store the result in a new object called hc
.
Finally, we plot the dendrogram using the plot
function, which displays the hierarchical clustering tree. The height of each branch in the tree represents the distance between clusters, with longer branches indicating greater dissimilarity. The dendrogram can be used to identify clusters of participants who sorted the fruit in similar ways.
Hierarchical clustering is a type of unsupervised machine learning algorithm that groups similar observations (in this case, participants) together based on their measured similarity or dissimilarity. In this example, we used Ward’s method to group participants based on the dissimilarity of their fruit sorting patterns, as measured by Euclidean distance. The resulting dendrogram can be used to identify clusters of participants who sorted the fruit in similar ways, which can help to identify patterns or commonalities in their perceptions of the fruit.
ChatGPT did not offer a scree plot. With the help of this SO answer, I added a scree plot manually. Since the groups are pretty random, there isn’t a strong break in the scree.
plot(x = length(hc$height):1,
y = hc$height)
lines(x = length(hc$height):1,
y = hc$height)
USER: Multidimensional Scaling
Ok, that looks good. Now demonstrate the use of multidimensional scaling on the df
data set. Then describe the analysis.
ASSISTANT
Here’s an example of how to perform multidimensional scaling (MDS) analysis on the df
data frame in R:
# Extract the fruit columns from the data frame
<- df[, 2:7]
fruit
# Calculate the dissimilarity matrix using Euclidean distance
<- dist(fruit, method = "euclidean")
dist_matrix
# Perform MDS analysis
<- cmdscale(dist_matrix, k = 2)
mds
# Plot the MDS solution
plot(mds, type = "n", main = "Fruit Pile Sorting MDS")
text(mds, labels = rownames(fruit))
In this code, we first extract the columns representing the fruit sorting data from the df
data frame and store them in a new data frame called fruit
. We then calculate the dissimilarity matrix using Euclidean distance, as in the hierarchical clustering example.
Next, we perform MDS analysis using the cmdscale
function. MDS is a technique used to visualize the similarity or dissimilarity of a set of observations (in this case, participants) in a lower-dimensional space, such as a two-dimensional plane. The k
parameter specifies the number of dimensions to use for the MDS solution; in this example, we use k = 2
to obtain a two-dimensional solution.
Finally, we plot the MDS solution using the plot
function, with type = "n"
to create an empty plot, and then use the text
function to add labels to the plot representing each participant. The resulting plot shows the position of each participant in the two-dimensional space, based on the similarity or dissimilarity of their fruit sorting patterns.
MDS analysis can be useful for visualizing the similarities and differences between a set of observations in a low-dimensional space, making it easier to identify patterns and trends in the data. In this example, we used MDS to visualize the similarity or dissimilarity of participants’ fruit sorting patterns in a two-dimensional space, which can help to identify groups or clusters of participants who sorted the fruit in similar ways.
USER: Correspondence Analysis
Ok, that looks good. Using the R computer language and the FactoMineR
R library, demonstrate the use of correspondence analysis on the df
data set. Then describe the analysis.
I had trouble getting accurate and consistent results on this step and explicitly defined the library and function to use.
ASSISTANT
Here’s an example of how to perform correspondence analysis on the df
data frame in R using the FactoMineR
library:
# Load the FactoMineR library
library(FactoMineR)
Warning: package 'FactoMineR' was built under R version 4.2.3
# Perform correspondence analysis on the fruit data
<- CA(df[, 2:7]) res.ca
# Plot the first two dimensions of the CA results
plot(res.ca, axes = c(1,2), main = "Fruit Pile Sorting Correspondence Analysis")
In this code, we first load the FactoMineR
library using the library
function. We then perform correspondence analysis on the df
data frame using the CA
function from the FactoMineR
library. Correspondence analysis is a statistical technique used to explore the relationship between two categorical variables, in this case, the participants’ fruit sorting patterns and the types of fruit.
The CA
function takes the data frame as an argument and automatically detects the categorical variables in the data set. The resulting res.ca
object contains the results of the correspondence analysis, including the eigenvalues, percentage of variance explained by each dimension, and factor scores for each participant and fruit type.
Finally, we plot the first two dimensions of the correspondence analysis results using the plot
function. The resulting plot shows the relationship between the participants’ fruit sorting patterns and the types of fruit, based on the two dimensions with the highest eigenvalues.
Correspondence analysis can be useful for identifying relationships between two categorical variables and exploring patterns or trends in the data. In this example, we used correspondence analysis to explore the relationship between the participants’ fruit sorting patterns and the types of fruit, which can help to identify any patterns or associations between the two variables.
Citation
@online{craig2023,
author = {Craig, Nathan and , ChatGPT},
title = {Exploring {Pile} {Sorts} with {ChatGPT}},
date = {2023-04-10},
url = {https://nmc.quarto.pub/nmc/posts/2023-04-10-sorting-w-gpt},
langid = {en}
}