Skip to main content

AI Field Trial Face-Offs

Word of Lore Field Trial Face-Offs are published continuously for the ongoing trials. Ratings are adjusted on schedule. The ongoing rating period ends on December 31, 2024.

≛ Co-Proficient
≳ Marginal Preference
> Preference
≫ Strong Preference
⋙ Absolute Preference
⨉ Mutually Suboptimal

Email Generation for Work: Claude 3.5 Haiku vs OpenAI o1-preview

Claude 3.5 Haiku prevails over OpenAI o1-preview at the Email Generation for Work trial

Email Quality: OpenAI o1 ≛ Claude 3.5 Haiku

  • OpenAI o1 consistently demonstrated superior formatting and visual organization
  • Claude 3.5 Haiku excelled in providing detailed, comprehensive content
  • Both maintained grammatically correct and coherent communication
  • OpenAI o1's systematic use of bold text and sections enhanced readability
  • Claude 3.5 Haiku sometimes lacked consistent visual formatting elements

Accuracy and Information Integrity: Claude 3.5 Haiku ≳ OpenAI o1

  • Both models maintained high accuracy in representing provided information
  • Claude 3.5 Haiku typically included more specific details from given contexts
  • Neither model showed tendency to fabricate or embellish information
  • Both accurately adhered to provided scenarios and requirements

Relevance and Customization: Claude 3.5 Haiku > OpenAI o1

  • Claude 3.5 Haiku showed stronger ability to incorporate specific context details
  • OpenAI o1 excelled in structured presentation of information
  • Both demonstrated good adaptation to various business scenarios
  • Claude 3.5 Haiku provided more comprehensive solutions in complex scenarios

Consistency: OpenAI o1 ≛ Claude 3.5 Haiku

  • Both maintained professional tone throughout all communications
  • OpenAI o1 showed more consistent formatting across different email types
  • Claude 3.5 Haiku maintained more consistent detail level across responses
  • Neither model showed significant variation in quality across tasks

User Experience: Claude 3.5 Haiku ≳ OpenAI o1

  • Both offer intuitive chat interfaces available across major platforms
  • Both provide clear and understandable user interfaces
  • Claude 3.5 Haiku generally completes responses more quickly
  • OpenAI o1 takes longer to formulate final answers
  • Both maintain responsive and stable performance during use

Authenticity: Claude 3.5 Haiku > OpenAI o1

  • Claude 3.5 Haiku demonstrated stronger personal connection in communications
  • Both maintained appropriate professional tone
  • Claude 3.5 Haiku showed more genuine enthusiasm in positive communications
  • OpenAI o1 balanced formality with approachability effectively
  • OpenAI o1 could be overly reserved in personal/emotional contexts

Conclusion: Claude 3.5 Haiku > OpenAI o1-preview

The trial results show distinct strengths in both models. Claude 3.5 Haiku excelled in detailed content, personal connection, and comprehensive solutions. OpenAI o1-preview demonstrated superior formatting and structure. The results suggest that Claude 3.5 Haiku might be more suitable for complex, relationship-focused communications, while OpenAI o1-preview could be preferred for structured, information-heavy communications requiring clear organization.

AI-Generated Text Detection: ContentDetector.AI vs Writer's AI Content Detector

ContentDetector.AI prevails over Writer's AI Content Detector at the AI-Generated Text Detection Trial

Results of a face-off between ContentDetector.AI and Writer's AI Content Detector at the AI-Generated Text Detection Trial.

Accuracy: ContentDetector.AI ⋙ Writer's AI Content Detector

  • ContentDetector.AI overperforms across all qualitative metrics.
  • Writer's AI Content Detector disclaimer makes it clear that the tool is optimized to reduce the false positive rate, which is confirmed by our findings.
  • Full metrics breakdown:
    • F1-score: 60% for ContentDetector.AI. Not available for Writer's AI Content Detector
    • Accuracy: 69% (ContentDetector.AI) vs 47% (Writer's AI Content Detector)
    • False Positive Rate: 26% (ContentDetector.AI) vs 0% (Writer's AI Content Detector)
    • False Negative Rate: 35% (ContentDetector.AI) vs 100% (Writer's AI Content Detector)

Robustness: ContentDetector.AI ⋙ Writer's AI Content Detector

  • ContentDetector.AI: the best-performing categories are casual writing, creative writing, and product descriptions.
  • ContentDetector.AI: worst performance is seen in the technical writing category

Explainability: ContentDetector.AI ≫ Writer's AI Content Detector

  • ContentDetector.AI highlights specific sentences that it deems AI-generated
  • The two tools surface the results of detection differently: Writer shows the percentage of text that it deems human-generated, while ContentDetector.AI surfaces a probability score of content being AI-generated.

Conclusion: ContentDetector.AI ⋙ Writer's AI Content Detector

ContentDetector.AI by Content at Scale emerges as the clear prevailing AI in this trial. While Writer's AI Content Detector achieves its stated goal of eliminating false positives (reaching a 0% false-positive rate in our tests), this comes at a significant cost: the tool struggles to identify genuine AI-generated content. Both tools offer free access without usage limits or paywalls during our testing period.

Summarizing Articles: Perplexity AI Companion vs Ghostreader & GPT-4o mini

Perplexity AI Companion prevails over Ghostreader (Readwise) & GPT-4o mini (OpenAI) at the Article Summarization Trial

Results of a face-off between Perplexity AI Companion and Ghostreader (Reader’s AI-powered assistant) with GPT-4o mini at Summarizing Articles with AI-Powered Content Condensation Trial.

Conciseness: Ghostreader & GPT-4o mini ≫ Perplexity AI Companion

  • Ghostreader consistently produces shorter summaries while maintaining essential information
  • Perplexity tends to generate significantly longer summaries, sometimes providing non-essential details

Accuracy and Objectivity: Perplexity AI Companion ≛ Ghostreader & GPT-4o mini

  • Both generally maintain objective tone in summaries
  • Ghostreader maintains faithful representation of source material without distortion
  • Perplexity provides more comprehensive technical details when handling complex topics

Coherence and Readability: Perplexity AI Companion > Ghostreader & GPT-4o mini

  • Perplexity excels in organizing complex information with clear hierarchical structure
  • Ghostreader maintains flow in a short format, typically within one paragraph

Balance of Completeness and Relevance: Perplexity AI Companion ≫ Ghostreader & GPT-4o mini

  • Perplexity captures more comprehensive coverage of technical topics and research papers
  • Ghostreader focuses on core messages without getting into details

Convenience and ease of use: Perplexity AI Companion ≛ Ghostreader & GPT-4o mini

  • Both are convenient to use on the web, offering browser extensions
  • Ghostreader provides automatic summarization for all saved articles within Reader
  • Perplexity occasionally unable to summarize articles on the web

Conclusion: Perplexity AI Companion > Ghostreader & GPT-4o mini

The two tools are quite different in terms of summarization. While Perplexity AI Companion offers more comprehensive summaries, Ghostreader's summarization provides better at-a-glance overviews that are great for quickly understanding what an article is about. Given its consistent performance in readability and balance of completeness and relevance, we declare Perplexity AI Companion the prevailing AI for this trial.

AI Judgement: Mistral Large 2 vs Cohere Command R+

Command R+ (Cohere) prevails over Mistral Large 2 (Mistral AI) model at the AI Judgement task

Rationality and Logic: Command R+ ≛ Mistral Large 2

  • Both show strong analytical capabilities in ethical dilemmas and identify logical fallacies accurately
  • Mistral Large 2 tends to be more verbose and sometimes redundant
  • Command R+ occasionally overcomplicates simple scenarios

Impartiality: Command R+ ≛ Mistral Large 2

  • Both consistently recognize conflicts of interest and need for recusal
  • Both maintain strong ethical standards in decision-making
  • Command R+ shows excessive caution in some scenarios
  • Mistral Large 2 sometimes shows inconsistency in detail level between options

Deterrence and Marginality: Command R+ > Mistral Large 2

  • Both excel at recognizing marginal differences
  • Command R+ is better at suggesting alternative selection methods
  • Mistral Large 2 sometimes over-analyzes marginal cases

Consistency: Command R+ > Mistral Large 2

  • Both maintain consistent ethical principles across scenarios and provide clear reasoning for decisions
  • Command R+ shows slight edge in consistency in applying standards
  • Mistral Large 2 varies in analysis depth between similar cases

Ethical Considerations: Command R+ ≛ Mistral Large 2

  • Both show strong grasp of ethical principles and maintain a professional tone throughout
  • Both handle whistleblowing and bias scenarios well
  • Mistral Large 2 sometimes too cautious in clear ethical violations
  • Command R+ occasionally overemphasizes theoretical frameworks

Overall: Command R+ ≳ Mistral Large 2

The margin is small, but Command R+ demonstrates slightly better characteristics at consistency and deterrence in marginal cases. We declare Command R+ a prevailing AI for AI Judgement trial, however, both models show strong potential for AI-driven arbitration.

Summarizing Articles: DuckDuckGo & GPT-4o mini vs DuckDuckGo & Claude 3 Haiku

DuckDuckGo AI Chat & GPT-4o mini prevails over DuckDuckGo AI Chat & Claude 3 Haiku at the Article Summarization trial

Results of a face-off between DuckDuckGo AI Chat with GPT-4o mini and Claude 3 Haiku models at Summarizing Articles with AI-Powered Content Condensation Trial.

Conciseness: GPT-4o mini > Claude 3 Haiku

  • GPT-4o mini consistently produced more concise summaries while maintaining essential information
  • Claude 3 Haiku's summaries were often longer and included unnecessary details

Accuracy and Objectivity: Claude 3 Haiku ≳ GPT-4o mini

  • Claude 3 Haiku's summaries generally provided more specific and verifiable information
  • Both models maintained objectivity and avoided introducing bias
  • GPT-4o mini sometimes presented more general or vague information

Coherence and Readability: GPT-4o mini ≛ Claude 3 Haiku

  • Both models produced well-structured and easy-to-read summaries
  • Claude 3 Haiku's use of numbered points or paragraphs sometimes enhanced readability

Balance of Completeness and Relevance: Claude 3 Haiku ≳ GPT-4o mini

  • Claude 3 Haiku often provided more comprehensive coverage of key points and specific details
  • GPT-4o mini's summaries, while relevant, sometimes missed crucial aspects or nuances

Convenience and Ease of Use: GPT-4o mini ⋙ Claude 3 Haiku

  • Claude 3 Haiku was unable to access more than half of the test articles
  • User experience is the same with Duck Duck Go AI Chat web interface

Conclusion: GPT-4o mini ≫ Claude 3 Haiku

While Claude 3 Haiku had a slight edge in completeness and accuracy, this model selection could not access most article links. Therefore, we declare Duck Duck Go Chat & GPT-4o mini the winner in this face-off.

AI-Generated Text Detection: QuillBot vs Sapling

QuillBot AI Detector prevails over Sapling AI Detector at the AI-Generated Text Detection Trial

Results of a face-off between QuillBot AI Detector and Sapling AI Detector at the AI-Generated Text Detection Trial.

Accuracy: QuillBot ≫ Sapling

  • QuillBot overall is much more accurate.
  • Sapling tends to identify 1 in 4 human-generated content as AI-generated. False Positive Rate is 5 times higher than QuillBot.
  • QuillBot is 3 times less likely to mislabel AI-generated content as human-generated.
  • Full metrics breakdown:
    • F1-score: 68.66% (QuillBot) vs 67.20% (Sapling)
    • Accuracy: 95.56% (QuillBot) vs 82.02% (Sapling)
    • False Positive Rate: 4.76% (QuillBot) vs 24.39% (Sapling)
    • False Negative Rate: 4.17% (QuillBot) vs 12.50% (Sapling)

Robustness: QuillBot ≫ Sapling

  • QuillBot performs consistently across all categories, from social media posts to technical and academic writing.
  • Sapling consistently underperforms in detecting AI in academic writing, technical writing, and social media posts.
  • Both tools work equally well in detecting AI in casual writing and product descriptions.

Explainability: QuillBot ≳ Sapling

  • Both tools highlight content that is likely AI-generated
  • QuillBot provides a refined identification according to the level of AI involvement: AI-generated, AI-generated & AI-refined, Human-written & AI-refined, or Human-written
  • Sapling highlights and color-codes each sentence according to the likelihood of being AI-generated.

Conclusion: QuillBot AI Detector ≫ Sapling AI Detector

Both tools can identify AI-generated content with QuillBot having a clear edge over accuracy and robustness, making it the tool to go across domains. Sapling does well at identifying AI content in casual writing, making it a good tool for sniffing AI in general non-specialized content. When it comes to explainability, both tools highlight suspected AI-generated areas, but QuillBot takes it a step further with a gradation of AI involvement. Overall, we declare QuillBot AI Detection as the prevailing AI in this face-off.

Summarizing Articles: Mistral Le Chat & NeMo vs Gemini Advanced

Mistral NeMo (Mistral Le Chat) prevails over Gemini Advanced (Google) at the Article Summarization Trial

Results of a face-off between Mistral Le Chat with Mistral NeMo and Gemini Advanced at Summarizing Articles with AI-Powered Content Condensation Trial.

Conciseness: Gemini Advanced > Mistral NeMo

  • Gemini Advanced is preferred for shorter summaries while capturing key points
  • Mistral NeMo provided more detailed but longer summaries

Accuracy and Objectivity: Mistral NeMo ≫ Gemini Advanced

  • Mistral NeMo typically more accurate and comprehensive, especially for complex topics
  • Gemini Advanced occasionally contained inaccuracies or missed important details
  • Both maintained objectivity in most cases

Coherence and Readability: Mistral NeMo ≫ Gemini Advanced

  • Mistral NeMo often used clearer structures with numbered points or bullet lists
  • Gemini Advanced's brevity sometimes resulted in less organized presentation
  • Both generally produced coherent and readable summaries

Balance Between Completeness and Relevance: Mistral NeMo ≫ Gemini Advanced

  • Mistral NeMo usually provided more complete coverage of article content
  • Gemini Advanced focused on core ideas but sometimes omitted relevant details
  • Mistral NeMo better at handling complex topics requiring nuanced explanations

Convenience and Ease of Use: Mistral NeMo ≫ Gemini Advanced

  • Both platforms feature user-friendly chat interfaces
  • Gemini Advanced captured only 65% of articles due to technical limitations to access the provided websites
  • Mistral Le Chat was able to capture all articles from variety of sources
  • Gemini Advanced unable to summarise any articles related to politics around election time
  • Gemini is available on mobile, offering greater convenience for mobile users

Conclusion: Le Chat & Mistral NeMo ≫ Gemini Advanced

Mistral NeMo generally excelled in accuracy, completeness, structured presentation, and convenience, making it preferable for complex topics or when detailed understanding is required. Gemini Advanced showed strength in conciseness for quick reference.

Summarizing Articles: DuckDuckGo & Llama 3.1 70B vs DuckDuckGo & Mixtral 8x7B

DuckDuckGo AI Chat & Llama 3.1 70B prevails over DuckDuckGo AI Chat & Mixtral 8x7B at the Article Summarization trial

For brevity, assume both models were used with DuckDuckGo AI Chat when comparing.

Conciseness: Mixtral 8x7B > Llama 3.1 70B

  • Mixtral 8x7B consistently produced shorter summaries while capturing essential information
  • Llama 3.1 70B's summaries were often longer and more detailed

Accuracy and objectivity: Llama 3.1 70B ≫ Mixtral 8x7B

  • Llama 3.1 70B typically provided more accurate representations of the original text
  • Both systems generally maintained objectivity
  • Mixtral 8x7B occasionally included information not present in the original content

Coherence and readability: Llama 3.1 70B ≳ Mixtral 8x7B

  • Both systems produced coherent and readable summaries
  • Llama 3.1 70B often used bullet points or clear sections, enhancing structure

Balance of Completeness and Relevance: Llama 3.1 70B ≫ Mixtral 8x7B

  • Llama 3.1 70B frequently offered more comprehensive coverage of key points
  • Mixtral 8x7B sometimes omitted important details in favor of brevity

Convenience and ease of use: Llama 3.1 70B ≛ Mixtral 8x7B

  • Both models are easily accessible in DuckDuckGo AI Chat on the web

Conclusion: Llama 3.1 70B ≫ Mixtral 8x7B

DuckDuckGo & Llama 3.1 70B generally performed better in accuracy, completeness, and structured presentation, providing comprehensive and easily navigable summaries. DuckDuckGo & Mixtral 8x7B excelled in producing concise summaries suitable for quick overviews.

AI Judgement: Claude 3.5 Sonnet vs OpenAI o1-preview

A draw between Claude 3.5 Sonnet and OpenAI o1-preview models at the AI Judgement task

Rationality and Logic: Claude 3.5 ≛ OpenAI o1

  • Both AIs excel at providing clear, step-by-step reasoning and logical analysis
  • Both demonstrate strong probabilistic reasoning and ability to break down complex scenarios

Impartiality: Claude 3.5 ≛ OpenAI o1

  • Both consistently acknowledge potential biases and conflicts of interest
  • Both recommend recusal in clear conflict of interest cases
  • Claude 3.5 could sometimes be more decisive in final recommendations

Deterrence and Marginality: Claude 3.5 ≳ OpenAI o1

  • Both recognize marginal differences and avoid arbitrary decisions in most cases
  • Claude 3.5 is more consistent in declaring ties when appropriate
  • OpenAI o1 occasionally declares a winner despite marginal differences

Consistency: Claude 3.5 ≛ OpenAI o1

  • Both apply similar reasoning across related scenarios and maintain consistent ethical principles
  • Claude 3.5 could be more explicit about ensuring consistency across judgments

Ethical Considerations: Claude 3.5 ≛ OpenAI o1

  • Both demonstrate strong awareness of ethical implications and carefully weigh competing principles
  • Claude 3.5 sometimes struggles to provide definitive recommendations in highly complex ethical dilemmas

Transparency and Justification: Claude 3.5 ≛ OpenAI o1

  • Both provide clear, detailed explanations for decisions throughout responses
  • Both break down reasoning into logical steps, enhancing transparency

Conclusion: Claude 3.5 ≛ OpenAI o1

Both AI systems demonstrate strong capabilities in rational decision-making, impartiality, and ethical reasoning. While each has minor areas for improvement, their overall performance is remarkably similar. The conclusion is that they are evenly matched in this AI Judgment task.

Email Generation for Work: GPT-4o vs OpenAI o1-mini

OpenAI o1-mini prevails over GPT-4o at the Email Generation for Work trial

Email Quality: OpenAI o1-mini > GPT-4o

  • Both AIs consistently produced grammatically correct, clear, and coherent emails
  • OpenAI o1-mini often provided more comprehensive and detailed responses
  • OpenAI o1-mini often used more reader-friendly formatting (headings, bullet points)
  • Both maintained professional tones appropriate for work contexts

Accuracy and Information Integrity: OpenAI o1-mini ≛ GPT-4o

  • Both AIs adhered closely to the provided information without evident fabrication
  • No significant issues with hallucinations or made-up facts were observed

Relevance and Customization: OpenAI o1-mini > GPT-4o

  • OpenAI o1-mini generally showed superior ability to tailor responses to specific situations
  • Both AIs demonstrated good understanding of various email contexts and requirements

Consistency: OpenAI o1-mini ≛ GPT-4o

  • Both AIs maintained uniform voice and quality across different email types
  • Performance remained consistent regardless of task complexity

User Experience: OpenAI o1-mini ≛ GPT-4o

  • Although OpenAI o1-mini takes more time "thinking", the subsequent generation completes quickly, making the overall experience comparable with the slower GPT-4o generation

Authenticity: OpenAI o1-mini ≛ GPT-4o

  • Both AIs generated emails that sounded natural and appropriate for work contexts
  • No significant issues with artificial-sounding language were detected

Conclusion: OpenAI o1-mini > GPT-4o

While both AIs performed well in generating work-related emails, OpenAI o1-mini frequently demonstrated a slight edge in comprehensiveness, customization, and formatting. Both AIs could produce high-quality, professional emails suitable for various work scenarios, with consistent performance across different email types and complexity levels.

Email Generation for Work: Claude 3.5 vs Llama 3.1 70B

Claude 3.5 prevails over Llama 3.1 70B at the Email Generation for Work trial

Email Quality: Claude 3.5 Sonnet > Llama 3.1 70B

  • Both AIs consistently produced grammatically correct, clear, and coherent emails
  • Claude 3.5 Sonnet often provided significantly more comprehensive and better-structured responses, particularly in complex scenarios like internal policy announcements and sales pitches
  • Llama 3.1 70B occasionally maintained a more personal tone in certain scenarios, especially in customer service contexts
  • Both AIs showed equal proficiency in crafting job interview follow-up emails

Accuracy and Information Integrity: Claude 3.5 Sonnet ≳ Llama 3.1 70B

  • Both AIs adhered well to the provided information without fabrication
  • Claude 3.5 Sonnet consistently provided more detailed and specific information, notably in welcome emails to new employees and event cancellation notices

Relevance and Customization: Claude 3.5 Sonnet ≳ Llama 3.1 70B

  • Both AIs demonstrated good ability to understand and address specific instructions
  • Claude 3.5 Sonnet showed superior tailoring in complex situations, such as collaboration requests between departments
  • Llama 3.1 70B excelled in maintaining a more personal and empathetic tone in customer-focused scenarios
  • Both AIs showed willingness to offer substantial compensation in customer complaint scenarios

Consistency: Claude 3.5 Sonnet ≳ Llama 3.1 70B

  • Both AIs maintained a uniform voice across multiple email types
  • Claude 3.5 Sonnet demonstrated more consistent quality across varying complexity levels, particularly in handling apology emails for missed deadlines
  • Llama 3.1 70B showed strong adaptability in matching formal tones when required

User Experience: Claude 3.5 Sonnet > Llama 3.1 70B

  • Claude could attach documents, guides, and examples in many formats.
  • Llama has no native support for document attachment, making prompts more difficult to customize.

Authenticity: Claude 3.5 Sonnet ≛ Llama 3.1 70B

  • Both AIs generally matched appropriate tones and styles for different email contexts
  • Llama 3.1 70B outperformed Claude 3.5 Sonnet in some customer service scenarios by using a more natural, conversational tone and I-statements

Conclusion: Claude 3.5 Sonnet > Llama 3.1 70B

Claude 3.5 Sonnet and Llama 3.1 70B both performed well in the email generation for work trial, with Claude 3.5 Sonnet showing a clear edge in comprehensiveness, structure, and handling complex scenarios. Claude particularly excelled in formal communications, detailed explanations, and enhancing readability through superior formatting. Llama 3.1 70B demonstrated strengths in maintaining a personal tone, especially in customer service contexts, often achieving a more natural, conversational style. While Claude 3.5 Sonnet generally provided more detailed and structured responses, Llama 3.1 70B showed surprising adaptability in matching formality when required.

AI Judgement: Mixtral 8x7B vs Llama 3 8B

Mixtral 8-7B prevails over Llama 3 8B model at the AI Judgement task

Rationality and Logic: Mixtral 8x7B ≳ Llama 3

  • Both models demonstrate strong logical analysis and reasoning
  • Mixtral shows better handling of ethical dilemmas and syllogisms
  • Llama occasionally struggles with complex ethical scenarios

Impartiality: Mixtral 8x7B ≳ Llama 3

  • Both models generally maintain objectivity
  • Both correctly identify need to recuse in clear conflict of interest situations
  • Llama has more difficulty with nuanced scenarios and subtle biases

Deterrence and Marginality: Mixtral 8x7B ⨉ Llama 3

  • Inconsistent handling of truly marginal cases
  • Often fail to consider deterrence effects in judgments
  • Struggle with explicitly stating when differences are marginal

Consistency: Mixtral 8x7B ≳ Llama 3

  • Both demonstrate good consistency in applying ethical principles
  • Both maintain consistent reasoning over time
  • Llama shows occasional inconsistency across related cases

Ethical Considerations: Mixtral 8x7B ≛ Llama 3

  • Generally identify key ethical issues and provide thoughtful analysis
  • Show commitment to fairness and addressing bias
  • Occasionally struggle with prioritizing ethical considerations in complex scenarios

Transparency and Justification: Mixtral 8x7B ≛ Llama 3

  • Llama sometimes gets carried away into irrelevant avenues

Conclusion: Mixtral 8x7B ≳ Llama 3

In this AI Field Trial for judgment tasks, Mixtral 8x7B Instruct demonstrates a slight edge over Llama 3 8B Instruct in Rationality and Logic, Impartiality, and Consistency. Both models perform equally well in Ethical Considerations, showing strong capabilities in identifying and analyzing ethical issues. However, both models underperform in Deterrence and Marginality, indicating a clear area for improvement in handling marginal cases and considering deterrence effects. While Mixtral shows a marginal advantage in several areas, both models demonstrate strong overall performance in AI judgment tasks, with specific strengths and areas for improvement identified for each.

Summarizing Articles: ChatGPT GPT-4o vs Perplexity.ai Pro

Perpexity.ai Pro prevails over ChatGPT GPT-4o at the Article Summarization task

Conciseness: ChatGPT GPT-4o > Perplexity AI Pro

  • ChatGPT GPT-4o consistently produced shorter summaries while still capturing essential information
  • Perplexity AI Pro tended to provide more detailed summaries, often longer than ChatGPT GPT-4o's

Accuracy and Objectivity: Perplexity AI Pro ≳ ChatGPT GPT-4o

  • Both ChatGPT GPT-4o and Perplexity AI Pro generally maintained accuracy and objectivity in their summaries
  • Perplexity AI Pro sometimes provided more comprehensive and nuanced information, especially for technical or complex topics

Coherence and Readability: Perplexity AI Pro > ChatGPT GPT-4o

  • Perplexity AI Pro often used clearer structures with headers, bullet points, or numbered lists, enhancing readability
  • ChatGPT GPT-4o's summaries were typically coherent but sometimes lacked the structured presentation of Perplexity AI Pro's

Balance of Completeness and Relevance: Perplexity AI Pro > ChatGPT GPT-4o

  • Perplexity AI Pro frequently offered more comprehensive overviews, including more details from the original text
  • ChatGPT GPT-4o tended to focus on key points, sometimes omitting details that Perplexity AI Pro included

Convenience and Accessibility: Perplexity AI Pro ≛ ChatGPT GPT-4o

  • Both tools offer web and mobile apps, making them fairly accessible
  • During the trial, both tools had issues accessing some articles:
    • ChatGPT couldn't access 17% of URLs
    • Perplexity AI Pro couldn't access 25% of articles
  • Interestingly, disabling Pro mode in Perplexity sometimes made articles accessible for summarization. This was not considered part of the trial evaluation, as we specifically tested Pro mode

Conclusion: Perplexity AI Pro > ChatGPT GPT-4o

  • ChatGPT GPT-4o excels in producing concise, quick-reference summaries
  • Perplexity AI Pro provides more comprehensive, structured summaries with greater detail
  • The choice between ChatGPT GPT-4o and Perplexity AI Pro depends on the user's needs: quick overview (ChatGPT GPT-4o) vs. detailed understanding (Perplexity AI Pro)
  • Both tools demonstrate strengths in different areas, suggesting they could be complementary depending on the task at hand

Summarizing Articles: You.com Smart vs Perplexity.ai Quick Search

You.com Smart prevails over Perplexity.ai Quick Search at Summarizing Articles task

Conciseness: You.com Smart ≳ Perplexity.ai

  • Both tools generally performed well in capturing essential information concisely
  • Perplexity.ai occasionally provided longer, more detailed summaries
  • You.com Smart consistently produced shorter summaries, often preferred for brevity

Accuracy and Objectivity: You.com Smart ≛ Perplexity.ai

  • Both tools demonstrated consistent accuracy and objectivity across most examples
  • Performance was often comparable between the two tools
  • In one notable case, Perplexity.ai correctly summarized the intended page, while You.com Smart summarized the wrong page

Coherence and Readability: You.com Smart > Perplexity.ai

  • Both tools produced generally coherent and readable summaries
  • You.com Smart was often favored for its clear structure and organization
  • Perplexity.ai occasionally used numbered points effectively
  • You.com Smart frequently employed subheadings, bullet points, and other formatting to enhance readability

Balance of Completeness and Relevance: Perplexity.ai ≳ You.com Smart

  • Perplexity.ai often excelled in providing comprehensive coverage of key points
  • You.com Smart sometimes omitted specific details or examples
  • Perplexity.ai tended to include more specific details from the original text
  • You.com Smart occasionally focused more on broader implications rather than specific details

Convenience and Accessibility: You.com Smart ≛ Perplexity.ai

  • Both tools provide equal convenience via chat interface
  • Both tools available on mobile for convenience

Additional Observations

  • Performance consistency: You.com Smart showed more consistent performance across different types of content, while Perplexity.ai's performance varied depending on the subject matter
  • Depth vs. brevity trade-off: Perplexity.ai tended to prioritize depth of information, while You.com Smart favored brevity and quick comprehension
  • User preference influence: The preferred tool may depend on the user's specific needs—in-depth analysis or a quick overview

Conclusion: You.com Smart ≳ Perplexity.ai

  • Perplexity.ai Quick Search is more suitable for readers seeking in-depth information, providing accurate and comprehensive summaries
  • You.com Smart is better suited for readers seeking quick overviews or easily referenceable information, offering concise and well-structured summaries

Summarizing Articles: Command R+ (Cohere) vs Gemini (Google)

Command R+ prevails over Gemini at the Summarizing Articles task

Conciseness: Gemini ≫ Command R+

  • Gemini consistently excels in brevity, providing significantly shorter summaries while still capturing key points
  • Command R+ often produces longer summaries, sacrificing conciseness for more comprehensive coverage

Accuracy and Objectivity: Command R+ ≫ Gemini

  • Command R+ generally provides more comprehensive and accurate representations of the original articles
  • In discussions of complex or technical subjects, Command R+ tends to offer more accurate and nuanced information
  • While less detailed, Gemini is generally accurate in the information it does provide

Coherence and Readability: Command R+ ≛ Gemini

  • Command R+ typically flows logically and provides clear connections between ideas, offering a more in-depth narrative
  • Gemini offers a quick and efficient overview of main topics, making it easier for readers to grasp key ideas rapidly

Balance of Completeness and Relevance: Command R+ ≳ Gemini

  • Command R+ generally includes more critical points from the original texts without omitting crucial information, providing context and relevant details
  • Gemini tends to concentrate on the most critical points without delving into excessive detail, often capturing the essence of the topic without including minor or tangential information

Convenience and Accessibility: Gemini ≛ Command R+

  • Due to its brevity, Gemini is often highlighted as more convenient for quick reference and easier to scan and digest quickly
  • Command R+ is often preferred for readers seeking a more comprehensive understanding of the subject matter
  • Gemini failed almost 40% of the tests due to the inability to access provided URLs, resulting in error messages

Additional Observations

  • The choice between Command R+ and Gemini often depends on the reader's specific needs for depth versus brevity
  • Command R+ frequently offers more detailed explanations of concepts, historical context, and implications of the topics discussed
  • Gemini is particularly suitable for time-constrained readers who need a quick understanding without extensive details
  • For straightforward subjects, Gemini is often praised for its clear and direct presentation of information

Conclusion: Command R+ ≫ Gemini

  • Cohere Command R+ is more suitable for readers seeking in-depth understanding of complex topics and comprehensive coverage of source material.
  • Gemini is better suited for users who need quick, concise overviews of straightforward subjects and prefer easily scannable summaries.

AI Judgement: Command R (Cohere) vs Llama 3 70B (Meta)

A draw between Command R and Llama 3 70B models at the AI Judgement task

Rationality and Logic: Command R ≛ Llama 3 70B

  • Both AIs demonstrate strong analytical skills and clear reasoning
  • Both excel at identifying logical fallacies and providing step-by-step explanations
  • Both occasionally provide overly detailed responses

Impartiality: Command R ≛ Llama 3 70B

  • Both AIs consistently strive to maintain objectivity and recognize conflicts of interest
  • Command R should acknowledge potential biases more explicitly
  • Llama 3 70B occasionally shows slight biases in language use or framing

Deterrence and Marginality: Command R ≛ Llama 3 70B

  • Both recognize small differences and are willing to declare ties when appropriate
  • Command R could be more consistent in recommending ties for very close cases
  • Llama 3 70B sometimes struggles to make clear decisions in nearly equal scenarios

Consistency: Command R ≛ Llama 3 70B

  • Both generally apply standards uniformly across similar scenarios
  • Command R shows occasional slight inconsistencies in severity ratings
  • Llama 3 70B demonstrates minor inconsistencies in reasoning or emphasis between similar cases

Ethical Considerations: Command R Llama 3 70B

  • Both AIs demonstrate a strong grasp of ethical principles and carefully weigh competing concerns
  • Command R could more explicitly reference specific ethical frameworks in some analyses
  • Llama 3 70B could provide more nuanced discussions of complex ethical trade-offs

Transparency and Justification: Command R ≛ Llama 3 70B

  • Both provide clear explanations and articulate reasoning processes transparently
  • Both occasionally over-explain or provide unnecessary context

Conclusion: Command R ≛ Llama 3 70B

Overall, Command R and Llama 3 70B perform similarly as AI arbitrators, with each showing strengths and minor weaknesses across the criteria. Both demonstrate strong rational thinking, impartiality, and ethical considerations, but could improve in areas such as deterrence and consistency in close cases.

AI Judgement: OpenAI ChatGPT GPT-4o vs GPT-4o mini

GPT-4o prevails over GPT-4o mini model at the AI Judgement task

Rationality and Logic: GPT-4o ≳ GPT-4o mini

  • Both models excel at breaking down complex scenarios and providing step-by-step reasoning
  • Both demonstrate strong probabilistic reasoning skills and identify logical fallacies accurately
  • GPT-4o shows slightly stronger performance in complex scenario analysis

Impartiality: GPT-4o ≳ GPT-4o mini

  • Both models recognize potential conflicts of interest and suggest appropriate actions
  • GPT-4o maintains slightly better objectivity when evaluating scenarios with personal implications
  • GPT-4o mini may sometimes lean towards overly cautious approaches
  • Both could improve on explicitly stating when setting aside personal beliefs

Deterrence and Marginality: GPT-4o ≛ GPT-4o mini

  • Both models recognize when differences between options are marginal
  • Both are willing to declare ties or suggest alternative methods when appropriate
  • Both occasionally struggle to definitively choose between very close options
  • GPT-4o mini shows occasional inconsistency in declaring clear winners vs. marginal preferences

Consistency: GPT-4o ≳ GPT-4o mini

  • Both apply similar reasoning across related scenarios
  • Both maintain consistent ethical principles in different contexts
  • GPT-4o shows slightly better consistency in severity ratings for similar scenarios
  • Both have minor variations in explanation depth across similar questions

Ethical Considerations: GPT-4o ≛ GPT-4o mini

  • Both demonstrate a strong understanding of ethical principles and dilemmas
  • Both balance competing ethical considerations well
  • GPT-4o mini consistently recommends recusal and transparency in conflict-of-interest scenarios
  • Both could benefit from more explicit discussion of long-term ethical implications in some cases

Transparency and Justification: GPT-4o ≛ GPT-4o mini

  • Both provide clear and detailed explanations for their reasoning processes
  • Both break down complex decisions into logical steps
  • GPT-4o mini could benefit from more structured presentation of justifications in some cases

Conclusion: GPT-4o ≳ GPT-4o mini

GPT-4o is declared the prevailing AI with a marginal preference in rationality and logic, impartial point of view, and consistency. The remaining criteria resulted in a tie. While the difference is marginal, it is consistently marginal in these three criteria.