Finally read the tech report on Gemini, #Google’s most capable LLM. Here are some of the interesting details that are often overlooked: - Gemini models are multimodal by design and really shine in various multimodal capabilities. Not only can they directly consume text, audio, images, and video, but they can also directly generate images. The provided examples do look good. I wonder at some point such multimodal models might become the backbone or starting point for more narrow image generation systems. - The on-device model is trained with all the best practices like distillation and 4-bit quantization. It is also trained with significantly more tokens since it requires much less compute. If done right, this model should be very capable for its inference cost. - Gemini’s speech recognition is already as good as the Whisper models (larger v3) which is a big deal considering it’s not specifically trained for it. Again highlighting the multimodal capabilities of the model. - Training at data center scale, takes a lot of systems engineering wizardry beyond just knowledge in learning. Rare errors in normal training that we usually ignore become frequent and must be handled gracefully.
Understanding Gemini's Multimodal Capabilities
Explore top LinkedIn content from expert professionals.
Summary
Google's Gemini models are cutting-edge multimodal AI systems capable of understanding and generating content across text, audio, image, and video. These models are designed to process and integrate diverse types of data, pushing the limits of artificial intelligence and enabling advanced applications like real-time speech recognition, creative content generation, and large-scale data analysis.
- Explore multimodal input: Leverage Gemini's ability to process text, images, audio, video, and code simultaneously for solving complex, cross-domain challenges like translating a video or summarizing long-form content.
- Utilize its scalability: Take advantage of Gemini's different model sizes—ranging from Nano for mobile devices to Ultra for highly complex tasks—to meet specific needs efficiently.
- Integrate for diverse tasks: Apply Gemini's human-like reasoning and analysis to innovate in education, research, or accessibility, offering personalized solutions and insights across disciplines.
-
-
So, Gemini 1.5 has recently been released, but what's new and different about it? Gemini 1.5 Pro is an advanced Transformer-based model using a sparse mixture-of-experts (MoE) approach, building on the multimodal capabilities of its predecessor, Gemini 1.0. It incorporates extensive MoE and language model research, allowing it to efficiently handle inputs by activating only relevant parameters. Gemini 1.5 Pro demonstrates significant advancements in multimodal understanding and computational efficiency. Below are the key features that you need to know about: • 𝗘𝘅𝘁𝗲𝗻𝗱𝗲𝗱 𝗖𝗼𝗻𝘁𝗲𝘅𝘁 𝗟𝗲𝗻𝗴𝘁𝗵: Can understand inputs up to 10 million tokens, significantly more than its predecessors, enabling processing of almost a day of audio, large codebases, or extended video content. • 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗖𝗮𝗽𝗮𝗯𝗶𝗹𝗶𝘁𝗶𝗲𝘀: Natively supports and interleaves data from different modalities (audio, visual, text, code) in the same input sequence. • 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝗰𝘆 𝗮𝗻𝗱 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲: Achieves comparable or superior quality to previous models like Gemini 1.0 Ultra, with significantly less training compute and enhanced serving efficiency. So when should you use it? Gemini 1.5 Pro excels in processing and understanding complex multimodal data sets over extended contexts. This makes it ideal for applications requiring deep contextual analysis and the integration of diverse data types, such as advanced natural language understanding, multimodal content creation and analysis, real-time translation and transcription, large-scale data analysis, and interactive AI systems. Its efficiency and performance in these areas stem from significant improvements in architecture, data handling, and computational efficiency. Paper: https://lnkd.in/eQbbBQdB
-
Ok - here is a full technical breakdown of what we know about Gemini: * There are 3 model sizes: Ultra, Pro, Nano. The only disclosed size is for Nano and it's 1.8 & 3.25B. The info is not particularly useful because we could have bound the size given it runs on Pixel. * Ultra follows Chinchilla scaling laws - as the idea is to get the best possible perf for a given compute budget. Inference was not a concern, PR was, you want bold numbers. Smaller models are all heavily in the data saturation regime. * Gemini is natively multimodal - i.e. trained from scratch on different modalities. Compare that with Flamingo: step 1) train an LLM - Chinchilla step 2) train a vision encoder using contrastive pre-training step 3) freeze the backbones and train the system e2e input: text, audio, image, video output: text, image (big advantage compared to GPT-4 V) Multimodal demos: * https://lnkd.in/dJZTkB79 * https://lnkd.in/dEBq-aMh * Needless to say the model is massively multilingual as well. ---COMPUTE--- I'll just leave you with this excerpt from the paper: "but at Gemini Ultra scale, we combine SuperPods in multiple datacenters using Google’s intra-cluster and inter-cluster network. Google’s network latencies and bandwidths are sufficient to support the commonly used synchronous training paradigm, exploiting model parallelism within superpods and data-parallelism across superpods." parallelism across GPUs - sweet - how about parallelism across data centers :) ---EVALS--- * Gemini Ultra’s performance exceeds current SOTA results on 30 of the 32 academic benchmarks BUT IMPORTANT NOTE: not clear how this translates into actual perf given data contamination & in general eval issues with LLMs. * They report 90% on MMLU which is better than human experts and better than GPT-4 but again it seems the eval methodology was not the same (number of shots + CoT look to be different). * Report better results than GPT-4V on the new MMMU (multimodal) benchmark. In general I wouldn't give eval numbers too much attention bc it's not clear whether the comparison is fair between GPT-4 & Gemini - i just want to play with the model. :) ---MISC UPDATES--- * They also share AlphaCode 2 that is estimated to perform better than 85% of codeforces competitive programming competition participants (compared to 50% from original AlphaCode). It leveraged Gemini Pro to get to these results. * They introduce TPU v5p with 2x more FLOPS and 3x more HBM. A single pod composes of 8960 chips! * Gemini is already powering many google products (a fine-tuned version of Pro is already in Bard, Nano is running on Pixel) * on december 13th Pro will be accessible through an API! Each one of these could be an update in its own right - I simply dislike Google's shipping strategy. It's a firehose method, and if we're talking about safe deployment it's much better to gradually give people access to these systems - a la OpenAI. Just my 2 cents.
-
The release of Google's Gemini Pro 1.5 is, IMO, the biggest piece of A.I. news yet this year. The LLM has a gigantic million-token context window, multimodal inputs (text, code, image, audio, video) and GPT-4-like capabilities despite being much smaller and faster. Key Features 1. Despite being a mid-size model (so much faster and cheaper), its capabilities rival the full-size models Gemini Ultra 1.0 and GPT-4, which are the two most capable LLMs available today. 2. At a million tokens, its context window demolishes Claude 2, the foundation LLM with the next longest context window (Claude 2's is only a fifth of the size at 200k). A million tokens corresponds to 700,000 words (seven lengthy novels) and Gemini Pro 1.5 accurately retrieves needles from this vast haystack 99% of the time! 3. Accepts text, code, images, audio (a million tokens corresponds to 11 hours of audio), and video (1MM tokens = an hour of video). Today's episode contains an example of Gemini Pro 1.5 answering my questions about a 54-minute-long video with astounding accuracy and grace. How did Google pull this off? • Gemini Pro 1.5 is a Mixture-of-Experts (MoE) architecture, routing your input to specialized submodels (e.g., one for math, one for code, etc.), depending on the broad topic of your input. This allows for focused processing and explains both the speed gains and high capability level despite being a mid-size model. • While OpenAI also uses the MoE approach in GPT-4, Google seems to have achieved greater efficiency with the approach. This edge may stem from Google's pioneering work on MoE (Google were the first to publish on MoE, way back in 2017) and their resultant deep in-house expertise on the topic. • Training-data quality is also a likely factor in Google's success. What's next? • Google has 10-million-token context-windows in testing. That order-of-magnitude jump would correspond to future Gemini releases being able to handle ~70 novels, 100 hours of audio or 10 hours of video. • If Gemini Pro 1.5 can achieve GPT-4-like capabilities, the Gemini Ultra 1.5 release I imagine is in the works may allow Google to leapfrog OpenAI and reclaim their crown as the world's undisputed A.I. champions (unless OpenAI gets GPT-5 out first)! Want access? • Gemini Pro 1.5 is available with a 128k context window through Google AI Studio and (for enterprise customers) through Google Cloud's Vertex AI. • There's a waitlist for access to the million-token version (I had access through the early-tester program). Check out today's episode (#762) for more detail on all of the above (including Gemini 1.5 Pro access/waitlist links). The Super Data Science Podcast is available on all major podcasting platforms and a video version is on YouTube. #superdatascience #machinelearning #ai #llms #geminipro #geminiultra
-
Google Unveils Gemini: A Multimodal AI Model with Human-like Performance Google Research has unveiled Gemini, a family of multimodal AI models that demonstrate human-level performance across diverse tasks. Boasting capabilities in image, audio, video, and text domains, Gemini represents a significant advancement in the field of artificial intelligence. Key Highlights: Human-Expert Performance: Gemini Ultra, the most advanced model, surpasses human experts on 57 subjects in the MMLU benchmark, achieving a score above 90%. Multimodal Reasoning: Gemini excels at tasks requiring both understanding and reasoning across different modalities. It can solve math problems from handwritten notes, analyze charts and generate tables, and even answer questions about video content. State-of-the-Art Benchmarks: Gemini sets new state-of-the-art results across 30 out of 32 benchmarks, including text, image, video, and speech understanding tasks. Democratizing Access: Available in various sizes, Gemini caters to different needs. Nano models are designed for on-device usage, Pro models are ideal for data centers, and the Ultra model tackles highly complex tasks. Responsible Development: Google emphasizes responsible deployment, addressing potential bias and harmful outputs through careful fine-tuning and instruction tuning. Applications: Education: Gemini's capabilities offer immense potential in education, providing personalized learning experiences and assisting students with complex concepts. Science & Research: Gemini can accelerate scientific discovery by analyzing vast data sets and generating insights across disciplines. Productivity & Creativity: Gemini can empower users through intelligent assistance in tasks like writing, coding, and problem-solving. Accessibility: Gemini's ability to process diverse modalities makes it a valuable tool for individuals with disabilities. Availability: As of today, Gemini Pro powers Bard, Google's AI-powered chatbot. On December 13th, developers can access Gemini Pro through APIs. Android users will have access to the Nano model on Pixel 8 Pro devices. Bard Advanced, powered by Gemini Ultra, will launch early next year. https://lnkd.in/gptk-K88 This groundbreaking technology marks a significant leap forward in AI, paving the way for a future where machines can collaborate with humans and solve problems in ways that were once unimaginable.
-
I think the AI community has underestimated the value of large context windows and fully multimodal AI (can see video, as well as documents and texts) as a solution to many real-world AI problems. I find that, when working inside the context window, even a million tokens worth, the AI both reasons very well and has very low rates of hallucinations. And an AI that can see enables entirely different modalities for using AI systems. Here, I give Gemini 1.5 a video of my screen (it would be trivial for it to watch live, of course), and it accurately understands what I am doing and what I could do better. Gemini Pro 1.5 feels like working with GPT-4 after using GPT-3.5. The underlying model still isn't "smart" enough to do everything you want, but the added context window and ability to hold entire videos or folders of documents just makes the experience feel superhuman. AI, for better or worse, as manager and advisor. Superclippy for real.
-
I was fortunate to receive an invitation for early access to Google's new Gemini 1.5 pro model, which boasts a 1 million token context window. If you want to experiment with it, here are a few things you need to know to get started. It was released yesterday to the public in a low-key announcement primarily aimed at developers. 1. You can access it in AI Studio. (Link in comments) 2. AI Studio is free. 3. In AI Studio, the interface doesn't natively save your chat history. (It is designed for developers to test prompts in different ways with models.) However, you can save your prompts to a library. (Note: Officially, it doesn't save chat history...But I have noticed my last few saved prompts include the chat history, so I hope that is a newly upgraded feature since they are improving it continuously.) 4. You can test prompts in different models in three ways: a chat interface, freeform prompts, and structured prompts. You can learn how each type works using their tutorials. 5. With the Gemini 1.5 Pro model, you can, for the first time, upload video to an LLM as an input 🤯 6. The video, however, does not have audio modality - for now. Technically, the AI is ingesting the video frame by frame as stills, but it can read timestamps in the video. 7. For any response, you can use the "get code" button to get the lines of code vs text, which you can copy and paste. 8. Expect responses (especially with video inputs) to take a bit longer than you are used to with smaller context text only or text plus images inputs. This early peek at Gemini 1.5 pro is mind-blowing, especially considering it is still in its most primitive state. Iterative releases will only improve it from here. Using it over these last few weeks has already changed my perspective on much of the progress made in AI in the past several years. I will share more of my thoughts about that soon, but for now, I wanted to share the tips on access and how to use it so that you can also get a peek into it and try it out over the weekend. #ai #google #gemini
Explore categories
- Hospitality & Tourism
- Productivity
- Finance
- Soft Skills & Emotional Intelligence
- Project Management
- Education
- Technology
- Leadership
- Ecommerce
- User Experience
- Recruitment & HR
- Customer Experience
- Real Estate
- Marketing
- Sales
- Retail & Merchandising
- Science
- Supply Chain Management
- Future Of Work
- Consulting
- Writing
- Economics
- Employee Experience
- Workplace Trends
- Fundraising
- Networking
- Corporate Social Responsibility
- Negotiation
- Communication
- Engineering
- Career
- Business Strategy
- Change Management
- Organizational Culture
- Design
- Innovation
- Event Planning
- Training & Development