标签

 performance 

相关的文章:

The TensorFlow Blog -

Half-precision Inference Doubles On-Device Inference Performance

Posted by Marat Dukhan and Frank Barchard, Software Engineers CPUs deliver the widest reach for ML inference and remain the default target for TensorFlow Lite. Consequently, improving CPU inference performance is a top priority, and we are excited to announce that we doubled floating-point inference performance in TensorFlow Lite’s XNNPack backend by enabling half-precision inference on ARM CPUs. This means that more AI powered features may be deployed to older and lower tier devices. Traditionally, TensorFlow Lite supported two kinds of numerical computations in machine learning models: a) floating-point using IEEE 754 single-precision (32-bit) format and b) quantized using low-precision integers. While single-precision floating-point numbers provide maximum flexibility and ease of use, they come at the cost of 4X overhead in storage and memory and exhibit a performance overhead compared to 8-bit integer computations. In contrast, half-precision (FP16) floating-point numbers pose an interesting alternative balancing ease-of-use and performance: the processor needs to transfer twice fewer bytes and each vector operation produces twice more elements. By virtue of this property, FP16 inference paves the way for 2X speedup for floating-point models compared to the traditional FP32 way. For a long time FP16 inference on CPUs primarily remained a research topic, as the lack of hardware support for FP16 computations limited production use-cases. However, around 2017 new mobile chipsets started to include support for native FP16 computations, and by now most mobile phones, both on the high-end and the low-end. Building upon this broad availability, we are pleased to announce the general availability for half-precision inference in TensorFlow Lite and XNNPack. Performance Improvements Half-precision inference has already been battle-tested in production across Google Assistant, Google Meet, YouTube, and ML Kit, and demonstrated close to 2X speedups across a wide range of neural network architectures and mobile devices. Below, we present benchmarks on nine public models covering common computer vision tasks: MobileNet v2 image classification [download] MobileNet v3-Small image classification [download] DeepLab v3 segmentation [download] BlazeFace face detection [download] SSDLite 2D object detection [download] Objectron 3D object detection [download] Face Mesh landmarks [download] MediaPipe Hands landmarks [download] KNIFT local feature descriptor [download] These models were benchmarked on 5 popular mobile devices, including recent and older devices (Pixel 3a, Pixel 5a, Pixel 7, Galaxy M12 and Galaxy S22). The average speedup is shown below. Single-threaded inference speedup with half-precision (FP16) inference compared to single-precision (FP32) across 5 mobile devices. Higher numbers are better. The same models were also benchmarked on three laptop computers (MacBook Air M1, Surface Pro X and Surface Pro 9) Single-threaded inference speedup with half-precision (FP16) inference compared to single-precision (FP32) across 3 laptop computers. Higher numbers are better. Currently, the FP16-capable hardware supported in XNNPack is limited to ARM & ARM64 devices with ARMv8.2 FP16 arithmetics extension, which includes Android phones starting with Pixel 3, Galaxy S9 (Snapdragon SoC), Galaxy S10 (Exynos SoC), iOS devices with A11 or newer SoCs, all Apple Silicon Macs, and Windows ARM64 laptops based with Snapdragon 850 SoC or newer. How Can I Use It? To benefit from the half-precision inference in XNNPack, the user must provide a floating-point (FP32) model with FP16 weights and special "reduced_precision_support" metadata to indicate model compatibility with FP16 inference. The metadata can be added during model conversion using the _experimental_supported_accumulation_type attribute of the tf.lite.TargetSpec object: ... converter.target_spec.supported_types = [tf.float16] converter.target_spec._experimental_supported_accumulation_type = tf.dtypes.float16 When the compatible model is delegated to XNNPack on a hardware with native support for FP16 computations, XNNPack will transparently replace FP32 operators with their FP16 equivalents, and insert additional operators to convert model inputs from FP32 to FP16 and convert model outputs back from FP16 to FP32. If the hardware is not capable of FP16 arithmetics, XNNPack will perform model inference with FP32 calculations. Therefore, a single model can be transparently deployed on both recent and legacy devices. Additionally, the XNNPack delegate provides an option to force FP16 inference regardless of the model metadata. This option is intended for development workflows, and in particular for testing end-to-end accuracy of the model when FP16 inference is used. In addition to devices with native FP16 arithmetics support, forced FP16 inference is supported on x86/x86-64 devices with AVX2 extension in emulation mode: all elementary floating-point operations are computed in FP32, then converted to FP16 and back to FP32. Note that such simulation is slow and not a bit-exact equivalent to native FP16 inference, but simulates the effects of restricted mantissa precision and exponent range in the native FP16 arithmetics. To force FP16 inference, either build TensorFlow Lite with --define xnnpack_force_float_precision=fp16 Bazel option, or apply XNNPack delegate explicitly and add TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16 flag to the TfLiteXNNPackDelegateOptions.flags bitmask passed into the TfLiteXNNPackDelegateCreate call: TfLiteXNNPackDelegateOptions xnnpack_options = TfLiteXNNPackDelegateOptionsDefault(); ... xnnpack_options.flags |= TFLITE_XNNPACK_DELEGATE_FLAG_FORCE_FP16; TfLiteDelegate* xnnpack_delegate = TfLiteXNNPackDelegateCreate(&xnnpack_options); XNNPack provides full feature parity between FP32 and FP16 operators: all operators that are supported for FP32 inference are also supported for FP16 inference, and vice versa. In particular, sparse inference operators are supported for FP16 inference on ARM processors. Therefore, users can combine the performance benefits of sparse and FP16 inference in the same model. Future Work In addition to most ARM and ARM64 processors, the most recent Intel processors, code-named Sapphire Rapids, support native FP16 arithmetics via the AVX512-FP16 instruction set, and the recently announced AVX10 instruction set promises to make this capability widely available on x86 platform. We plan to optimize XNNPack for these instruction sets in a future release. Acknowledgements We would like to thank Alan Kelly, Zhi An Ng, Artsiom Ablavatski, Sachin Joglekar, T.J. Alumbaugh, Andrei Kulik, Jared Duke, Matthias Grundmann for contributions towards half-precision inference in TensorFlow Lite and XNNPack.

AI生成摘要 TensorFlow Lite's XNNPack backend has doubled floating-point inference performance on ARM CPUs by enabling half-precision inference. This allows for AI-powered features to be deployed on older and lower-tier devices. Half-precision (FP16) floating-point numbers provide a balance between ease-of-use and performance, resulting in a 2X speedup compared to traditional FP32 inference. The availability of hardware support for FP16 computations in mobile chipsets has made this possible. Benchmarks have shown close to 2X speedups across various neural network architectures and mobile devices. To use half-precision inference in XNNPack, a floating-point model with FP16 weights and special metadata must be provided. The XNNPack delegate also provides an option to force FP16 inference. Future work includes optimizing XNNPack for Intel processors that support native FP16 arithmetics.

相关推荐 去reddit讨论

Visual Studio Blog -

Visual Studio 2022 – 17.8 Performance Enhancements

Version 17.8 welcomes an array of exhilarating performance enhancements, including Improved Razor/Blazor Responsiveness, Enhanced F5 Speed, Optimized IntelliSense for C++ Unreal Engine and Build Acceleration for Non-SDK style .NET Projects.  At the heart of these changes is our commitment to enhancing performance, The post Visual Studio 2022 – 17.8 Performance Enhancements appeared first on Visual Studio Blog.

AI生成摘要 Visual Studio 2022 - 17.8版本带来了一系列令人振奋的性能增强,包括响应式文件打开体验、改进的响应速度、增强的F5速度、针对C++ Unreal Engine的优化IntelliSense和非SDK样式.NET项目的构建加速。这些改进的核心是为了提升性能,为编码体验提供无缝和高效的平台。享受这些改进,让编码之旅更加高效和令人兴奋!

相关推荐 去reddit讨论

Planet PostgreSQL -

Andrew Atkinson: Teach Kelvin Your Thing (TKTY) — High Performance PostgreSQL for Rails 🖥️

Recently I met Kelvin Omereshone, based in Nigeria, for a session on his show Teach Kelvin Your Thing (TKYT). Here’s the description of the show: Teach Kelvin Your Thing was created out of a need for me to learn not just new technologies but how folks who know these technologies use them. This was a fun opportunity to contribute to Kelvin’s catalog of more than 50 recorded sessions! The sessions mainly cover web development with JavaScript tech, until this one! Kelvin let me know this was the first TKYT session outside of JavaScript. Maybe we’ll inspire some people to try Ruby! Besides TKYT, Kelvin is a prolific blogger, producer, writer, and an upcoming author! Kelvin is an experienced Sails framework programmer, and announced the recent honor of becoming the lead maintainer for the framework. Kelvin and decided to talk about High Performance PostgreSQL for Rails. The session is called High Performance PostgreSQL for Rails applications with Andrew Atkinson and is on YouTube. The recorded session is embedded below. Although we barely scratched the surface of the topic ideas Kelvin had, I’ve written up his questions as a bonus Q&A below the video. Note for the video: unfortunately my fancy microphone wasn’t used (by mistake). Apologies on the audio. Q&A With a one hour session, we only made it through some basics with the Active Record ORM, SQL queries, query planning, and efficient indexes. While those are some of the key ingredients to good performance, there’s much more. The following questions explore more performance related topics beyond what we covered in the session. How do you optimize PostgreSQL for high performance in a Rails application? Achieving and sustaining high performance with web applications requires removing latency wherever possible in the request and response cycle. For the database portion, you’ll want to understand your SQL queries well, make sure they’re narrowly scoped, and that indexes are optimized to support them. Besides that, the p[...]

AI生成摘要 最近我在尼日利亚遇到了Kelvin Omereshone,参加了他的节目Teach Kelvin Your Thing (TKYT)。这是一个关于学习新技术和了解技术使用的需求而创建的节目。这次的TKYT是第一个不涉及JavaScript的节目,希望能激发人们尝试Ruby。Kelvin是一位博主、制作人、作家和即将出版的作者,他是一位经验丰富的Sails框架程序员,并宣布成为该框架的首席维护者。在节目中,我们讨论了高性能PostgreSQL在Rails应用中的应用。对于高性能的实现,需要优化SQL查询、合理使用索引,并确保服务器实例具备适当的CPU、内存和磁盘。此外,还讨论了索引表的策略、数据库设计考虑因素、开发者在高性能Rails应用中常见的陷阱、缓存机制对PostgreSQL性能的影响以及开发者可能未充分利用的PostgreSQL高级特性和设置。

相关推荐 去reddit讨论

NVIDIA Blog -

Igniting the Future: TensorRT-LLM Release Accelerates AI Inference Performance, Adds Support for New Models Running on RTX-Powered Windows 11 PCs

Artificial intelligence on Windows 11 PCs marks a pivotal moment in tech history, revolutionizing experiences for gamers, creators, streamers, office workers, students and even casual PC users. It offers unprecedented opportunities to enhance productivity for users of the more than 100 million Windows PCs and workstations that are powered by RTX GPUs. And NVIDIA RTX Read article >

AI生成摘要 Windows 11 PCs with artificial intelligence (AI) capabilities are revolutionizing experiences for various users. NVIDIA RTX technology is making it easier for developers to create AI applications. New optimizations and resources announced at Microsoft Ignite will help developers deliver new end-user experiences faster. TensorRT-LLM for Windows will soon be compatible with OpenAI's Chat API, allowing developers to run projects locally on a PC with RTX. AI Workbench is a toolkit that allows developers to create and customize pretrained generative AI models. NVIDIA and Microsoft are releasing DirectML enhancements to accelerate popular AI models. The upcoming release of TensorRT-LLM will bring improved inference performance and support for additional LLMs. NVIDIA is enabling TensorRT-LLM for Windows to offer a similar API interface to OpenAI's ChatAPI, allowing developers to work with local AI. Developers can leverage cutting-edge AI models and deploy with a cross-vendor API. These advancements will accelerate the development and deployment of AI features and applications on RTX PCs.

相关推荐 去reddit讨论

Lei Mao's Log Book -

C++ Function Call Performance

Performance Caveats of Passing Functions as Arguments in C++

AI生成摘要 本文讨论了在C++中将函数作为参数传递的性能问题。通过比较函数指针、std::function和lambda函数的性能,发现使用std::function虽然方便,但不如使用函数指针或lambda函数高效。对于需要频繁调用的快速计算函数,应避免使用std::function,而使用函数指针或lambda函数以获得最佳性能。

相关推荐 去reddit讨论

Planet PostgreSQL -

Ryan Lambert: Pre-conference Session Materials: GIS Data, Queries, and Performance

This post supports our full day pre-conference session, PostGIS and PostgreSQL: GIS Data, Queries, and Performance at PASS Data Community Summit 2023 on November 13. Downloads for session The data, permissions script, and example SQL queries used through this session are available below.

AI生成摘要 瑞安·兰伯特发布了一篇文章(2023年11月12日),介绍了他即将在PASS数据社区峰会2023上举行的全天预会议课程“PostGIS和PostgreSQL:GIS数据、查询和性能”。文章提供了会议所需数据下载、权限脚本和SQL查询示例。文章还包含了如何加载演示数据库到PgOSM Flex Docker镜像的步骤,以及如何设置数据库和加载数据文件的指令。最后,文章提供了演示和SQL的相关链接。

相关推荐 去reddit讨论

The Keyword -

Get creative with generative AI in Performance Max

Animation depicting images of dogs being generated using a text prompt reading “Elegant dogs on color backdrop”

AI生成摘要 Google's Performance Max, the first-ever AI-powered campaign, has been helping small and large businesses stay ahead of consumer trends and adapt to the unpredictability of the consumer journey. The company has announced generative AI features that will help marketers scale and build high-quality assets that drive performance. With the help of Google AI, marketers can now generate new text and image assets for their campaigns in just a few clicks, taking performance data into consideration when suggesting or generating certain assets. The goal is to help marketers of all sizes drive greater performance and experiment with new types of images and messaging.

相关推荐 去reddit讨论

Planet PostgreSQL -

Syed Salman Ahmed Bokhari: Performance tuning in PostgreSQL using shared_buffers

Explore the sweet spot for shared_buffers allocation and ensure your PostgreSQL system runs at its peak efficiency. The post Performance tuning in PostgreSQL using shared_buffers appeared first on Stormatics.

AI生成摘要 本文介绍了PostgreSQL的共享缓存(shared_buffers)的作用和调整方法。共享缓存可以减少磁盘I/O,提高查询响应时间。但是过度分配共享缓存会影响其他进程的内存使用,导致性能下降。作者通过测试发现,将共享缓存的值从默认的128MB增加到1GB和2GB,可以分别将tps从1548提高到1742和1780。因此,适当调整共享缓存的大小可以提高PostgreSQL数据库的性能。

相关推荐 去reddit讨论

MongoDB -

Search Nodes Now in Public Preview: Performance at Scale with Dedicated Infrastructure

While scalability has become a common buzzword in today’s enterprise vernacular, it’s something we take extremely seriously at MongoDB. Whether it’s increasing a certain capability to be used in additional contexts, or continuing to increase the capacity of a certain technology in size or scale, our product teams are always looking to maximize scalability for our customers’ most demanding workloads. Today we are excited to take the next step in this journey with the announcement of Search Nodes, now available in public preview. Search Nodes provide dedicated infrastructure for Atlas Search and Vector Search workloads, allowing you to fully scale search independent of database needs. Incorporating Search Nodes into your Atlas deployment allows for better performance at scale, and delivers workload isolation, higher availability, and the ability to optimize resource usage. We see this as the next evolution of our architecture for both Atlas Search and Vector Search, furthering our developer data platform, including the benefits of a fully managed sync without the need for an ETL or index management. We have listened to the feedback from our customer base and are excited to take the next step in bringing this feature closer to general availability. So what exactly is changing, and what are the benefits of Search Nodes? To see where we’re going, let’s take a brief look at where we have been. Previously, Atlas Search (mongot) has been co-located with Atlas (mongod) on Atlas Nodes (see diagram below). The pros of this configuration are that it is simple and cheap, enabling a large portion of our current user base to get started quickly. Figure 1: Diagram of architecture Atlas Search configuration on Atlas Nodes However, there are a couple of consequences from this setup. Because Search and Vector Search are co-located on Atlas Nodes and clusters, users have to try and size their workload based on both Search and Database requirements using traditional Atlas deployment. This introduces potential issues, including the possibility of resource contention between a database and search deployment, which has the potential to cause service interruptions. It also becomes difficult having both resources commingled, as you lack the granularity to set limits on the share of the overall workload from your database or search. With our announcement of Search Nodes available in public preview, these considerations are a thing of the past, as we now offer the developer greater visibility and control, with benefits including: Workload isolation Better performance at scale (40% - 60% decrease in query time for many complex queries) Higher availability Improved developer experience Figure 2: Diagram of dedicated architecture with Search Nodes Getting started with Search Nodes is super simple — to begin, just follow these steps in the MongoDB UI: Navigate to your “Database Deployments” section in the MongoDB UI Click the green “+Create” button On the “Create New Cluster” page, change the radio button for AWS for “Multi-cloud, multi-region & workload isolation” to enabled Toggle the radio button for “Search Nodes for workload isolation” to enabled. Select the number of nodes in the text box Check the agreement box Click “Create cluster” For existing Atlas Search users, click “Edit Configuration” in the MongoDB Atlas Search UI and enable the toggle for workload isolation. Then the steps are the same as noted above. Figure 3: How to enable Search Nodes in the Atlas UI We’re excited to be offering customers the option of dedicated infrastructure that Search Nodes provides and look forward to seeing the next wave of scalability for both Atlas Search and Vector Search workloads. We’ll also be announcing a more cost and performance efficient configuration for Vector Search coming soon. For further details you can jump right into our docs to learn more. We can’t wait to see what you build!

AI生成摘要 本文介绍了向量搜索和大型语言模型(LLMs)的概念及其应用。向量搜索是一种基于相似性的搜索方法,通过将数据编码为向量并计算它们之间的距离来实现。LLMs是一种人工智能技术,通过训练嵌入模型来理解文本并执行自然语言处理任务。这些技术的基础早在公元前300年就已经存在,但直到2017年才出现了转换器架构,使得LLMs能够处理更多的数据。OpenAI的ChatGPT的发布使LLMs变得流行起来,也促进了向量搜索的采用。

相关推荐 去reddit讨论

Planet PostgreSQL -

Syed Salman Ahmed Bokhari: PostgreSQL performance tuning using work_mem

Enhance PostgreSQL query performance of your database with our guide on optimizing work_mem. The post PostgreSQL performance tuning using work_mem appeared first on Stormatics.

AI生成摘要 本文介绍了如何通过调整 PostgreSQL 的参数来提高性能,其中一个影响性能的参数是 work_mem。work_mem 设置了查询操作(如排序和哈希)在写入临时磁盘之前可以使用的最大内存量。适当调整 work_mem 可以加快查询速度,减少磁盘操作,提高数据库性能。然而,设置过高的 work_mem 可能导致过度使用内存,因此需要找到适合特定工作负载的平衡点。通过测试,本文证明了适当调整 work_mem 可以显著提高查询速度和资源管理效率。

相关推荐 去reddit讨论

热榜 Top10
...
ShowMeBug
...
天勤数据
...
观测云
...
白鲸技术栈
...
LigaAI
...
eolink
...
Dify.AI
推荐或自荐