EAS architecture and usage - Platform For AI - Alibaba Cloud Documentation Center

After model training is complete, you can use Elastic Algorithm Service (EAS) to quickly deploy your model as an online inference service or a web application. EAS supports heterogeneous resources and combines capabilities such as automatic scaling, one-click stress testing, canary release, and real-time monitoring to ensure service stability and business continuity in high-concurrency scenarios at a lower cost.

EAS architecture

Details of EAS architecture layers

Infrastructure: Uses heterogeneous hardware (CPUs or GPUs) and provides General Unit (GU) specifications and preemptible instances designed for AI scenarios. This helps you reduce costs and improve efficiency.
Container scheduling: Helps you manage cluster resources during business peaks and valleys through the following features:
- Automatic scaling: Automatically adjusts the number of service instances during significant workload fluctuations to prevent resource waste.
- Scheduled scaling: Adjusts the number of service instances to a specific number at a specific point in time, suitable for scenarios in which you can estimate the workload.
- Elastic resource pool: If resources in the dedicated resource group are fully occupied, the service will automatically scale out to pay-as-you-go public resource group machines to ensure service stability.
Model deployment: Helps you efficiently monitor service status in real time, and simplifies service publishing and update processes to adapt to computing resources. It supports the following features:
- One-click stress testing: Supports dynamic load increase to find the load limit of the service. You can also view real-time monitoring data accurate to the second and testing reports.
- Canary release: Adds multiple services to the same canary group, with some services used for the production environment and others for the canary environment. You can also switch the traffic proportion distributed to each service for more flexible canary release.
- Real-time monitoring: After a service is deployed, you can view its metrics on the service monitoring page to understand service invocation and operation status, such as queries per second (QPS), response time, and CPU utilization.
- Traffic mirroring: Mirrors the traffic of one service to another at a certain proportion without affecting its normal operation. This is used to test the performance and reliability of new services.
Inference: EAS supports the following inference capabilities:
- Real-time synchronous inference: Suitable for scenarios featuring high throughput and low latency, and not affecting the normal operation of online business, such as custom search and intelligent chatbot. The system can also adapt to deployment models based on business requirements for optimal performance.
- Near real-time asynchronous inference: Suitable for scenarios such as text-to-image generation and video processing, with message queues integrated within the inference service. This enables service scaling and saves O&M efforts.

Supported regions

EAS is available in the following regions: China (Beijing), China (Shanghai), China (Hangzhou), China (Zhangjiakou), China (Ulanqab), China (Shenzhen), China (Heyuan), China (Guangzhou), China (Chengdu), China (Hong Kong), Japan (Tokyo), Singapore, Indonesia (Jakarta), US (Silicon Valley), US (Virginia), and Germany (Frankfurt).

Billing

See Billing of EAS.

Usage

Step 1: Preparations

Prepare inference resources.
Select an EAS resource group. EAS provides three types of resources: public resources, dedicated resources, and Lingjun resources. To use dedicated resources or Lingjun resources, you need to purchase them before use. For guidance on resource selection and purchase configurations, see Overview of EAS resource groups.
Prepare the model, pre-processing and post-processing code files, and others.
Prepare the developed and trained model, code processing files, and others. Upload them to the specific cloud storage service. Access these data by storage mounting.

Step 2: Deploy a service

Deployment tools: EAS supports deploying and managing services through graphical interface or command line methods. They have different deployment processes and operational details.

Type	GUI method	Command line method
Deploy service	Deploy through console	Deploy through local client (EASCMD).
Manage service	On the Inference Service tab of the Elastic Algorithm Service (EAS) page Manage EAS services. Including: View model calling information. View logs, monitoring information, and deployment information. Scale in, scale out, start, stop, and delete model services.	Manage model services through EASCMD, see Run commands to use the EASCMD client.

Deployment methods: EAS supports image deployment (recommended) and processor deployment. For differences, see Parameters for custom deployment in the console.

Type	Description	References
Image deployment (recommended)	Images ensure consistency between the development and deployment environments. EAS provides official images that are suitable for various scenarios. You can use an official image to implement push-button deployment. You can also use a custom image without modification to deploy a model service in a convenient manner.	Service deployment: Custom image
Processor deployment	EAS provides prebuilt processors for common frameworks, such as PMML and XGBOOST. Using EAS prebuilt processors allows you to quickly start services, but may not meet specific requirements. You can also build custom processors to implement more flexible logic.	Built-in processor Service deployment: Custom processor

Step 3: Calling and stress testing

Deploy a model as a web UI application: You can use the console to open the web app in a browser and interact with the inference service.
Deploy a model as an API service:
- After deployment, you can use the online service debugging feature to send HTTP service requests to verify that the service can perform inference normally.
- Call the service to implement online inference and asynchronous inference. EAS services support multiple service calling methods such as Internet endpoint, VPC endpoint, and VPC direct connection.
For information about stress testing, see Automatic service stress testing.

Step 4: Monitor services and service scaling

After the service is running normally, you can enable service monitoring alerts to monitor the usage of service resources.
You can also enable horizontal or scheduled automatic scaling features to manage the computing resources of online services in real time.

Step 5: Asynchronous inference

Queue service and asynchronous inference are required in scenarios where inference takes a long time. When your inference service receives many requests, create an input queue to store the requests. After the requests are processed, save the results to the output queue and asynchronously return the results. This prevents unprocessed requests from being discarded. Additionally, EAS supports multiple methods of sending request data to the queue service and automatically scales the inference service by monitoring the amount of data in the queue. This effectively controls the number of service instances. For more information, see Asynchronous inference services.

References

For information about EAS use cases, see EAS use cases.
Data Science Workshop (DSW) of PAI is a cloud-based and interactive integrated development environment (IDE) for machine learning. You can use Notebooks to read data, develop algorithms, and train and deploy models. For more information, see DSW overview.
Visualized Modeling (Designer) of PAI provides hundreds of algorithm components. It supports large-scale distributed training for traditional machine learning, deep learning, and reinforcement learning, as well as streaming training and batch training. For more information, see Designer overview.