ML Training API Documentation
Complete API Documentation for External Integration
Version 1.0.0 | Last Updated: November 2025
Quick Start
Get started with Summit Health ML Training API in minutes. This guide will help you trigger training processes, monitor progress, and manage billing for third-party users.
https://your-backend-server.comAPI Version: v1
Content-Type: application/json
Prerequisites
- API access credentials (API key or OAuth token)
- Valid billing account for cost allocation
- Training dataset access (MIMIC-III, MIMIC-IV, PubMed, etc.)
Your First Training Request
curl -X POST "https://your-backend-server.com/api/training/start" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"instance_type": "classical",
"datasets": "MIMICIII,MIMIC4",
"user_id": "external_user_123",
"billing_account": "account_abc"
}'
Authentication
All API requests require authentication. Summit Health supports two authentication methods:
1. API Key Authentication
Include your API key in the request header:
Authorization: Bearer YOUR_API_KEY
2. OAuth 2.0
For more secure access, use OAuth 2.0:
Authorization: Bearer YOUR_OAUTH_TOKEN
Training API Endpoints
Start a new training job on specified instance type.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
instance_type |
string | Optional | Instance type: classical, 48vcpu, or trainium. Default: classical |
datasets |
string | Optional | Comma-separated datasets: MIMICIII, MIMIC4, PubMed. Default: MIMICIII |
user_id |
string | Required | External user ID for billing allocation |
billing_account |
string | Required | Billing account ID for cost tracking |
Request Example
POST /api/training/start?instance_type=48vcpu&datasets=MIMICIII,MIMIC4
Content-Type: application/json
Authorization: Bearer YOUR_API_KEY
{
"user_id": "external_user_123",
"billing_account": "account_abc",
"metadata": {
"project_name": "Medical NER Model",
"description": "Training for production deployment"
}
}
Response Example
{
"success": true,
"message": "Training started successfully on 48 vCPU Instance (3x Faster)",
"job_id": "training_48vcpu_20251124_152300_12345",
"model_id": "tinyllama-1b-medical-48vcpu-20251124",
"process_id": "12345",
"instance_type": "48vcpu",
"instance_name": "48 vCPU Instance (3x Faster)",
"host": "ec2-204-236-243-64.compute-1.amazonaws.com",
"model_path": "/home/ec2-user/Training_Data/models/tinyllama-1b-medical-phase1",
"resource_type": "48 vCPU",
"datasets": ["MIMICIII", "MIMIC4"],
"status": {
"current_step": 0,
"total_steps": 5000,
"status": "running"
},
"progress_detected": true,
"billing": {
"estimated_cost": 250.00,
"estimated_hours": 50,
"billing_account": "account_abc",
"user_id": "external_user_123"
}
}
Important: The training is configured with 3 epochs and a max_steps limit of 5,000 steps. In Hugging Face Transformers, when both max_steps and num_train_epochs are specified, max_steps takes precedence, meaning training will stop at 5,000 steps even if 3 full epochs would require more steps.
Why 5,000 steps instead of 30,000?
- Current Configuration: With effective batch size of 32 and the combined MIMIC-III + MIMIC-IV dataset (~391,360 documents), 5,000 steps may not complete 3 full epochs. This is a conservative limit for cost efficiency.
- Typical Training: For full 3-epoch training on large medical datasets, 30,000 steps would be more appropriate to ensure the model sees all training data multiple times.
- Current Limitation: The 5,000 step limit means training may stop before completing 3 epochs, potentially limiting model convergence.
Recommendation: For production medical model training, consider increasing max_steps to 30,000 to ensure full 3-epoch training. This can be configured by modifying the training script or as a custom training parameter. Contact support for assistance with extended training configurations.
Get current status of a training job.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
job_id |
string | Required | Training job ID returned from start endpoint |
instance_type |
string | Optional | Instance type filter |
Response Example
{
"job_id": "training_48vcpu_20251124_152300_12345",
"status": "running",
"current_step": 2500,
"total_steps": 5000,
"progress_percent": 50.0,
"estimated_time_remaining": "22:30:00",
"metrics": {
"train_loss": 0.85,
"learning_rate": 0.0001
},
"billing": {
"cost_so_far": 125.00,
"hours_used": 25
}
}
Stop a running training job.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
instance_type |
string | Optional | Instance type: classical, 48vcpu, or trainium |
List all stored training models with metadata.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
instance_type |
string | Optional | Filter by instance type |
status |
string | Optional | Filter by status: running, stopped, completed |
limit |
integer | Optional | Maximum number of records (default: 100) |
Register a completed training model in the database.
Request Body
{
"job_id": "training_48vcpu_20251124_152300_12345",
"model_path": "/home/ec2-user/Training_Data/models/tinyllama-1b-medical-phase1",
"instance_type": "48vcpu",
"final_metrics": {
"train_loss": 0.7547,
"train_runtime": 161045.76,
"epoch": 2.98
}
}
Billing & Cost Allocation
Summit Health provides transparent billing with automatic cost allocation to third-party users and billing accounts.
Pricing Structure
| Resource Type | Pricing | Description |
|---|---|---|
| Classical CPU | $5.00/hour | Standard CPU training instances |
| 48 vCPU | $7.50/hour | High-performance 48-core instances (3x faster) |
| Trainium | $15.00/hour | AWS Trainium instances for accelerated training |
| Base Cost | $10.00/job | One-time setup cost per training job |
| Per Epoch | $0.25/epoch | Additional cost per training epoch |
| Storage | $0.10/GB/month | Model storage cost |
Cost Calculation Example
Training Job Details:
- Instance Type: 48 vCPU
- Training Duration: 50 hours
- Epochs: 3
- Model Size: 4.2 GB
Cost Breakdown:
- Base Cost: $10.00
- Compute (50 hours × $7.50): $375.00
- Epochs (3 × $0.25): $0.75
- Storage (4.2 GB × $0.10): $0.42/month
Total: $386.17 (plus $0.42/month storage)
Billing Endpoints
Get cost breakdown for a specific user or billing account.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
user_id |
string | Optional | Filter by user ID |
billing_account |
string | Optional | Filter by billing account |
start_date |
string | Optional | Start date (YYYY-MM-DD) |
end_date |
string | Optional | End date (YYYY-MM-DD) |
Response Example
{
"user_id": "external_user_123",
"billing_account": "account_abc",
"period": {
"start": "2025-11-01",
"end": "2025-11-30"
},
"costs": {
"total": 1250.50,
"training": {
"compute": 1000.00,
"base_costs": 50.00,
"epochs": 15.00
},
"storage": 185.50
},
"jobs": [
{
"job_id": "training_48vcpu_20251124_152300_12345",
"cost": 386.17,
"duration_hours": 50,
"status": "completed"
}
]
}
Record a cost for billing allocation (automatically called by system).
Request Body
{
"job_id": "training_48vcpu_20251124_152300_12345",
"user_id": "external_user_123",
"billing_account": "account_abc",
"cost_type": "compute",
"amount": 375.00,
"duration_hours": 50,
"metadata": {
"instance_type": "48vcpu",
"resource_type": "48 vCPU"
}
}
user_id and billing_account, costs are automatically tracked and allocated. You can query costs at any time using the billing endpoints.
Monitoring & Logs
Get training logs for a specific job.
Query Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
job_id |
string | Required | Training job ID |
lines |
integer | Optional | Number of log lines to retrieve (default: 500) |
Get training metrics history for visualization.
Deployment Status
https://github.com/your-org/summit-healthVercel Deployment: Auto-deploy on push to main branch
Backend API:
https://your-backend-server.com
Deployment Process
- Code Push: Push changes to GitHub main branch
- Auto-Deploy: Vercel automatically deploys frontend
- Backend Update: Backend API updates require manual deployment or CI/CD pipeline
- Verification: Test API endpoints after deployment
Code Examples
Python Example
import requests
import time
API_BASE_URL = "https://your-backend-server.com"
API_KEY = "YOUR_API_KEY"
def start_training(user_id, billing_account, datasets="MIMICIII"):
"""Start a training job"""
response = requests.post(
f"{API_BASE_URL}/api/training/start",
params={
"instance_type": "48vcpu",
"datasets": datasets
},
headers={
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
},
json={
"user_id": user_id,
"billing_account": billing_account
}
)
return response.json()
def check_status(job_id):
"""Check training job status"""
response = requests.get(
f"{API_BASE_URL}/api/training/process-status",
params={"job_id": job_id},
headers={"Authorization": f"Bearer {API_KEY}"}
)
return response.json()
def get_costs(user_id, billing_account):
"""Get cost breakdown"""
response = requests.get(
f"{API_BASE_URL}/api/training/cost-tracking/user-costs",
params={
"user_id": user_id,
"billing_account": billing_account
},
headers={"Authorization": f"Bearer {API_KEY}"}
)
return response.json()
# Example usage
if __name__ == "__main__":
# Start training
result = start_training(
user_id="external_user_123",
billing_account="account_abc",
datasets="MIMICIII,MIMIC4"
)
job_id = result["job_id"]
print(f"Training started: {job_id}")
# Monitor progress
while True:
status = check_status(job_id)
print(f"Progress: {status['progress_percent']}%")
if status["status"] == "completed":
print("Training completed!")
break
time.sleep(60) # Check every minute
# Get final costs
costs = get_costs("external_user_123", "account_abc")
print(f"Total cost: ${costs['costs']['total']}")
JavaScript/Node.js Example
const axios = require('axios');
const API_BASE_URL = 'https://your-backend-server.com';
const API_KEY = 'YOUR_API_KEY';
async function startTraining(userId, billingAccount, datasets = 'MIMICIII') {
const response = await axios.post(
`${API_BASE_URL}/api/training/start`,
{
user_id: userId,
billing_account: billingAccount
},
{
params: {
instance_type: '48vcpu',
datasets: datasets
},
headers: {
'Authorization': `Bearer ${API_KEY}`,
'Content-Type': 'application/json'
}
}
);
return response.data;
}
async function checkStatus(jobId) {
const response = await axios.get(
`${API_BASE_URL}/api/training/process-status`,
{
params: { job_id: jobId },
headers: { 'Authorization': `Bearer ${API_KEY}` }
}
);
return response.data;
}
// Example usage
(async () => {
const result = await startTraining(
'external_user_123',
'account_abc',
'MIMICIII,MIMIC4'
);
console.log('Training started:', result.job_id);
// Monitor progress
const interval = setInterval(async () => {
const status = await checkStatus(result.job_id);
console.log(`Progress: ${status.progress_percent}%`);
if (status.status === 'completed') {
clearInterval(interval);
console.log('Training completed!');
}
}, 60000); // Check every minute
})();
cURL Examples
# Start training
curl -X POST "https://your-backend-server.com/api/training/start?instance_type=48vcpu&datasets=MIMICIII,MIMIC4" \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"user_id": "external_user_123",
"billing_account": "account_abc"
}'
# Check status
curl -X GET "https://your-backend-server.com/api/training/process-status?job_id=training_48vcpu_20251124_152300_12345" \
-H "Authorization: Bearer YOUR_API_KEY"
# Get costs
curl -X GET "https://your-backend-server.com/api/training/cost-tracking/user-costs?user_id=external_user_123&billing_account=account_abc" \
-H "Authorization: Bearer YOUR_API_KEY"
# List models
curl -X GET "https://your-backend-server.com/api/training/models/list?status=completed&limit=10" \
-H "Authorization: Bearer YOUR_API_KEY"
📞 Support & Contact
For API access, billing questions, or technical support:
- Email: api-support@summithealth.ai
- Documentation: https://docs.summithealth.ai
- Status Page: https://status.summithealth.ai