



### "Fast" ML for Science

Nhan Tran, Fermilab & Duc Hoang, MIT ICISE HW Camp 7 March 2024

With material from Aobo Li, Beomki Yeo, Javier Duarte

### Table of contents

- Particle physics & Experimental science
- Fast ML for Science
- ML the basics
  - Training and inference
- Computing technology & platform
- Coprocessors for science
- Efficient ML Codesign
- Examples: particle physics, fusion
- Towards automated, accelerated discovery



# Fast ML for Science





1 channel ~ 10b 1 channel, 1 MHz rate ~ 10 Mb/s 100k channels, 1 MHz rate ~ 1 Tb/s



1 channel, 1 MHz rate ~ 10 Mb/s 100k channels, 1 MHz rate ~ 1 Tb/s

Tee Late



Embed more intelligence

# Fast ML for science and the extreme edge

"Scientific discoveries come from groundbreaking ideas and the capability to validate those ideas by testing nature at new scales - finer and more precise temporal and spatial resolution. This is leading to an explosion of data that must be interpreted, and ML is proving a powerful approach. The more efficiently we can test our hypotheses, the faster we can achieve discovery. To fully unleash the power of ML and accelerate discoveries, it is necessary to embed it into our scientific process, into our instruments and detectors."

> Applications and Techniques for Fast Machine Learning in Science https://doi.org/10.3389/fdata.2022.787421

Benchmarks bring innovation

The Fast ML for Science community aims to bring **seemingly different domains** together to develop **techniques**, **tools**, **and platforms** for challenges that **far outpace industry**.

#### Benchmarks bring innovation

The Fast ML for Science community aims to bring **seemingly different domains** together to develop **techniques, tools, and platforms** for challenges that **far outpace industry.** 











Fermilab accelerator complex





# The Need For Speed 4 distributed recording ASICs Wireless communication hub















MLCommons launches machine learning benchmark for devices like smartwatches and voice assistants by Ben Wodecki6/16/2021



With experts from Qualcomm, Fermilab, and Google aiding in its development

MLCommons, the open engineering consortium behind the MLPerf benchmark test, has launched a new measurement suite aimed at 'tiny' devices like smartwatches and voice assistants.

MLPerf Tiny Inference is designed to compare performance of embedded devices and models with a footprint of 100kB or less, by measuring





















- Fast control
  - Immediate response to dynamics of the experiment and data readout
  - Event timing, triggering, etc.
- Slow control
  - Detector stability over minutes, days, weeks, months,...
  - Monitoring and controlling operational parameters: electronics gains, pedestals, calibrations, etc.



# ML - the basics

### Why AI?

Universal function approximation - fit with customizable objective: f(inputs; lots of parameters) = output

- <u>Expressive</u>: able to find patterns and correlations in high-dimensional data not explicitly accounted for
- <u>Powerful</u>: can unlock large gains in performance
- <u>Adaptive</u>, <u>flexible</u>, <u>autonomous</u>: able to adapt to new data, conditions automatically; handles all different types of data representations



#### All of Al in one slide



#### **HEPML-LivingReview**

#### A Living Review of Machine Learning for Particle Physics

Modern machine learning techniques, including deep learning, is rapidly being applied, adapted, and developed for high energy physics. The goal of this document is to provide a nearly comprehensive list of citations for those developing and applying these approaches to experimental, phenomenological, or theoretical analyses. As a living document, it will be updated as often as possible to incorporate the latest developments. A list of proper (unchanging) reviews can be found within. Papers are grouped into a small set of topics to be as useful as possible. Suggestions are most welcome.



The purpose of this note is to collect references for modern machine learning as applied to particle physics. A minimal number of categories is chosen in order to be as useful as possible. Note that papers may be referenced in more than one category. The fact that a paper is listed in this document does not endorse or validate its content - that is for the community (and for peer-review) to decide. Furthermore, the classification here is a best attempt and may have flaws - please let us know if (a) we have missed a paper you think should be included, (b) a paper has been misclassified, or (c) a citation for a paper is not correct or if the journal information is now available. In order to be as useful as possible, this document will continue to evolve so please check back before you write your next paper. If you find this review helpful, please consider citing it using (cite/hopmllivingreview) in HEPML.bb.

Reviews

- Modern reviews
- . Jet Substructure at the Large Hadron Collider: A Review of Recent Advances in Theory and Machine Learning [DOI]
- . Deep Learning and its Application to LHC Physics [DOI]
- Machine Learning in High Energy Physics Community White Paper [DOI]
- Machine learning at the energy and intensity frontiers of particle physics
- Machine learning and the physical sciences [DOI]
- Machine and Deep Learning Applications in Particle Physics [DOI]
- Modern Machine Learning and Particle Physics
- Machine Learning in the Search for New Fundamental Physics
- Artificial Intelligence and Machine Learning in Nuclear Physics

https://iml-wg.github.io/HEPML-LivingReview



### Basic elements of machine learning

- Learning mathematical models from data that:
  - characterize the patterns, regularities, and relationships amongst variables in the system
- Three key components:
  - Model: chosen mathematical model
    - Depends on the task, data modality
  - Learning: estimate statistical model from data
  - Prediction and Inference: using statistical model to make predictions on new data points and infer properties of system(s)



### Machine learning computation



Simple 2 input example (Fisher linear discriminant, linear support vector machine,...)  $O_1 = I_1 \times W_{11} + I_2 \times W_{21} + b_1$ 





#### Machine learning computation





#### Some intuition

### https://playground.tensorflow.org/









we do so by <u>adjusting the weights</u>





to learn the weights, we need the **derivative** of the loss w.r.t. the weight i.e. "how should the weight be updated to decrease the loss?"

$$w' = w - \alpha \frac{\partial \mathcal{L}}{\partial w}$$

with multiple weights, we need the gradient of the loss w.r.t. the weights

$$\mathbf{w}' = \mathbf{w} - \alpha \nabla_{\mathbf{w}} \mathcal{L}$$



### Backpropagation

a neural network defines a function of composed operations  $f_L(\mathbf{w}_L, f_{L-1}(\mathbf{w}_{L-1}, \dots f_1(\mathbf{w}_1, \mathbf{x}) \dots))$ and the loss  $\mathcal{L}$  is a function of the network output

→ use <u>chain rule</u> to calculate gradients



🛠 Fermilab

#### Stochastic gradient descent

See animated gifs: <u>http://ruder.io/optimizing-gradient-descent/</u>

stochastic gradient descent (SGD):  $w = w - \alpha \tilde{\nabla}_w \mathcal{L}$ use stochastic gradient estimate to descend the surface of the loss function

recent variants use additional terms to maintain "memory" of previous gradient information and scale gradients per parameter



local minima and saddle points are largely not an issue in many dimensions, can move in exponentially more directions



# **Compute technology & platforms**
#### **Basics of computing**

- Microprocessor: A single Integrated Circuit which can do data processing and logic control
- Integrated Circuit: A chunk of transistors
- Transistor: A minimal building block of electronics



Intel 4004 chipset design (2300 transistors)



#### CPU

- CPU (Central Processing Unit): Made of Cores, Caches and Control Units
  - Core: Algorithm Logical Units (ALUs) and registers
    - ALU: performs mathematical operations
    - Register: small storage which stores data being processed
  - Cache: On-chip memory
  - Control Unit: Distribute operations to other units





#### Moore's law, Dennard Scaling, Pollack's Rule

- Moore's Law: observation that the number of transistors in processors doubles every two years
- Dennard Scaling
  - Free scaling of the frequency (f) for the same power consumption (P)
  - $P = \alpha C V^2 f$
  - Capacitance (C) and operating voltage (V) are linearly reduced with the size of transistor
- Pollack's rule
  - Observation of Performance ~  $\sqrt{N}$  (N = the number of transistors)
  - Moore's law allows more number of transistors (N) on the same chip size



#### Moore's law, Dennard Scaling, Pollack's Rule

Below the transistor size of 65 nm (since yr. 2005), the current leakage (I<sub>leakage</sub>) is not negligible anymore

 $P = \alpha C V^2 f + V I_{leakage}$ 





### Flynn's Taxonomy







### Flynn's Taxonomy

|            | Instruction Streams                                                  |                                            |  |  |  |
|------------|----------------------------------------------------------------------|--------------------------------------------|--|--|--|
|            | one                                                                  | many                                       |  |  |  |
| <u>e</u> e | SISD<br>traditional yon                                              | MISD                                       |  |  |  |
|            | Neumann single<br>CPU computer                                       | May be pipelined<br>Computers              |  |  |  |
| many       | SIMD<br>Vector processors<br>fine grained data<br>Parallel computers | MIMD<br>Multi computers<br>Multiprocessors |  |  |  |



### **GPU**s

- Graphical processing unit
- Many number of cores (~1000)
  - Much simpler than CPU
  - Small caches

- Originally intended for graphics on PC screen
- Major vendors: Nvidia, AMD, Intel





### Rise of ML

- Necessity/Data
- Hardware
- ML Research
- Tools



### Rise of ML

- Necessity/Data
- Hardware
- ML Research
- Tools



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp



### Rise of ML

- Necessity/Data
- Hardware
- ML Research
- Tools



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2019 by K. Rupp



| Technology size | Year | Technology size | Year |
|-----------------|------|-----------------|------|
| 10 um           | 1971 | 130 nm          | 2001 |
| 6 um            | 1974 | 90 nm           | 2004 |
| 3 um            | 1977 | 65 nm           | 2006 |
| 1.5 um          | 1982 | 45 nm           | 2008 |
| 1 um            | 1985 | 32 nm           | 2010 |
| 800 nm          | 1989 | 22 nm           | 2012 |
| 600 nm          | 1994 | 14 nm           | 2014 |
| 350 nm          | 1995 | 10 nm           | 2017 |
| 250 nm          | 1997 | 7 nm            | 2018 |
| 180 nm          | 1999 | 5 nm            | 2020 |
|                 |      |                 |      |

### Rise of ML

- Necessity/Data
- Hardware
- ML Research
- Tools









COMMUNICATIONS DF THE ACM 02/2019 VOL.62 NO.02

#### A New Golden Age for Computer Architecture

Agriculture Technology Monitoring Noise Pollution The Computational Sprinting Game Blockchain from a Distributed Computing Perspective







- ASIC
  - Google TPU block diagram
  - Very efficient compute but long development times and challenge to make general purpose





- FPGA
  - More flexible to changing workloads
  - Still not that easy to program



- NPUs
  - Fast moving space
  - Immature software ecosystem
  - Interoperability a challenge





































#### Modalities of processing

|                                                   | CPU                                                                             | GPU                                                                             | FPGA                                                                                             |  |
|---------------------------------------------------|---------------------------------------------------------------------------------|---------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------|--|
| Latency                                           | O (10) μs                                                                       | O (100) μs                                                                      | Deterministic,<br>O (100) ns                                                                     |  |
| I/O with processor                                | Ethernet, USB, PCIe                                                             | PCIe, Nvlink                                                                    | Connectivity to any<br>data source via<br>printed circuit board<br>(PCB)                         |  |
| Engineering cost                                  | Low entry level<br>(programmable with<br>c++, python, etc.)                     | Low entry level<br>(programmable with<br>CUDA, OpenCL,<br>etc.)                 | Some high-level<br>syntax available,<br>traditionally VHDL,<br>Verilog (specialized<br>engineer) |  |
| Single precision<br>floating point<br>performance | 0 (10) TFLOPs                                                                   | 0 (10) TFLOPs                                                                   | Optimized for fixed point performance                                                            |  |
| Serial / parallel                                 | Optimized for serial<br>performance,<br>increasingly using<br>vector processing | Optimized for parallel performance                                              | Optimized for<br>parallel performance                                                            |  |
| Memory                                            | 0 (100) GB RAM                                                                  | 0 (10) GB                                                                       | <i>O</i> (10) MB (on the<br>FPGA itself, not the<br>PCB)                                         |  |
| Backward<br>compatibility                         | Compatible, except<br>for vector instruction<br>sets                            | Compatible, except<br>for specific features<br>only available on<br>modern GPUs | Not easily backward compatible                                                                   |  |



### Accelerated compute

#### **Embedded Systems**

Embedded in our experiments; often (hard) real-time latency constraints, custom architectures

#### Coprocessors

Traditional datacenter-scale compute; throughput-driven; general purpose architectures

#### Fast ML regimes





### Efficient ML codesign (For embedded systems)

- Field Programmable Gate Arrays are reprogrammable integrated circuits
- Contain many different building blocks ('resources') which are connected together as you desire
- Originally popular for prototyping ASICs, but now also for high performance computing



Now Intel!

Now AMD!



- Field Programmable Gate Arrays are reprogrammable integrated circuits
- Logic cells / Look Up Tables perform arbitrary functions on small bitwidth inputs (2-6)
  - These can be used for boolean operations, arithmetic, small memories
- Flip-Flops register data in time with the clock pulse
- DSPs (Digital Signal Processor) are specialized units for multiplication and arithmetic
  - Faster and more efficient than using LUTs for these types of operations
- BRAMs are small, fast memories RAMs, ROMs, FIFOs (18Kb each in Xilinx)
  - Memories using BRAMs more efficient than using LUTs



- Field Programmable Gate Arrays are reprogrammable integrated circuits
- <u>High speed transceivers</u> with Tb/s total bandwidth PCIe, (Multi) Gigabit Ethernet, Infiniband
- AND: Support highly parallel algorithm implementations
- Low power per Op (relative to CPU/GPU)
- Cons:
  - Limited resources on chip
  - Difficult to program concurrency always challenging





- Field Programmable Gate Arrays are reprogrammable integrated circuits
- <u>High speed transceivers</u> with Tb/s total bandwidth PCle, (Multi) Gigabit Ethernet, Infiniband
- AND: Support highly parallel algorithm implementations
- Low power per Op (relative to CPU/GPU)














## How are FPGAs programmed?

- Hardware Description Languages
  - HDLs are programming languages which describe electronic circuits
- High Level Synthesis
  - Compile from C/C++ to VHDL
  - Pre-processor directives and constraints used to optimize the design
  - Drastic decrease in firmware development time!
- Not totally rainbows and sunshine, often projects are mixes of HDL and HLS but HLS can be used to make kernels or IPs of dedicated algorithms









## Moving data expensive, computing cheap

|                       | Relative Energy Cost |   |    |     |      |       |
|-----------------------|----------------------|---|----|-----|------|-------|
| Operation:            | Energy (pJ)          |   |    |     | -    |       |
| 8b Add                | 0.03                 |   |    |     |      |       |
| 16b Add               | 0.05                 |   |    |     |      |       |
| 32b Add               | 0.1                  |   |    |     |      |       |
| 16b FP Add            | 0.4                  |   |    |     |      |       |
| 32b FP Add            | 0.9                  |   |    |     |      |       |
| 8b Mult               | 0.2                  |   |    |     |      |       |
| 32b Mult              | 3.1                  |   |    |     |      |       |
| 16b FP Mult           | 1.1                  |   |    |     |      |       |
| 32b FP Mult           | 3.7                  |   |    |     |      |       |
| 32b SRAM Read (8KB)   | 5                    |   |    |     |      |       |
| 32b DRAM Read         | 640                  |   |    |     |      |       |
| Adapted from Horowitz |                      | 1 | 10 | 100 | 1000 | 10000 |

🛟 Fermilab

50

## Moving data expensive, computing cheap



🛟 Fermilab



## **Efficient machine learning**

- Computation parallelization/ vectorization and in-memory compute (architecture)
- Quantization, reduced precision
  - For ML, 32-bit floating point is often overkill
  - Integer/fixed-point math at 16,8,7,6,5...1 bits
- Compression, pruning
  - maintain the same performance while removing low weight synapses and neurons





### Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Benjamin Hawks<sup>1</sup>, Javier Duarte<sup>2</sup>, Nicholas J. Fraser<sup>3</sup>, Alessandro Pappalardo<sup>3</sup>, Nhan Tran<sup>1,4</sup>, Yaman Umuroglu<sup>3</sup>

<sup>1</sup>Fermi National Accelerator Laboratory, Batavia, IL, United States <sup>2</sup>University of California San Diego, La Jolla, CA, United States, <sup>3</sup>Xilinx Research, Dublin, Ireland, <sup>4</sup>Northwestern University, Evanston, IL, United States

Developed Quantization-aware pruning procedure:

- Used BOPS as hardware efficiency metric
- Fine-tuning vs. Lottery ticket pruning
- Effect of Batch Norm and L1 reg
- Explored generalizability of QAP-ed models including metrics like neural efficiency
- Bayesian Optimization/structured pruning vs. unstructured pruning

| Model         | Precision             | BN or $L_1$                                 | Pruned [%] | BOPs           | Accuracy [%]  | $\langle \epsilon_b^{\epsilon_s=0.5}\rangle$ [%] | $\langle \mathrm{AUC} \rangle$ [%] |
|---------------|-----------------------|---------------------------------------------|------------|----------------|---------------|--------------------------------------------------|------------------------------------|
| Nominal       | 32-bit floating-point | $L_1 + BN$ $L_1 + BN$ $L_1 + BN$ $L_1 + BN$ | 0          | 4,652,832      | <b>76.977</b> | <b>0.00171</b>                                   | <b>94.335</b>                      |
| Pruning + PTQ | 16-bit fixed-point    |                                             | 70         | 631,791        | 75.01         | 0.00210                                          | 94.229                             |
| QAT           | 6-bit fixed-point     |                                             | 0          | 412,960        | 76.737        | 0.00208                                          | 94.206                             |
| QAP           | 6-bit scaled-integer  |                                             | <b>80</b>  | <b>189,672</b> | 76.602        | 0.00211                                          | 94.197                             |

### 🛟 Fermilab

### Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Benjamin Hawks <sup>1</sup>, Javier Duarte <sup>2</sup>, Nicholas J. Fraser <sup>3</sup>, Alessandro Pappalardo <sup>3</sup>, Nhan Tran <sup>1,4</sup>, Yaman Umuroglu <sup>3</sup>

<sup>1</sup>Fermi National Accelerator Laboratory, Batavia, IL, United States <sup>2</sup>University of California San Diego, La Jolla, CA, United States, <sup>3</sup>Xilinx Research, Dublin, Ireland, <sup>4</sup>Northwestern University, Evanston, IL, United States

Developed Quantization-aware pruning procedure:

- Used BOPS as hardware efficiency metric
- Fine-tuning vs. Lottery ticket pruning
- Effect of Batch Norm and L1 reg
- Explored generalizability of QAP-ed models including metrics like neural efficiency
- Bayesian Optimization/structured pruning vs. unstructured pruning





#### Ps and Qs: Quantization-Aware Pruning for Efficient Low Latency Neural Network Inference

Benjamin Hawks<sup>1</sup>, Javier Duarte<sup>2</sup>, Nicholas J. Fraser<sup>3</sup>, Alessandro Pappalardo<sup>3</sup>, Nhan Tran<sup>1,4</sup>, Yaman Umuroglu<sup>3</sup>

<sup>1</sup>Fermi National Accelerator Laboratory, Batavia, IL, United States <sup>2</sup>University of California San Diego, La Jolla, CA, United States, <sup>3</sup>Xilinx Research, Dublin, Ireland, <sup>4</sup>Northwestern University, Evanston, IL, United States

Developed Quantization-aware pruning procedure:

- Used BOPS as hardware efficiency metric
- Fine-tuning vs. Lottery ticket pruning
- Effect of Batch Norm and L1 reg
- Explored generalizability of QAP-ed models including metrics like neural efficiency
- Bayesian Optimization/structured pruning vs. unstructured pruning



### Efficient algorithm codesign

### More interesting directions — distillation and inductive bias





### Efficient algorithm codesign

#### More interesting directions — distillation and inductive bias



Model performance improves with distillation of expert knowledge, and more robust (see talk)





# Efficient hardware - algorithm codesign





**Physics requirements** 





59



What kind of platform?

#### Physics requirements



What kind of platform?

#### **Physics requirements**



See proposal for **QONNX** 

 1 = 0.03125
 2 = 0

 2 = 0
 3 = 6

 0 (32×32)
 1 = 0.03125

 1 = 0.03125
 2 = 0

 3 = 6
 Relu

#### **Physics requirements**



<u>QKeras</u> (Google) <u>Brevitas</u> (AMD) <u>HAWQ</u> (UC Berkeley) <u>QONNX</u> (Microsoft/AMD)



What kind of platform?







<u>QKeras</u> (Google) <u>Brevitas</u> (AMD) <u>HAWQ</u> (UC Berkeley) <u>QONNX</u> (Microsoft/AMD)

BNL711 FELIX Firmware Floorplanning



### Why hls4ml

- open-source
- Community-supported
- User-driven
- Accessible and usable

https://github.com/fastmachinelearning/hls4ml-tutorial

#### Check the accuracy and make a ROC curve import plotting import matplotlib.pyplot as plt from sklearn.metrics import accuracy\_score y\_keras = model.predict(X\_test) print("Accuracy: {})".format(accuracy\_score(np.argmax(y\_test, axis=1), np.argmax(y\_keras, axis=1)))) plt.figure(figsize=(9, 9))

\_ = plotting.makeRoc(y\_test, y\_keras, le.classes\_)

#### Convert the model to FPGA firmware with hls4ml

Now we will go through the steps to convert the model we trained to a low-latency optimized FPGA firmware with hls4ml. First, we will evaluate its classification performance to make sure we haven't lost accuracy using the fixed-point data types. Then we will synthesize the model with Vivado HLS and check the metrics of latency and FPGA resource usage.

#### Make an hls4ml config & model

Check performance

The hls4ml Neural Network inference library is controlled through a configuration dictionary. In this example we'll use the most simple variation, later exercises will look at more advanced configuration.

#### In []: import hls4ml

Let's visualise what we created. The model architecture is shown, annotated with the shape and data types

In [ ]: hls4ml.utils.plot\_model(hls\_model, show\_shapes=True, show\_precision=True, to\_file=None)

#### Compile, predict

Now we need to check that this model performance is still good. We compile the hls\_model, and then use hls\_model.predict to execute the FPGA firmware with bit-accurate emulation on the CPU.

**ab** 

#### In [ ]: hls model.compile()

X\_test = np.ascontiguousarray(X\_test)
y\_hls = hls\_model.predict(X\_test)

### Why hls4ml

- open-source
- Community-supported
- User-driven
- Accessible and usable

|                                            | <pre>model.save('keras_model.h5'</pre>  |
|--------------------------------------------|-----------------------------------------|
| mups.//github.com/rastmachinelearning/ms   |                                         |
| https://aithub.com/fastmachinelearning/bls | 4ml_tuttorial                           |
|                                            | Make sure you've saved your traine      |
|                                            |                                         |
|                                            | model.compile(optimizer='ad             |
|                                            | <pre>model.add(Dense(1, activati)</pre> |
|                                            | model.add(Dense(32, activat             |
|                                            | model.add(Dense(64, input_s             |
|                                            | model = Sequential()                    |
|                                            |                                         |

|                                                          |                                | Check performance<br>Check the accuracy and make a ROC curve                                                                                                                                                                    |                                |                                                                                                                                                                        |
|----------------------------------------------------------|--------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                                                          | In []:                         | <pre>import plotting<br/>import matplotlib.pyplot as plt<br/>from sklearn.metrics import accuracy_score<br/>y_keras = model.predict(X_test)<br/>print("Accuracy: {}".format(accuracy_score<br/>plt.figure(figsize=(9, 9))</pre> | (np.a                          | rgmax(y_test, axis=1), np.argmax(y_keras, axis=1))))                                                                                                                   |
| Use hIs4ml to convert a neural network from Keras to HLS |                                | asse                                                                                                                                                                                                                            | c )<br>3. Convert with his4mi: |                                                                                                                                                                        |
|                                                          |                                |                                                                                                                                                                                                                                 |                                | python 🔲 Caply cade                                                                                                                                                    |
|                                                          |                                |                                                                                                                                                                                                                                 | Na                             | immer hls4nl                                                                                                                                                           |
| 'hls4ml ' is a package                                   | developed to co                | onvert machine learning models (like those from                                                                                                                                                                                 |                                |                                                                                                                                                                        |
| Keras) into high-level s                                 | ynthesis (HLS) o               | code. This allows the deployment of such models onto                                                                                                                                                                            | el We                          | # Configure the converter                                                                                                                                              |
| FPGAs. The following s                                   | steps will quide v             | you through converting a Keras neural network model                                                                                                                                                                             | ator                           | <pre>config = hls4ml.utils.config_from_keras_model(model, granularity='model')</pre>                                                                                   |
| to HLS using 'h1s4m1'                                    | :                              | ,                                                                                                                                                                                                                               | arei                           | print(conilg)                                                                                                                                                          |
| to rice to rig meeting                                   |                                |                                                                                                                                                                                                                                 |                                | # Convert the model                                                                                                                                                    |
| Installation:                                            |                                |                                                                                                                                                                                                                                 |                                | <pre>hls_model = hls4ml.converters.convert_from_keras_model(model,</pre>                                                                                               |
| If you haven't installed                                 | hls4ml`yet,y                   | ou can do so using 'pip':                                                                                                                                                                                                       | lled                           | hls_config=config,                                                                                                                                                     |
|                                                          |                                | Copy code                                                                                                                                                                                                                       | onfi                           | output_dir= his4ml                                                                                                                                                     |
|                                                          |                                |                                                                                                                                                                                                                                 |                                |                                                                                                                                                                        |
| pip install hls4                                         | ml                             |                                                                                                                                                                                                                                 |                                | # Print the model configuration to check                                                                                                                               |
| Den se Maria Maria M                                     | la dal                         |                                                                                                                                                                                                                                 | el(                            | hls4ml.utils.plot_model(hls_model, show_shapes=True, show_precision=True,                                                                                              |
| Prepare Your Keras M                                     | lodel:                         |                                                                                                                                                                                                                                 | )                              | · · · · · · · · · · · · · · · · · · ·                                                                                                                                  |
| Let's assume you've tr                                   | rained a simple r              | model in Keras or you have a Keras model ready:                                                                                                                                                                                 |                                | in the his4mi.converters.convert_trom_keras_model function:                                                                                                            |
| python                                                   |                                | 🗂 Copy code                                                                                                                                                                                                                     | )                              | <pre>ins_conrig : is the configuration for the conversion, we generated it using 'seefing from home, and 1' for simplicity but you can outermize this as peeded.</pre> |
| from korne medel                                         | s import Segue                 |                                                                                                                                                                                                                                 | ker                            | <ul> <li>Contrig_iton_kerss_model for simplicity, but you can customize this as needed.</li> <li>'output: director where the HIS project will be created.</li> </ul>   |
| from keras lavers                                        | s import Dense                 | a a a a a a a a a a a a a a a a a a a                                                                                                                                                                                           | bue                            | * 'fora part': specifies the EPGA part. Adjust this based on your target EPGA.                                                                                         |
| area notes rady sa                                       | s amport sense                 |                                                                                                                                                                                                                                 |                                | 4. Build the HLS Project:                                                                                                                                              |
| # Create a simple                                        |                                |                                                                                                                                                                                                                                 | ire i                          | Once the model is converted, you can compile it into HLS:                                                                                                              |
| model = Sequentia                                        | al()                           |                                                                                                                                                                                                                                 |                                | python 🗍 Copy code                                                                                                                                                     |
| model.add(Dense(                                         | 64, input_shap                 | <pre>pe=(16,), activation='relu'))</pre>                                                                                                                                                                                        | pes                            |                                                                                                                                                                        |
| model.add(Dense(                                         | <ol> <li>activation</li> </ol> | n='relu'))                                                                                                                                                                                                                      |                                | <pre>hls_model.compile()</pre>                                                                                                                                         |
| model.add(Dense()                                        | 1, activation=                 | :'sigmoid'))                                                                                                                                                                                                                    | 3                              | 5. Run the HLS Simulation (Optional):                                                                                                                                  |
| model.compile(opt                                        | timizer='adam'                 | <pre>, loss='binary_crossentropy', metrics=['accur</pre>                                                                                                                                                                        |                                | Before synthesizing for FPGA, you can run a C-simulation to check if the model works                                                                                   |
|                                                          |                                |                                                                                                                                                                                                                                 | eti                            | correctly in HLS:                                                                                                                                                      |
| Make sure you've save                                    | ed your trained r              | nodel:                                                                                                                                                                                                                          | 10.0                           | pystion 📋 Copy cade                                                                                                                                                    |
| tutorial                                                 |                                | 🖺 Capy cade                                                                                                                                                                                                                     |                                |                                                                                                                                                                        |
| model eave(thore                                         | e model h511                   |                                                                                                                                                                                                                                 |                                | his_model.build(csin= rue);                                                                                                                                            |
| model.save( Kela                                         | S_moder.no.)                   |                                                                                                                                                                                                                                 |                                | ne<br>nego man hann danara ne kanara ne kanara na kanara na kanara na                                                                                                  |
|                                                          |                                | <pre>y_hts = hts_modet.predict(X_test)</pre>                                                                                                                                                                                    |                                | After this, you'll have an HLS project in the specified 'output_dir' that you can use with                                                                             |
|                                                          |                                |                                                                                                                                                                                                                                 |                                | EPGA development tools to generate bitstreams for EPGA deployment.                                                                                                     |



Siemens simplifies development of Al accelerators for advanced system-onchip designs with Catapult Al NN

ð) (E

PR Newswire Tue, May 21, 2024, 8:00 AM CDT • 5 min read.



- Catapult AI NN offers software e solution to synthesize AI Neural<sup>1</sup>
- Enables software development t models designed in Python into facilitating faster and more pow to standard processors

PLANO, Texas, May 21, 2024 /PRNew Industries Software today announce High-Level Synthesis (HLS) of neura Application-Specific Integrated Circu (SoCs). Catapult AI NN is a complete network description from an AI frame synthesizes it into an RTL accelerato implementation in silicon. Catapult AI NN brings together hls4ml, an open-source package for machine learning hardware acceleration, and Siemens' Catapult™ HLS software for High-Level Synthesis. Developed in close collaboration with Fermilab, a U.S. Department of Energy Laboratory, and other leading contributors to hls4ml, Catapult AI NN addresses the unique requirements of machine learning accelerator design for power, performance, and area on custom silicon.



- Emerging computing architectures
- New microelectronics technologies
- Efficient neural algorithms, e.g. spiking



- Emerging computing architectures
- New microelectronics technologies
- Efficient neural algorithms, e.g. spiking





- Emerging computing architectures
- New microelectronics technologies
- Efficient neural algorithms, e.g. spiking





- Emerging computing architectures
- New microelectronics technologies
- Efficient neural algorithms, e.g. spiking





# **Outlook**

## Fast ML for Science

Embedding ML into our experiments with extreme requirements brings radical new capabilities, accelerates scientific discovery, and spurs technological innovation



## Fast ML for Science

Embedding ML into our experiments with extreme requirements brings radical new capabilities, accelerates scientific discovery, and spurs technological innovation

## Intelligent edge of tomorrow

We are developing novel ML techniques and accessible tools co-designed with cutting edge hardware for science while collaborating with researchers and industry



## Fast ML for Science

Embedding ML into our experiments with extreme requirements brings radical new capabilities, accelerates scientific discovery, and spurs technological innovation

## Intelligent edge of tomorrow

We are developing novel ML techniques and accessible tools co-designed with cutting edge hardware for science while collaborating with researchers and industry

## Outlook

Powerful intelligent sensing being demonstrated across a wide array of applications; Continuing to advance ML methods and hardware development to enable ultra-fast automated experimentation to enable future ground-breaking discoveries!

