Review #25A
===========================================================================

Overall merit
-------------
3. Weak reject: contribution is weak or offset by deficiencies, will not
  fight against acceptance

Reviewer expertise
------------------
1. No familiarity

Paper summary
-------------
This paper proposes hybrid power/state tracing for hardware-accelerated real-times systems
through the use of a resource-efficient trace IP core instantiated in the programmable
systems-on-chip and a custom external power measurement subsystem. The advantage
is that such tracing enables unified latency, functional and energy monitoring.
The authors demonstrate several applications of the proposed hbrid tracing
such as using it to identify application phases and estimated best-case energy savings.

Strengths
---------
+ The proposed solution can enable holistic debugging and understanding of the energy, functional and latency trade-offs in FPGA-accelerated real-time systems.
+ The solution is resource efficient.

Weaknesses
----------
- The solution needs to instrument the DUT's fabric. The cost aspect is not clearly discussed and quantified.
- Research contributions and novelties are unclear.
- Evaluation does not have direct comparison with baseline solutions.

Writing quality
---------------
2. Needs improvement

Experimental methodology
------------------------
3. Poor

Comments for author
-------------------
The proposed solution would be useful for hardware-accelerated real-times systems.
But my biggest concern is that the majority content of the paper is about the
engineering parts, so it is unclear what are the research contributions and novelties.


Since the solution argues for the benefit of hybrid monitoring, it should
compare with some baseline solutions that only monitors latency or energy individually
and hopefully demonstrates the holistic trace provides more information or
is more efficient or easier to analyze or cheaper etc. compared to simply combining
the prior separate monitoring solutions.

The paper targets using holistic tracing information to improve debugging (it's in
the title as well). Readers would expect evidence that the tool helps troubleshoot
a number of real-world cases about performance or energy bottlenecks. The presented
V.A, B, D are some neutral case that's not directly related to debugging.
V.C is nice. Is that a real-world case of bottleneck being identified or a
synthetic one? More cases like that would make the evaluation stronger.

Since the trace IP core needs to be instantiated in the DUT's fabric, this may
prevent the solution from being widely adopted compared to off-device monitoring
solutions that can directly plug in to the DUT. So it's important to demonstrate
that the benefits significantly outweigh the cost and inconvenience of
modifying DUT. You mentioned in Page 3 that the solution is cost-efficient.
But I did not see it being discussed or quantified in later sections. Please
clarify.

I think the paper writing needs significant improvement. It is overall very verbose,
making the key messages/points difficult to grasp. For example, the introduction
spends almost two pages to talk about background and only at the end of page 2
it describes what this work proposes to do. That point should be made much
earlier, and many of the text about ASICs, DSPS, performance vs power tradeoffs
should be made much more condense or deferred to a background section. Spending
too much space talking about those only distracts and confuses the readers.
The paper also uses too many terminologies and acronyms. I'd suggest try to
avoid them unless they are really relevant to what this paper's main points.
Including a glossary early on can also help.

Questions for authors’ response
---------------------------------
1. Can you clarify the research contributions and novelties of this work?

2. Can you quantify the cost aspect of the trace IP core?

3. Compared to combining the prior latency and energy monitoring solutions,
what are the key advantages of the holistic monitoring.


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


Review #25B
===========================================================================

Overall merit
-------------
4. Weak accept: identifiable contribution, may clear the bar but will not
  fight for acceptance

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
The paper proposes a hybrid approach that combines both state (e.g. latency) tracing through an IP core placed in the FPGA fabric and power tracing through external instrumentation setup.

Strengths
---------
The idea is interesting to me. The authors also did a good job explaining the differences between the proposed approach and already existing debugging tools available in FPGA-based SoCs (such as signal taping and programmable
logic analyzers).

Writing quality
---------------
2. Needs improvement

Experimental methodology
------------------------
4. Average

Comments for author
-------------------
The paper proposes a hybrid approach that combines both state (e.g. latency) tracing through an IP core placed in the FPGA fabric and power tracing through external instrumentation setup.


The idea is interesting to me. The authors also did a good job explaining the differences between the proposed approach and already existing debugging tools available in FPGA-based SoCs (such as signal taping and programmable
logic analyzers).

Here are some minor comments for the authors:

#missing references:

##Introduction: commonly, such hardware is implemented using FPGAs <- requires citations for support

##Introduction: which is beyond the capabilities of traditional tools <- requires citations for supporting claim


#IC: should be defined first: Integrated Circuits (IC)
(bottom of p.4) <- remove

#towards the end of first col of page 2: Real-Time Systems -> should be RTS since it is already abbreviated.

#Section IV: We use an own <- we use our own

# The paper writing needs polishing. There are words and expressions that are excessively repeated again and again in the paper, which makes it annoying to read.
For example: 
##"not only .. but also" expression is repeated 28 times in the paper!
##"thus" is repeated 43 times in the paper


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


Review #25C
===========================================================================

Overall merit
-------------
5. Accept: significant contribution, want to see this at the conference and
  will argue against any reject scores

Reviewer expertise
------------------
4. Expert

Paper summary
-------------
A novel hybrid Power/State-Tracing debugging methodology was proposed
in this work leveraging the fine-grained flexibility available in FPGA
accelerated systems. The authors are able to collect hybrid
power+event traces gathered with the aid of an external power data
aggregation sub-system connected to the programmable logic (PL). The
ability to collect hybrid traces is then shown to be to a powerful
aid towards automated identification of temporal behavior
characteristics of real-time applications. By having temporal,
functional, and energy-related trace, the authors extract accurate
per-phase and per-component energy baselines required for
power/latency optimization and energy envelope estimation. A
real-world mixed hardware/software visual servoing system and related
closed control-loop was examined as a case study to address
performance/energy trade-off in the FPGA-accelerated heterogeneous
system.

Strengths
---------
+ The authors tackle a tough challenge and extremely timely challenge,
which is observability and explainability of complex multi-core
embedded SoCs.

+ The authors provide a full design and implementation of a system
that marries traditional OS-level techniques, programmable logic, and
custom circuit design to design a powerful debugging/analysis tool
the breed of which is very much needed.

+ A full case study of for a Visual Servoing System (VSS) closed
 control loop is provided to show the proposed solution in action.

Weaknesses
----------
- The paper is very dense and it is hard to at times hard to follow
 all the moving parts of the design.  The experimental setup, in
 terms of application workload, could have also been better
 introduced.

- The paper seems to be a bit too vocal about the fact that
 instruction- or cycle-accurate tracing is not required.

Writing quality
---------------
4. Well-written

Experimental methodology
------------------------
5. Good

Comments for author
-------------------
This paper represents an ideal target for this venue for a few
reasons. First, it reviews important aspects of state-of-the-art
FPGA-accelerated embedded systems, such as automatic power gating of
uninstantiated memory resources; or the trade-offs in exploiting these
features vs. the temporal behavior of the platform. Moreover, a brief
yet substantial survey was provided on existing temporal-functional
analyzers.

The contribution is also clear. Providing a hybrid events+power
logging interface with near-zero impact on the temporal and functional
characteristics of the system is a remarkable feat. The proposed
technology is well suited for both energy-sensitive and time-sensitive
real-time systems. Indeed, the achieved energy-latency trade-off is 
beyond the capabilities of traditional tools (e.g., Xilinx ILA,
CoreSight, JTAG). Also, the authors exploited PS-side GPIO ports to
embed PS state information inside the PL, which is a nice touch.
Lastly, by using an FPGA Mezzanine Card (FMC) as an interface to the
FPGA, the authors managed to substantially reduce cost and on-chip
resource consumption while providing a "generic" interface to the
external measurement system through the Serial Peripheral Interface
(SPI).

The experimental methodology appears to be sound. Moreover, the
authors have extended their design to further monitor I/O components
power consumption, on top of processors and PL. I particularly liked
that even minor details were taken into account and described (e.g.,
interconnect used to convert AXI3 to AXI4-Lite or the possibility of
increasing the SCLK frequencies beyond 50 MHz to support high-rate
control systems).  The resolution (sub-microsecond) and sustained
length of tracing (few milliseconds) is also quite remarkable.

The goal of the proposed system is of great interest, in my
view. Automated phase-detection for complex systems is a tough
challenge to approach. Interestingly, the paper shows that it can be
used to estimate the best-case energy saving as well as latency
monitoring.

There are a few aspects that could be further improved and some
limitations that the authors could mention. The most notable are
reported below.

First, it remains a bit vague about how the authors synchronized power and
state data in the external analyzer system. Voltage/current time-stamp
is assigned upon arrival at the external micro-controller while the
state time-stamp is assigned in the PS (according to Fig. 4), i.e.
before crossing the PL and FMC.

Second, while using an external power management sub-system has its
benefits, it makes the platform dependent on an external
micro-controller and FMC. This on the other hand impacts total size,
weight, area and power of the system.

Third, I recognize the value of behavioral monitoring of the operating
system envelope in a way that is asynchronous with application
workload, i.e. in macro-phases. But on the other hand, reasoning at
this level does not allow direct observation of the effect on power or
timing of individual instructions and memory transactions. In the last
paragraph on page 4, the authors claim that: "As periods, deadlines
and execution times of most real-time applications range from hundreds
of μs to dozens of ms, the high single-cycle resolution of ELA-based
solutions is just not required. Similarly, single-instruction and
per-signal trace information from CPUs and FPGA fabric is far more
accurate than needed to ...". This is not always true as signal-level
monitoring is often essential for both temporal and functional aspects
of the system that cannot be captured across macro-phases of
computation. Systems like the GreenHills Probe or the Lauterbach
hardware debuggers are heavily used in practice because they allow
this level of fine-grained observation.

Minor issues:

- Please provide a better explanation for Fig. 1. If Fig. 1 is not
required, it can be safely removed.

- In the Power Monitoring subsection in Section II, a few related
power measurement tools for FPGA-accelerated systems are worth
mentioning. To name a few, Xilinx Power Estimator (XPE), System
Monitor (SYSMON), and, most importantly XADC, which delivers
advantages in flexibility, integration, and cost savings across a wide
range of applications.

Fig. 2, misspelled the word ADC to ADS.

Questions for authors’ response
---------------------------------
- The system is comprised of three major components, namely PS, PL,
 and external micro-controller on the FMC. Each operates in a
 different clock domain. How are these devices kept synchronized?  Is the
same SCLK signal fed to the external device?


* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *


Review #25D
===========================================================================

Overall merit
-------------
4. Weak accept: identifiable contribution, may clear the bar but will not
  fight for acceptance

Reviewer expertise
------------------
3. Knowledgeable

Paper summary
-------------
This paper proposes extending execution traces (such as those produced by
ARM's Trace Macrocell) with relevant power implementation (e.g. supply rail
currents).  The authors suggest that the granularity of raw processor trace is
too fine relative to typical control loop periods in the ms, and that manual
instrumentation of code represents a better tradeoff than full tracing if the
designer is trying to optimise power consumption against real-time performence
(e.g. deadline misses).  The authors have prototyped their approach on a
Xilinx Zynq attached to a self-built power measurement board and demonstrate
its use in debugging a real-time video processing application.

Strengths
---------
* Looks like a useful tool.
* Optimal point tradeoff is very interesting.

Weaknesses
----------
* Too focussed on the HW artefact.
* Trace rate tradeoff not well explored.
* Existing HW mechanisms prob would work (coresight/FTM/ITM).

Writing quality
---------------
4. Well-written

Experimental methodology
------------------------
4. Average

Comments for author
-------------------
HW is cool, but as a reader the most interesting thing is exploring what the
paramaters should be.  Reads a bit like the sample rate is dictated by HW
limitation, not by principles.

It seems to me that existing HW mechanisms might be able to do the job already
(particularly on the Zynq platform) e.g. the ITM (instruction trace macrocell)
permits the injection of arbitrary information into the trace stream (e.g. the
manual annotation of program phases), while the FTM (fabric trace macrocell, a
Xilinx-specific subsystem) would allow you to inject the power trace data.
Combine this with a bit of prefiltering in the FPGA fabric and you've got a
tracing system needing nothing more than a few power-rail sensors (you could
even connect some very high sample-rate ADCs to the FPGA via LVDS, letting you
really explore the sample-rate tradeoff).

Given that, the really interesting part of your paper to me is investigating
the parameter space: what data is useful, and what isn't, and what's the right
temporal/sample resolution for a particular task.  Building HW is a lot of
fun, but it's only scientifically valuable to the extent that you can answer
an interesting question with it.  You've got the structure of something really
interesting here (you're abolutely correct that debugging power usage is
currently really hard), and a very promising approach to doing so.

Questions for authors’ response
---------------------------------
* Have you explored the design tradeoffs for monitoring and if so, how did
  you arrive at ksamples/sec being the optimal value?


Comments
===========================================================================

Response by XXX
---------------------------------------------------------------------------
Dear TPC members,

Given the highly technical FPGA-SoC/Zynq-related techniques used in this work, we find it absolutely remarkable that the reviewers took the time to fully analyze all the technicalities involved.
In particular, some low-level details - even though not explicitly described in the paper - were recognized in the reviews.
Thank you very much - particularly for the feedback/questions, which we would like to annotate as follows.

Note: Our original paper contained additional tables/figures that had to be removed due to the change from 12 to 11 pages. We attached a PDF with those relevant {referenced below} and will include them in the paper if accepted.

Review A:
---------
Based on the insight that there is a demand for energy/latency co-debugging tools (motivated below), we
- proposed hybrid tracing as promising methodology,
- presented an implementation that *barely affects* the RTS's temporal/functional/energy properties, and
- demonstrated its applicability.

Traditional latency monitoring neither captures energy nor is particularly lightweight w.r.t., e.g., overhead or interference.
Similarly, energy monitoring lacks temporal and functional information (IRQs, IDs) required to balance energy against latency.
Combining the separate solutions into a holistic one is a non-trivial task, in particular given today's complex heterogeneous SoCs.
Amongst many reasons, this holds true because separate energy/latency monitoring solutions
- might rely on different notions of time (e.g., variable-frequency clock cycles vs. microseconds), or
- often are closed/black-boxed/encrypted and thus cannot be synchronized online.

Cost-wise, our solution
- requires 3x to 26x {TabIII} less resources (particularly BRAM {Fig12}) and
- is ~50x less expensive than, e.g., an NI-PXIe-6124 {TabIV}.
A detailed resource breakdown is part of the paper (TabI), altogether showing the efficiency of our solution.

The VSS's bottleneck identified in V.C is indeed real-word - preliminary evaluations of a hardware-only power management module show savings close to those projected beforehand.

Review C:
---------
The microcontroller/firmware performs all timestamping (using a sub-us timer) - yielding minimum and constant latencies for state+analog data.
The timestamp in Fig4 is only used for coarse/seconds-only synchronization of Zynq/microcontroller-RTCs.

Although EMS (and DUT instrumentation) affect the overall SWAaP, we believe that this is acceptable as it only holds true during debugging, not deployment.

We agree on the described limitations of macro-phase-monitoring and will include them and above clarifications in the paper.
We would further like to elaborate that, e.g., DVFS-based voltage/current changes commonly span thousands of instructions/clocks (i.e., within reach of our tool).

SCLK is driven by the microcontroller and (two-stage-)synchronized to the PL.
PL/PS synchronization is handled by the Zynq's internal CDC-FIFOs.

Review D:
---------
We considered this very option of (mis?)using ITM+PTM/PFT+FTM initially, but opted for SPIB and external measurement system because
- using xTM and/or capturing high-rate ADCs increases energy consumption (thus interfering with application-level measurements), and
- the small-size ETB necessitates TPIU-based storage offloading, either externally (MIO) - or to EMIO/PL, then affecting energy *and* (DDRx/RAM-)latencies

Although exploring the tracing parameter space indeed is a promising direction, our parameterization was primarily driven by Ethernet/VSS properties, as monitoring individual Ethernet frames implied analog sampling rates around 200kSPS (due to bitrate, MTU, interframe spacing, etc.).
We, however, already explored feasible ratios of state/analog samples (on implementation/storage-level), as shown in {Fig13}.

Comment @A16 by Shepherd
---------------------------------------------------------------------------
Summary:

There is a general consensus among the reviewers that the paper fits well within the scope of the venue and that what proposed is both practical and technically insightful. The problem of debugging complex software workload using a variety of heterogeneous resources across a modern SoC is a tough one. Thus, there is certainly value in the attempt performed by the authors in tackling the many challenges in this space. 

Revisions:

Two main weaknesses were raised by the reviewers during the discussion phase. 

(1) The first problem is a fundamental lack of refinement in the presentation of the work. In the current version, key design-level considerations do not stand out very well. Rather, they are blurred by many technical considerations and implementation details. These, albeit important, should not stand in the way of understanding the general design principles. In light of this, the authors need to perform a general restructuring of the paper following a top-down approach. The recommendation is to provide design principles first and only after to dive into implementation details and low-level considerations. One way of going about this is to divide the current Section III into an initial section containing an overview of the system and with a block-level description of that presented in Figure 2; then have a section on implementation details where individual blocks are better described.

(2) The authors need to better clarify how the proposed solution is useful for debugging real-world applications and in a way that is more cost-effective than what offered by state-of-the-art research- or commercial-grade debugging tools. The authors can start by integrating the additional evaluation and considerations provided as part of their rebuttal. The authors are also encouraged to use the additional space available to them for the final version of the paper.

Apart from the two main points above, a few additional aspects need attention:
(1) The introduction needs to be significantly pruned to be much shorter than 2 full pages. The section shall also clearly highlight two things: (i) what work is being proposed early on, and (ii) what contribution can be claimed by the authors and why;
(2) Address explicit recommendations provided by the reviewers about the need for supporting references as stated;
(4) Perform a general pass on the text to get rid of repetitions; and 
(3) Better integrate (a variation of) Figure 1 in the description of what is being proposed.

I look forward to reading an improved draft of the manuscript following the provided guidelines no later than Feb 7, 2020. The paper will need to be deemed in full compliance with the requested revisions by Feb 17, 2020. I strongly recommend the authors to provide a tentative version of the new draft a couple of weeks before the aforementioned deadline. This would allow making an additional iteration on the revisions if necessary.

Best regards,
Your friendly shepherd.