Review #25A =========================================================================== Overall merit ------------- 3. Weak reject: contribution is weak or offset by deficiencies, will not fight against acceptance Reviewer expertise ------------------ 1. No familiarity Paper summary ------------- This paper proposes hybrid power/state tracing for hardware-accelerated real-times systems through the use of a resource-efficient trace IP core instantiated in the programmable systems-on-chip and a custom external power measurement subsystem. The advantage is that such tracing enables unified latency, functional and energy monitoring. The authors demonstrate several applications of the proposed hbrid tracing such as using it to identify application phases and estimated best-case energy savings. Strengths --------- + The proposed solution can enable holistic debugging and understanding of the energy, functional and latency trade-offs in FPGA-accelerated real-time systems. + The solution is resource efficient. Weaknesses ---------- - The solution needs to instrument the DUT's fabric. The cost aspect is not clearly discussed and quantified. - Research contributions and novelties are unclear. - Evaluation does not have direct comparison with baseline solutions. Writing quality --------------- 2. Needs improvement Experimental methodology ------------------------ 3. Poor Comments for author ------------------- The proposed solution would be useful for hardware-accelerated real-times systems. But my biggest concern is that the majority content of the paper is about the engineering parts, so it is unclear what are the research contributions and novelties. Since the solution argues for the benefit of hybrid monitoring, it should compare with some baseline solutions that only monitors latency or energy individually and hopefully demonstrates the holistic trace provides more information or is more efficient or easier to analyze or cheaper etc. compared to simply combining the prior separate monitoring solutions. The paper targets using holistic tracing information to improve debugging (it's in the title as well). Readers would expect evidence that the tool helps troubleshoot a number of real-world cases about performance or energy bottlenecks. The presented V.A, B, D are some neutral case that's not directly related to debugging. V.C is nice. Is that a real-world case of bottleneck being identified or a synthetic one? More cases like that would make the evaluation stronger. Since the trace IP core needs to be instantiated in the DUT's fabric, this may prevent the solution from being widely adopted compared to off-device monitoring solutions that can directly plug in to the DUT. So it's important to demonstrate that the benefits significantly outweigh the cost and inconvenience of modifying DUT. You mentioned in Page 3 that the solution is cost-efficient. But I did not see it being discussed or quantified in later sections. Please clarify. I think the paper writing needs significant improvement. It is overall very verbose, making the key messages/points difficult to grasp. For example, the introduction spends almost two pages to talk about background and only at the end of page 2 it describes what this work proposes to do. That point should be made much earlier, and many of the text about ASICs, DSPS, performance vs power tradeoffs should be made much more condense or deferred to a background section. Spending too much space talking about those only distracts and confuses the readers. The paper also uses too many terminologies and acronyms. I'd suggest try to avoid them unless they are really relevant to what this paper's main points. Including a glossary early on can also help. Questions for authors’ response --------------------------------- 1. Can you clarify the research contributions and novelties of this work? 2. Can you quantify the cost aspect of the trace IP core? 3. Compared to combining the prior latency and energy monitoring solutions, what are the key advantages of the holistic monitoring. * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Review #25B =========================================================================== Overall merit ------------- 4. Weak accept: identifiable contribution, may clear the bar but will not fight for acceptance Reviewer expertise ------------------ 3. Knowledgeable Paper summary ------------- The paper proposes a hybrid approach that combines both state (e.g. latency) tracing through an IP core placed in the FPGA fabric and power tracing through external instrumentation setup. Strengths --------- The idea is interesting to me. The authors also did a good job explaining the differences between the proposed approach and already existing debugging tools available in FPGA-based SoCs (such as signal taping and programmable logic analyzers). Writing quality --------------- 2. Needs improvement Experimental methodology ------------------------ 4. Average Comments for author ------------------- The paper proposes a hybrid approach that combines both state (e.g. latency) tracing through an IP core placed in the FPGA fabric and power tracing through external instrumentation setup. The idea is interesting to me. The authors also did a good job explaining the differences between the proposed approach and already existing debugging tools available in FPGA-based SoCs (such as signal taping and programmable logic analyzers). Here are some minor comments for the authors: #missing references: ##Introduction: commonly, such hardware is implemented using FPGAs <- requires citations for support ##Introduction: which is beyond the capabilities of traditional tools <- requires citations for supporting claim #IC: should be defined first: Integrated Circuits (IC) (bottom of p.4) <- remove #towards the end of first col of page 2: Real-Time Systems -> should be RTS since it is already abbreviated. #Section IV: We use an own <- we use our own # The paper writing needs polishing. There are words and expressions that are excessively repeated again and again in the paper, which makes it annoying to read. For example: ##"not only .. but also" expression is repeated 28 times in the paper! ##"thus" is repeated 43 times in the paper * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Review #25C =========================================================================== Overall merit ------------- 5. Accept: significant contribution, want to see this at the conference and will argue against any reject scores Reviewer expertise ------------------ 4. Expert Paper summary ------------- A novel hybrid Power/State-Tracing debugging methodology was proposed in this work leveraging the fine-grained flexibility available in FPGA accelerated systems. The authors are able to collect hybrid power+event traces gathered with the aid of an external power data aggregation sub-system connected to the programmable logic (PL). The ability to collect hybrid traces is then shown to be to a powerful aid towards automated identification of temporal behavior characteristics of real-time applications. By having temporal, functional, and energy-related trace, the authors extract accurate per-phase and per-component energy baselines required for power/latency optimization and energy envelope estimation. A real-world mixed hardware/software visual servoing system and related closed control-loop was examined as a case study to address performance/energy trade-off in the FPGA-accelerated heterogeneous system. Strengths --------- + The authors tackle a tough challenge and extremely timely challenge, which is observability and explainability of complex multi-core embedded SoCs. + The authors provide a full design and implementation of a system that marries traditional OS-level techniques, programmable logic, and custom circuit design to design a powerful debugging/analysis tool the breed of which is very much needed. + A full case study of for a Visual Servoing System (VSS) closed control loop is provided to show the proposed solution in action. Weaknesses ---------- - The paper is very dense and it is hard to at times hard to follow all the moving parts of the design. The experimental setup, in terms of application workload, could have also been better introduced. - The paper seems to be a bit too vocal about the fact that instruction- or cycle-accurate tracing is not required. Writing quality --------------- 4. Well-written Experimental methodology ------------------------ 5. Good Comments for author ------------------- This paper represents an ideal target for this venue for a few reasons. First, it reviews important aspects of state-of-the-art FPGA-accelerated embedded systems, such as automatic power gating of uninstantiated memory resources; or the trade-offs in exploiting these features vs. the temporal behavior of the platform. Moreover, a brief yet substantial survey was provided on existing temporal-functional analyzers. The contribution is also clear. Providing a hybrid events+power logging interface with near-zero impact on the temporal and functional characteristics of the system is a remarkable feat. The proposed technology is well suited for both energy-sensitive and time-sensitive real-time systems. Indeed, the achieved energy-latency trade-off is beyond the capabilities of traditional tools (e.g., Xilinx ILA, CoreSight, JTAG). Also, the authors exploited PS-side GPIO ports to embed PS state information inside the PL, which is a nice touch. Lastly, by using an FPGA Mezzanine Card (FMC) as an interface to the FPGA, the authors managed to substantially reduce cost and on-chip resource consumption while providing a "generic" interface to the external measurement system through the Serial Peripheral Interface (SPI). The experimental methodology appears to be sound. Moreover, the authors have extended their design to further monitor I/O components power consumption, on top of processors and PL. I particularly liked that even minor details were taken into account and described (e.g., interconnect used to convert AXI3 to AXI4-Lite or the possibility of increasing the SCLK frequencies beyond 50 MHz to support high-rate control systems). The resolution (sub-microsecond) and sustained length of tracing (few milliseconds) is also quite remarkable. The goal of the proposed system is of great interest, in my view. Automated phase-detection for complex systems is a tough challenge to approach. Interestingly, the paper shows that it can be used to estimate the best-case energy saving as well as latency monitoring. There are a few aspects that could be further improved and some limitations that the authors could mention. The most notable are reported below. First, it remains a bit vague about how the authors synchronized power and state data in the external analyzer system. Voltage/current time-stamp is assigned upon arrival at the external micro-controller while the state time-stamp is assigned in the PS (according to Fig. 4), i.e. before crossing the PL and FMC. Second, while using an external power management sub-system has its benefits, it makes the platform dependent on an external micro-controller and FMC. This on the other hand impacts total size, weight, area and power of the system. Third, I recognize the value of behavioral monitoring of the operating system envelope in a way that is asynchronous with application workload, i.e. in macro-phases. But on the other hand, reasoning at this level does not allow direct observation of the effect on power or timing of individual instructions and memory transactions. In the last paragraph on page 4, the authors claim that: "As periods, deadlines and execution times of most real-time applications range from hundreds of μs to dozens of ms, the high single-cycle resolution of ELA-based solutions is just not required. Similarly, single-instruction and per-signal trace information from CPUs and FPGA fabric is far more accurate than needed to ...". This is not always true as signal-level monitoring is often essential for both temporal and functional aspects of the system that cannot be captured across macro-phases of computation. Systems like the GreenHills Probe or the Lauterbach hardware debuggers are heavily used in practice because they allow this level of fine-grained observation. Minor issues: - Please provide a better explanation for Fig. 1. If Fig. 1 is not required, it can be safely removed. - In the Power Monitoring subsection in Section II, a few related power measurement tools for FPGA-accelerated systems are worth mentioning. To name a few, Xilinx Power Estimator (XPE), System Monitor (SYSMON), and, most importantly XADC, which delivers advantages in flexibility, integration, and cost savings across a wide range of applications. Fig. 2, misspelled the word ADC to ADS. Questions for authors’ response --------------------------------- - The system is comprised of three major components, namely PS, PL, and external micro-controller on the FMC. Each operates in a different clock domain. How are these devices kept synchronized? Is the same SCLK signal fed to the external device? * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * Review #25D =========================================================================== Overall merit ------------- 4. Weak accept: identifiable contribution, may clear the bar but will not fight for acceptance Reviewer expertise ------------------ 3. Knowledgeable Paper summary ------------- This paper proposes extending execution traces (such as those produced by ARM's Trace Macrocell) with relevant power implementation (e.g. supply rail currents). The authors suggest that the granularity of raw processor trace is too fine relative to typical control loop periods in the ms, and that manual instrumentation of code represents a better tradeoff than full tracing if the designer is trying to optimise power consumption against real-time performence (e.g. deadline misses). The authors have prototyped their approach on a Xilinx Zynq attached to a self-built power measurement board and demonstrate its use in debugging a real-time video processing application. Strengths --------- * Looks like a useful tool. * Optimal point tradeoff is very interesting. Weaknesses ---------- * Too focussed on the HW artefact. * Trace rate tradeoff not well explored. * Existing HW mechanisms prob would work (coresight/FTM/ITM). Writing quality --------------- 4. Well-written Experimental methodology ------------------------ 4. Average Comments for author ------------------- HW is cool, but as a reader the most interesting thing is exploring what the paramaters should be. Reads a bit like the sample rate is dictated by HW limitation, not by principles. It seems to me that existing HW mechanisms might be able to do the job already (particularly on the Zynq platform) e.g. the ITM (instruction trace macrocell) permits the injection of arbitrary information into the trace stream (e.g. the manual annotation of program phases), while the FTM (fabric trace macrocell, a Xilinx-specific subsystem) would allow you to inject the power trace data. Combine this with a bit of prefiltering in the FPGA fabric and you've got a tracing system needing nothing more than a few power-rail sensors (you could even connect some very high sample-rate ADCs to the FPGA via LVDS, letting you really explore the sample-rate tradeoff). Given that, the really interesting part of your paper to me is investigating the parameter space: what data is useful, and what isn't, and what's the right temporal/sample resolution for a particular task. Building HW is a lot of fun, but it's only scientifically valuable to the extent that you can answer an interesting question with it. You've got the structure of something really interesting here (you're abolutely correct that debugging power usage is currently really hard), and a very promising approach to doing so. Questions for authors’ response --------------------------------- * Have you explored the design tradeoffs for monitoring and if so, how did you arrive at ksamples/sec being the optimal value? Comments =========================================================================== Response by XXX --------------------------------------------------------------------------- Dear TPC members, Given the highly technical FPGA-SoC/Zynq-related techniques used in this work, we find it absolutely remarkable that the reviewers took the time to fully analyze all the technicalities involved. In particular, some low-level details - even though not explicitly described in the paper - were recognized in the reviews. Thank you very much - particularly for the feedback/questions, which we would like to annotate as follows. Note: Our original paper contained additional tables/figures that had to be removed due to the change from 12 to 11 pages. We attached a PDF with those relevant {referenced below} and will include them in the paper if accepted. Review A: --------- Based on the insight that there is a demand for energy/latency co-debugging tools (motivated below), we - proposed hybrid tracing as promising methodology, - presented an implementation that *barely affects* the RTS's temporal/functional/energy properties, and - demonstrated its applicability. Traditional latency monitoring neither captures energy nor is particularly lightweight w.r.t., e.g., overhead or interference. Similarly, energy monitoring lacks temporal and functional information (IRQs, IDs) required to balance energy against latency. Combining the separate solutions into a holistic one is a non-trivial task, in particular given today's complex heterogeneous SoCs. Amongst many reasons, this holds true because separate energy/latency monitoring solutions - might rely on different notions of time (e.g., variable-frequency clock cycles vs. microseconds), or - often are closed/black-boxed/encrypted and thus cannot be synchronized online. Cost-wise, our solution - requires 3x to 26x {TabIII} less resources (particularly BRAM {Fig12}) and - is ~50x less expensive than, e.g., an NI-PXIe-6124 {TabIV}. A detailed resource breakdown is part of the paper (TabI), altogether showing the efficiency of our solution. The VSS's bottleneck identified in V.C is indeed real-word - preliminary evaluations of a hardware-only power management module show savings close to those projected beforehand. Review C: --------- The microcontroller/firmware performs all timestamping (using a sub-us timer) - yielding minimum and constant latencies for state+analog data. The timestamp in Fig4 is only used for coarse/seconds-only synchronization of Zynq/microcontroller-RTCs. Although EMS (and DUT instrumentation) affect the overall SWAaP, we believe that this is acceptable as it only holds true during debugging, not deployment. We agree on the described limitations of macro-phase-monitoring and will include them and above clarifications in the paper. We would further like to elaborate that, e.g., DVFS-based voltage/current changes commonly span thousands of instructions/clocks (i.e., within reach of our tool). SCLK is driven by the microcontroller and (two-stage-)synchronized to the PL. PL/PS synchronization is handled by the Zynq's internal CDC-FIFOs. Review D: --------- We considered this very option of (mis?)using ITM+PTM/PFT+FTM initially, but opted for SPIB and external measurement system because - using xTM and/or capturing high-rate ADCs increases energy consumption (thus interfering with application-level measurements), and - the small-size ETB necessitates TPIU-based storage offloading, either externally (MIO) - or to EMIO/PL, then affecting energy *and* (DDRx/RAM-)latencies Although exploring the tracing parameter space indeed is a promising direction, our parameterization was primarily driven by Ethernet/VSS properties, as monitoring individual Ethernet frames implied analog sampling rates around 200kSPS (due to bitrate, MTU, interframe spacing, etc.). We, however, already explored feasible ratios of state/analog samples (on implementation/storage-level), as shown in {Fig13}. Comment @A16 by Shepherd --------------------------------------------------------------------------- Summary: There is a general consensus among the reviewers that the paper fits well within the scope of the venue and that what proposed is both practical and technically insightful. The problem of debugging complex software workload using a variety of heterogeneous resources across a modern SoC is a tough one. Thus, there is certainly value in the attempt performed by the authors in tackling the many challenges in this space. Revisions: Two main weaknesses were raised by the reviewers during the discussion phase. (1) The first problem is a fundamental lack of refinement in the presentation of the work. In the current version, key design-level considerations do not stand out very well. Rather, they are blurred by many technical considerations and implementation details. These, albeit important, should not stand in the way of understanding the general design principles. In light of this, the authors need to perform a general restructuring of the paper following a top-down approach. The recommendation is to provide design principles first and only after to dive into implementation details and low-level considerations. One way of going about this is to divide the current Section III into an initial section containing an overview of the system and with a block-level description of that presented in Figure 2; then have a section on implementation details where individual blocks are better described. (2) The authors need to better clarify how the proposed solution is useful for debugging real-world applications and in a way that is more cost-effective than what offered by state-of-the-art research- or commercial-grade debugging tools. The authors can start by integrating the additional evaluation and considerations provided as part of their rebuttal. The authors are also encouraged to use the additional space available to them for the final version of the paper. Apart from the two main points above, a few additional aspects need attention: (1) The introduction needs to be significantly pruned to be much shorter than 2 full pages. The section shall also clearly highlight two things: (i) what work is being proposed early on, and (ii) what contribution can be claimed by the authors and why; (2) Address explicit recommendations provided by the reviewers about the need for supporting references as stated; (4) Perform a general pass on the text to get rid of repetitions; and (3) Better integrate (a variation of) Figure 1 in the description of what is being proposed. I look forward to reading an improved draft of the manuscript following the provided guidelines no later than Feb 7, 2020. The paper will need to be deemed in full compliance with the requested revisions by Feb 17, 2020. I strongly recommend the authors to provide a tentative version of the new draft a couple of weeks before the aforementioned deadline. This would allow making an additional iteration on the revisions if necessary. Best regards, Your friendly shepherd.