NewRegressionFramework

We'd like to revamp the regression tests by moving to a new framework. This page is intended to host a discussion of features and design for the new framework.

Ali's plan for a new implementation

Use pytest
It has, by far, the best documentation of any of the python testing frameworks and seems to be the most active
Seems to be completely extensible via python plugins and hooks
Supports outputting JUnit XML incase we want to use a continuous integration solution such as Jenkins or Hudson
The pytest xdist and plugin support running tests on multiple-cpus or multiple machines
Good collection of tasks and tutorials

How things would work

Marks may be assigned to tests either with python decorators or a class attribute if we want to stay python 2.5 compatible
- The decorators would probably include cpu model, memory system, ISA, mode, and run length.
- We might want to use pytest_addoption to be able to pass lists specifically for each of the decorators and generate tests that match appropriately with this
- Alternatively we could use pytest-markfiltration although the syntax can be rather contrived

Outstanding Questions

How would we do test discovery?
- pytest will search py files looking for tests
- Files can match a pattern, classes in files can match a pattern or functions can match a pattern
- or it can only match things that inherit from Python UnitTest
Should we use xunit style or func args style setups?
Should we have a class that inherits from Python.UnitTest and does the heavy lifting or should we have a completely separate class that does the heavy lifting and use a factory class to create a bunch of instances of the seperate class?
Should gem5 be called as a library or on the command line?
How should we store output files? Same way we do now? should each directory just have a __init__.py and then the tests can be referred to as long.linux_boot.arm.linux.o3?

Desirable features

Ability to add regressions via EXTRAS
- For example, move eio tests into eio module so we don't try to run them when it's not compiled in
Ability to not run regressions for which binaries or other inputs aren't available
- With maybe some nice semi-automated way of downloading binaries when they're publicly available
Better categorization of tests, and ability to run tests by category, e.g.:
- by CPU model
- by ISA
- by Ruby protocol
- by length
More directed tests that cover specific functionality and complete faster. Running spec benchmarks is important but spends a lot of time doing the same thing over and over. Those should only be a component of our testing, not almost all of it like it is now. This is a desirable feature of our testing strategy, not necessarily something that impacts the regression framework.
Better checkpoint testing
- some of this doesn't really depend on the regression framework, just needs new tests
- e.g., integrating util/checkpoint-tester.py
Support for random testing (e.g., for background testing processes)
- Random latencies?
- Random testing a la memory testers but with different seeds, longer intervals
Decouple from SCons
- Avoid having scons dependency bugs force unnecessary re-running of tests, particularly for update-refs
- Don't rely on scons to run jobs... running scons -j8 with a bunch of tests and a batch queing system means that 8 cpus are consumed, even if there is only one job running.
- Either make scons be able to submit the jobs or have something else that manages the jobs and their completion status
Easy support for running separate tests where only the input parameters differ
- For example, several protocols utilize different state transitions depending on configuration flags. It would be great if we could test these without having to create new directories and tests.
- Similarly, we could/should test topologies this way as well.
Automated way to use nightly regressions as a basis for updating "m5-stable"
- How do you identify the last working revision? (from Ali)
- Maybe need a bug-tracking system so we could record facts like "changeset Y fixes a bug introduced in changeset X" then we could automatically exclude changesets between X and Y, but we don't have that. (from stever)
Better definitions of success criteria.
- E.g. Stats were changed, but output is all still correct vs simply passed and failed. (Passed, stats diffs, failed)
- For example you could say that the terminal output changing is fail, or the stdout and spec binary outputs changing are failed, but a 1% difference in stats is a stats difference, which needs to be addresses
- I envision this as providing reasonable certainty that if you create a change you know will modify the stats, you have a quick verification that nothing broke horribly before updating the stats.

Implementation ideas

Just ideas... no definitive decisions have been made yet.

Use Python's unittest module, or something that extends it such as nose
Use SCons to manage dependencies between binaries/test inputs and test results, but in a different SCons invocation (i.e., in its own SConstruct/SConscript)

NewRegressionFramework

Contents

Ali's plan for a new implementation

How things would work

Outstanding Questions

Desirable features

Implementation ideas

Navigation menu

Views

Personal tools

Navigation

Search

Tools