Experiment setup for good skeleton tracking

Hi Guys,
I have some questions about skeleton tracking performance in my lab. My camera is D435.

1/ If I want to improve the skeleton tracking precision, should I wear skinny cloth?

2/ I found my feet will merge with the box under the feet (see figure_1). Is there any method to solve this problem?
Fig1

3/ Does the reflective marker around various joints of subjects affect the skeleton tracking? I remember the skeleton tracking was terrible using Kinect around reflective markers.

4/ It shows that the feet joint is the most wobbling joint during the skeleton tracking. Is there any approach to improve the feet joint tracking effect?

I saw the video published on Youtube (https://www.youtube.com/watch?v=gMPtV4NXtUo) by Nuitrack shows unsteady feet tracking too.

5/ What’s you suggestions to figure out the parameters for the spatial/subsample filters according to the specific experiment environment (I saw the config file contains these two filters)?

In other words, what the relationship between the skeleton tracking performance and the filter’s parameters? Like I suspect if down-sample filter will reduce the tracking effect if the factor is above 2.

6/ BTW, I am so curious about the D435 configurations and environment setting you guys used in that youtube video (https://www.youtube.com/watch?v=gMPtV4NXtUo). I found there is much less noise than I have in my lab (figure_2 is my lab environment).
Fig2

Thanks!!

FWIW the Intel d435 is NOT a good sensor for doing skeletal tracking from our testing.

Its RMS error rate is to high for accurate sensing of depth data beyond around 1m - this leads to very poor tracking results - and failure of repeat-ability from frame to frame.

The nuitrack system utilizes the depth data only for approximating where the skeletal joints are.

And The depth data is generated inside the sensor on the d435 using a stereo imaging system which basically uses a lot of math to gestimate the depth at each point.
This gestimate is them passed thru a boat load of post processing before nuitrack attempts to process it.

The d415 is a far better sensor for skeletal tracking depths from our current testing.

ALSO as an aide - the POSE in your figure with hands across chest would likely cause all sorts of grief with a d435 - as all the post processing added to the RMS errors at that distance of at least ±5cm would likely result in a mess of incorrect depth data - that would fail to even see the fingers or hands reliably,

If you do want to perserve with the d435 - i would suggest manually changing the default depth map settings in the nuitrack.config - or more ideally in code - so that the rawwidth and rawheight as well as processwidth and processheight are - 848x480 - which seems the optimum size.

Westa

Hi Westa,
I appreciate your suggestions! I have following questions:

1/ I have two reasons which let me hesitate to buy the D415. One is the FOV of the D415. I know we can’t use two cameras simultaneously in Nuitrack.

The other concern is the rolling shutter. It seems can’t capture the high-speed motion very well. But if it can handle the jumping speed without noticeably image quality reduction, D415 will definitely be a better option.

So I have a request here: Since you have experiences for both cameras. Could you make a very simply comparison in which a subject does counter movement jump. I just want to know if the D415 can handle it and how much less noise from D415 sensor.

If the D415 looks good. Our team would like to compare further with the maker based Vicon system. I will publish more result here to help people know the accuracy levels for two cameras. Before that, I have to persuade others that the D415 at least looks good.

2/ I actually only need the lower body skeleton, so does nuitrack handle only the lower body skeleton?

3/ Based on the Nuitrack youtube demo about D435, I think we need to find out why do they have that decent skeleton tracking quality.

I think to rotate the D435 90 degree would let the subjects get closer. The subject can just stay 1.5~2m away from the camera.

I also think may be the more bright light and outside IR projector would make the depth map better.

BTW, if we wear skinny and RGB color cloth, it could be better too. The filters can reduce the RMS further but I don’t know how to play with the parameters properly without overshooting.

All in all, I want to give the D435 another several tries.

Thanks!

The thing to understand about the d435 - is that it is not NOISE that is the issue - but the size of the error rate of each depth point.

As such for any given frame - its how inaccurate the reporting of depth is in each pixel - with the d435 at any distance over 1.0 this can star to become problematic for the d435.

At 1.5 meters = any individual point could be incorrectly reporting the distance by an unworkable amount and by 2 meters its unusable for many systems like skeletal trackers.

The other thing to understand is that while yes the d435 has a wider field of view - it still resolves the same number of pixels - meaning each point is actually reporting a larger volume of space.

And when you are trying to work out where a finger is - this larger volume means you are seeing BLOBS of depth as opposed to details.

The net result of all this is that the d415 has a RMS error rate of less than half that of the D435 - and up to twice the resolution of detail for any given point at the same distance from the sensor.

With regard to rolling versus global shutter - UNLESS you sensor is physically moving - as in panning left to right - it is unlikely that you will ever suffer noticable depth reporting errors that are quantifiable worse than the errors introduced by the d435 as the distance from the sensor increases.

Understand that systems like nuitrack - are making calculated guesses ( using smart ai systems ) that can recognise a human and work out where that bodys skeleton points likely are.

To do this is needs the best quality data possible to get the best possible results - and for our money the d435 data is just too poor a quality under most circumstances.

Westa

Hi,

Please take a look at our recommendations for environment conditions when using Nuitrack.

Hi,
Thanks for your suggestions! I actually sent your team a letter explaining that you may need to refresh this setup guide. For example, D435 can actually work outdoors. But I think your suggestions are very helpful anyway.

Hi Westa, I appreciate your detailed response! I decided to buy a D415. But I can’t understand the following words:

From the equation in the Best Known Methods to Tuning RealSense D4xx Cam white paper on page 3:

Depth RMS error (mm) = Plane Fit RMS error = Distance(mm)^2 x Subpixel RMS Error/(focal length * Baseline)

If I replace the following D435 parameters (get from the Realsense Viewer) into the equation:
(1280x720) HFOV = 90deg, Xres = 1280, subpixel = 0.09, baseline = 50mm, distance = 1111mm. The Plane Fit RMS error is: 3.47mm (The Depth Quality Tool indicated 3.32mm)

(848x480) HFOV = 90deg, Xres = 848, subpixels = 0.05, baseline = 50mm, distance = 1111mm. The Plane Fit RMS error is: 2.91mm (The Depth Quality Tool indicated 2.71mm)

But when I remove the textured paper in front of the camera. Both the plane fit and subpixel RMS error go up.

If we assume we have enough texture on the subject and the background and replace the ground truth distance with 2 meters, we can get Plain fit RMS error: 11.25mm for Xres = 1280 and 9.43mm for Xres = 848.

My question is: Are these depth RMS errors at 2m calculated above unacceptable for the AI algorithm to generate a reliable skeleton?

Best,
Jake

HI Jake,

There is really no simple answer to these questions. It has a lot to do with your usage case.
What level of accuracy you require. What level of repeat-ability you require. What level of detail you require.

But for what its worth - with the d435 - 848x480 is the highest recommended pixel width - this is basically the hardware resolved width … Any wider setting is giving you a digitally interpolated result from the sensor - which according to intel results in higher RMS.

Have a look on the realsense community forum - there is a lot of detail on the performance of each sensor and a lot of explanations starting to be forthcoming from intel about the limitations of the d435 - it would have been nice to see these before we all purchased d435s because they looking useful on paper.

The way that the nuitrack system works is by using algorithms (AI like) to recognise that there is a body in the field of view - and then uses further algorithms to calculate where parts of the skeleton for that body “likely are”.

I use the term LIKELY ARE - since its all a set of approximations based on the knowledge of anatomy and human movements and inverse kinematics type math. For this all to work optimally - you need access to reliable and repeatable depth data.

As the errors increase in the depth data - the accuracy of the reporting of the location of the points on the skeleton becomes less accurate from frame to frame which leads to jitter and jerky movements.

Intel uses a lot of post processing to attempt to clean up their data - and remove holes and such – but this also has the potential to actually increase the number of frame to frame “errors” in the sense that a post processed cleaned up hole in one frame - may not be a hole in the next frame - as instead reports a depth - which can result in increased jitter in the tracking.

This is further compounded at contact points such as the floor and occlusion points for example where the hand occludes the elbow or the shoulder.

In the case of contact points such as the floor - the errors result in the reported location of the ankle and foot demonstrating quite high jitter and inaccuracy - as the exact location of the contact point between the foot and the floor become blurred from frame to frame

If you examine the depth data around the floor / foot contact point - with an RMS error of almost a centimeter - that means the floor and the foot location are kind of blending together by up to half an inch - so you are getting a lot of guesswork errors I guess.

As opposed to the D415 at the same distance which would likely be around 3.5-4.5mm without doing the math.

The other side of all this is the virtual size of a PIXEL of depth at this distance - since we are told that the physical hardware can only optimally work at 848x480 for the d435 but with a significantly wider field of view - this means that the effective size of each depth point is much larger for the D435 over the D415. Intel is telling us that this resolves to approximately HALF the level of detail for the same point in space for the D435 over the D415 for any given distance - once you get out past around 1m to 1.5m

The lower level of detail combined with the higher RMS compounds the poor quality of the data being handed to Nuitrack - with the end result being less than optimal tracking - at least for all of our internal usage cases.

For example - we require solid and accurate repeatable tracking of the entire skeleton - and especially the hands - regardless of where they are positioned. Right now at 2m for our needs the skeleton reporting is not reliable or accurate enough with the D435 for what we would consider commercial use.

And frankly I doubt that you will ever get results approaching those of a marker based system with the current d435 sensor technology - though Im very happy to be proved wrong on that if someone can show us a set of settings that works accurately and reliably and repeatably.

FWIW the d415 is getting close - but its still not TimeOfFlight which still feels like the holy grail point - who knows maybe if nuitrack manages to get their LiPS ToF sensor implementation working we may start to get close to a marker based system.

Westa

Hi Westa, very interesting and thoughtful explanation about D435!

I have no experience about Kinect V2 which employs TOF techs. Do you think TOF is way better than the active/passive techs used by D435?

Jake

Hi Jake,

Just my personal opinions based on our research - but its all about usage and needs - Only time will tell one would have to say - regarding which solution becomes the commercial practical base line.

Though yes on our current research we suspect that TimeOfFlight is likely to produce more repeatable results - but that too will greatly depend on how the software in the sensor is coded and implemented.

Nuitrack have been saying for some time that they are working on support for new sensors at least one of which is likely to be ToF - time too will tell there.

One issue is however - that ToF sensors tend to be more expensive and harder to get right.

The Kinect2.0 was really a $1500 valued sensor at the time it was released - that MS effectively gave away to generate Console sales.

I would expect any ToF sensor to be at around double the price of the current Intel and Orbbec sensors. I have heard that the DL model from LiPS is around 300US at sample unit prices.

Westa

Thanks! You are right, the time will tell us which one will win.