Help - Search - Members - Calendar
Full Version: Troubleshooting Segmentation Faults
OESF Forums > Distros, Development, and Model Specific Forums > Distro Support and Discussion > pdaXrom
flyvholm
I've compiled the latest version of gPhoto (2.2.0, libgphoto 2.2.1) on my Zaurus, but it crashes with a segmentation fault. Aided by gPhoto developers I've found that the crash happens because the following assignment fails:
CODE
struct submenu *cursub = menus[menuno].submenus+submenuno;

The structure contains pointers to two functions, but above statement assigns invalid pointers, resulting in the seg fault when a function call is performed with an invalid pointer. The relevant source code is found in the libgphoto package, file camlibs/ptp2/library.c, crash happens at line 3727. gPhoto developers suggested that it is an ARM architecture related problem and couldn't help me more.

Isn't it possible to find out what causes such problems so the code can be modified (or compiled differently) to work on the Zaurus? Or does everybody just give up at this point??? ohmy.gif

Any troubleshooting advice would be highly appreciated. I really really want this to work.
miskinis
QUOTE(flyvholm @ Oct 23 2006, 06:45 AM)
I've compiled the latest version of gPhoto (2.2.0, libgphoto 2.2.1) on my Zaurus, but it crashes with a segmentation fault. Aided by gPhoto developers I've found that the crash happens because the following assignment fails:
CODE
struct submenu *cursub = menus[menuno].submenus+submenuno;

The structure contains pointers to two functions, but above statement assigns invalid pointers, resulting in the seg fault when a function call is performed with an invalid pointer. The relevant source code is found in the libgphoto package, file camlibs/ptp2/library.c, crash happens at line 3727. gPhoto developers suggested that it is an ARM architecture related problem and couldn't help me more.

Isn't it possible to find out what causes such problems so the code can be modified (or compiled differently) to work on the Zaurus? Or does everybody just give up at this point???  ohmy.gif

Any troubleshooting advice would be highly appreciated. I really really want this to work.
*


Is it easy to make it crash, or does it take some work? I have not used gPhoto in many
years, but offhand I would have to say: Yes, it is possible to modify the code, if indeed
the code is at fault. I'm curious as to why the code would not compile and run on the Z.
If there are "ARM architecture problems" that affect C coding, I want to know about them!
Which Zaurus model is this failing on (C1000 as in your sig?)? Did you get any compiliation warnings?
flyvholm
Yes, it's on the C1000.
I have only run into one command that triggers the crash:
gphoto2 --set-config capture=on
Without this I can't use the camera for remote capture which was the whole plan with my Z. sad.gif
It works fine w. Ubuntu 6.06 (Dapper) on my laptop, so the code is ok, except when being compiled on the Z.
An example of ARM architecture issues:
http://www.arm.com/support/faqdev/1228.html (See last paragraph, 'Porting code...')

But the compile toolchain could be the problem too? Here's an example of what can happen:
http://www.arm.com/support/faqdev/1247.html
I know this applies to a different toolchain, but the symptoms appear to be exactly what I'm seeing. Debugging with GDB I can't trace the code all the way to the actual crash, and backtraces fail, running into null pointers in the stack.

One suggestion I got from the gPhoto people was trying to use different compile flags. I haven't had success, but maybe it could resolve issues for others. Here's a list of ARM specific compile flags:
http://gcc.gnu.org/onlinedocs/gcc-3.4.5/gcc/ARM-Options.html

Finally I have a question. I just stumbled across GNU ARM which is a somewhat newer toolchain than the zgcc-3.4.5 I'm using now. Is it possible (and not overly complicated) to use this to compile programs for pdaXrom on the Z??
merli
QUOTE(flyvholm @ Oct 24 2006, 12:55 PM)
Yes, it's on the C1000.
I have only run into one command that triggers the crash:
gphoto2 --set-config capture=on
Without this I can't use the camera for remote capture which was the whole plan with my Z.  sad.gif
It works fine w. Ubuntu 6.06 (Dapper) on my laptop, so the code is ok, except when being compiled on the Z.
An example of ARM architecture issues:
http://www.arm.com/support/faqdev/1228.html  (See last paragraph, 'Porting code...')

But the compile toolchain could be the problem too? Here's an example of what can happen:
http://www.arm.com/support/faqdev/1247.html
I know this applies to a different toolchain, but the symptoms appear to be exactly what I'm seeing. Debugging with GDB I can't trace the code all the way to the actual crash, and backtraces fail, running into null pointers in the stack.

One suggestion I got from the gPhoto people was trying to use different compile flags. I haven't had success, but maybe it could resolve issues for others. Here's a list of ARM specific compile flags:
http://gcc.gnu.org/onlinedocs/gcc-3.4.5/gcc/ARM-Options.html

Finally I have a question. I just stumbled across GNU ARM which is a somewhat newer toolchain than the zgcc-3.4.5 I'm using now. Is it possible (and not overly complicated) to use this to compile programs for pdaXrom on the Z??
*


I am not developer but there is problem in arm with structure alignment which is different as in X86 architecture. So some structures can be misinterpreted.
I have problem with this issue in my port of Dukenukem3d which runs ok only save/load does not works as it tries too put game structures directly to file. Also have similar problem with port of descent2 which freezes when you get hit and some bitmap structures tries to appear on screen.
You can try to use __attribute__((packed)) when you define some structures or try to compile with param -fpack-stuct, but I can say I tried both methods with no success.

Maybe some or core Zaurus and kernel arm developers could say more about this problem and propose method to correct code to be portable to arm.
This url tries to explain but I still too dumb to know what to do with code http://netwinder.osuosl.org/users/b/brianb.../alignment.html.

I would be glad if information if this thred will find solution and fix problem also with gphoto and other my ports. Please if there is someone who really understand what's going on help us.
damiandixon
Try and rebuild all source with:

-mstructure-size-boundary=32

Uses a lot more memory but everything ends up aligned on 32 bit boundary

This may also work but I have never used this:

-malignment-traps

Basically from what the comments in the thread are saying is that data is being accesed misaligned.

Use one or the other compiler options.

The documentation for -malignment-traps explains the problem quite well.

Regards
Damian
flyvholm
I've tried rebuilding with -mstructure-size-boundary=32, -malignment-traps and mno-alignment-traps for that matter - crashes just the same.

However, according to Merli's link it could likely be an alignment issue. Here is the structure in which the crash happens:

CODE
struct submenu {
    char      *label;
    char  *name;
    uint16_t    propid;
    uint16_t    vendorid;
    uint16_t    type;
    get_func    getfunc;
    put_func    putfunc;
};

struct menu {
    char  *label;
    char  *name;
    struct    submenu    *submenus;
};


It is the get_func and put_func pointers that are corrupted. I tried putting some __attribute__((packed)) in there, and it did change the invalid values of the pointers, but only to other invalid values sad.gif Maybe I'm doing it wrong - can anybody tell how to apply the __attribute__((packed)) on these structures to get a working alignment?
miskinis
This might be a stupid shot in the dark, and just a workaround, but try adding
uint16_t filler1;
after
uint16_t type;

The basis of the this theory is that this will push the 2 function pointers up to a word boundary.
Serge
You may also want to check this page from maemo wiki:
https://maemo.org/maemowiki/PortingFromX86ToARM

About structures, there are two issues, one is struct members alignment, another is struct size. A sample code that can be used to demontsrate the problems can be found here: http://www.internettablettalk.com/forums/s...read.php?t=2668

I use the following code to solve packing issues:
CODE
#pragma pack(1)
typedef struct s
{
   char x;
   int y;
}
#ifdef __GNUC__
__attribute__((packed))
#endif
S;
#pragma pack()


Hope this helps smile.gif
flyvholm
Good news: I got rid of my seg fault, thanks to miskinis' "shot in the dark". Instead of adding a filler (which would require me to add an element everywhere the structure is used), I changed submenu->type to a 4-byte element, type uint. Voila, alignment achieved and I can finally, finally do remote capture with my Z! biggrin.gif

Bad news: This issue is very confusing, and it certainly was a pain to troubleshoot. Look at the structure:
CODE
struct submenu {
char   *label;
char  *name;
uint16_t propid;
uint16_t vendorid;
uint16_t type;
get_func getfunc;
put_func putfunc;
};

If the function pointers are unaligned, other things would appear to be so as well - the character arrays can be uneven # of bytes, and if propid happens to be aligned, vendorid isn't! For the same reason I did consider miskinis' suggestion very unlikely to work and was close to not even trying. Is it only some types of objects, such as pointers, that need alignment?

It appears that compilers by default will add padding in some places to achieve alignment, and in other places they won't. Inconsistent, but perhaps a compromise between memory efficiency and avoiding crashes. What I really don't understand are the following two things:
1) How is "packing" going to help? Doesn't that tell the compiler not to add any padding, increasing the likelyhood of unaligned accesses? I did try all packing suggestions anyway and, indeed, couldn't make any work.
2) How the heck to make the compiler align the offending elements? You'd think that's what e.g. __attribute__((aligned(4))) is for, but I've applied it both to the function pointers and, to be sure, the element just before them (submenu->type). Still unaligned!! mad.gif Finally, the flag -Wcast-align is supposed to warn of possible bad alignments. It did come with warnings, just not where the alignment problem was.

In the end, only a shot in the dark worked. This doesn't seem right. Do we have a buggy toolchain (zgcc-3.4.5)? Which leads me back to asking: Can the newer GNU ARM toolchain be used to compile programs for the Z and pdaXrom without major difficulties?

P.S. To make sure the post doesn't appear ungrateful I'd like to thank everybody who made suggestions.
Serge
2flyvholm:

OK, let's start with step by step tutorial if you don't mind smile.gif First we need some test program to experiment with. You can provide some sample, but I still suggest starting with this one:

CODE
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>

#pragma pack(1)
typedef struct s
{
   char x;
   int y;
} S;
#pragma pack()

int main()
{
   int i;
   char *buffer = (char *)malloc(16);
   for (i = 0; i < 16; i++) buffer[i] = i;

   printf("reading unaligned value from the buffer at offset 1: %08X\n",
       *(int *)(buffer + 1));
   printf("offsetof(S, y)=%d\n", offsetof(S, y));
   printf("sizeof(S)=%d\n", sizeof(S));

   free(buffer);
   return 0;
}


The goal is to make it produce identical output for both x86 and arm. You can use 'aligned' and 'packed' attributes to learn how they work.

By the way, if the compiler complained about some other parts of code with -Wcast-align option, that parts of code have high probability of bugs too.

edit: A bottom line: it is nonportable code that is having problems on arm, the compiler is not buggy. The same nonportable code just happens to work on x86 giving you impression that it is arm architecture at fault. If you need more explanations about the reasons why alignment works this way on arm, I can try to provide them here (but links that were posted in this topic already should be enough to get all the information).
miskinis
Hi,

I'm glad I could offer a working solution. I expected it work actually, as long as
other code did not "play tricks" with the structure, like bypassing normal (proper)
C structure access, and using pointer arithmetic etc. to work with data residing
in the structure.

As far as your statement regarding the "character arrays", only pointers to "character
arrays" are stored in the structure, so the length does not matter in any way. Technically
a pointer to the first character is stored.

Since you mentioned that the crash was occurring when the function pointer was
being dereferenced and then called, I assumed that the "uint16_t" types were being
accessed OK, or at least not causing a crash. So I just began by trying to address the
root cause of the crash.

EDIT: Oh yeh, you could have just added the filler like I suggested, and not worried
about doing anything with it, right?

John
flyvholm
Serge:
I appreciate your effort to explain. But the two things I didn't understand was how packing (removing padding) can help when alignment (add padding) is what is needed. Plus, why wouldn't the compiler align elements when specifically told so.

Well, it turned out that earlier in the code (1000s of lines, so I didn't look through it all!) someone had added a #pragma pack(1) without doing a #pragma pack() later, so the compiler had counterinstructions to pack everything regardless of my efforts to align it! Adding a #pragma pack() after the structure to be packed (so that all subsequent structures were NOT packed) solved the problem too.

Packing code can be necessary for some purposes, but it is not the solution when you're looking to align elements to avoid program crashes. Straight the opposite, it is a source of unalignment. So look out for #pragma pack and __attribute__((packed)) in your code if it's crashing. Removing them and perhaps using __attribute__((align(4))) or compiler flags to help ensure alignment could solve your issues (with the risk of creating others).

Miskinis:
Thanks for clarifying. Adding the filler did work - it was just more cumbersome because I also had to add an element in many other lines of code where the structure is used.
Serge
QUOTE(flyvholm @ Nov 1 2006, 10:17 PM)
Serge:
I appreciate your effort to explain. But the two things I didn't understand was how packing (removing padding) can help when alignment (add padding) is what is needed.

If you have a clear understanding of what is needed in your program, that is ok. Either packing or alignment can be desired in some cases. It depends on your program. Most of such problems arise from the use of data structures that are directly loaded/saved from/to disk. In general this operation is not portable and you have to deal with endiannes (not an issue for arm as it is little endian just as x86) and different packing and alignment problems. It is not specified in C/C++ standard how struct memebers get aligned/packed, so the result depends on a compiler or a platform where this code is used. In order to force some specific alignment, different compiler specific pragmas and attributes are used. In x86 world, #pragma pack(1) is generally used to force structure packing, not it is not supported on arm (as you can see in the example that I had posted) so it can be a source of problems. So if you find such code in your sources that you want to also run on arm, you are better to check and fix this code. It is a good idea to insert lots of asserts checking for 'offsetof' and 'sizeof' for critical data, this way you can catch lots of problems and ensure that pragmas and different mumbo-yumbo actually got accepted by compiler and it understood what you wanted.

QUOTE
Plus, why wouldn't the compiler align elements when specifically told so.

Please post a complete testcase (a program that can be compiled and run), so we can look at it and try to figure out what's wrong.

QUOTE
Packing code can be necessary for some purposes, but it is not the solution when you're looking to align elements to avoid program crashes. Straight the opposite, it is a source of unalignment. So look out for #pragma pack and __attribute__((packed)) in your code if it's crashing. Removing them and perhaps using __attribute__((align(4))) or compiler flags to help ensure alignment could solve your issues (with the risk of creating others).

Packing is not the source of crashing on arm. It is unagned memory access that is not supported by arm hardware. If you use __attribute__((packed)) in your code, you explicitly tell the compiler that this data is unaligned and the compiler generates code that accesses this data byte at a time and combines it together to get correct result. Surely, this results in some noticeable performance penalty, but the code works as expected. Because of such potential performance problems, the compilers on arm architecture try to avoid unaligned data much harder, that's why #pragma pack is not respected and it has its own policy for dealing with alignment. And you can also get into a trouble if you explicitly use noncompatible pointers conversion such as '*(int *)(buffer + 1)' (and this can easily result in an unaligned memory access), but the compiler can warn about them.

Wy the way, did you check information about /proc/cpu/alignment from http://www.nslu2-linux.org/wiki/Info/Alignment ? It could be also useful when debugging your code.
flyvholm
QUOTE
QUOTE
Plus, why wouldn't the compiler align elements when specifically told so.

Please post a complete testcase (a program that can be compiled and run), so we can look at it and try to figure out what's wrong.

Somebody had left a #pragma pack(1) earlier in the code without doing a #pragma pack() later, so compiler was instructed to pack all subsequent code regardless of my effort to align it.

You've said that packing is not at fault and ARM architecture is not at fault. But for a fact, running packed code on ARM devices is a source of crashes. Of course, in the end the programmer is at fault for writing non-portable code, or the user is at fault for trying to run non-portable code on ARM. But knowing that human error is to blame doesn't help you much when troubleshooting.

Thanks for the link. It clarified another thing I hadn't understood - why the uint16_t elements did not need to be on a 4-byte boundary (2-byte elements on 2-byte boundaries are ok).
Serge
QUOTE(flyvholm @ Nov 2 2006, 12:11 AM)
Somebody had left a #pragma pack(1) earlier in the code without doing a #pragma pack() later, so compiler was instructed to pack all subsequent code regardless of my effort to align it.

I'm sorry for repeating it again, but #pragma pack(1) works differently on x86 and arm, so it results in different program behaviour and if you consider x86 behaviour correct, there is no surprize that this code does not work on arm the way you expected. You have been warned, it is up to you what to do with this information smile.gif In your place, I would try to modify the code to work the same.

QUOTE
You've said that packing is not at fault and ARM architecture is not at fault. But for a fact, running packed code on ARM devices is a source of crashes. Of course, in the end the programmer is at fault for writing non-portable code, or the user is at fault for trying to run non-portable code on ARM. But knowing that human error is to blame doesn't help you much when troubleshooting.

I'm sorry, I did not try to offend you for sure. Running packed code on arm devices is a source of crashes just because it was initially developed and tested on x86. If it was initially developed on arm, and you were porting it to x86, maybe you would be blaming x86 woes right now wink.gif Anyway, fixing the code so that it is portable and works correctly everywhere is a way to go. Once you get clear understanding about what's happening (and we'll try to help you with that), it will be relatively easy to fix it, good luck smile.gif

And one last final thing. Trying to do some random changes to the real code based on empirical guesses and (mis)interpretation of different information found on the net is not the best way to fix it. I still suggest you to do some tests with some smaller samples first.
flyvholm
x86 and ARM were designed for different purposes; one is not more right than the other. But it just happens that most of the applications we're compiling for the Z was written and tested on x86. Then if it crashes on the Z, we need to find out why so we can fix it.

As I understand it:
1) When legit x86 code is ported to an ARM device and crashes, unaligned access is likely the issue.
2) Compilers usually align things in memory. Packing code is telling the compiler not to do this, so if the x86 code you are trying to port includes packing, this could well be a reason for unaligned access. Removing the packing from the code could cure the problem.

The reason I want to point this out is that packing the code has appeared as a solution to eliminating unaligned access rather than a source of unaligned access. At least both me and Merli have been adding pack statements to our code in the hope of getting rid of unaligned access. Maybe I misunderstood you, but when you posted code showing how to pack a structure, I took it as a suggestion for something to try to solve my problem. But actually it was a #pragma pack statement in the code that was the whole problem.

Anyway, the code is now fixed and working as intended, so I'm happy. laugh.gif
This is a "lo-fi" version of our main content. To view the full version with more information, formatting and images, please click here.
Invision Power Board © 2001-2014 Invision Power Services, Inc.