Home| Contact Us

1. Introduction: "File format"

  • Expression data must follow comma separate values (CSV) file format for header column labels and expression values. The header column is the first column in the data file, and contains the category of information that will be listed in the subsequent columns. The top row (starting with Column B) lists individual protein specimens. The rows in the data file contain expression values of proteins for each specimen and category. See Figure 1 below for details.

    Figure 1: Dataset Example

  • Comma-separated values (CSV): Wikipedia [http://en.wikipedia.org/wiki/Comma-separated_values]

2. Introduction: "Datafile Check"

  • Check the submitted file to confirm that the category fields and variable name fields are filled. If the fields are not filled, the error messages depicted in Figure 2 and Figure 3 will result.
    Figure 2: Error Due tu Incomplete Variable Name Field
    Figure 3: Error Due to Incomplete Category Field

  • Note that a datasheet with null (empty cell) values will work, but the result will have some bias (Figure 4). This is because null values are considered zero (0).

    Figure 4: Warning about Null Values

3. Introduction: "Setting Positive and Negative"

  • Select the positive group and negative group for assessing a proteins' performance as a biomarker (that is, the protein can be validated if it can efficiently distinguish between these two groups). 'Positive' and 'Negative' denotes conditions or states (such as disease and non-disease, cancer and normal, or moderate case and severe case) that will be used to evaluate each candidate as a biomarker. One or more categories can be selected for a positive or negative diagnosis.

  • Case 1. HCC vs. Normal. In this example, HCC is the positive group, Normal is the negative group, and other categories are not used (Figure 5).
    Figure 5: HCC (Positive) vs. Normal (Negative)

  • Case 2. HCC vs. Cirrhosis and Hepatitis. In this example, HCC is the positive group, Cirrhosis and Hepatitis are treated as the negative group, and other categories are not used. Two or more categories can be merged into one group (Figure 6).
    Figure 6: HCC (Positive) vs. Cirrhosis, Hepatitis (Negative)

  • Case 3. Cancer vs. Liver Disease. In this example, HCC, Cholangiocarcinoma, Stomach cancer, and Pancreatic cancer are the positive group, Cirrhosis and Hepatitis are the negative group, and normal is not used. Two or more categories can be merged into one group (Figure 7).
    Figure 7: Cancers (Positive) vs. Liver Disease (Negative)

4. Introduction: "Parameter Check"

  • Check the positive and negative group selections to confirm that the selections are appropriate for analysis (Figure 8). Click to select the "<< Back" button to re-select positive and negative groups.
    Figure 8: Parameter Check

  • If either 'postive' or 'negative' is not selected, following massage shown in Figure 9 will appear.
    Figure 9: Error Message, Positive or Negative Group not Selected

5. Introduction: "Set Cross-validation option"

  • Set cross-validation option as shown in Figure 10.
    Figure 10: Cross-Validation Options
  • Note that at least 15 data points for each group (for the positive and negative groups) are recommended for CV and logistic regression analysis.

6. Introduction: "AUC list"

  • The AUC list (or result table) contains gene or protein names, P-value, area under the curve (AUC) values, and the confidence interval (CI) of AUC values, which are listed in descending order by AUC value (Figure 11).
    Figure 11: Example AUC list

    • AUC is the most popular measure of accuracy derived from receiver operating characteristic (ROC) curve. AUC values range from 0.5 to 1; good candidates with better specificity or sensitivity have larger AUC values (close to 1).
    • p-value derived by Mann-Whitney U test is listed to help users determine the difference between groups (positive vs. negative) for each protein. For example, in Figure 12, the level of Apolipoprotein A-1 differs significantly between HCC and Cirrhosis (P=0.03), but the level of Ceruloplasmin does not (P=1.171). When the significance level alpha is equal to 0.05)

    • Confidence interval (CI): Wikipedia [http://en.wikipedia.org/wiki/Confidence_Interval]
    • Mann-Whitney U test: Wikipedia [http://en.wikipedia.org/wiki/Mann-Whitney_U_test]
    • P-value: Wikipedia [http://en.wikipedia.org/wiki/P-value]

  • Click to select the "<< Back" button to return to the parameter setting page.

  • To make a protein panel (Figure 12):
    1. Click to select the check boxes for two or more proteins in the AUC list.
    2. Click to select the "Panel (Manual)" button to make and assess a biomarker panel that includes the selected proteins.
    3. Click to select the "Panel (AUTO)" button to make and assess a biomarker panel that the PanelComposer recommends.
      Figure 12: Create a Protein Panel

7. Introduction: "ROC curve"

  • ROC (Receiver operator characteristic ) curve is a general statistical method for assessing the performance of a binary classifier that can distinguish two categories. ROC curve is the plot of a test's 100-specificity (%), which is plotted on the horizontal axis, versus its sensitivities (%), which are plotted on the vertical axis.
  • AUC value represents the first measure for performance. A good classifier has a larger AUC value than a poor classifier. However, it is sometimes necessary to check their sensitivities and 100-specificity (%) for more detail, and to compare using an ROC graph.
  • In the example in Figure 13, Apolipoprotein A-1 appears to be the best classifier based on AUC values, but with some limitations (e.g., the specificity of the classifier must be less than 90% or 95%). A panel with Apolipoprotein A-1 and Vitamin D-binding protein is better than Apolipoprotein A-1 under 90% sensitivity region because the panel has 69.05% of sensitivity but Apolipoprotein A-1 has 23.81% of sensitivity under the same specificity limit.
    Figure 13: ROC curve Comparison

8. Appendix

  1. Accuracy: Wikipedia [http://en.wikipedia.org/wiki/Accuracy]

  2. Comma-separated values (CSV): Wikipedia [http://en.wikipedia.org/wiki/Comma-separated_values]

  3. Confidence interval (CI): Wikipedia [http://en.wikipedia.org/wiki/Confidence_Interval]

  4. Cross-validaion (CV): Wikipedia [http://en.wikipedia.org/wiki/Cross-validation_(statistics)]

  5. Logistic regression:
    • Wikipedia [http://en.wikipedia.org/wiki/Logistic_regression]
    • logistic regression does not have any assumption about the independent variables such as normally distributed, linearly related, or equal variance within each group, but the explanatory variables should not be highly correlated with one another because this could cause problems with estimation.

  6. Mann-Whitney U test: Wikipedia [http://en.wikipedia.org/wiki/Mann-Whitney_U_test]

  7. P-value: Wikipedia [http://en.wikipedia.org/wiki/P-value]

  8. Pearson's correlation coefficient (PCC):
    • Wikipedia [http://en.wikipedia.org/wiki/Pearson's_correlation]
    • Pearson's correlation coefficient value is a measure of the correlation (linear dependence) between two variables, giving a value between +1 and -1 inclusive.
      • A value of 0 implies that there is no linear correlation between the variables.
      • A value near 0 implies that there is weak linear correlation between the variables.
      • A value near +1 or -1 implies that there is strong linear correlation between the variables.

  9. Receiver operating characteristic (ROC): Wikipedia [http://en.wikipedia.org/wiki/Receiver_operating_characteristic]

  10. Sensitivity and Specificity: Wikipedia [http://en.wikipedia.org/wiki/Sensitivity_and_specificity]

  11. Youden index (Y-index):

#424, YPRC/BPRC, Industry-University Research Center, Yonsei Univ., Seodaemun-gu, Seoul, Korea, 120-749
Tel: +82-2-2123-6626, Fax: +82-2-393-6589
2011-2023 (C) Yonsei Proteome Research Center